betterwithdata_cleaning_4/exploration/MEPS Scraping.ipynb

279 lines
12 KiB
Plaintext

{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Medical Expenditure Panel Survey Data Scrape\n",
"### Because SAS never dies...###\n",
"----\n",
"\n",
"The MEPS data set is fairly easy to obtain. All the ZIP files are stored at a consistent URL, making our lives fairly easy.\n",
"\n",
"What's more challenging is actually making sense of the data. We ran into two issues:\n",
"\n",
"- The data comes in the SAS XPORT format. R is the only language we could find actually capable of reading these files. In order to open the data, we had to turn them into CSV for everyone to use.\n",
"- Fixed-width formatting means there are plenty of values that make no sense: \"-9\" is actually a flag for \"the interviewer did not record this data.\"\n",
"\n",
"This notebook contains the code needed to scrape the MEPS site for turning data into CSV format."
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"library(httr)\n",
"library(foreign)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Normalize PUF\n",
"\n",
"In order to actually look up data, each file has a unique Public Use File (PUF) code associated with it. However, in order to download the actual file, the PUF has to be translated a bit.\n",
"\n",
"- Remove the `C-` in all names - `HC-178F` becomes `h178f`\n",
"- Remove the leading 0 in all names - `HC-048E` becomes `h48e`\n",
"- Lowercase"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"normalize_puf <- function(full_puf) {\n",
" stage_1 <- gsub(\"C-0\", \"\", full_puf)\n",
" stage_2 <- gsub(\"C-\", \"\", stage_1)\n",
" \n",
" tolower(stage_2)\n",
"}"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Download PUF\n",
"\n",
"The download process ended up being a bit easier than expected. All the files are located in a single directory, making the job of scraping trivial. Each **PUF** has two zip files associated with it:\n",
"- The ASCII fixed-width data\n",
"- The binary fixed-width data (in XPORT format)\n",
"\n",
"The XPORT format contains information needed to reconstruct the column headers, so we use that to make our lives easy. However, we have to use a specfic library (foreign) to actually read the data. Finally, write everything to a CSV.\n",
"\n",
"One point of improvement in the future would be to remove the intermediate files. For right now, we're not going to worry too much about them."
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"download_puf <- function(short_puf) {\n",
" puf_base <- \"http://meps.ahrq.gov/mepsweb/data_files/pufs/\"\n",
" puf_suffix <- \"ssp.zip\"\n",
" \n",
" zip_filename <- paste0(short_puf, \"ssp.zip\")\n",
" filename <- paste0(short_puf, \".ssp\")\n",
" puf_url <- paste0(puf_base, zip_filename)\n",
" download.file(puf_url, zip_filename)\n",
" \n",
" # unzip\n",
" unzip(zip_filename, files = filename)\n",
" saveName <- paste0(short_puf, \".csv\")\n",
"\n",
" # read sas file and return as csv file\n",
" mydata <- read.xport(filename)\n",
" write.table(mydata, file = saveName, sep = \",\")\n",
"}"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Actual Downloads\n",
"Now that the functions are defined, we can put them to use. Two example vectors have been provided to download all available data for emergency room visits, and for outpatient care over all available years."
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {
"collapsed": false
},
"outputs": [
{
"ename": "ERROR",
"evalue": "Error in eval(expr, envir, enclos): object 'w' not found\n",
"output_type": "error",
"traceback": [
"Error in eval(expr, envir, enclos): object 'w' not found\n"
]
},
{
"name": "stderr",
"output_type": "stream",
"text": [
"Loading required package: xml2\n",
"\n",
"Attaching package: 'XML'\n",
"\n",
"The following object is masked from 'package:rvest':\n",
"\n",
" xml\n",
"\n",
"Warning message:\n",
"In is.na(data): is.na() applied to non-(list or vector) of type 'NULL'Warning message:\n",
"In is.na(data): is.na() applied to non-(list or vector) of type 'NULL'Warning message:\n",
"In is.na(data): is.na() applied to non-(list or vector) of type 'NULL'Warning message:\n",
"In is.na(data): is.na() applied to non-(list or vector) of type 'NULL'Warning message:\n",
"In is.na(data): is.na() applied to non-(list or vector) of type 'NULL'Warning message:\n",
"In is.na(data): is.na() applied to non-(list or vector) of type 'NULL'Warning message:\n",
"In is.na(data): is.na() applied to non-(list or vector) of type 'NULL'Warning message:\n",
"In is.na(data): is.na() applied to non-(list or vector) of type 'NULL'Warning message:\n",
"In is.na(data): is.na() applied to non-(list or vector) of type 'NULL'Warning message:\n",
"In is.na(data): is.na() applied to non-(list or vector) of type 'NULL'Warning message:\n",
"In is.na(data): is.na() applied to non-(list or vector) of type 'NULL'Warning message:\n",
"In is.na(data): is.na() applied to non-(list or vector) of type 'NULL'Warning message:\n",
"In is.na(data): is.na() applied to non-(list or vector) of type 'NULL'Warning message:\n",
"In is.na(data): is.na() applied to non-(list or vector) of type 'NULL'Warning message:\n",
"In is.na(data): is.na() applied to non-(list or vector) of type 'NULL'Warning message:\n",
"In is.na(data): is.na() applied to non-(list or vector) of type 'NULL'Warning message:\n",
"In is.na(data): is.na() applied to non-(list or vector) of type 'NULL'Warning message:\n",
"In is.na(data): is.na() applied to non-(list or vector) of type 'NULL'Warning message:\n",
"In is.na(data): is.na() applied to non-(list or vector) of type 'NULL'Warning message:\n",
"In is.na(data): is.na() applied to non-(list or vector) of type 'NULL'Warning message:\n",
"In is.na(data): is.na() applied to non-(list or vector) of type 'NULL'Warning message:\n",
"In is.na(data): is.na() applied to non-(list or vector) of type 'NULL'Warning message:\n",
"In is.na(data): is.na() applied to non-(list or vector) of type 'NULL'Warning message:\n",
"In is.na(data): is.na() applied to non-(list or vector) of type 'NULL'Warning message:\n",
"In is.na(data): is.na() applied to non-(list or vector) of type 'NULL'Warning message:\n",
"In is.na(data): is.na() applied to non-(list or vector) of type 'NULL'Warning message:\n",
"In is.na(data): is.na() applied to non-(list or vector) of type 'NULL'Warning message:\n",
"In is.na(data): is.na() applied to non-(list or vector) of type 'NULL'Warning message:\n",
"In is.na(data): is.na() applied to non-(list or vector) of type 'NULL'Warning message:\n",
"In is.na(data): is.na() applied to non-(list or vector) of type 'NULL'Warning message:\n",
"In is.na(data): is.na() applied to non-(list or vector) of type 'NULL'Warning message:\n",
"In is.na(data): is.na() applied to non-(list or vector) of type 'NULL'Warning message:\n",
"In is.na(data): is.na() applied to non-(list or vector) of type 'NULL'Warning message:\n",
"In is.na(data): is.na() applied to non-(list or vector) of type 'NULL'Warning message:\n",
"In is.na(data): is.na() applied to non-(list or vector) of type 'NULL'Warning message:\n",
"In is.na(data): is.na() applied to non-(list or vector) of type 'NULL'Warning message:\n",
"In is.na(data): is.na() applied to non-(list or vector) of type 'NULL'Warning message:\n",
"In is.na(data): is.na() applied to non-(list or vector) of type 'NULL'Warning message:\n",
"In is.na(data): is.na() applied to non-(list or vector) of type 'NULL'Warning message:\n",
"In is.na(data): is.na() applied to non-(list or vector) of type 'NULL'Warning message:\n",
"In is.na(data): is.na() applied to non-(list or vector) of type 'NULL'Warning message:\n",
"In is.na(data): is.na() applied to non-(list or vector) of type 'NULL'Warning message:\n",
"In is.na(data): is.na() applied to non-(list or vector) of type 'NULL'Warning message:\n",
"In is.na(data): is.na() applied to non-(list or vector) of type 'NULL'Warning message:\n",
"In is.na(data): is.na() applied to non-(list or vector) of type 'NULL'Warning message:\n",
"In is.na(data): is.na() applied to non-(list or vector) of type 'NULL'Warning message:\n",
"In is.na(data): is.na() applied to non-(list or vector) of type 'NULL'Warning message:\n",
"In is.na(data): is.na() applied to non-(list or vector) of type 'NULL'Warning message:\n",
"In is.na(data): is.na() applied to non-(list or vector) of type 'NULL'Warning message:\n",
"In is.na(data): is.na() applied to non-(list or vector) of type 'NULL'Warning message:\n",
"In is.na(data): is.na() applied to non-(list or vector) of type 'NULL'"
]
},
{
"ename": "ERROR",
"evalue": "Error in lookup.xport(file): file not in SAS transfer format\n",
"output_type": "error",
"traceback": [
"Error in lookup.xport(file): file not in SAS transfer format\n"
]
}
],
"source": [
"# Example usage: Download data by PUF\n",
"#\n",
"# Emergency Visit PUF's\n",
"# pufs <- c('HC-160E','HC-152E','HC-144E','HC-135E','HC-126E','HC-118E','HC-110E','HC-102E','HC-094E','HC-085E','HC-077E','HC-067E','HC-059E','HC-051E','HC-033E','HC-026E','HC-016E','HC-010E')\n",
"\n",
"# Outpatient Visits\n",
"# pufs <- c('HC-160F','HC-152F','HC-144F','HC-135F','HC-126F','HC-118F','HC-110F','HC-102F','HC-094F','HC-085F','HC-077F','HC-067F','HC-059F','HC-051F','HC-033F','HC-026F','HC-016F','HC-010F')\n",
"\n",
"# puf_downloads = c()\n",
"# for (puf in pufs) {\n",
"# download_puf(normalize_puf(puf))\n",
"# }\n",
"\n",
"########\n",
"\n",
"# Example usage: Scrape the MEPS web page for all PUF ids\n",
"# Feel free to modify the URL being used - Any of the links on [this page](http://meps.ahrq.gov/mepsweb/data_stats/download_data_files.jsp)\n",
"# are fair game.\n",
"\n",
"#install.packages(\"rvest\")\n",
"library(rvest)\n",
"html <-\n",
" read_html(\n",
" \"http://meps.ahrq.gov/mepsweb/data_stats/download_data_files_results.jsp?cboDataYear=All&cboDataTypeY=2%2CHousehold+Event+File&buttonYearandDataType=Search&cboPufNumber=All&SearchTitle=Emergency+Room+Visits\"\n",
" )\n",
"\n",
"library(XML)\n",
"\n",
"# to scrap all the file name from web and form html links\n",
"doc <- htmlParse(html, asText = TRUE)\n",
"plain.text <- xpathSApply(doc, \"//a\", xmlValue)\n",
"data <- c()\n",
"for (i in 1:length(plain.text))\n",
"{\n",
" if (substr(plain.text[i], 1,3) == \"HC-\")\n",
" {\n",
" data[i] <- tolower(plain.text[i])\n",
" }\n",
" originalData <- data[!is.na(data)]\n",
" \n",
"}\n",
"\n",
"mydata <- c()\n",
"for (i in 1:length(originalData))\n",
"{\n",
" mydata[i] <- gsub(\"c-0\", \"\", originalData[i])\n",
" mydata[i] <- gsub(\"c-\", \"\", mydata[i])\n",
" \n",
"}\n",
"\n",
"# to scrap all .ssp data file \n",
"for (i in 1:length(mydata))\n",
"{\n",
" download_puf(mydata[i])\n",
"}"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "R",
"language": "R",
"name": "ir"
},
"language_info": {
"codemirror_mode": "r",
"file_extension": ".r",
"mimetype": "text/x-r-source",
"name": "R",
"pygments_lexer": "r",
"version": "3.2.2"
}
},
"nbformat": 4,
"nbformat_minor": 0
}