betterwithdata_cleaning_4/exploration/.ipynb_checkpoints/MEPS Scraping-checkpoint.ipynb

160 lines
5.1 KiB
Plaintext
Raw Normal View History

{
2015-11-07 16:33:42 -05:00
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Medical Expenditure Panel Survey Data Scrape\n",
"### Because SAS never dies...###\n",
"----\n",
"\n",
"The MEPS data set is fairly easy to obtain. All the ZIP files are stored at a consistent URL, making our lives fairly easy.\n",
"\n",
"What's more challenging is actually making sense of the data. We ran into two issues:\n",
"\n",
"- The data comes in the SAS XPORT format. R is the only language we could find actually capable of reading these files. In order to open the data, we had to turn them into CSV for everyone to use.\n",
"- Fixed-width formatting means there are plenty of values that make no sense: \"-9\" is actually a flag for \"the interviewer did not record this data.\"\n",
"\n",
"This notebook contains the code needed to scrape the MEPS site for turning data into CSV format."
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"library(httr)\n",
"library(foreign)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Normalize PUF\n",
"\n",
"In order to actually look up data, each file has a unique Public Use File (PUF) code associated with it. However, in order to download the actual file, the PUF has to be translated a bit.\n",
"\n",
"- Remove the `C-` in all names - `HC-178F` becomes `h178f`\n",
"- Remove the leading 0 in all names - `HC-048E` becomes `h48e`\n",
"- Lowercase"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"normalize_puf <- function(full_puf) {\n",
" stage_1 <- gsub(\"C-0\", \"\", full_puf)\n",
" stage_2 <- gsub(\"C-\", \"\", stage_1)\n",
" \n",
" tolower(stage_2)\n",
"}"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Download PUF\n",
"\n",
"The download process ended up being a bit easier than expected. All the files are located in a single directory, making the job of scraping trivial. Each **PUF** has two zip files associated with it:\n",
"- The ASCII fixed-width data\n",
"- The binary fixed-width data (in XPORT format)\n",
"\n",
"The XPORT format contains information needed to reconstruct the column headers, so we use that to make our lives easy. However, we have to use a specfic library (foreign) to actually read the data. Finally, write everything to a CSV.\n",
"\n",
"One point of improvement in the future would be to remove the intermediate files. For right now, we're not going to worry too much about them."
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"download_puf <- function(short_puf) {\n",
" puf_base <- \"http://meps.ahrq.gov/mepsweb/data_files/pufs/\"\n",
" puf_suffix <- \"ssp.zip\"\n",
" \n",
" zip_filename <- paste0(short_puf, \"ssp.zip\")\n",
" filename <- paste0(short_puf, \".ssp\")\n",
" puf_url <- paste0(puf_base, zip_filename)\n",
" download.file(puf_url, zip_filename)\n",
" \n",
" # unzip\n",
" unzip(zip_filename, files = filename)\n",
" saveName <- paste0(short_puf, \".csv\")\n",
"\n",
" # read sas file and return as csv file\n",
" mydata <- read.xport(filename)\n",
" write.table(mydata, file = saveName, sep = \",\")\n",
"}"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Actual Downloads\n",
"Now that the functions are defined, we can put them to use. Two example vectors have been provided to download all available data for emergency room visits, and for outpatient care over all available years."
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {
"collapsed": false
},
"outputs": [
{
"ename": "ERROR",
"evalue": "Error in eval(expr, envir, enclos): object 'pufs' not found\n",
"output_type": "error",
"traceback": [
"Error in eval(expr, envir, enclos): object 'pufs' not found\n"
]
}
],
"source": [
"# Emergency Visit PUF's\n",
"# pufs <- c('HC-160E','HC-152E','HC-144E','HC-135E','HC-126E','HC-118E','HC-110E','HC-102E','HC-094E','HC-085E','HC-077E','HC-067E','HC-059E','HC-051E','HC-033E','HC-026E','HC-016E','HC-010E')\n",
"\n",
"# Outpatient Visits\n",
"# pufs <- c('HC-160F','HC-152F','HC-144F','HC-135F','HC-126F','HC-118F','HC-110F','HC-102F','HC-094F','HC-085F','HC-077F','HC-067F','HC-059F','HC-051F','HC-033F','HC-026F','HC-016F','HC-010F')\n",
"\n",
"puf_downloads = c()\n",
"for (puf in pufs) {\n",
" download_puf(normalize_puf(puf))\n",
"}"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "R",
"language": "R",
"name": "ir"
},
"language_info": {
"codemirror_mode": "r",
"file_extension": ".r",
"mimetype": "text/x-r-source",
"name": "R",
"pygments_lexer": "r",
"version": "3.2.2"
}
},
"nbformat": 4,
"nbformat_minor": 0
}