Merge branch 'master' of https://github.com/bspeice/betterwithdata_cleaning_4

2026-07-02 01:43:06 -04:00 · 2015-11-07 16:35:56 -05:00
parent af76258ce6 4f2598eb4b
commit a6f1a4be96
2 changed files with 241 additions and 16 deletions
@@ -1,6 +1,159 @@
 {
- "cells": [],
- "metadata": {},
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# Medical Expenditure Panel Survey Data Scrape\n",
+    "### Because SAS never dies...###\n",
+    "----\n",
+    "\n",
+    "The MEPS data set is fairly easy to obtain. All the ZIP files are stored at a consistent URL, making our lives fairly easy.\n",
+    "\n",
+    "What's more challenging is actually making sense of the data. We ran into two issues:\n",
+    "\n",
+    "- The data comes in the SAS XPORT format. R is the only language we could find actually capable of reading these files. In order to open the data, we had to turn them into CSV for everyone to use.\n",
+    "- Fixed-width formatting means there are plenty of values that make no sense: \"-9\" is actually a flag for \"the interviewer did not record this data.\"\n",
+    "\n",
+    "This notebook contains the code needed to scrape the MEPS site for turning data into CSV format."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 1,
+   "metadata": {
+    "collapsed": false
+   },
+   "outputs": [],
+   "source": [
+    "library(httr)\n",
+    "library(foreign)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Normalize PUF\n",
+    "\n",
+    "In order to actually look up data, each file has a unique Public Use File (PUF) code associated with it. However, in order to download the actual file, the PUF has to be translated a bit.\n",
+    "\n",
+    "- Remove the `C-` in all names - `HC-178F` becomes `h178f`\n",
+    "- Remove the leading 0 in all names - `HC-048E` becomes `h48e`\n",
+    "- Lowercase"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 2,
+   "metadata": {
+    "collapsed": true
+   },
+   "outputs": [],
+   "source": [
+    "normalize_puf <- function(full_puf) {\n",
+    "    stage_1 <- gsub(\"C-0\", \"\", full_puf)\n",
+    "    stage_2 <- gsub(\"C-\", \"\", stage_1)\n",
+    "    \n",
+    "    tolower(stage_2)\n",
+    "}"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Download PUF\n",
+    "\n",
+    "The download process ended up being a bit easier than expected. All the files are located in a single directory, making the job of scraping trivial. Each **PUF** has two zip files associated with it:\n",
+    "- The ASCII fixed-width data\n",
+    "- The binary fixed-width data (in XPORT format)\n",
+    "\n",
+    "The XPORT format contains information needed to reconstruct the column headers, so we use that to make our lives easy. However, we have to use a specfic library (foreign) to actually read the data. Finally, write everything to a CSV.\n",
+    "\n",
+    "One point of improvement in the future would be to remove the intermediate files. For right now, we're not going to worry too much about them."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 3,
+   "metadata": {
+    "collapsed": true
+   },
+   "outputs": [],
+   "source": [
+    "download_puf <- function(short_puf) {\n",
+    "    puf_base <- \"http://meps.ahrq.gov/mepsweb/data_files/pufs/\"\n",
+    "    puf_suffix <- \"ssp.zip\"\n",
+    "    \n",
+    "    zip_filename <- paste0(short_puf, \"ssp.zip\")\n",
+    "    filename <- paste0(short_puf, \".ssp\")\n",
+    "    puf_url <- paste0(puf_base, zip_filename)\n",
+    "    download.file(puf_url, zip_filename)\n",
+    "    \n",
+    "    # unzip\n",
+    "    unzip(zip_filename, files = filename)\n",
+    "    saveName <- paste0(short_puf, \".csv\")\n",
+    "\n",
+    "    # read sas file and return as csv file\n",
+    "    mydata <- read.xport(filename)\n",
+    "    write.table(mydata, file = saveName, sep = \",\")\n",
+    "}"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Actual Downloads\n",
+    "Now that the functions are defined, we can put them to use. Two example vectors have been provided to download all available data for emergency room visits, and for outpatient care over all available years."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 4,
+   "metadata": {
+    "collapsed": false
+   },
+   "outputs": [
+    {
+     "ename": "ERROR",
+     "evalue": "Error in eval(expr, envir, enclos): object 'pufs' not found\n",
+     "output_type": "error",
+     "traceback": [
+      "Error in eval(expr, envir, enclos): object 'pufs' not found\n"
+     ]
+    }
+   ],
+   "source": [
+    "# Emergency Visit PUF's\n",
+    "# pufs <- c('HC-160E','HC-152E','HC-144E','HC-135E','HC-126E','HC-118E','HC-110E','HC-102E','HC-094E','HC-085E','HC-077E','HC-067E','HC-059E','HC-051E','HC-033E','HC-026E','HC-016E','HC-010E')\n",
+    "\n",
+    "# Outpatient Visits\n",
+    "# pufs <- c('HC-160F','HC-152F','HC-144F','HC-135F','HC-126F','HC-118F','HC-110F','HC-102F','HC-094F','HC-085F','HC-077F','HC-067F','HC-059F','HC-051F','HC-033F','HC-026F','HC-016F','HC-010F')\n",
+    "\n",
+    "puf_downloads = c()\n",
+    "for (puf in pufs) {\n",
+    "    download_puf(normalize_puf(puf))\n",
+    "}"
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "R",
+   "language": "R",
+   "name": "ir"
+  },
+  "language_info": {
+   "codemirror_mode": "r",
+   "file_extension": ".r",
+   "mimetype": "text/x-r-source",
+   "name": "R",
+   "pygments_lexer": "r",
+   "version": "3.2.2"
+  }
+ },
 "nbformat": 4,
 "nbformat_minor": 0
 }
@@ -1,16 +1,87 @@
 {
 "cells": [
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# Medical Expenditure Panel Survey Data Scrape\n",
+    "### Because SAS never dies...###\n",
+    "----\n",
+    "\n",
+    "The MEPS data set is fairly easy to obtain. All the ZIP files are stored at a consistent URL, making our lives fairly easy.\n",
+    "\n",
+    "What's more challenging is actually making sense of the data. We ran into two issues:\n",
+    "\n",
+    "- The data comes in the SAS XPORT format. R is the only language we could find actually capable of reading these files. In order to open the data, we had to turn them into CSV for everyone to use.\n",
+    "- Fixed-width formatting means there are plenty of values that make no sense: \"-9\" is actually a flag for \"the interviewer did not record this data.\"\n",
+    "\n",
+    "This notebook contains the code needed to scrape the MEPS site for turning data into CSV format."
+   ]
+  },
  {
   "cell_type": "code",
-   "execution_count": 60,
+   "execution_count": 1,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "library(httr)\n",
-    "library(foreign)\n",
-    "        \n",
+    "library(foreign)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Normalize PUF\n",
+    "\n",
+    "In order to actually look up data, each file has a unique Public Use File (PUF) code associated with it. However, in order to download the actual file, the PUF has to be translated a bit.\n",
+    "\n",
+    "- Remove the `C-` in all names - `HC-178F` becomes `h178f`\n",
+    "- Remove the leading 0 in all names - `HC-048E` becomes `h48e`\n",
+    "- Lowercase"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 2,
+   "metadata": {
+    "collapsed": true
+   },
+   "outputs": [],
+   "source": [
+    "normalize_puf <- function(full_puf) {\n",
+    "    stage_1 <- gsub(\"C-0\", \"\", full_puf)\n",
+    "    stage_2 <- gsub(\"C-\", \"\", stage_1)\n",
+    "    \n",
+    "    tolower(stage_2)\n",
+    "}"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Download PUF\n",
+    "\n",
+    "The download process ended up being a bit easier than expected. All the files are located in a single directory, making the job of scraping trivial. Each **PUF** has two zip files associated with it:\n",
+    "- The ASCII fixed-width data\n",
+    "- The binary fixed-width data (in XPORT format)\n",
+    "\n",
+    "The XPORT format contains information needed to reconstruct the column headers, so we use that to make our lives easy. However, we have to use a specfic library (foreign) to actually read the data. Finally, write everything to a CSV.\n",
+    "\n",
+    "One point of improvement in the future would be to remove the intermediate files. For right now, we're not going to worry too much about them."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 3,
+   "metadata": {
+    "collapsed": true
+   },
+   "outputs": [],
+   "source": [
    "download_puf <- function(short_puf) {\n",
    "    puf_base <- \"http://meps.ahrq.gov/mepsweb/data_files/pufs/\"\n",
    "    puf_suffix <- \"ssp.zip\"\n",
@@ -27,29 +98,30 @@
    "    # read sas file and return as csv file\n",
    "    mydata <- read.xport(filename)\n",
    "    write.table(mydata, file = saveName, sep = \",\")\n",
-    "}\n",
-    "\n",
-    "normalize_puf <- function(full_puf) {\n",
-    "    stage_1 <- gsub(\"C-0\", \"\", full_puf)\n",
-    "    stage_2 <- gsub(\"C-\", \"\", stage_1)\n",
-    "    \n",
-    "    tolower(stage_2)\n",
    "}"
   ]
  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Actual Downloads\n",
+    "Now that the functions are defined, we can put them to use. Two example vectors have been provided to download all available data for emergency room visits, and for outpatient care over all available years."
+   ]
+  },
  {
   "cell_type": "code",
-   "execution_count": 63,
+   "execution_count": 4,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "ename": "ERROR",
-     "evalue": "Error in lookup.xport(file): file not in SAS transfer format\n",
+     "evalue": "Error in eval(expr, envir, enclos): object 'pufs' not found\n",
     "output_type": "error",
     "traceback": [
-      "Error in lookup.xport(file): file not in SAS transfer format\n"
+      "Error in eval(expr, envir, enclos): object 'pufs' not found\n"
     ]
    }
   ],
@@ -58,7 +130,7 @@
    "# pufs <- c('HC-160E','HC-152E','HC-144E','HC-135E','HC-126E','HC-118E','HC-110E','HC-102E','HC-094E','HC-085E','HC-077E','HC-067E','HC-059E','HC-051E','HC-033E','HC-026E','HC-016E','HC-010E')\n",
    "\n",
    "# Outpatient Visits\n",
-    "pufs <- c('HC-160F','HC-152F','HC-144F','HC-135F','HC-126F','HC-118F','HC-110F','HC-102F','HC-094F','HC-085F','HC-077F','HC-067F','HC-059F','HC-051F','HC-033F','HC-026F','HC-016F','HC-010F')\n",
+    "# pufs <- c('HC-160F','HC-152F','HC-144F','HC-135F','HC-126F','HC-118F','HC-110F','HC-102F','HC-094F','HC-085F','HC-077F','HC-067F','HC-059F','HC-051F','HC-033F','HC-026F','HC-016F','HC-010F')\n",
    "\n",
    "puf_downloads = c()\n",
    "for (puf in pufs) {\n",