{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "### My first Kaggle competition\n", "\n", "It's time! After embarking on a Machine Learning class this semester, and with a Saturday in which I don't have much planned, I wanted to put this class and training to work. It's my first competition submission. I want to walk you guys through how I'm approaching this problem, because I thought it would be really neat. The competition is Banco Santander's [Santander Customer Satisfaction][1] competition. It seemed like an easy enough problem I could actually make decent progress on it.\n", "\n", "# Data Exploration\n", "\n", "First up: we need to load our data and do some exploratory work. Because we're going to be using this data for model selection prior to testing, we need to make a further split. I've already gone ahead and done this work, please see the code in the [appendix below](#Appendix).\n", "\n", "[1]: https://www.kaggle.com/c/santander-customer-satisfaction" ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "collapsed": false }, "outputs": [], "source": [ "import pandas as pd\n", "import numpy as np\n", "import matplotlib.pyplot as plt\n", "%matplotlib inline\n", "\n", "# Record how long it takes to run the notebook - I'm curious.\n", "from datetime import datetime\n", "start = datetime.now()\n", "\n", "dataset = pd.read_csv('split_train.csv')\n", "dataset.index = dataset.ID\n", "X = dataset.drop(['TARGET', 'ID', 'ID.1'], 1)\n", "y = dataset.TARGET" ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "array([0, 1], dtype=int64)" ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "y.unique()" ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "369" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "len(X.columns)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Okay, so there are only [two classes we're predicting][2]: 1 for unsatisfied customers, 0 for satisfied customers. I would have preferred this to be something more like a regression, or predicting multiple classes: maybe the customer isn't the most happy, but is nowhere near closing their accounts. For now though, that's just the data we're working with.\n", "\n", "Now, I'd like to make a scatter matrix of everything going on. Unfortunately as noted above, we have 369 different features. There's no way I can graphically make sense of that much data to start with.\n", "\n", "We're also not told what the data actually represents: Are these survey results? Average time between contact with a customer care person? Frequency of contacting a customer care person? The idea is that I need to reduce the number of dimensions we're predicting across.\n", "\n", "## Dimensionality Reduction pt. 1 - Binary Classifiers\n", "\n", "My first attempt to reduce the data dimensionality is to find all the binary classifiers in the dataset \\(i.e. 0 or 1 values\\) and see if any of those are good \\(or anti-good\\) predictors of the final data.\n", "\n", "[2]: https://www.kaggle.com/c/santander-customer-satisfaction/data" ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "111" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "cols = X.columns\n", "b_class = []\n", "for c in cols:\n", " if len(X[c].unique()) == 2:\n", " b_class.append(c)\n", " \n", "len(b_class)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "So there are 111 features in the dataset that are a binary label. Let's see if any of them are good at predicting the users satisfaction!" ] }, { "cell_type": "code", "execution_count": 5, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", " | Accuracy | \n", "
---|---|
count | \n", "111.000000 | \n", "
mean | \n", "0.905159 | \n", "
std | \n", "0.180602 | \n", "
min | \n", "0.043598 | \n", "
25% | \n", "0.937329 | \n", "
50% | \n", "0.959372 | \n", "
75% | \n", "0.960837 | \n", "
max | \n", "0.960837 | \n", "