Merge pull request #16 from bspeice/docusaurus

Docusaurus
This commit is contained in:
bspeice 2024-11-10 13:44:05 -08:00 committed by GitHub
commit 582e03cff3
No known key found for this signature in database
GPG Key ID: B5690EEEBB952194
211 changed files with 46032 additions and 426 deletions

View File

@ -1,6 +0,0 @@
FROM mcr.microsoft.com/vscode/devcontainers/ruby:0-2.7-bullseye
RUN wget https://github.com/errata-ai/vale/releases/download/v2.21.0/vale_2.21.0_Linux_64-bit.tar.gz -O /tmp/vale.tar.gz \
&& cd /usr/local/bin \
&& tar xf /tmp/vale.tar.gz \
&& rm /tmp/vale.tar.gz

View File

@ -1,13 +1,19 @@
// For format details, see https://aka.ms/devcontainer.json. For config options, see the README at:
// https://github.com/microsoft/vscode-dev-containers/tree/v0.245.0/containers/ruby
// For format details, see https://aka.ms/devcontainer.json. For config options, see the
// README at: https://github.com/devcontainers/templates/tree/main/src/typescript-node
{
"name": "Ruby",
"build": {
"dockerfile": "Dockerfile"
},
"name": "Node.js & TypeScript",
"image": "mcr.microsoft.com/devcontainers/typescript-node:1-22-bookworm",
"runArgs": ["--userns=keep-id"],
"remoteUser": "vscode",
"containerUser": "vscode",
"workspaceMount": "source=${localWorkspaceFolder},target=/workspaces/${localWorkspaceFolderBasename},type=bind,Z"
"containerUser": "node",
"containerEnv": {
"HOME": "/home/node"
},
"customizations": {
"vscode": {
"extensions": [
"ChrisChinchilla.vale-vscode"
]
}
},
"onCreateCommand": "bash -c 'mkdir -p ~/.local/bin && wget -qO- https://github.com/errata-ai/vale/releases/download/v2.21.0/vale_2.21.0_Linux_64-bit.tar.gz | tar xz -C ~/.local/bin'"
}

29
.gitignore vendored
View File

@ -1,8 +1,21 @@
_site/
.swp
.sass-cache/
.jekyll-metadata
.bundle/
vendor/
.styles/
.vscode/
# Dependencies
/node_modules
# Production
/build
# Generated files
.docusaurus
.cache-loader
# Misc
.DS_Store
.env.local
.env.development.local
.env.test.local
.env.production.local
.styles
npm-debug.log*
yarn-debug.log*
yarn-error.log*

View File

@ -2,6 +2,6 @@ StylesPath = .styles
MinAlertLevel = suggestion
Packages = Microsoft, write-good
[*]
[*.{md,mdx}]
BasedOnStyles = Vale, Microsoft, write-good
write-good.E-Prime = NO

View File

@ -1,24 +0,0 @@
---
layout: page
---
<style type="text/css" media="screen">
.container {
margin: 10px auto;
max-width: 600px;
text-align: center;
}
h1 {
margin: 30px 0;
font-size: 4em;
line-height: 1;
letter-spacing: -1px;
}
</style>
<div class="container">
<h1>404</h1>
<p><strong>Page not found :(</strong></p>
<p>The requested page could not be found.</p>
</div>

1
CNAME
View File

@ -1 +0,0 @@
speice.io

29
Gemfile
View File

@ -1,29 +0,0 @@
source "https://rubygems.org"
# Hello! This is where you manage which Jekyll version is used to run.
# When you want to use a different version, change it below, save the
# file and run `bundle install`. Run Jekyll with `bundle exec`, like so:
#
# bundle exec jekyll serve
#
# This will help ensure the proper Jekyll version is running.
# Happy Jekylling!
gem "jekyll", "~> 3.8.3"
gem "texture"
# If you want to use GitHub Pages, remove the "gem "jekyll"" above and
# uncomment the line below. To upgrade, run `bundle update github-pages`.
# gem "github-pages", group: :jekyll_plugins
# If you have any plugins, put them here!
group :jekyll_plugins do
gem "jekyll-feed", "~> 0.6"
gem "jekyll-remote-theme"
end
# Windows does not include zoneinfo files, so bundle the tzinfo-data gem
gem "tzinfo-data", platforms: [:mingw, :mswin, :x64_mingw, :jruby]
# Performance-booster for watching directories on Windows
gem "wdm", "~> 0.1.0" if Gem.win_platform?

View File

@ -1,78 +0,0 @@
GEM
remote: https://rubygems.org/
specs:
addressable (2.7.0)
public_suffix (>= 2.0.2, < 5.0)
colorator (1.1.0)
concurrent-ruby (1.1.6)
em-websocket (0.5.1)
eventmachine (>= 0.12.9)
http_parser.rb (~> 0.6.0)
eventmachine (1.2.7)
ffi (1.12.2)
forwardable-extended (2.6.0)
http_parser.rb (0.6.0)
i18n (0.9.5)
concurrent-ruby (~> 1.0)
jekyll (3.8.6)
addressable (~> 2.4)
colorator (~> 1.0)
em-websocket (~> 0.5)
i18n (~> 0.7)
jekyll-sass-converter (~> 1.0)
jekyll-watch (~> 2.0)
kramdown (~> 1.14)
liquid (~> 4.0)
mercenary (~> 0.3.3)
pathutil (~> 0.9)
rouge (>= 1.7, < 4)
safe_yaml (~> 1.0)
jekyll-feed (0.13.0)
jekyll (>= 3.7, < 5.0)
jekyll-remote-theme (0.4.2)
addressable (~> 2.0)
jekyll (>= 3.5, < 5.0)
jekyll-sass-converter (>= 1.0, <= 3.0.0, != 2.0.0)
rubyzip (>= 1.3.0, < 3.0)
jekyll-sass-converter (1.5.2)
sass (~> 3.4)
jekyll-seo-tag (2.6.1)
jekyll (>= 3.3, < 5.0)
jekyll-watch (2.2.1)
listen (~> 3.0)
kramdown (1.17.0)
liquid (4.0.3)
listen (3.2.1)
rb-fsevent (~> 0.10, >= 0.10.3)
rb-inotify (~> 0.9, >= 0.9.10)
mercenary (0.3.6)
pathutil (0.16.2)
forwardable-extended (~> 2.6)
public_suffix (4.0.4)
rb-fsevent (0.10.3)
rb-inotify (0.10.1)
ffi (~> 1.0)
rouge (3.17.0)
rubyzip (2.3.0)
safe_yaml (1.0.5)
sass (3.7.4)
sass-listen (~> 4.0.0)
sass-listen (4.0.0)
rb-fsevent (~> 0.9, >= 0.9.4)
rb-inotify (~> 0.9, >= 0.9.7)
texture (0.3)
jekyll (~> 3.7)
jekyll-seo-tag (~> 2.1)
PLATFORMS
ruby
DEPENDENCIES
jekyll (~> 3.8.3)
jekyll-feed (~> 0.6)
jekyll-remote-theme
texture
tzinfo-data
BUNDLED WITH
2.1.4

41
README.md Normal file
View File

@ -0,0 +1,41 @@
# Website
This website is built using [Docusaurus](https://docusaurus.io/), a modern static website generator.
### Installation
```
$ yarn
```
### Local Development
```
$ yarn start
```
This command starts a local development server and opens up a browser window. Most changes are reflected live without having to restart the server.
### Build
```
$ yarn build
```
This command generates static content into the `build` directory and can be served using any static contents hosting service.
### Deployment
Using SSH:
```
$ USE_SSH=true yarn deploy
```
Not using SSH:
```
$ GIT_USER=<Your GitHub username> yarn deploy
```
If you are using GitHub pages for hosting, this command is a convenient way to build the website and push to the `gh-pages` branch.

View File

@ -1,44 +0,0 @@
# Welcome to Jekyll!
#
# This config file is meant for settings that affect your whole blog, values
# which you are expected to set up once and rarely edit after that. If you find
# yourself editing this file very often, consider using Jekyll's data files
# feature for the data you need to update frequently.
#
# For technical reasons, this file is *NOT* reloaded automatically when you use
# 'bundle exec jekyll serve'. If you change this file, please restart the server process.
# Site settings
# These are used to personalize your new site. If you look in the HTML files,
# you will see them accessed via {{ site.title }}, {{ site.email }}, and so on.
# You can create any custom variable you would like, and they will be accessible
# in the templates via {{ site.myvariable }}.
title: speice.io
description: The Old Speice Guy
email: bradlee@speice.io
baseurl: "" # the subpath of your site, e.g. /blog
url: "https://speice.io/" # the base hostname & protocol for your site, e.g. http://example.com
github_username: bspeice
# Build settings
markdown: kramdown
# theme: texture
remote_theme: thelehhman/texture
plugins:
- jekyll-feed
- jekyll-remote-theme
include: [_pages]
permalink: /:year/:month/:title.html
# Exclude from processing.
# The following items will not be processed, by default. Create a custom list
# to override the default setting.
# exclude:
# - Gemfile
# - Gemfile.lock
# - node_modules
# - vendor/bundle/
# - vendor/cache/
# - vendor/gems/
# - vendor/ruby/

View File

@ -1,23 +0,0 @@
{% if page.layout == 'post' %}
{% comment %}Thanks to https://www.bytedude.com/jekyll-previous-and-next-posts/{% endcomment %}
<div class="container">
<hr>
<div class="post-nav">
<div>
{% if page.previous.url %}
<a href="{{page.previous.url}}">&laquo;&nbsp;{{page.previous.title}}</a>
{% endif %}
</div>
<div class="post-nav-next">
{% if page.next.url %}
<a href="{{page.next.url}}">{{page.next.title}}&nbsp;&raquo;</a>
{% endif %}
</div>
</div>
</div>
<script type="text/javascript"
src="https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.1/MathJax.js?config=TeX-AMS-MML_HTMLorMML">
</script>
{% endif %}

View File

@ -1,7 +0,0 @@
<meta charset="UTF-8">
<meta name="viewport" content="width=device-width, initial-scale=1.0">
<meta http-equiv="X-UA-Compatible" content="ie=edge">
<link rel="stylesheet" href="{{ "/assets/css/style.css" | relative_url }}">
<link rel="stylesheet" href="{{ "/assets/css/fonts.css" | prepend: site.baseurl }}">
<title>{{ page.title | default: site.title }}</title>
{% seo %}

View File

@ -1,7 +0,0 @@
<div class="navbar">
<a href="{{ "/" | prepend: site.baseurl }}">Home</a>
<span class="separator"></span>
<a href="{{ "/about/" | prepend: site.baseurl }}">About</a>
<span class="separator"></span>
<a href="{{ "/feed.xml" | prepend: site.baseurl }}">RSS</a>
</div>

View File

@ -1,15 +0,0 @@
<div class="container">
<h2>{{ site.title }}</h1>
<h1>{{ site.description }}</h2>
<ul class="social">
{%- if site.texture.social_links.github -%}
<a href="https://github.com/{{ site.texture.social_links.github }}"><li><i class="icon-github-circled"></i></li></a>
{%- endif -%}
{%- if site.texture.social_links.linkedIn -%}
<a href="https://linkedin.com/{{ site.texture.social_links.linkedIn }}"><li><i class="icon-linkedin-squared"></i></li></a>
{%- endif -%}
{%- if site.texture.social_links.twitter -%}
<a href="https://twitter.com/{{ site.texture.social_links.twitter }}"><li><i class="icon-twitter-squared"></i></li></a>
{%- endif -%}
</ul>
</div>

View File

@ -1,13 +0,0 @@
---
layout: page
title: About
permalink: /about/
---
Developer currently living in New York City.
Best ways to get in contact:
- Email: [bradlee@speice.io](mailto:bradlee@speice.io)
- Github: [bspeice](https://github.com/bspeice)
- LinkedIn: [bradleespeice](https://www.linkedin.com/in/bradleespeice/)

View File

@ -1,15 +0,0 @@
@font-face {
font-family: 'JetBrains Mono';
src: url('/assets/font/JetBrainsMono-Regular.woff2') format('woff2'),
url('/assets/font/JetBrainsMono-Regular.woff') format('woff');
font-weight: normal;
font-style: normal;
}
@font-face {
font-family: 'Lato';
src: url('/assets/font/lato-regular-webfont.woff2') format('woff2'),
url('/assets/font/lato-regular-webfont.woff') format('woff');
font-weight: normal;
font-style: normal;
}

View File

@ -1,119 +0,0 @@
---
---
// Import the theme rules
@import "theme";
body {
max-width: 100%;
overflow-x: hidden;
font-family: 'Lato', sans-serif;
}
.navbar {
color: $gray;
}
.separator {
margin-right: .45rem;
margin-left: .25rem;
color: #000;
&:after {
content: '\00a0/';
}
}
header {
padding-top: 80px;
padding-bottom: 0;
};
header h1,h2 {
color: #000;
}
.post-description {
color: #555;
}
.post-container a {
color: #555;
border-bottom-color: $gray;
border-bottom-style: dotted;
border-bottom-width: 1px;
position: relative;
display: inline-block;
padding: 1px 1px;
transition: color ease 0.3s;
&::after {
content: '';
position: absolute;
z-index: -1;
width: 100%;
height: 0%;
left: 0;
bottom: 0;
background-color: $gray;
transition: all ease 0.3s;
}
&:hover {
color: #fff;
border-bottom-style: solid;
&::after {
height: 100%;
}
}
}
body pre {
font-size: 15px;
}
pre.highlight, code {
font-family: 'JetBrains Mono', monospace;
}
div.highlighter-rouge {
// Default theme uses `width: 100vw`, which while cool, does cause the page
// to exceed screen width and trigger horizontal scrolling. No bueno.
width: 99vw;
}
.post-date {
// On the front page, make sure titles don't force wrapping the date box content
text-align: right;
white-space: nowrap;
}
blockquote {
color: #555;
right: 100px;
margin-left: 0;
padding-left: 1.8rem;
border-left: 5px solid $gray;
}
.post-nav {
/* Insert your custom styling here. Example:
font-size: 14px;
*/
display: flex;
margin-top: 1em;
margin-bottom: 1em;
}
.post-nav div {
/* flex-grow, flex-shrink, flex-basis */
flex: 1 1 0;
}
.post-nav-next {
text-align: right;
}
th, td {
border-bottom: 1px solid $gray;
padding: 0.75em;
}

3
babel.config.js Normal file
View File

@ -0,0 +1,3 @@
module.exports = {
presets: [require.resolve('@docusaurus/core/lib/babel/preset')],
};

View File

@ -0,0 +1,59 @@
Title: Welcome, and an algorithm
Date: 2015-11-19
Tags: introduction, trading
Modified: 2015-12-05
Category: Blog
Hello! Glad to meet you. I'm currently a student at Columbia University
studying Financial Engineering, and want to give an overview of the projects
I'm working on!
To start things off, Columbia has been hosting a trading competition that
myself and another partner are competing in. I'm including a notebook of the
algorithm that we're using, just to give a simple overview of a miniature
algorithm.
The competition is scored in 3 areas:
- Total return
- [Sharpe ratio](1)
- Maximum drawdown
Our algorithm uses a basic momentum strategy: in the given list of potential
portfolios, pick the stocks that have been performing well in the past 30
days. Then, optimize for return subject to the drawdown being below a specific
level. We didn't include the Sharpe ratio as a constraint, mostly because
we were a bit late entering the competition.
I'll be updating this post with the results of our algorithm as they come along!
---
**UPDATE 12/5/2015**: Now that the competition has ended, I wanted to update
how the algorithm performed. Unfortunately, it didn't do very well. I'm planning
to make some tweaks over the coming weeks, and do another forward test in January.
- After week 1: Down .1%
- After week 2: Down 1.4%
- After week 3: Flat
And some statistics for all teams participating in the competition:
| | |
|--------------------|--------|
| Max Return | 74.1% |
| Min Return | -97.4% |
| Average Return | -.1% |
| Std Dev of Returns | 19.6% |
---
{% notebook 2015-11-14-welcome.ipynb %}
<script type="text/x-mathjax-config">
MathJax.Hub.Config({tex2jax: {inlineMath: [['$','$'], ['\(','\)']]}});
</script>
<script async src='https://cdn.mathjax.org/mathjax/latest/MathJax.js?config=TeX-AMS_CHTML'></script>
[1]: https://en.wikipedia.org/wiki/Sharpe_ratio

View File

@ -0,0 +1,293 @@
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Trading Competition Optimization\n",
"\n",
"### Goal: Max return given maximum Sharpe and Drawdown"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"from IPython.display import display\n",
"import Quandl\n",
"from datetime import datetime, timedelta\n",
"\n",
"tickers = ['XOM', 'CVX', 'CLB', 'OXY', 'SLB']\n",
"market_ticker = 'GOOG/NYSE_VOO'\n",
"lookback = 30\n",
"d_col = 'Close'\n",
"\n",
"data = {tick: Quandl.get('YAHOO/{}'.format(tick))[-lookback:] for tick in tickers}\n",
"market = Quandl.get(market_ticker)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Calculating the Return\n",
"We first want to know how much each ticker returned over the prior period."
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"{'CLB': -0.0016320202164526894,\n",
" 'CVX': 0.0010319531629488911,\n",
" 'OXY': 0.00093418904454400551,\n",
" 'SLB': 0.00098431254720448159,\n",
" 'XOM': 0.00044165797556096868}"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"returns = {tick: data[tick][d_col].pct_change() for tick in tickers}\n",
"\n",
"display({tick: returns[tick].mean() for tick in tickers})"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Calculating the Sharpe ratio\n",
"Sharpe: ${R - R_M \\over \\sigma}$\n",
"\n",
"We use the average return over the lookback period, minus the market average return, over the ticker standard deviation to calculate the Sharpe. Shorting a stock turns a negative Sharpe positive."
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"{'CLB': -0.10578734457846127,\n",
" 'CVX': 0.027303529817677398,\n",
" 'OXY': 0.022622210057414487,\n",
" 'SLB': 0.026950946344858676,\n",
" 'XOM': -0.0053519259698605499}"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"market_returns = market.pct_change()\n",
"\n",
"sharpe = lambda ret: (ret.mean() - market_returns[d_col].mean()) / ret.std()\n",
"sharpes = {tick: sharpe(returns[tick]) for tick in tickers}\n",
"\n",
"display(sharpes)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Calculating the drawdown\n",
"This one is easy - what is the maximum daily change over the lookback period? That is, because we will allow short positions, we are not concerned strictly with maximum downturn, but in general, what is the largest 1-day change?"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"{'CLB': 0.043551495607375035,\n",
" 'CVX': 0.044894389686214398,\n",
" 'OXY': 0.051424517867144637,\n",
" 'SLB': 0.034774627850375328,\n",
" 'XOM': 0.035851524605672758}"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"drawdown = lambda ret: ret.abs().max()\n",
"drawdowns = {tick: drawdown(returns[tick]) for tick in tickers}\n",
"\n",
"display(drawdowns)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Performing the optimization\n",
"\n",
"$\\begin{align}\n",
"max\\ \\ & \\mu \\cdot \\omega\\\\\n",
"s.t.\\ \\ & \\vec{1} \\omega = 1\\\\\n",
"& \\vec{S} \\omega \\ge s\\\\\n",
"& \\vec{D} \\cdot | \\omega | \\le d\\\\\n",
"& \\left|\\omega\\right| \\le l\\\\\n",
"\\end{align}$\n",
"\n",
"We want to maximize average return subject to having a full portfolio, Sharpe above a specific level, drawdown below a level, and leverage not too high - that is, don't have huge long/short positions."
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"'Optimization terminated successfully.'"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/plain": [
"\"Holdings: [('XOM', 5.8337945679814904), ('CVX', 42.935064321851307), ('CLB', -124.5), ('OXY', 36.790387773552119), ('SLB', 39.940753336615096)]\""
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/plain": [
"'Expected Return: 32.375%'"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/plain": [
"'Expected Max Drawdown: 4.34%'"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"import numpy as np\n",
"from scipy.optimize import minimize\n",
"\n",
"#sharpe_limit = .1\n",
"drawdown_limit = .05\n",
"leverage = 250\n",
"\n",
"# Use the map so we can guarantee we maintain the correct order\n",
"# sharpe_a = np.array(list(map(lambda tick: sharpes[tick], tickers))) * -1 # So we can write as upper-bound\n",
"dd_a = np.array(list(map(lambda tick: drawdowns[tick], tickers)))\n",
"returns_a = np.array(list(map(lambda tick: returns[tick].mean(), tickers))) # Because minimizing\n",
"\n",
"meets_sharpe = lambda x: sum(abs(x) * sharpe_a) - sharpe_limit\n",
"def meets_dd(x):\n",
" portfolio = sum(abs(x))\n",
" if portfolio < .1:\n",
" # If there are no stocks in the portfolio,\n",
" # we can accidentally induce division by 0,\n",
" # or division by something small enough to cause infinity\n",
" return 0\n",
" \n",
" return drawdown_limit - sum(abs(x) * dd_a) / sum(abs(x))\n",
"\n",
"is_portfolio = lambda x: sum(x) - 1\n",
"\n",
"def within_leverage(x):\n",
" return leverage - sum(abs(x))\n",
"\n",
"objective = lambda x: sum(x * returns_a) * -1 # Because we're minimizing\n",
"bounds = ((None, None),) * len(tickers)\n",
"x = np.zeros(len(tickers))\n",
"\n",
"constraints = [\n",
" {\n",
" 'type': 'eq',\n",
" 'fun': is_portfolio\n",
" }, {\n",
" 'type': 'ineq',\n",
" 'fun': within_leverage\n",
" #}, {\n",
" # 'type': 'ineq',\n",
" # 'fun': meets_sharpe\n",
" }, {\n",
" 'type': 'ineq',\n",
" 'fun': meets_dd\n",
" }\n",
"]\n",
"\n",
"optimal = minimize(objective, x, bounds=bounds, constraints=constraints,\n",
" options={'maxiter': 500})\n",
"\n",
"# Optimization time!\n",
"display(optimal.message)\n",
"\n",
"display(\"Holdings: {}\".format(list(zip(tickers, optimal.x))))\n",
"\n",
"expected_return = optimal.fun * -100 # multiply by -100 to scale, and compensate for minimizing\n",
"display(\"Expected Return: {:.3f}%\".format(expected_return))\n",
"\n",
"expected_drawdown = sum(abs(optimal.x) * dd_a) / sum(abs(optimal.x)) * 100\n",
"display(\"Expected Max Drawdown: {0:.2f}%\".format(expected_drawdown))\n",
"\n",
"# TODO: Calculate expected Sharpe"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.5.0"
}
},
"nbformat": 4,
"nbformat_minor": 0
}

View File

@ -0,0 +1,173 @@
# Trading Competition Optimization
### Goal: Max return given maximum Sharpe and Drawdown
```python
from IPython.display import display
import Quandl
from datetime import datetime, timedelta
tickers = ['XOM', 'CVX', 'CLB', 'OXY', 'SLB']
market_ticker = 'GOOG/NYSE_VOO'
lookback = 30
d_col = 'Close'
data = {tick: Quandl.get('YAHOO/{}'.format(tick))[-lookback:] for tick in tickers}
market = Quandl.get(market_ticker)
```
# Calculating the Return
We first want to know how much each ticker returned over the prior period.
```python
returns = {tick: data[tick][d_col].pct_change() for tick in tickers}
display({tick: returns[tick].mean() for tick in tickers})
```
{'CLB': -0.0016320202164526894,
'CVX': 0.0010319531629488911,
'OXY': 0.00093418904454400551,
'SLB': 0.00098431254720448159,
'XOM': 0.00044165797556096868}
# Calculating the Sharpe ratio
Sharpe: ${R - R_M \over \sigma}$
We use the average return over the lookback period, minus the market average return, over the ticker standard deviation to calculate the Sharpe. Shorting a stock turns a negative Sharpe positive.
```python
market_returns = market.pct_change()
sharpe = lambda ret: (ret.mean() - market_returns[d_col].mean()) / ret.std()
sharpes = {tick: sharpe(returns[tick]) for tick in tickers}
display(sharpes)
```
{'CLB': -0.10578734457846127,
'CVX': 0.027303529817677398,
'OXY': 0.022622210057414487,
'SLB': 0.026950946344858676,
'XOM': -0.0053519259698605499}
# Calculating the drawdown
This one is easy - what is the maximum daily change over the lookback period? That is, because we will allow short positions, we are not concerned strictly with maximum downturn, but in general, what is the largest 1-day change?
```python
drawdown = lambda ret: ret.abs().max()
drawdowns = {tick: drawdown(returns[tick]) for tick in tickers}
display(drawdowns)
```
{'CLB': 0.043551495607375035,
'CVX': 0.044894389686214398,
'OXY': 0.051424517867144637,
'SLB': 0.034774627850375328,
'XOM': 0.035851524605672758}
# Performing the optimization
$\begin{align}
max\ \ & \mu \cdot \omega\\
s.t.\ \ & \vec{1} \omega = 1\\
& \vec{S} \omega \ge s\\
& \vec{D} \cdot | \omega | \le d\\
& \left|\omega\right| \le l\\
\end{align}$
We want to maximize average return subject to having a full portfolio, Sharpe above a specific level, drawdown below a level, and leverage not too high - that is, don't have huge long/short positions.
```python
import numpy as np
from scipy.optimize import minimize
#sharpe_limit = .1
drawdown_limit = .05
leverage = 250
# Use the map so we can guarantee we maintain the correct order
# sharpe_a = np.array(list(map(lambda tick: sharpes[tick], tickers))) * -1 # So we can write as upper-bound
dd_a = np.array(list(map(lambda tick: drawdowns[tick], tickers)))
returns_a = np.array(list(map(lambda tick: returns[tick].mean(), tickers))) # Because minimizing
meets_sharpe = lambda x: sum(abs(x) * sharpe_a) - sharpe_limit
def meets_dd(x):
portfolio = sum(abs(x))
if portfolio < .1:
# If there are no stocks in the portfolio,
# we can accidentally induce division by 0,
# or division by something small enough to cause infinity
return 0
return drawdown_limit - sum(abs(x) * dd_a) / sum(abs(x))
is_portfolio = lambda x: sum(x) - 1
def within_leverage(x):
return leverage - sum(abs(x))
objective = lambda x: sum(x * returns_a) * -1 # Because we're minimizing
bounds = ((None, None),) * len(tickers)
x = np.zeros(len(tickers))
constraints = [
{
'type': 'eq',
'fun': is_portfolio
}, {
'type': 'ineq',
'fun': within_leverage
#}, {
# 'type': 'ineq',
# 'fun': meets_sharpe
}, {
'type': 'ineq',
'fun': meets_dd
}
]
optimal = minimize(objective, x, bounds=bounds, constraints=constraints,
options={'maxiter': 500})
# Optimization time!
display(optimal.message)
display("Holdings: {}".format(list(zip(tickers, optimal.x))))
expected_return = optimal.fun * -100 # multiply by -100 to scale, and compensate for minimizing
display("Expected Return: {:.3f}%".format(expected_return))
expected_drawdown = sum(abs(optimal.x) * dd_a) / sum(abs(optimal.x)) * 100
display("Expected Max Drawdown: {0:.2f}%".format(expected_drawdown))
# TODO: Calculate expected Sharpe
```
'Optimization terminated successfully.'
"Holdings: [('XOM', 5.8337945679814904), ('CVX', 42.935064321851307), ('CLB', -124.5), ('OXY', 36.790387773552119), ('SLB', 39.940753336615096)]"
'Expected Return: 32.375%'
'Expected Max Drawdown: 4.34%'

View File

@ -0,0 +1,222 @@
---
slug: 2015/11/welcome
title: Welcome, and an algorithm
date: 2015-11-19 12:00:00
last_update:
date: 2015-12-05 12:00:00
authors: [bspeice]
tags: []
---
Hello! Glad to meet you. I'm currently a student at Columbia University studying Financial Engineering, and want to give an overview of the projects I'm working on!
<!-- truncate -->
To start things off, Columbia has been hosting a trading competition that myself and another partner are competing in. I'm including a notebook of the algorithm that we're using, just to give a simple overview of a miniature algorithm.
The competition is scored in 3 areas:
- Total return
- [Sharpe ratio](https://en.wikipedia.org/wiki/Sharpe_ratio)
- Maximum drawdown
Our algorithm uses a basic momentum strategy: in the given list of potential portfolios, pick the stocks that have been performing well in the past 30 days. Then, optimize for return subject to the drawdown being below a specific level. We didn't include the Sharpe ratio as a constraint, mostly because we were a bit late entering the competition.
I'll be updating this post with the results of our algorithm as they come along!
---
**UPDATE 12/5/2015**: Now that the competition has ended, I wanted to update how the algorithm performed. Unfortunately, it didn't do very well. I'm planning to make some tweaks over the coming weeks, and do another forward test in January.
- After week 1: Down .1%
- After week 2: Down 1.4%
- After week 3: Flat
And some statistics for all teams participating in the competition:
| Statistic | Value |
|--------------------|--------|
| Max Return | 74.1% |
| Min Return | -97.4% |
| Average Return | -.1% |
| Std Dev of Returns | 19.6% |
---
## Trading Competition Optimization
**Goal: Max return given maximum Sharpe and Drawdown**
```python
from IPython.display import display
import Quandl
from datetime import datetime, timedelta
tickers = ['XOM', 'CVX', 'CLB', 'OXY', 'SLB']
market_ticker = 'GOOG/NYSE_VOO'
lookback = 30
d_col = 'Close'
data = {tick: Quandl.get('YAHOO/{}'.format(tick))[-lookback:] for tick in tickers}
market = Quandl.get(market_ticker)
```
## Calculating the Return
We first want to know how much each ticker returned over the prior period.
```python
returns = {tick: data[tick][d_col].pct_change() for tick in tickers}
display({tick: returns[tick].mean() for tick in tickers})
```
```
{'CLB': -0.0016320202164526894,
'CVX': 0.0010319531629488911,
'OXY': 0.00093418904454400551,
'SLB': 0.00098431254720448159,
'XOM': 0.00044165797556096868}
```
## Calculating the Sharpe ratio
Sharpe: ${R - R_M \over \sigma}$
We use the average return over the lookback period, minus the market average return, over the ticker standard deviation to calculate the Sharpe. Shorting a stock turns a negative Sharpe positive.
```python
market_returns = market.pct_change()
sharpe = lambda ret: (ret.mean() - market_returns[d_col].mean()) / ret.std()
sharpes = {tick: sharpe(returns[tick]) for tick in tickers}
display(sharpes)
```
```
{'CLB': -0.10578734457846127,
'CVX': 0.027303529817677398,
'OXY': 0.022622210057414487,
'SLB': 0.026950946344858676,
'XOM': -0.0053519259698605499}
```
## Calculating the drawdown
This one is easy - what is the maximum daily change over the lookback period? That is, because we will allow short positions, we are not concerned strictly with maximum downturn, but in general, what is the largest 1-day change?
```python
drawdown = lambda ret: ret.abs().max()
drawdowns = {tick: drawdown(returns[tick]) for tick in tickers}
display(drawdowns)
```
```
{'CLB': 0.043551495607375035,
'CVX': 0.044894389686214398,
'OXY': 0.051424517867144637,
'SLB': 0.034774627850375328,
'XOM': 0.035851524605672758}
```
# Performing the optimization
$$
\begin{align*}
max\ \ & \mu \cdot \omega\\
s.t.\ \ & \vec{1} \omega = 1\\
& \vec{S} \omega \ge s\\
& \vec{D} \cdot | \omega | \le d\\
& \left|\omega\right| \le l\\
\end{align*}
$$
We want to maximize average return subject to having a full portfolio, Sharpe above a specific level, drawdown below a level, and leverage not too high - that is, don't have huge long/short positions.
```python
import numpy as np
from scipy.optimize import minimize
#sharpe_limit = .1
drawdown_limit = .05
leverage = 250
# Use the map so we can guarantee we maintain the correct order
# So we can write as upper-bound
# sharpe_a = np.array(list(map(lambda tick: sharpes[tick], tickers))) * -1
dd_a = np.array(list(map(lambda tick: drawdowns[tick], tickers)))
# Because minimizing
returns_a = np.array(list(map(lambda tick: returns[tick].mean(), tickers)))
meets_sharpe = lambda x: sum(abs(x) * sharpe_a) - sharpe_limit
def meets_dd(x):
portfolio = sum(abs(x))
if portfolio < .1:
# If there are no stocks in the portfolio,
# we can accidentally induce division by 0,
# or division by something small enough to cause infinity
return 0
return drawdown_limit - sum(abs(x) * dd_a) / sum(abs(x))
is_portfolio = lambda x: sum(x) - 1
def within_leverage(x):
return leverage - sum(abs(x))
objective = lambda x: sum(x * returns_a) * -1 # Because we're minimizing
bounds = ((None, None),) * len(tickers)
x = np.zeros(len(tickers))
constraints = [
{
'type': 'eq',
'fun': is_portfolio
}, {
'type': 'ineq',
'fun': within_leverage
#}, {
# 'type': 'ineq',
# 'fun': meets_sharpe
}, {
'type': 'ineq',
'fun': meets_dd
}
]
optimal = minimize(objective, x, bounds=bounds, constraints=constraints,
options={'maxiter': 500})
# Optimization time!
display(optimal.message)
display("Holdings: {}".format(list(zip(tickers, optimal.x))))
# multiply by -100 to scale, and compensate for minimizing
expected_return = optimal.fun * -100
display("Expected Return: {:.3f}%".format(expected_return))
expected_drawdown = sum(abs(optimal.x) * dd_a) / sum(abs(optimal.x)) * 100
display("Expected Max Drawdown: {0:.2f}%".format(expected_drawdown))
# TODO: Calculate expected Sharpe
```
```
'Optimization terminated successfully.'
"Holdings: [('XOM', 5.8337945679814904),
('CVX', 42.935064321851307),
('CLB', -124.5),
('OXY', 36.790387773552119),
('SLB', 39.940753336615096)]"
'Expected Return: 32.375%'
'Expected Max Drawdown: 4.34%'
```

View File

@ -0,0 +1,21 @@
Title: Autocallable Bonds
Date: 2015-11-27
Category: Blog
Tags: finance, simulation, monte carlo
Authors: Bradlee Speice
Summary: For a final project, my group was tasked with understanding three exotic derivatives: The Athena, Phoenix without memory, and Phoenix with memory autocallable products.
[//]: <> "Modified:"
My only non-core class this semester has been in Structure Products. We've been surveying a wide variety of products, and the final project was to pick one to report on.
Because these are all very similar, we decided to demonstrate all 3 products at once.
What follows below is a notebook demonstrating the usage of [Julia](http://julialang.com) for Monte-Carlo simulation of some exotic products.
{% notebook 2015-11-27-autocallable.ipynb language[julia] %}
<script type="text/x-mathjax-config">
MathJax.Hub.Config({tex2jax: {inlineMath: [['$','$'], ['\(','\)']]}});
</script>
<script async src='https://cdn.mathjax.org/mathjax/latest/MathJax.js?config=TeX-AMS_CHTML'></script>

File diff suppressed because one or more lines are too long

File diff suppressed because one or more lines are too long

Binary file not shown.

After

Width:  |  Height:  |  Size: 72 KiB

File diff suppressed because one or more lines are too long

After

Width:  |  Height:  |  Size: 81 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 73 KiB

View File

@ -0,0 +1,121 @@
<?xml version="1.0" encoding="UTF-8"?>
<svg xmlns="http://www.w3.org/2000/svg"
xmlns:xlink="http://www.w3.org/1999/xlink"
xmlns:gadfly="http://www.gadflyjl.org/ns"
version="1.2"
width="141.42mm" height="100mm" viewBox="0 0 141.42 100"
stroke="none"
fill="#000000"
stroke-width="0.3"
font-size="3.88"
>
<g class="plotroot xscalable yscalable" id="fig-5f2bef255c9640798a762cea30b280d2-element-1">
<g font-size="3.88" font-family="'PT Sans','Helvetica Neue','Helvetica',sans-serif" fill="#564A55" stroke="#000000" stroke-opacity="0.000" id="fig-5f2bef255c9640798a762cea30b280d2-element-2">
<text x="73.61" y="88.39" text-anchor="middle" dy="0.6em">Months</text>
</g>
<g class="guide colorkey" id="fig-5f2bef255c9640798a762cea30b280d2-element-3">
<g font-size="2.82" font-family="'PT Sans','Helvetica Neue','Helvetica',sans-serif" fill="#4C404B" id="fig-5f2bef255c9640798a762cea30b280d2-element-4">
<text x="131.9" y="66.46" dy="0.35em">1</text>
<text x="131.9" y="39.15" dy="0.35em">5</text>
<text x="131.9" y="59.63" dy="0.35em">2</text>
<text x="131.9" y="52.81" dy="0.35em">3</text>
<text x="131.9" y="45.98" dy="0.35em">4</text>
</g>
<g shape-rendering="crispEdges" stroke="#000000" stroke-opacity="0.000" id="fig-5f2bef255c9640798a762cea30b280d2-element-5">
<rect x="129.58" y="65.78" width="1.31" height="0.68" fill="#004D84"/>
<rect x="129.58" y="65.1" width="1.31" height="0.68" fill="#005B8D"/>
<rect x="129.58" y="64.41" width="1.31" height="0.68" fill="#006995"/>
<rect x="129.58" y="63.73" width="1.31" height="0.68" fill="#00769D"/>
<rect x="129.58" y="63.05" width="1.31" height="0.68" fill="#0083A3"/>
<rect x="129.58" y="62.36" width="1.31" height="0.68" fill="#278FA9"/>
<rect x="129.58" y="61.68" width="1.31" height="0.68" fill="#409BAF"/>
<rect x="129.58" y="61" width="1.31" height="0.68" fill="#55A7B5"/>
<rect x="129.58" y="60.32" width="1.31" height="0.68" fill="#69B2BA"/>
<rect x="129.58" y="59.63" width="1.31" height="0.68" fill="#7BBCC0"/>
<rect x="129.58" y="58.95" width="1.31" height="0.68" fill="#8DC6C5"/>
<rect x="129.58" y="58.27" width="1.31" height="0.68" fill="#9ED0CB"/>
<rect x="129.58" y="57.59" width="1.31" height="0.68" fill="#A5CFC7"/>
<rect x="129.58" y="56.9" width="1.31" height="0.68" fill="#ABCEC4"/>
<rect x="129.58" y="56.22" width="1.31" height="0.68" fill="#B1CCC2"/>
<rect x="129.58" y="55.54" width="1.31" height="0.68" fill="#B5CCC1"/>
<rect x="129.58" y="54.85" width="1.31" height="0.68" fill="#B7CBBF"/>
<rect x="129.58" y="54.17" width="1.31" height="0.68" fill="#B9CBBD"/>
<rect x="129.58" y="53.49" width="1.31" height="0.68" fill="#BBCBBB"/>
<rect x="129.58" y="52.81" width="1.31" height="0.68" fill="#BDCABA"/>
<rect x="129.58" y="52.12" width="1.31" height="0.68" fill="#BFCAB8"/>
<rect x="129.58" y="51.44" width="1.31" height="0.68" fill="#C2C9B7"/>
<rect x="129.58" y="50.76" width="1.31" height="0.68" fill="#C4C9B6"/>
<rect x="129.58" y="50.07" width="1.31" height="0.68" fill="#C6C8B5"/>
<rect x="129.58" y="49.39" width="1.31" height="0.68" fill="#C9C7B4"/>
<rect x="129.58" y="48.71" width="1.31" height="0.68" fill="#CCC7B2"/>
<rect x="129.58" y="48.03" width="1.31" height="0.68" fill="#CFC6AE"/>
<rect x="129.58" y="47.34" width="1.31" height="0.68" fill="#D4C5AA"/>
<rect x="129.58" y="46.66" width="1.31" height="0.68" fill="#D8C3A6"/>
<rect x="129.58" y="45.98" width="1.31" height="0.68" fill="#D3B79A"/>
<rect x="129.58" y="45.3" width="1.31" height="0.68" fill="#CDAB8E"/>
<rect x="129.58" y="44.61" width="1.31" height="0.68" fill="#C89E82"/>
<rect x="129.58" y="43.93" width="1.31" height="0.68" fill="#C19177"/>
<rect x="129.58" y="43.25" width="1.31" height="0.68" fill="#BA836C"/>
<rect x="129.58" y="42.56" width="1.31" height="0.68" fill="#B27563"/>
<rect x="129.58" y="41.88" width="1.31" height="0.68" fill="#AA665A"/>
<rect x="129.58" y="41.2" width="1.31" height="0.68" fill="#A05752"/>
<rect x="129.58" y="40.52" width="1.31" height="0.68" fill="#96484A"/>
<rect x="129.58" y="39.83" width="1.31" height="0.68" fill="#8B3844"/>
<rect x="129.58" y="39.15" width="1.31" height="0.68" fill="#7E273E"/>
<g stroke="#FFFFFF" stroke-width="0.2" id="fig-5f2bef255c9640798a762cea30b280d2-element-6">
<path fill="none" d="M129.58,66.46 L 130.9 66.46"/>
<path fill="none" d="M129.58,39.15 L 130.9 39.15"/>
<path fill="none" d="M129.58,59.63 L 130.9 59.63"/>
<path fill="none" d="M129.58,52.81 L 130.9 52.81"/>
<path fill="none" d="M129.58,45.98 L 130.9 45.98"/>
</g>
</g>
<g fill="#362A35" font-size="3.88" font-family="'PT Sans','Helvetica Neue','Helvetica',sans-serif" stroke="#000000" stroke-opacity="0.000" id="fig-5f2bef255c9640798a762cea30b280d2-element-7">
<text x="129.58" y="35.15">Color</text>
</g>
</g>
<g clip-path="url(#fig-5f2bef255c9640798a762cea30b280d2-element-9)" id="fig-5f2bef255c9640798a762cea30b280d2-element-8">
<g pointer-events="visible" opacity="1" fill="#000000" fill-opacity="0.000" stroke="#000000" stroke-opacity="0.000" class="guide background" id="fig-5f2bef255c9640798a762cea30b280d2-element-10">
<rect x="19.63" y="5" width="107.95" height="80.39"/>
</g>
<g class="guide ygridlines xfixed" stroke-dasharray="0.5,0.5" stroke-width="0.2" stroke="#D0D0E0" id="fig-5f2bef255c9640798a762cea30b280d2-element-11">
<path fill="none" d="M19.63,83.39 L 127.58 83.39"/>
<path fill="none" d="M19.63,68.11 L 127.58 68.11"/>
<path fill="none" d="M19.63,52.83 L 127.58 52.83"/>
<path fill="none" d="M19.63,37.56 L 127.58 37.56"/>
<path fill="none" d="M19.63,22.28 L 127.58 22.28"/>
<path fill="none" d="M19.63,7 L 127.58 7"/>
</g>
<g class="guide xgridlines yfixed" stroke-dasharray="0.5,0.5" stroke-width="0.2" stroke="#D0D0E0" id="fig-5f2bef255c9640798a762cea30b280d2-element-12">
<path fill="none" d="M55.93,5 L 55.93 85.39"/>
<path fill="none" d="M90.76,5 L 90.76 85.39"/>
<path fill="none" d="M125.58,5 L 125.58 85.39"/>
</g>
<g class="plotpanel" id="fig-5f2bef255c9640798a762cea30b280d2-element-13">
<g stroke-width="0.3" fill="#000000" fill-opacity="0.000" stroke-dasharray="none" id="fig-5f2bef255c9640798a762cea30b280d2-element-14">
<path fill="none" d="M21.63,61.39 L 22.15 63.01 22.68 64.88 23.2 65.75 23.72 65.53 24.24 66.19 24.77 65.44 25.29 64.6 25.81 62.6 26.33 64.24 26.86 63.14 27.38 62.41 27.9 61.76 28.42 60.15 28.94 59.51 29.47 59.43 29.99 57.05 30.51 57.31 31.03 58.76 31.56 60.04 32.08 59.29 32.6 56.95 33.12 58.64 33.65 58.8 34.17 60.04 34.69 61.19 35.21 60.49 35.74 58.06 36.26 59.33 36.78 59.15 37.3 60.03 37.83 57.9 38.35 60.9 38.87 60.47 39.39 61.8 39.91 62.99 40.44 62.59 40.96 62.85 41.48 61.37 42 60.8 42.53 60.36 43.05 61.74 43.57 59.57 44.09 63.47 44.62 65.21 45.14 64.53 45.66 61.4 46.18 60.13 46.71 58.91 47.23 57.72 47.75 55.68 48.27 54.24 48.8 52.75 49.32 51.15 49.84 51.7 50.36 51.13 50.88 50.26 51.41 50.58 51.93 49.94 52.45 49.43 52.97 47.26 53.5 47.63 54.02 43.28 54.54 41.73 55.06 38.71 55.59 36.85 56.11 38.83 56.63 40.76 57.15 40.38 57.68 38.32 58.2 40.53 58.72 41.71 59.24 45.53 59.77 44.75 60.29 42.69 60.81 43.54 61.33 47.48 61.85 46.26 62.38 49.17 62.9 51.72 63.42 51.08 63.94 48.32 64.47 49.22 64.99 46.99 65.51 45.37 66.03 44.84 66.56 49.23 67.08 49.47 67.6 48.5 68.12 48.55 68.65 50.2 69.17 52.04 69.69 48.28 70.21 49.14 70.74 47.87 71.26 44.77 71.78 45.94 72.3 46.46 72.82 48.27 73.35 45.57 73.87 44.73 74.39 44.69 74.91 41.97 75.44 42.75 75.96 42.73 76.48 42.15 77 42.15 77.53 42.06 78.05 42.3 78.57 42.05 79.09 38.58 79.62 37.43 80.14 37.45 80.66 37.58 81.18 37.02 81.71 37.77 82.23 34.18 82.75 34.92 83.27 34.3 83.79 31.3 84.32 30.74 84.84 31.04 85.36 33.45 85.88 37.48 86.41 40.9 86.93 38.85 87.45 40.71 87.97 39.53 88.5 37.84 89.02 35.57 89.54 34.28 90.06 32.61 90.59 31.25 91.11 31.69 91.63 30.35 92.15 28.63 92.67 32.02 93.2 31.58 93.72 31.03 94.24 30.32 94.76 33.63 95.29 37.51 95.81 38.21 96.33 38.63 96.85 39 97.38 38.02 97.9 39.92 98.42 41.11 98.94 41.89 99.47 40.85 99.99 39.33 100.51 36.48 101.03 37.12 101.56 33.6 102.08 33.78 102.6 30.27 103.12 29.93 103.64 26.12 104.17 26.05 104.69 24.24 105.21 25.87 105.73 27.08 106.26 25.68 106.78 28.93 107.3 28.27 107.82 29.74 108.35 29.07 108.87 27.61 109.39 28.83 109.91 27.94 110.44 24.59 110.96 21.98 111.48 23.57 112 22.27 112.53 23.76 113.05 24.49 113.57 24.01 114.09 27.19 114.61 29.62 115.14 30.47 115.66 31.48 116.18 34.04 116.7 29.61 117.23 29.29 117.75 31.04 118.27 29.64 118.79 32.69 119.32 33.45 119.84 35.26 120.36 36.97 120.88 36.6 121.41 37.29 121.93 39.06 122.45 38.48 122.97 39.26 123.5 39.98 124.02 38.69 124.54 38.15 125.06 37.06 125.58 39.61" class="geometry color_" stroke="#004D84"/>
<path fill="none" d="M21.63,61.39 L 22.15 61.55 22.68 59.53 23.2 62.63 23.72 61.1 24.24 60.87 24.77 63.38 25.29 63.77 25.81 65.57 26.33 63.22 26.86 63.73 27.38 63.29 27.9 60.76 28.42 60.09 28.94 60 29.47 58.56 29.99 59.7 30.51 59.73 31.03 57.99 31.56 58.25 32.08 60.02 32.6 63.02 33.12 63.52 33.65 64.28 34.17 63.56 34.69 65.89 35.21 65.14 35.74 64.68 36.26 61.26 36.78 60.1 37.3 58.77 37.83 58.54 38.35 59.44 38.87 61.08 39.39 59.79 39.91 59.06 40.44 59.77 40.96 58.28 41.48 60.23 42 59.88 42.53 55.28 43.05 58.62 43.57 58.51 44.09 58.98 44.62 56.77 45.14 54.9 45.66 55.78 46.18 55.1 46.71 56.55 47.23 58.27 47.75 56.82 48.27 56.24 48.8 54.06 49.32 56.41 49.84 55.68 50.36 58.5 50.88 57.87 51.41 58.81 51.93 59.87 52.45 61.56 52.97 65.32 53.5 65.49 54.02 65.91 54.54 67.73 55.06 66.87 55.59 67.38 56.11 64.09 56.63 63.02 57.15 63.89 57.68 62.49 58.2 61.84 58.72 63.22 59.24 62.02 59.77 62.72 60.29 62.27 60.81 62.55 61.33 58.97 61.85 57.07 62.38 60.9 62.9 61.93 63.42 61.82 63.94 61.85 64.47 61.73 64.99 62.68 65.51 64.04 66.03 61.26 66.56 62.96 67.08 64.38 67.6 65.88 68.12 66.68 68.65 68.22 69.17 67.78 69.69 69.38 70.21 70.75 70.74 72.39 71.26 70.94 71.78 72.08 72.3 72.27 72.82 73.58 73.35 73.74 73.87 72.35 74.39 72.22 74.91 74.12 75.44 75.3 75.96 74.87 76.48 75.32 77 74.8 77.53 74.11 78.05 74.5 78.57 69.69 79.09 72.14 79.62 69.69 80.14 69.75 80.66 69.51 81.18 71.33 81.71 73.9 82.23 72.82 82.75 70.59 83.27 71.59 83.79 71.58 84.32 71.71 84.84 69.86 85.36 70.05 85.88 69.15 86.41 72.2 86.93 69.53 87.45 70.69 87.97 70.15 88.5 69.51 89.02 67.76 89.54 67.41 90.06 67.64 90.59 67.5 91.11 68.27 91.63 68.42 92.15 68.23 92.67 68.39 93.2 68.47 93.72 69.07 94.24 71.95 94.76 74.34 95.29 75.63 95.81 74.08 96.33 71.92 96.85 71.36 97.38 71.17 97.9 68.55 98.42 70.43 98.94 71.01 99.47 70.5 99.99 69.72 100.51 69.93 101.03 66.34 101.56 64.29 102.08 67.1 102.6 68.99 103.12 70.73 103.64 67.44 104.17 68.72 104.69 71.08 105.21 70.44 105.73 71.41 106.26 69.46 106.78 67.03 107.3 66.86 107.82 68.11 108.35 67.18 108.87 67.67 109.39 66.77 109.91 67.01 110.44 65.42 110.96 66.32 111.48 68.51 112 68.31 112.53 66.43 113.05 67.65 113.57 68.46 114.09 67.9 114.61 69.27 115.14 65.25 115.66 63.92 116.18 63.66 116.7 62.69 117.23 62.57 117.75 63.19 118.27 64.28 118.79 61.75 119.32 62.08 119.84 62.38 120.36 62.89 120.88 61.83 121.41 60.06 121.93 58.9 122.45 57.83 122.97 56.53 123.5 58.83 124.02 57.9 124.54 59.82 125.06 59.66 125.58 60.89" class="geometry color_" stroke="#7E273E"/>
<path fill="none" d="M21.63,61.39 L 22.15 62.01 22.68 62.12 23.2 64.62 23.72 64.89 24.24 60.83 24.77 58.38 25.29 57.87 25.81 59.4 26.33 59.29 26.86 58.42 27.38 58.88 27.9 61.48 28.42 61.57 28.94 63.47 29.47 63.35 29.99 64.6 30.51 66.75 31.03 67.31 31.56 66.27 32.08 66.88 32.6 66.7 33.12 68.45 33.65 68.23 34.17 69.78 34.69 70.71 35.21 68.96 35.74 71.3 36.26 71.51 36.78 72.6 37.3 71.53 37.83 68.39 38.35 67.9 38.87 68.04 39.39 66.79 39.91 65.4 40.44 66.69 40.96 66.18 41.48 65.4 42 63.44 42.53 61.3 43.05 63.19 43.57 61.69 44.09 58.24 44.62 57.65 45.14 56.99 45.66 58.92 46.18 58.42 46.71 57.78 47.23 56.04 47.75 55.47 48.27 53.63 48.8 50.38 49.32 50.16 49.84 47.49 50.36 46.04 50.88 45.37 51.41 45.88 51.93 46.2 52.45 48.8 52.97 46.81 53.5 47.41 54.02 44.95 54.54 42.2 55.06 40.28 55.59 40.95 56.11 41.25 56.63 42.26 57.15 45.49 57.68 49.57 58.2 51.75 58.72 51.28 59.24 50.92 59.77 50.58 60.29 49.25 60.81 48.75 61.33 50.04 61.85 51.83 62.38 56.12 62.9 58.24 63.42 53.76 63.94 48.49 64.47 49.53 64.99 48.91 65.51 47.5 66.03 44.98 66.56 45.34 67.08 45.47 67.6 47.09 68.12 45.28 68.65 46.64 69.17 46.73 69.69 47.31 70.21 41.61 70.74 40.78 71.26 39.14 71.78 38.12 72.3 38.34 72.82 38.18 73.35 38.73 73.87 38.42 74.39 38.55 74.91 38.09 75.44 37.59 75.96 37.78 76.48 35.41 77 31.8 77.53 33.17 78.05 33.1 78.57 31.68 79.09 33.07 79.62 34.53 80.14 34.08 80.66 33.6 81.18 32.97 81.71 32.22 82.23 32.64 82.75 35.43 83.27 36.8 83.79 37.78 84.32 39.79 84.84 40.84 85.36 40.79 85.88 37.93 86.41 34.02 86.93 33.81 87.45 34.55 87.97 32.24 88.5 29.77 89.02 29.49 89.54 34.31 90.06 34.92 90.59 36.37 91.11 33.74 91.63 37.64 92.15 36.01 92.67 39.21 93.2 39.67 93.72 38.94 94.24 41.03 94.76 42.02 95.29 41.38 95.81 40.65 96.33 41.11 96.85 42.33 97.38 41.38 97.9 39.81 98.42 37.69 98.94 38.54 99.47 35.99 99.99 35.41 100.51 38.56 101.03 38.77 101.56 40.05 102.08 42.32 102.6 43.14 103.12 43.52 103.64 46.7 104.17 45.1 104.69 43.26 105.21 42.06 105.73 45.01 106.26 44.86 106.78 43.66 107.3 43.57 107.82 43.53 108.35 44.76 108.87 44.65 109.39 46.06 109.91 44.57 110.44 49.26 110.96 49.18 111.48 51.74 112 51.36 112.53 52.06 113.05 55.72 113.57 57.48 114.09 59.83 114.61 61.84 115.14 63.48 115.66 63.61 116.18 62.01 116.7 62.46 117.23 62.32 117.75 63.34 118.27 61.52 118.79 61.6 119.32 59.09 119.84 61.22 120.36 62.07 120.88 62.68 121.41 64.73 121.93 65.46 122.45 63.95 122.97 63.8 123.5 65.66 124.02 63.97 124.54 62.93 125.06 62.62 125.58 59.34" class="geometry color_" stroke="#88C4C4"/>
<path fill="none" d="M21.63,61.39 L 22.15 61.13 22.68 60.81 23.2 62.02 23.72 62.26 24.24 60.5 24.77 57.82 25.29 56.74 25.81 56.51 26.33 58.2 26.86 60.28 27.38 61.78 27.9 58.53 28.42 57.05 28.94 57.46 29.47 56.25 29.99 55.53 30.51 56.65 31.03 57.59 31.56 54.25 32.08 55.32 32.6 56.21 33.12 59.52 33.65 59.2 34.17 60.8 34.69 62.11 35.21 62.65 35.74 63.18 36.26 65.53 36.78 62.71 37.3 61.28 37.83 62.17 38.35 64.09 38.87 65.57 39.39 68.22 39.91 69.83 40.44 71.7 40.96 72.83 41.48 72.38 42 73.06 42.53 75.36 43.05 74.31 43.57 72.7 44.09 72.32 44.62 72.66 45.14 71.6 45.66 70.6 46.18 71.62 46.71 70.85 47.23 71.92 47.75 72.63 48.27 70.05 48.8 70.08 49.32 66.73 49.84 67.47 50.36 66.88 50.88 69.37 51.41 68.11 51.93 68.23 52.45 68.51 52.97 68.65 53.5 68.56 54.02 70.73 54.54 69.14 55.06 69.57 55.59 69.92 56.11 72.37 56.63 73.95 57.15 73.32 57.68 73.92 58.2 73.39 58.72 73.43 59.24 73.68 59.77 74.47 60.29 72.03 60.81 72.45 61.33 73.14 61.85 73.15 62.38 71.74 62.9 69.85 63.42 70.7 63.94 71.26 64.47 72.88 64.99 74.1 65.51 70.79 66.03 70.7 66.56 70.99 67.08 70.58 67.6 69.57 68.12 68.88 68.65 70.25 69.17 68.44 69.69 66.43 70.21 66.82 70.74 67.93 71.26 66.66 71.78 68.22 72.3 67.71 72.82 70.26 73.35 69.76 73.87 70.48 74.39 68.83 74.91 69.78 75.44 69.89 75.96 73.59 76.48 67.68 77 68.15 77.53 64.77 78.05 62.61 78.57 63.48 79.09 64.13 79.62 63.34 80.14 64.97 80.66 66.9 81.18 67.84 81.71 65.25 82.23 65.96 82.75 68.01 83.27 67.94 83.79 67.69 84.32 67.66 84.84 65.74 85.36 66.5 85.88 66.92 86.41 66.41 86.93 68.51 87.45 71.01 87.97 71.21 88.5 72.5 89.02 73.18 89.54 70.08 90.06 72.06 90.59 74.8 91.11 74.42 91.63 75.56 92.15 73.31 92.67 74.09 93.2 75.35 93.72 74.34 94.24 77.32 94.76 74.76 95.29 72.97 95.81 71.55 96.33 72.98 96.85 75.64 97.38 75.6 97.9 76.73 98.42 74.28 98.94 73.56 99.47 73.22 99.99 74.74 100.51 74.37 101.03 73.72 101.56 75.73 102.08 76.29 102.6 75.49 103.12 74.57 103.64 76.28 104.17 75.11 104.69 74.66 105.21 77.83 105.73 80.01 106.26 78.94 106.78 80.86 107.3 81.08 107.82 80.19 108.35 77.17 108.87 79.12 109.39 82.08 109.91 80.5 110.44 78.48 110.96 77.93 111.48 77.62 112 77.35 112.53 78.06 113.05 78 113.57 78.5 114.09 77.21 114.61 78.04 115.14 78.9 115.66 78.49 116.18 78.9 116.7 76.42 117.23 75.58 117.75 76.93 118.27 75.55 118.79 72.22 119.32 71.57 119.84 70.44 120.36 70.09 120.88 73.61 121.41 72.94 121.93 70.79 122.45 73.21 122.97 75.36 123.5 75.87 124.02 78.45 124.54 78.19 125.06 79.03 125.58 75.53" class="geometry color_" stroke="#BECAB9"/>
<path fill="none" d="M21.63,61.39 L 22.15 62.55 22.68 62.37 23.2 63.25 23.72 63.25 24.24 64.02 24.77 65.99 25.29 69.66 25.81 71.24 26.33 70.93 26.86 70.19 27.38 71.55 27.9 73.13 28.42 70.7 28.94 73.34 29.47 72.47 29.99 71.56 30.51 71.95 31.03 71.63 31.56 71.16 32.08 73.57 32.6 73.55 33.12 76.27 33.65 73.77 34.17 73.68 34.69 73.68 35.21 76.32 35.74 78.54 36.26 78.59 36.78 78.96 37.3 79.49 37.83 81.13 38.35 81.3 38.87 81.14 39.39 82.98 39.91 80.8 40.44 80.57 40.96 79.64 41.48 81.32 42 78.97 42.53 79.94 43.05 82.18 43.57 81.76 44.09 80.36 44.62 79.69 45.14 78.74 45.66 76.71 46.18 74.35 46.71 73.54 47.23 75.11 47.75 75.3 48.27 72.13 48.8 71.57 49.32 73.81 49.84 70.57 50.36 70.7 50.88 69.51 51.41 67.7 51.93 68.7 52.45 68.49 52.97 67.36 53.5 68.38 54.02 68.57 54.54 69.48 55.06 67.24 55.59 67.91 56.11 68.49 56.63 69.85 57.15 70.98 57.68 72.27 58.2 72.59 58.72 72.89 59.24 71.74 59.77 74.72 60.29 75.85 60.81 76.74 61.33 76.47 61.85 78.94 62.38 80.68 62.9 79.33 63.42 79.29 63.94 80.75 64.47 78.89 64.99 75.99 65.51 76.3 66.03 75.83 66.56 74.26 67.08 73.28 67.6 73.07 68.12 76.32 68.65 79.61 69.17 78.24 69.69 79.84 70.21 78.28 70.74 79.19 71.26 80.4 71.78 82.76 72.3 81.01 72.82 78.82 73.35 80.8 73.87 79.4 74.39 79.97 74.91 79.62 75.44 78.4 75.96 80.89 76.48 80.82 77 80.98 77.53 78.76 78.05 78.89 78.57 78.74 79.09 78.77 79.62 79.58 80.14 80.03 80.66 79.16 81.18 78.87 81.71 77.29 82.23 79.92 82.75 80.48 83.27 76.93 83.79 74.47 84.32 75.54 84.84 73.87 85.36 72.13 85.88 69.8 86.41 70.82 86.93 71.83 87.45 73.26 87.97 72.86 88.5 74.19 89.02 72.37 89.54 70.28 90.06 70.37 90.59 71.36 91.11 69.93 91.63 67.03 92.15 69.87 92.67 66.08 93.2 65.76 93.72 65.11 94.24 65.89 94.76 66.82 95.29 67.4 95.81 66.67 96.33 65.37 96.85 66.11 97.38 66.54 97.9 68.48 98.42 69.86 98.94 66.92 99.47 66.68 99.99 68.42 100.51 71.29 101.03 69.59 101.56 67.99 102.08 67.6 102.6 68.79 103.12 68.83 103.64 72.71 104.17 69.46 104.69 68.62 105.21 64.75 105.73 66.78 106.26 64.76 106.78 67.81 107.3 70.4 107.82 70.97 108.35 72.14 108.87 72.67 109.39 72.77 109.91 74.44 110.44 75.67 110.96 77.33 111.48 76.89 112 75.27 112.53 73.3 113.05 75.32 113.57 74.37 114.09 76.24 114.61 75.89 115.14 72.65 115.66 72.29 116.18 70.49 116.7 71.87 117.23 69.1 117.75 70.16 118.27 66.35 118.79 65.95 119.32 67.92 119.84 67.88 120.36 68.62 120.88 69.89 121.41 68.68 121.93 65.93 122.45 64.6 122.97 64.63 123.5 64.11 124.02 65.81 124.54 63.74 125.06 61.66 125.58 61.11" class="geometry color_" stroke="#D2B497"/>
</g>
</g>
</g>
<g class="guide ylabels" font-size="2.82" font-family="'PT Sans Caption','Helvetica Neue','Helvetica',sans-serif" fill="#6C606B" id="fig-5f2bef255c9640798a762cea30b280d2-element-15">
<text x="18.63" y="83.39" text-anchor="end" dy="0.35em">95</text>
<text x="18.63" y="68.11" text-anchor="end" dy="0.35em">100</text>
<text x="18.63" y="52.83" text-anchor="end" dy="0.35em">105</text>
<text x="18.63" y="37.56" text-anchor="end" dy="0.35em">110</text>
<text x="18.63" y="22.28" text-anchor="end" dy="0.35em">115</text>
<text x="18.63" y="7" text-anchor="end" dy="0.35em">120</text>
</g>
<g font-size="3.88" font-family="'PT Sans','Helvetica Neue','Helvetica',sans-serif" fill="#564A55" stroke="#000000" stroke-opacity="0.000" id="fig-5f2bef255c9640798a762cea30b280d2-element-16">
<text x="8.81" y="43.19" text-anchor="middle" dy="0.35em" transform="rotate(-90, 8.81, 45.19)">Value</text>
</g>
</g>
<defs>
<clipPath id="fig-5f2bef255c9640798a762cea30b280d2-element-9">
<path d="M19.63,5 L 127.58 5 127.58 85.39 19.63 85.39" />
</clipPath
></defs>
</svg>

After

Width:  |  Height:  |  Size: 20 KiB

View File

@ -0,0 +1,473 @@
---
slug: 2015/11/autocallable
title: Autocallable Bonds
date: 2015-11-27 12:00:00
authors: [bspeice]
tags: []
---
For a final project, my group was tasked with understanding three exotic derivatives: The Athena, Phoenix without memory, and Phoenix with memory autocallable products.
<!-- truncate -->
My only non-core class this semester has been in Structure Products. We've been surveying a wide variety of products, and the final project was to pick one to report on.
Because these are all very similar, we decided to demonstrate all 3 products at once.
What follows below is a notebook demonstrating the usage of [Julia](http://julialang.com) for Monte-Carlo simulation of some exotic products.
---
```julia
using Gadfly
```
# Athena/Phoenix Simulation
## Underlying simulation
In order to price the autocallable bonds, we need to simulate the underlying assets. Let's go ahead and set up the simulation first, as this lays the foundation for what we're trying to do. We're going to use [JNJ](http://finance.yahoo.com/q?s=jnj) as the basis for our simulation. This implies the following parameters:
- $S_0$ = \$102.2 (as of time of writing)
- $q$ = 2.84%
- $r$ = [.49, .9, 1.21, 1.45, 1.69] (term structure as of time of writing, linear interpolation)
- $\mu$ = $r - q$ (note that this implies a negative drift because of current low rates)
- $\sigma$ = $\sigma_{imp}$ = 15.62% (from VIX implied volatility)
We additionally define some parameters for simulation:
- `T`: The number of years to simulate
- `m`: The number of paths to simulate
- `n`: The number of steps to simulate in a year
```julia
S0 = 102.2
nominal = 100
q = 2.84 / 100
σ = 15.37 / 100
term = [0, .49, .9, 1.21, 1.45, 1.69] / 100 + 1
###
# Potential: Based on PEP
# S0 = 100.6
# σ = 14.86
# q = 2.7
###
# Simulation parameters
T = 5 # Using years as the unit of time
n = 250 # simulations per year
m = 100000 # paths
num_simulations = 5; # simulation rounds per price
```
### Defining the simulation
To make things simpler, we simulate a single year at a time. This allows us to easily add in a dividend policy without too much difficulty, and update the simulation every year to match the term structure. The underlying uses GBM for simulation between years.
```julia
simulate_gbm = function(S0, μ, σ, T, n)
# Set the initial state
m = length(S0)
t = T / n
motion = zeros(m, n)
motion[:,1] = S0
# Build out all states
for i=1:(n-1)
motion[:,i+1] = motion[:,i] .* exp((μ - σ^2/2)*t) .* exp(sqrt(t) * σ .* randn(m))
end
return motion
end
function display_motion(motion, T)
# Given a matrix of paths, display the motion
n = length(motion[1,:])
m = length(motion[:,1])
x = repmat(1:n, m)
# Calculate the ticks we're going to use. We'd like to
# have an xtick every month, so calculate where those
# ticks will actually be at.
if (T > 3)
num_ticks = T
xlabel = "Years"
else
num_ticks = T * 12
xlabel = "Months"
end
tick_width = n / num_ticks
x_ticks = []
for i=1:round(num_ticks)
x_ticks = vcat(x_ticks, i*tick_width)
end
# Use one color for each path. I'm not sure if there's
# a better way to do this without going through DataFrames
colors = []
for i = 1:m
colors = vcat(colors, ones(n)*i)
end
plot(x=x, y=motion', color=colors, Geom.line,
Guide.xticks(ticks=x_ticks, label=false),
Guide.xlabel(xlabel),
Guide.ylabel("Value"))
end;
```
### Example simulation
Let's go ahead and run a sample simulation to see what the functions got us!
```julia
initial = ones(5) * S0
# Using μ=0, T=.25 for now, we'll use the proper values later
motion = simulate_gbm(initial, 0, σ, .25, 200)
display_motion(motion, .25)
```
![](./_notebook_files/_notebook_6_0.svg)
### Computing the term structure
Now that we've got the basic motion set up, let's start making things a bit more sophisticated for the model. We're going to assume that the drift of the stock is the difference between the implied forward rate and the quarterly dividend rate.
We're given the yearly term structure, and need to calculate the quarterly forward rate to match this structure. The term structure is assumed to follow:
$d(0, t) = d(0,t-1)\cdot f_{i-1, i}$
Where $f_{i-1, i}$ is the quarterly forward rate.
```julia
forward_term = function(yearly_term)
# It is assumed that we have a yearly term structure passed in, and starts at year 0
# This implies a nominal rate above 0 for the first year!
years = length(term)-1 # because we start at 0
structure = [(term[i+1] / term[i]) for i=1:years]
end;
```
### Illustrating the term structure
Now that we've got our term structure, let's validate that we're getting the correct results! If we've done this correctly, then:
```
term[2] == term[1] * structure[1]
```
```julia
# Example term structure taken from:
# http://www.treasury.gov/resource-center/data-chart-center/interest-rates/Pages/TextView.aspx?data=yield
# Linear interpolation used years in-between periods, assuming real-dollar
# interest rates
forward_yield = forward_term(term)
calculated_term2 = term[1] * forward_yield[1]
println("Actual term[2]: $(term[2]); Calculated term[2]: $(calculated_term2)")
```
```
Actual term[2]: 1.0049; Calculated term[2]: 1.0049
```
### The full underlying simulation
Now that we have the term structure set up, we can actually start doing some real simulation! Let's construct some paths through the full 5-year time frame. In order to do this, we will simulate 1 year at a time, and use the forward rates at those times to compute the drift. Thus, there will be 5 total simulations batched together.
```julia
full_motion = ones(5) * S0
full_term = vcat(term[1], forward_yield)
for i=1:T
μ = (full_term[i] - 1 - q)
year_motion = simulate_gbm(full_motion[:,end], μ, σ, 1, n)
full_motion = hcat(full_motion, year_motion)
end
display_motion(full_motion, T)
```
![](./_notebook_files/_notebook_12_0.svg)
### Final simulation
We're now going to actually build out the full motion that we'll use for computing the pricing of our autocallable products. It will be largely the same, but we will use far more sample paths for the simulation.
```julia
full_simulation = function(S0, T, n, m, term)
forward = vcat(term[1], forward_term(term))
# And an S0 to kick things off.
final_motion = ones(m) * S0
for i=1:T
μ = (forward[i] - 1 - q)
year_motion = simulate_gbm(final_motion[:,end], μ, σ, 1, n)
final_motion = hcat(final_motion, year_motion)
end
return final_motion
end
tic()
full_simulation(S0, T, n, m, term)
time = toq()
@printf("Time to run simulation: %.2fs", time)
```
```
Time to run simulation: 5.34s
```
## Athena Simulation
Now that we've defined our underlying simulation, let's actually try and price an Athena note. Athena has the following characteristics:
- Automatically called if the underlying is above the **call barrier** at observation
- Accelerated coupon paid if the underlying is above the **call barrier** at observation
- The coupon paid is $c \cdot i$ with $i$ as the current year, and $c$ the coupon rate
- Principle protection up until a **protection barrier** at observation; All principle at risk if this barrier not met
- Observed yearly
```julia
call_barrier = S0
strike = S0
protection_barrier = S0 * .6
coupon = nominal * .07
price_athena = function(initial_price, year_prices, call_barrier,
protection_barrier, coupon, forward_structure)
total_coupons = 0
t = length(year_prices)
for i=1:t
price = year_prices[i]
if price ≥ call_barrier
return (nominal + coupon*i) * exp((prod(forward_structure[i:end])-1)*(t-i))
end
end
# We've reached maturity, time to check capital protection
if year_prices[end] > protection_barrier
return nominal
else
put = (strike - year_prices[end]) / strike
return nominal*(1-put)
end
end
forward_structure = forward_term(term)
price_function = (year_prices) -> price_athena(S0, year_prices,
call_barrier, protection_barrier, coupon, forward_structure)
athena = function()
year_indexes = [n*i for i=1:T]
motion = full_simulation(S0, T, n, m, term)
payoffs = [price_function(motion[i, year_indexes]) for i=1:m]
return mean(payoffs)
end
mean_payoffs = zeros(num_simulations)
for i=1:num_simulations
tic()
mean_payoffs[i] = athena()
time = toq()
@printf("Mean of simulation %i: \$%.4f; Simulation time: %.2fs\n", i, mean_payoffs[i], time)
end
final_mean = mean(mean_payoffs)
println("Mean over $num_simulations simulations: $(mean(mean_payoffs))")
pv = final_mean * (exp(-(prod(forward_structure)-1)*T))
@printf("Present value of Athena note: \$%.2f, notional: \$%.2f", pv, nominal)
```
```
Mean of simulation 1: $103.2805; Simulation time: 5.59s
Mean of simulation 2: $103.3796; Simulation time: 5.05s
Mean of simulation 3: $103.4752; Simulation time: 5.18s
Mean of simulation 4: $103.4099; Simulation time: 5.37s
Mean of simulation 5: $103.3260; Simulation time: 5.32s
Mean over 5 simulations: 103.37421610015554
Present value of Athena note: $95.00, notional: $100.00
```
## Phoenix without Memory Simulation
Let's move into pricing a Phoenix without memory. It's very similar to the Athena production, with the exception that we introduce a coupon barrier so coupons are paid even when the underlying is below the initial price.
The Phoenix product has the following characteristics (example [here](https://www.rbccm.com/usstructurednotes/file-780079.pdf)):
- Automatically called if the underlying is above the **call barrier** at observation
- Coupon paid if the underlying is above a **coupon barrier** at observation
- Principle protection up until a **protection barrier** at observation; All principle at risk if this barrier not met
- Observed yearly
Some example paths (all assume that a call barrier of the current price, and coupon barrier some level below that):
- At the end of year 1, the stock is above the call barrier; the note is called and you receive the value of the stock plus the coupon being paid.
- At the end of year 1, the stock is above the coupon barrier, but not the call barrier; you receive the coupon. At the end of year 2, the stock is below the coupon barrier; you receive nothing. At the end of year 3, the stock is above the call barrier; the note is called and you receive the value of the stock plus a coupon for year 3.
We're going to re-use the same simulation, with the following parameters:
- Call barrier: 100%
- Coupon barrier: 70%
- Coupon: 6%
- Capital protection until 70% (at maturity)
```julia
call_barrier = S0
coupon_barrier = S0 * .8
protection_barrier = S0 * .6
coupon = nominal * .06
price_phoenix_no_memory = function(initial_price, year_prices, call_barrier, coupon_barrier,
protection_barrier, coupon, forward_structure)
total_coupons = 0
t = length(year_prices)
for i=1:t
price = year_prices[i]
if price ≥ call_barrier
return (nominal + coupon + total_coupons)*exp((prod(forward_structure[i:end])-1)*(t-i))
elseif price ≥ coupon_barrier
total_coupons = total_coupons * exp(forward_structure[i]-1) + coupon
else
total_coupons *= exp(forward_structure[i]-1)
end
end
# We've reached maturity, time to check capital protection
if year_prices[end] > protection_barrier
return nominal + total_coupons
else
put = (strike - year_prices[end]) / strike
return nominal*(1-put)
end
end
forward_structure = forward_term(term)
price_function = (year_prices) -> price_phoenix_no_memory(S0, year_prices,
call_barrier, coupon_barrier, protection_barrier, coupon, forward_structure)
phoenix_no_memory = function()
year_indexes = [n*i for i=1:T]
motion = full_simulation(S0, T, n, m, term)
payoffs = [price_function(motion[i, year_indexes]) for i=1:m]
return mean(payoffs)
end
mean_payoffs = zeros(num_simulations)
for i=1:num_simulations
tic()
mean_payoffs[i] = phoenix_no_memory()
time = toq()
@printf("Mean of simulation %i: \$%.4f; Simulation time: %.2fs\n", i, mean_payoffs[i], time)
end
final_mean = mean(mean_payoffs)
println("Mean over $num_simulations simulations: $(mean(mean_payoffs))")
pv = final_mean * exp(-(prod(forward_structure)-1)*(T))
@printf("Present value of Phoenix without memory note: \$%.2f", pv)
```
```
Mean of simulation 1: $106.0562; Simulation time: 5.72s
Mean of simulation 2: $106.0071; Simulation time: 5.85s
Mean of simulation 3: $105.9959; Simulation time: 5.87s
Mean of simulation 4: $106.0665; Simulation time: 5.93s
Mean of simulation 5: $106.0168; Simulation time: 5.81s
Mean over 5 simulations: 106.02850857209883
Present value of Phoenix without memory note: $97.44
```
## Phoenix with Memory Simulation
The Phoenix with Memory structure is very similar to the Phoenix, but as the name implies, has a special "memory" property: **It remembers any coupons that haven't been paid at prior observation times, and pays them all if the underlying crosses the coupon barrier**. For example:
- Note issued with 100% call barrier, 70% coupon barrier. At year 1, the underlying is at 50%, so no coupons are paid. At year 2, the underlying is at 80%, so coupons for both year 1 and 2 are paid, resulting in a double coupon.
You can also find an example [here](https://www.rbccm.com/usstructurednotes/file-781232.pdf).
Let's go ahead and set up the simulation! The parameters will be the same, but we can expect that the value will go up because of the memory attribute
```julia
call_barrier = S0
coupon_barrier = S0 * .8
protection_barrier = S0 * .6
coupon = nominal * .07
price_phoenix_with_memory = function(initial_price, year_prices, call_barrier,
coupon_barrier, protection_barrier, coupon, forward_structure)
last_coupon = 0
total_coupons = 0
t = length(year_prices)
for i=1:t
price = year_prices[i]
if price > call_barrier
return (nominal + coupon + total_coupons)*exp((prod(forward_structure[i:end])-1)*(t-i))
elseif price > coupon_barrier
####################################################################
# The only difference between with/without memory is the below lines
memory_coupons = (i - last_coupon) * coupon
last_coupon = i
total_coupons = total_coupons * exp(forward_structure[i]-1) + memory_coupons
####################################################################
else
total_coupons *= exp(forward_structure[i]-1)
end
end
# We've reached maturity, time to check capital protection
if year_prices[end] > protection_barrier
return nominal + total_coupons
else
put = (strike - year_prices[end]) / strike
return nominal*(1-put)
end
end
forward_structure = forward_term(term)
price_function = (year_prices) -> price_phoenix_with_memory(S0, year_prices,
call_barrier, coupon_barrier, protection_barrier, coupon, forward_structure)
phoenix_with_memory = function()
year_indexes = [n*i for i=1:T]
motion = full_simulation(S0, T, n, m, term)
payoffs = [price_function(motion[i, year_indexes]) for i=1:m]
return mean(payoffs)
end
mean_payoffs = zeros(num_simulations)
for i=1:num_simulations
tic()
mean_payoffs[i] = phoenix_with_memory()
time = toq()
@printf("Mean of simulation %i: \$%.4f; Simulation time: %.2fs\n",
i, mean_payoffs[i], time)
end
final_mean = mean(mean_payoffs)
println("Mean over $num_simulations simulations: $(mean(mean_payoffs))")
pv = final_mean * exp(-(prod(forward_structure)-1)*(T))
@printf("Present value of Phoenix with memory note: \$%.2f", pv)
```
```
Mean of simulation 1: $108.8612; Simulation time: 5.89s
Mean of simulation 2: $109.0226; Simulation time: 5.90s
Mean of simulation 3: $108.9175; Simulation time: 5.92s
Mean of simulation 4: $108.9426; Simulation time: 5.94s
Mean of simulation 5: $108.8087; Simulation time: 6.06s
Mean over 5 simulations: 108.91052564051816
Present value of Phoenix with memory note: $100.09
```

View File

@ -0,0 +1,16 @@
Title: Testing Cramer
Date: 2015-12-26
Category: Blog
Tags: futures, data science
Authors: Bradlee Speice
Summary:
[//]: <> "Modified: "
{% notebook 2015-12-26-testing_cramer.ipynb %}
<script type="text/x-mathjax-config">
MathJax.Hub.Config({tex2jax: {inlineMath: [['$','$'], ['\(','\)']]}});
</script>
<script async src='https://cdn.mathjax.org/mathjax/latest/MathJax.js?config=TeX-AMS_CHTML'></script>

View File

@ -0,0 +1,428 @@
{
"cells": [
{
"cell_type": "code",
"execution_count": 1,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"import requests\n",
"import pandas as pd\n",
"import numpy as np\n",
"from dateutil import parser as dtparser\n",
"from dateutil.relativedelta import relativedelta\n",
"from datetime import datetime\n",
"from html.parser import HTMLParser\n",
"from copy import copy\n",
"import Quandl"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Testing Cramer\n",
"\n",
"Pursuant to attending a graduate school studying Financial Engineering, I've been a fan of the [Mad Money][3] TV show featuring the bombastic Jim Cramer. One of the things that he's said is that you shouldn't use the futures to predict where the stock market is going to go. But he says it often enough, I've begun to wonder - who is he trying to convince?\n",
"\n",
"It makes sense that because futures on things like the S&P 500 are traded continuously, they would price in market information before the stock market opens. So is Cramer right to be convinced that strategies based on the futures are a poor idea? I wanted to test it out.\n",
"\n",
"The first question is where to get the future's data. I've been part of [Seeking Alpha][2] for a bit, and they publish the [Wall Street Breakfast][3] newsletter which contains daily future's returns as of 6:20 AM EST. I'd be interested in using that data to see if we can actually make some money.\n",
"\n",
"First though, let's get the data:\n",
"\n",
"# Downloading Futures data from Seeking Alpha\n",
"\n",
"We're going to define two HTML parsing classes - one to get the article URL's from a page, and one to get the actual data from each article.\n",
"\n",
"[1]: http://www.cnbc.com/mad-money/\n",
"[2]: http://seekingalpha.com/\n",
"[3]: http://seekingalpha.com/author/wall-street-breakfast?s=wall-street-breakfast"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"class ArticleListParser(HTMLParser):\n",
" \"\"\"Given a web page with articles on it, parse out the article links\"\"\"\n",
" \n",
" articles = []\n",
" \n",
" def handle_starttag(self, tag, attrs):\n",
" #if tag == 'div' and (\"id\", \"author_articles_wrapper\") in attrs:\n",
" # self.fetch_links = True\n",
" if tag == 'a' and ('class', 'dashboard_article_link') in attrs:\n",
" href = list(filter(lambda x: x[0] == 'href', attrs))[0][1]\n",
" self.articles.append(href)\n",
" \n",
"base_url = \"http://seekingalpha.com/author/wall-street-breakfast/articles\"\n",
"article_page_urls = [base_url] + [base_url + '/{}'.format(i) for i in range(2, 20)]\n",
"\n",
"global_articles = []\n",
"for page in article_page_urls:\n",
" # We need to switch the user agent, as SA blocks the standard requests agent\n",
" articles_html = requests.get(page,\n",
" headers={\"User-Agent\": \"Wget/1.13.4\"})\n",
" parser = ArticleListParser()\n",
" parser.feed(articles_html.text)\n",
" global_articles += (parser.articles)"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {
"collapsed": false,
"scrolled": true
},
"outputs": [],
"source": [
"class ArticleReturnParser(HTMLParser):\n",
" \"Given an article, parse out the futures returns in it\"\n",
" \n",
" record_font_tags = False\n",
" in_font_tag = False\n",
" counter = 0\n",
" # data = {} # See __init__\n",
" \n",
" def __init__(self, *args, **kwargs):\n",
" super().__init__(*args, **kwargs)\n",
" self.data = {}\n",
" \n",
" def handle_starttag(self, tag, attrs):\n",
" if tag == 'span' and ('itemprop', 'datePublished') in attrs:\n",
" date_string = list(filter(lambda x: x[0] == 'content', attrs))[0][1]\n",
" date = dtparser.parse(date_string)\n",
" self.data['date'] = date\n",
" \n",
" self.in_font_tag = tag == 'font'\n",
" \n",
" def safe_float(self, string):\n",
" try:\n",
" return float(string[:-1]) / 100\n",
" except ValueError:\n",
" return np.NaN\n",
" \n",
" def handle_data(self, content):\n",
" if not self.record_font_tags and \"Futures at 6\" in content:\n",
" self.record_font_tags = True\n",
" \n",
" if self.record_font_tags and self.in_font_tag:\n",
" if self.counter == 0:\n",
" self.data['DOW'] = self.safe_float(content)\n",
" elif self.counter == 1:\n",
" self.data['S&P'] = self.safe_float(content)\n",
" elif self.counter == 2:\n",
" self.data['NASDAQ'] = self.safe_float(content)\n",
" elif self.counter == 3:\n",
" self.data['Crude'] = self.safe_float(content)\n",
" elif self.counter == 4:\n",
" self.data['Gold'] = self.safe_float(content)\n",
" \n",
" self.counter += 1\n",
" \n",
" def handle_endtag(self, tag):\n",
" self.in_font_tag = False\n",
"\n",
"def retrieve_data(url):\n",
" sa = \"http://seekingalpha.com\"\n",
" article_html = requests.get(sa + url,\n",
" headers={\"User-Agent\": \"Wget/1.13.4\"})\n",
" parser = ArticleReturnParser()\n",
" parser.feed(article_html.text)\n",
" parser.data.update({\"url\": url})\n",
" parser.data.update({\"text\": article_html.text})\n",
" return parser.data\n",
"\n",
"# This copy **MUST** be in place. I'm not sure why,\n",
"# as you'd think that the data being returned would already\n",
"# represent a different memory location. Even so, it blows up\n",
"# if you don't do this.\n",
"article_list = list(set(global_articles))\n",
"article_data = [copy(retrieve_data(url)) for url in article_list]\n",
"# If there's an issue downloading the article, drop it.\n",
"article_df = pd.DataFrame.from_dict(article_data).dropna()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Fetching the Returns data\n",
"\n",
"Now that we have the futures data, we're going to compare across 4 different indices - the S&P 500 index, Dow Jones Industrial, Russell 2000, and NASDAQ 100. Let's get the data off of Quandl to make things easier!"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"# article_df is sorted by date, so we get the first row.\n",
"start_date = article_df.sort_values(by='date').iloc[0]['date'] - relativedelta(days=1)\n",
"SPY = Quandl.get(\"GOOG/NYSE_SPY\", trim_start=start_date)\n",
"DJIA = Quandl.get(\"GOOG/AMS_DIA\", trim_start=start_date)\n",
"RUSS = Quandl.get(\"GOOG/AMEX_IWM\", trim_start=start_date)\n",
"NASDAQ = Quandl.get(\"GOOG/EPA_QQQ\", trim_start=start_date)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Running the Comparison\n",
"\n",
"There are two types of tests I want to determine: How accurate each futures category is at predicting the index's opening change over the close before, and predicting the index's daily return.\n",
"\n",
"Let's first calculate how good each future is at predicting the opening return over the previous day. I expect that the futures will be more than 50% accurate, since the information is recorded 3 hours before the markets open."
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Articles Checked: \n",
" DJIA NASDAQ RUSS SPY\n",
"Crude 268 268 271 271\n",
"DOW 268 268 271 271\n",
"Gold 268 268 271 271\n",
"NASDAQ 268 268 271 271\n",
"S&P 268 268 271 271\n",
"\n",
"Prediction Accuracy:\n",
" DJIA NASDAQ RUSS SPY\n",
"Crude 0.544776 0.522388 0.601476 0.590406\n",
"DOW 0.611940 0.604478 0.804428 0.841328\n",
"Gold 0.462687 0.455224 0.464945 0.476015\n",
"NASDAQ 0.615672 0.608209 0.797048 0.830258\n",
"S&P 0.604478 0.597015 0.811808 0.848708\n"
]
}
],
"source": [
"def calculate_opening_ret(frame):\n",
" # I'm not a huge fan of the appending for loop,\n",
" # but it's a bit verbose for a comprehension\n",
" data = {}\n",
" for i in range(1, len(frame)):\n",
" date = frame.iloc[i].name\n",
" prior_close = frame.iloc[i-1]['Close']\n",
" open_val = frame.iloc[i]['Open']\n",
" data[date] = (open_val - prior_close) / prior_close\n",
" \n",
" return data\n",
"\n",
"SPY_open_ret = calculate_opening_ret(SPY)\n",
"DJIA_open_ret = calculate_opening_ret(DJIA)\n",
"RUSS_open_ret = calculate_opening_ret(RUSS)\n",
"NASDAQ_open_ret = calculate_opening_ret(NASDAQ)\n",
"\n",
"def signs_match(list_1, list_2):\n",
" # This is a surprisingly difficult task - we have to match\n",
" # up the dates in order to check if opening returns actually match\n",
" index_dict_dt = {key.to_datetime(): list_2[key] for key in list_2.keys()}\n",
" \n",
" matches = []\n",
" for row in list_1.iterrows():\n",
" row_dt = row[1][1]\n",
" row_value = row[1][0]\n",
" index_dt = datetime(row_dt.year, row_dt.month, row_dt.day)\n",
" if index_dt in list_2:\n",
" index_value = list_2[index_dt]\n",
" if (row_value > 0 and index_value > 0) or \\\n",
" (row_value < 0 and index_value < 0) or \\\n",
" (row_value == 0 and index_value == 0):\n",
" matches += [1]\n",
" else:\n",
" matches += [0]\n",
" #print(\"{}\".format(list_2[index_dt]))\n",
" return matches\n",
" \n",
" \n",
"prediction_dict = {}\n",
"matches_dict = {}\n",
"count_dict = {}\n",
"index_dict = {\"SPY\": SPY_open_ret, \"DJIA\": DJIA_open_ret, \"RUSS\": RUSS_open_ret, \"NASDAQ\": NASDAQ_open_ret}\n",
"indices = [\"SPY\", \"DJIA\", \"RUSS\", \"NASDAQ\"]\n",
"futures = [\"Crude\", \"Gold\", \"DOW\", \"NASDAQ\", \"S&P\"]\n",
"for index in indices:\n",
" matches_dict[index] = {future: signs_match(article_df[[future, 'date']],\n",
" index_dict[index]) for future in futures}\n",
" count_dict[index] = {future: len(matches_dict[index][future]) for future in futures}\n",
" prediction_dict[index] = {future: np.mean(matches_dict[index][future])\n",
" for future in futures}\n",
"print(\"Articles Checked: \")\n",
"print(pd.DataFrame.from_dict(count_dict))\n",
"print()\n",
"print(\"Prediction Accuracy:\")\n",
"print(pd.DataFrame.from_dict(prediction_dict))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"This data is very interesting. Some insights:\n",
"\n",
"- Both DOW and NASDAQ futures are pretty bad at predicting their actual market openings\n",
"- NASDAQ and Dow are fairly unpredictable; Russell 2000 and S&P are very predictable\n",
"- Gold is a poor predictor in general - intuitively Gold should move inverse to the market, but it appears to be about as accurate as a coin flip.\n",
"\n",
"All said though it appears that futures data is important for determining market direction for both the S&P 500 and Russell 2000. Cramer is half-right: futures data isn't very helpful for the Dow and NASDAQ indices, but is great for the S&P and Russell indices.\n",
"\n",
"# The next step - Predicting the close\n",
"\n",
"Given the code we currently have, I'd like to predict the close of the market as well. We can re-use most of the code, so let's see what happens:"
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Articles Checked:\n",
" DJIA NASDAQ RUSS SPY\n",
"Crude 268 268 271 271\n",
"DOW 268 268 271 271\n",
"Gold 268 268 271 271\n",
"NASDAQ 268 268 271 271\n",
"S&P 268 268 271 271\n",
"\n",
"Prediction Accuracy:\n",
" DJIA NASDAQ RUSS SPY\n",
"Crude 0.533582 0.529851 0.501845 0.542435\n",
"DOW 0.589552 0.608209 0.535055 0.535055\n",
"Gold 0.455224 0.451493 0.483395 0.512915\n",
"NASDAQ 0.582090 0.626866 0.531365 0.538745\n",
"S&P 0.585821 0.608209 0.535055 0.535055\n"
]
}
],
"source": [
"def calculate_closing_ret(frame):\n",
" # I'm not a huge fan of the appending for loop,\n",
" # but it's a bit verbose for a comprehension\n",
" data = {}\n",
" for i in range(0, len(frame)):\n",
" date = frame.iloc[i].name\n",
" open_val = frame.iloc[i]['Open']\n",
" close_val = frame.iloc[i]['Close']\n",
" data[date] = (close_val - open_val) / open_val\n",
" \n",
" return data\n",
"\n",
"SPY_close_ret = calculate_closing_ret(SPY)\n",
"DJIA_close_ret = calculate_closing_ret(DJIA)\n",
"RUSS_close_ret = calculate_closing_ret(RUSS)\n",
"NASDAQ_close_ret = calculate_closing_ret(NASDAQ)\n",
"\n",
"def signs_match(list_1, list_2):\n",
" # This is a surprisingly difficult task - we have to match\n",
" # up the dates in order to check if opening returns actually match\n",
" index_dict_dt = {key.to_datetime(): list_2[key] for key in list_2.keys()}\n",
" \n",
" matches = []\n",
" for row in list_1.iterrows():\n",
" row_dt = row[1][1]\n",
" row_value = row[1][0]\n",
" index_dt = datetime(row_dt.year, row_dt.month, row_dt.day)\n",
" if index_dt in list_2:\n",
" index_value = list_2[index_dt]\n",
" if (row_value > 0 and index_value > 0) or \\\n",
" (row_value < 0 and index_value < 0) or \\\n",
" (row_value == 0 and index_value == 0):\n",
" matches += [1]\n",
" else:\n",
" matches += [0]\n",
" #print(\"{}\".format(list_2[index_dt]))\n",
" return matches\n",
" \n",
" \n",
"matches_dict = {}\n",
"count_dict = {}\n",
"prediction_dict = {}\n",
"index_dict = {\"SPY\": SPY_close_ret, \"DJIA\": DJIA_close_ret,\n",
" \"RUSS\": RUSS_close_ret, \"NASDAQ\": NASDAQ_close_ret}\n",
"indices = [\"SPY\", \"DJIA\", \"RUSS\", \"NASDAQ\"]\n",
"futures = [\"Crude\", \"Gold\", \"DOW\", \"NASDAQ\", \"S&P\"]\n",
"for index in indices:\n",
" matches_dict[index] = {future: signs_match(article_df[[future, 'date']],\n",
" index_dict[index]) for future in futures}\n",
" count_dict[index] = {future: len(matches_dict[index][future]) for future in futures}\n",
" prediction_dict[index] = {future: np.mean(matches_dict[index][future])\n",
" for future in futures}\n",
" \n",
"print(\"Articles Checked:\")\n",
"print(pd.DataFrame.from_dict(count_dict))\n",
"print()\n",
"print(\"Prediction Accuracy:\")\n",
"print(pd.DataFrame.from_dict(prediction_dict))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Well, it appears that the futures data is terrible at predicting market close. NASDAQ predicting NASDAQ is the most interesting data point, but 63% accuracy isn't accurate enough to make money consistently.\n",
"\n",
"# Final sentiments\n",
"\n",
"The data bears out very close to what I expected would happen:\n",
"\n",
"- Futures data is more accurate than a coin flip for predicting openings, which makes sense since it is recorded only 3 hours before the actual opening\n",
"- Futures data is about as acccurate as a coin flip for predicting closings, which means there is no money to be made in trying to predict the market direction for the day given the futures data.\n",
"\n",
"In summary:\n",
"\n",
"- Cramer is half right: Futures data is not good for predicting the market open of the Dow and NASDAQ indices. Contrary to Cramer though, it is very good for predicting the S&P and Russell indices - we can achieve an accuracy slightly over 80% for each. \n",
"- Making money in the market is hard. We can't just go to the futures and treat them as an oracle for where the market will close.\n",
"\n",
"I hope you've enjoyed this, I quite enjoyed taking a deep dive in the analytics this way. I'll be posting more soon!"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.5.0"
}
},
"nbformat": 4,
"nbformat_minor": 0
}

View File

@ -0,0 +1,329 @@
```python
import requests
import pandas as pd
import numpy as np
from dateutil import parser as dtparser
from dateutil.relativedelta import relativedelta
from datetime import datetime
from html.parser import HTMLParser
from copy import copy
import Quandl
```
# Testing Cramer
Pursuant to attending a graduate school studying Financial Engineering, I've been a fan of the [Mad Money][3] TV show featuring the bombastic Jim Cramer. One of the things that he's said is that you shouldn't use the futures to predict where the stock market is going to go. But he says it often enough, I've begun to wonder - who is he trying to convince?
It makes sense that because futures on things like the S&P 500 are traded continuously, they would price in market information before the stock market opens. So is Cramer right to be convinced that strategies based on the futures are a poor idea? I wanted to test it out.
The first question is where to get the future's data. I've been part of [Seeking Alpha][2] for a bit, and they publish the [Wall Street Breakfast][3] newsletter which contains daily future's returns as of 6:20 AM EST. I'd be interested in using that data to see if we can actually make some money.
First though, let's get the data:
# Downloading Futures data from Seeking Alpha
We're going to define two HTML parsing classes - one to get the article URL's from a page, and one to get the actual data from each article.
[1]: http://www.cnbc.com/mad-money/
[2]: http://seekingalpha.com/
[3]: http://seekingalpha.com/author/wall-street-breakfast?s=wall-street-breakfast
```python
class ArticleListParser(HTMLParser):
"""Given a web page with articles on it, parse out the article links"""
articles = []
def handle_starttag(self, tag, attrs):
#if tag == 'div' and ("id", "author_articles_wrapper") in attrs:
# self.fetch_links = True
if tag == 'a' and ('class', 'dashboard_article_link') in attrs:
href = list(filter(lambda x: x[0] == 'href', attrs))[0][1]
self.articles.append(href)
base_url = "http://seekingalpha.com/author/wall-street-breakfast/articles"
article_page_urls = [base_url] + [base_url + '/{}'.format(i) for i in range(2, 20)]
global_articles = []
for page in article_page_urls:
# We need to switch the user agent, as SA blocks the standard requests agent
articles_html = requests.get(page,
headers={"User-Agent": "Wget/1.13.4"})
parser = ArticleListParser()
parser.feed(articles_html.text)
global_articles += (parser.articles)
```
```python
class ArticleReturnParser(HTMLParser):
"Given an article, parse out the futures returns in it"
record_font_tags = False
in_font_tag = False
counter = 0
# data = {} # See __init__
def __init__(self, *args, **kwargs):
super().__init__(*args, **kwargs)
self.data = {}
def handle_starttag(self, tag, attrs):
if tag == 'span' and ('itemprop', 'datePublished') in attrs:
date_string = list(filter(lambda x: x[0] == 'content', attrs))[0][1]
date = dtparser.parse(date_string)
self.data['date'] = date
self.in_font_tag = tag == 'font'
def safe_float(self, string):
try:
return float(string[:-1]) / 100
except ValueError:
return np.NaN
def handle_data(self, content):
if not self.record_font_tags and "Futures at 6" in content:
self.record_font_tags = True
if self.record_font_tags and self.in_font_tag:
if self.counter == 0:
self.data['DOW'] = self.safe_float(content)
elif self.counter == 1:
self.data['S&P'] = self.safe_float(content)
elif self.counter == 2:
self.data['NASDAQ'] = self.safe_float(content)
elif self.counter == 3:
self.data['Crude'] = self.safe_float(content)
elif self.counter == 4:
self.data['Gold'] = self.safe_float(content)
self.counter += 1
def handle_endtag(self, tag):
self.in_font_tag = False
def retrieve_data(url):
sa = "http://seekingalpha.com"
article_html = requests.get(sa + url,
headers={"User-Agent": "Wget/1.13.4"})
parser = ArticleReturnParser()
parser.feed(article_html.text)
parser.data.update({"url": url})
parser.data.update({"text": article_html.text})
return parser.data
# This copy **MUST** be in place. I'm not sure why,
# as you'd think that the data being returned would already
# represent a different memory location. Even so, it blows up
# if you don't do this.
article_list = list(set(global_articles))
article_data = [copy(retrieve_data(url)) for url in article_list]
# If there's an issue downloading the article, drop it.
article_df = pd.DataFrame.from_dict(article_data).dropna()
```
# Fetching the Returns data
Now that we have the futures data, we're going to compare across 4 different indices - the S&P 500 index, Dow Jones Industrial, Russell 2000, and NASDAQ 100. Let's get the data off of Quandl to make things easier!
```python
# article_df is sorted by date, so we get the first row.
start_date = article_df.sort_values(by='date').iloc[0]['date'] - relativedelta(days=1)
SPY = Quandl.get("GOOG/NYSE_SPY", trim_start=start_date)
DJIA = Quandl.get("GOOG/AMS_DIA", trim_start=start_date)
RUSS = Quandl.get("GOOG/AMEX_IWM", trim_start=start_date)
NASDAQ = Quandl.get("GOOG/EPA_QQQ", trim_start=start_date)
```
# Running the Comparison
There are two types of tests I want to determine: How accurate each futures category is at predicting the index's opening change over the close before, and predicting the index's daily return.
Let's first calculate how good each future is at predicting the opening return over the previous day. I expect that the futures will be more than 50% accurate, since the information is recorded 3 hours before the markets open.
```python
def calculate_opening_ret(frame):
# I'm not a huge fan of the appending for loop,
# but it's a bit verbose for a comprehension
data = {}
for i in range(1, len(frame)):
date = frame.iloc[i].name
prior_close = frame.iloc[i-1]['Close']
open_val = frame.iloc[i]['Open']
data[date] = (open_val - prior_close) / prior_close
return data
SPY_open_ret = calculate_opening_ret(SPY)
DJIA_open_ret = calculate_opening_ret(DJIA)
RUSS_open_ret = calculate_opening_ret(RUSS)
NASDAQ_open_ret = calculate_opening_ret(NASDAQ)
def signs_match(list_1, list_2):
# This is a surprisingly difficult task - we have to match
# up the dates in order to check if opening returns actually match
index_dict_dt = {key.to_datetime(): list_2[key] for key in list_2.keys()}
matches = []
for row in list_1.iterrows():
row_dt = row[1][1]
row_value = row[1][0]
index_dt = datetime(row_dt.year, row_dt.month, row_dt.day)
if index_dt in list_2:
index_value = list_2[index_dt]
if (row_value > 0 and index_value > 0) or \
(row_value < 0 and index_value < 0) or \
(row_value == 0 and index_value == 0):
matches += [1]
else:
matches += [0]
#print("{}".format(list_2[index_dt]))
return matches
prediction_dict = {}
matches_dict = {}
count_dict = {}
index_dict = {"SPY": SPY_open_ret, "DJIA": DJIA_open_ret, "RUSS": RUSS_open_ret, "NASDAQ": NASDAQ_open_ret}
indices = ["SPY", "DJIA", "RUSS", "NASDAQ"]
futures = ["Crude", "Gold", "DOW", "NASDAQ", "S&P"]
for index in indices:
matches_dict[index] = {future: signs_match(article_df[[future, 'date']],
index_dict[index]) for future in futures}
count_dict[index] = {future: len(matches_dict[index][future]) for future in futures}
prediction_dict[index] = {future: np.mean(matches_dict[index][future])
for future in futures}
print("Articles Checked: ")
print(pd.DataFrame.from_dict(count_dict))
print()
print("Prediction Accuracy:")
print(pd.DataFrame.from_dict(prediction_dict))
```
Articles Checked:
DJIA NASDAQ RUSS SPY
Crude 268 268 271 271
DOW 268 268 271 271
Gold 268 268 271 271
NASDAQ 268 268 271 271
S&P 268 268 271 271
Prediction Accuracy:
DJIA NASDAQ RUSS SPY
Crude 0.544776 0.522388 0.601476 0.590406
DOW 0.611940 0.604478 0.804428 0.841328
Gold 0.462687 0.455224 0.464945 0.476015
NASDAQ 0.615672 0.608209 0.797048 0.830258
S&P 0.604478 0.597015 0.811808 0.848708
This data is very interesting. Some insights:
- Both DOW and NASDAQ futures are pretty bad at predicting their actual market openings
- NASDAQ and Dow are fairly unpredictable; Russell 2000 and S&P are very predictable
- Gold is a poor predictor in general - intuitively Gold should move inverse to the market, but it appears to be about as accurate as a coin flip.
All said though it appears that futures data is important for determining market direction for both the S&P 500 and Russell 2000. Cramer is half-right: futures data isn't very helpful for the Dow and NASDAQ indices, but is great for the S&P and Russell indices.
# The next step - Predicting the close
Given the code we currently have, I'd like to predict the close of the market as well. We can re-use most of the code, so let's see what happens:
```python
def calculate_closing_ret(frame):
# I'm not a huge fan of the appending for loop,
# but it's a bit verbose for a comprehension
data = {}
for i in range(0, len(frame)):
date = frame.iloc[i].name
open_val = frame.iloc[i]['Open']
close_val = frame.iloc[i]['Close']
data[date] = (close_val - open_val) / open_val
return data
SPY_close_ret = calculate_closing_ret(SPY)
DJIA_close_ret = calculate_closing_ret(DJIA)
RUSS_close_ret = calculate_closing_ret(RUSS)
NASDAQ_close_ret = calculate_closing_ret(NASDAQ)
def signs_match(list_1, list_2):
# This is a surprisingly difficult task - we have to match
# up the dates in order to check if opening returns actually match
index_dict_dt = {key.to_datetime(): list_2[key] for key in list_2.keys()}
matches = []
for row in list_1.iterrows():
row_dt = row[1][1]
row_value = row[1][0]
index_dt = datetime(row_dt.year, row_dt.month, row_dt.day)
if index_dt in list_2:
index_value = list_2[index_dt]
if (row_value > 0 and index_value > 0) or \
(row_value < 0 and index_value < 0) or \
(row_value == 0 and index_value == 0):
matches += [1]
else:
matches += [0]
#print("{}".format(list_2[index_dt]))
return matches
matches_dict = {}
count_dict = {}
prediction_dict = {}
index_dict = {"SPY": SPY_close_ret, "DJIA": DJIA_close_ret,
"RUSS": RUSS_close_ret, "NASDAQ": NASDAQ_close_ret}
indices = ["SPY", "DJIA", "RUSS", "NASDAQ"]
futures = ["Crude", "Gold", "DOW", "NASDAQ", "S&P"]
for index in indices:
matches_dict[index] = {future: signs_match(article_df[[future, 'date']],
index_dict[index]) for future in futures}
count_dict[index] = {future: len(matches_dict[index][future]) for future in futures}
prediction_dict[index] = {future: np.mean(matches_dict[index][future])
for future in futures}
print("Articles Checked:")
print(pd.DataFrame.from_dict(count_dict))
print()
print("Prediction Accuracy:")
print(pd.DataFrame.from_dict(prediction_dict))
```
Articles Checked:
DJIA NASDAQ RUSS SPY
Crude 268 268 271 271
DOW 268 268 271 271
Gold 268 268 271 271
NASDAQ 268 268 271 271
S&P 268 268 271 271
Prediction Accuracy:
DJIA NASDAQ RUSS SPY
Crude 0.533582 0.529851 0.501845 0.542435
DOW 0.589552 0.608209 0.535055 0.535055
Gold 0.455224 0.451493 0.483395 0.512915
NASDAQ 0.582090 0.626866 0.531365 0.538745
S&P 0.585821 0.608209 0.535055 0.535055
Well, it appears that the futures data is terrible at predicting market close. NASDAQ predicting NASDAQ is the most interesting data point, but 63% accuracy isn't accurate enough to make money consistently.
# Final sentiments
The data bears out very close to what I expected would happen:
- Futures data is more accurate than a coin flip for predicting openings, which makes sense since it is recorded only 3 hours before the actual opening
- Futures data is about as acccurate as a coin flip for predicting closings, which means there is no money to be made in trying to predict the market direction for the day given the futures data.
In summary:
- Cramer is half right: Futures data is not good for predicting the market open of the Dow and NASDAQ indices. Contrary to Cramer though, it is very good for predicting the S&P and Russell indices - we can achieve an accuracy slightly over 80% for each.
- Making money in the market is hard. We can't just go to the futures and treat them as an oracle for where the market will close.
I hope you've enjoyed this, I quite enjoyed taking a deep dive in the analytics this way. I'll be posting more soon!

View File

@ -0,0 +1,327 @@
---
slug: 2015/12/testing-cramer
title: Testing Cramer
date: 2015-12-26 12:00:00
authors: [bspeice]
tags: []
---
Pursuant to attending a graduate school studying Financial Engineering, I've been a fan of the [Mad Money][1] TV show featuring the bombastic Jim Cramer. One of the things that he's said is that you shouldn't use the futures to predict where the stock market is going to go. But he says it often enough, I've begun to wonder - who is he trying to convince?
<!-- truncate -->
It makes sense that because futures on things like the S&P 500 are traded continuously, they would price in market information before the stock market opens. So is Cramer right to be convinced that strategies based on the futures are a poor idea? I wanted to test it out.
The first question is where to get the future's data. I've been part of [Seeking Alpha][2] for a bit, and they publish the [Wall Street Breakfast][3] newsletter which contains daily future's returns as of 6:20 AM EST. I'd be interested in using that data to see if we can actually make some money.
First though, let's get the data:
## Downloading Futures data from Seeking Alpha
We're going to define two HTML parsing classes - one to get the article URL's from a page, and one to get the actual data from each article.
[1]: http://www.cnbc.com/mad-money/
[2]: http://seekingalpha.com/
[3]: http://seekingalpha.com/author/wall-street-breakfast?s=wall-street-breakfast
```python
class ArticleListParser(HTMLParser):
"""Given a web page with articles on it, parse out the article links"""
articles = []
def handle_starttag(self, tag, attrs):
#if tag == 'div' and ("id", "author_articles_wrapper") in attrs:
# self.fetch_links = True
if tag == 'a' and ('class', 'dashboard_article_link') in attrs:
href = list(filter(lambda x: x[0] == 'href', attrs))[0][1]
self.articles.append(href)
base_url = "http://seekingalpha.com/author/wall-street-breakfast/articles"
article_page_urls = [base_url] + [base_url + '/{}'.format(i) for i in range(2, 20)]
global_articles = []
for page in article_page_urls:
# We need to switch the user agent, as SA blocks the standard requests agent
articles_html = requests.get(page,
headers={"User-Agent": "Wget/1.13.4"})
parser = ArticleListParser()
parser.feed(articles_html.text)
global_articles += (parser.articles)
```
```python
class ArticleReturnParser(HTMLParser):
"Given an article, parse out the futures returns in it"
record_font_tags = False
in_font_tag = False
counter = 0
# data = {} # See __init__
def __init__(self, *args, **kwargs):
super().__init__(*args, **kwargs)
self.data = {}
def handle_starttag(self, tag, attrs):
if tag == 'span' and ('itemprop', 'datePublished') in attrs:
date_string = list(filter(lambda x: x[0] == 'content', attrs))[0][1]
date = dtparser.parse(date_string)
self.data['date'] = date
self.in_font_tag = tag == 'font'
def safe_float(self, string):
try:
return float(string[:-1]) / 100
except ValueError:
return np.NaN
def handle_data(self, content):
if not self.record_font_tags and "Futures at 6" in content:
self.record_font_tags = True
if self.record_font_tags and self.in_font_tag:
if self.counter == 0:
self.data['DOW'] = self.safe_float(content)
elif self.counter == 1:
self.data['S&P'] = self.safe_float(content)
elif self.counter == 2:
self.data['NASDAQ'] = self.safe_float(content)
elif self.counter == 3:
self.data['Crude'] = self.safe_float(content)
elif self.counter == 4:
self.data['Gold'] = self.safe_float(content)
self.counter += 1
def handle_endtag(self, tag):
self.in_font_tag = False
def retrieve_data(url):
sa = "http://seekingalpha.com"
article_html = requests.get(sa + url,
headers={"User-Agent": "Wget/1.13.4"})
parser = ArticleReturnParser()
parser.feed(article_html.text)
parser.data.update({"url": url})
parser.data.update({"text": article_html.text})
return parser.data
# This copy **MUST** be in place. I'm not sure why,
# as you'd think that the data being returned would already
# represent a different memory location. Even so, it blows up
# if you don't do this.
article_list = list(set(global_articles))
article_data = [copy(retrieve_data(url)) for url in article_list]
# If there's an issue downloading the article, drop it.
article_df = pd.DataFrame.from_dict(article_data).dropna()
```
## Fetching the Returns data
Now that we have the futures data, we're going to compare across 4 different indices - the S&P 500 index, Dow Jones Industrial, Russell 2000, and NASDAQ 100. Let's get the data off of Quandl to make things easier!
```python
# article_df is sorted by date, so we get the first row.
start_date = article_df.sort_values(by='date').iloc[0]['date'] - relativedelta(days=1)
SPY = Quandl.get("GOOG/NYSE_SPY", trim_start=start_date)
DJIA = Quandl.get("GOOG/AMS_DIA", trim_start=start_date)
RUSS = Quandl.get("GOOG/AMEX_IWM", trim_start=start_date)
NASDAQ = Quandl.get("GOOG/EPA_QQQ", trim_start=start_date)
```
## Running the Comparison
There are two types of tests I want to determine: How accurate each futures category is at predicting the index's opening change over the close before, and predicting the index's daily return.
Let's first calculate how good each future is at predicting the opening return over the previous day. I expect that the futures will be more than 50% accurate, since the information is recorded 3 hours before the markets open.
```python
def calculate_opening_ret(frame):
# I'm not a huge fan of the appending for loop,
# but it's a bit verbose for a comprehension
data = {}
for i in range(1, len(frame)):
date = frame.iloc[i].name
prior_close = frame.iloc[i-1]['Close']
open_val = frame.iloc[i]['Open']
data[date] = (open_val - prior_close) / prior_close
return data
SPY_open_ret = calculate_opening_ret(SPY)
DJIA_open_ret = calculate_opening_ret(DJIA)
RUSS_open_ret = calculate_opening_ret(RUSS)
NASDAQ_open_ret = calculate_opening_ret(NASDAQ)
def signs_match(list_1, list_2):
# This is a surprisingly difficult task - we have to match
# up the dates in order to check if opening returns actually match
index_dict_dt = {key.to_datetime(): list_2[key] for key in list_2.keys()}
matches = []
for row in list_1.iterrows():
row_dt = row[1][1]
row_value = row[1][0]
index_dt = datetime(row_dt.year, row_dt.month, row_dt.day)
if index_dt in list_2:
index_value = list_2[index_dt]
if (row_value > 0 and index_value > 0) or \
(row_value < 0 and index_value < 0) or \
(row_value == 0 and index_value == 0):
matches += [1]
else:
matches += [0]
#print("{}".format(list_2[index_dt]))
return matches
prediction_dict = {}
matches_dict = {}
count_dict = {}
index_dict = {"SPY": SPY_open_ret, "DJIA": DJIA_open_ret, "RUSS": RUSS_open_ret, "NASDAQ": NASDAQ_open_ret}
indices = ["SPY", "DJIA", "RUSS", "NASDAQ"]
futures = ["Crude", "Gold", "DOW", "NASDAQ", "S&P"]
for index in indices:
matches_dict[index] = {future: signs_match(article_df[[future, 'date']],
index_dict[index]) for future in futures}
count_dict[index] = {future: len(matches_dict[index][future]) for future in futures}
prediction_dict[index] = {future: np.mean(matches_dict[index][future])
for future in futures}
print("Articles Checked: ")
print(pd.DataFrame.from_dict(count_dict))
print()
print("Prediction Accuracy:")
print(pd.DataFrame.from_dict(prediction_dict))
```
```
Articles Checked:
DJIA NASDAQ RUSS SPY
Crude 268 268 271 271
DOW 268 268 271 271
Gold 268 268 271 271
NASDAQ 268 268 271 271
S&P 268 268 271 271
Prediction Accuracy:
DJIA NASDAQ RUSS SPY
Crude 0.544776 0.522388 0.601476 0.590406
DOW 0.611940 0.604478 0.804428 0.841328
Gold 0.462687 0.455224 0.464945 0.476015
NASDAQ 0.615672 0.608209 0.797048 0.830258
S&P 0.604478 0.597015 0.811808 0.848708
```
This data is very interesting. Some insights:
- Both DOW and NASDAQ futures are pretty bad at predicting their actual market openings
- NASDAQ and Dow are fairly unpredictable; Russell 2000 and S&P are very predictable
- Gold is a poor predictor in general - intuitively Gold should move inverse to the market, but it appears to be about as accurate as a coin flip.
All said though it appears that futures data is important for determining market direction for both the S&P 500 and Russell 2000. Cramer is half-right: futures data isn't very helpful for the Dow and NASDAQ indices, but is great for the S&P and Russell indices.
## The next step - Predicting the close
Given the code we currently have, I'd like to predict the close of the market as well. We can re-use most of the code, so let's see what happens:
```python
def calculate_closing_ret(frame):
# I'm not a huge fan of the appending for loop,
# but it's a bit verbose for a comprehension
data = {}
for i in range(0, len(frame)):
date = frame.iloc[i].name
open_val = frame.iloc[i]['Open']
close_val = frame.iloc[i]['Close']
data[date] = (close_val - open_val) / open_val
return data
SPY_close_ret = calculate_closing_ret(SPY)
DJIA_close_ret = calculate_closing_ret(DJIA)
RUSS_close_ret = calculate_closing_ret(RUSS)
NASDAQ_close_ret = calculate_closing_ret(NASDAQ)
def signs_match(list_1, list_2):
# This is a surprisingly difficult task - we have to match
# up the dates in order to check if opening returns actually match
index_dict_dt = {key.to_datetime(): list_2[key] for key in list_2.keys()}
matches = []
for row in list_1.iterrows():
row_dt = row[1][1]
row_value = row[1][0]
index_dt = datetime(row_dt.year, row_dt.month, row_dt.day)
if index_dt in list_2:
index_value = list_2[index_dt]
if (row_value > 0 and index_value > 0) or \
(row_value < 0 and index_value < 0) or \
(row_value == 0 and index_value == 0):
matches += [1]
else:
matches += [0]
#print("{}".format(list_2[index_dt]))
return matches
matches_dict = {}
count_dict = {}
prediction_dict = {}
index_dict = {"SPY": SPY_close_ret, "DJIA": DJIA_close_ret,
"RUSS": RUSS_close_ret, "NASDAQ": NASDAQ_close_ret}
indices = ["SPY", "DJIA", "RUSS", "NASDAQ"]
futures = ["Crude", "Gold", "DOW", "NASDAQ", "S&P"]
for index in indices:
matches_dict[index] = {future: signs_match(article_df[[future, 'date']],
index_dict[index]) for future in futures}
count_dict[index] = {future: len(matches_dict[index][future]) for future in futures}
prediction_dict[index] = {future: np.mean(matches_dict[index][future])
for future in futures}
print("Articles Checked:")
print(pd.DataFrame.from_dict(count_dict))
print()
print("Prediction Accuracy:")
print(pd.DataFrame.from_dict(prediction_dict))
```
```
Articles Checked:
DJIA NASDAQ RUSS SPY
Crude 268 268 271 271
DOW 268 268 271 271
Gold 268 268 271 271
NASDAQ 268 268 271 271
S&P 268 268 271 271
Prediction Accuracy:
DJIA NASDAQ RUSS SPY
Crude 0.533582 0.529851 0.501845 0.542435
DOW 0.589552 0.608209 0.535055 0.535055
Gold 0.455224 0.451493 0.483395 0.512915
NASDAQ 0.582090 0.626866 0.531365 0.538745
S&P 0.585821 0.608209 0.535055 0.535055
```
Well, it appears that the futures data is terrible at predicting market close. NASDAQ predicting NASDAQ is the most interesting data point, but 63% accuracy isn't accurate enough to make money consistently.
## Final sentiments
The data bears out very close to what I expected would happen:
- Futures data is more accurate than a coin flip for predicting openings, which makes sense since it is recorded only 3 hours before the actual opening
- Futures data is about as acccurate as a coin flip for predicting closings, which means there is no money to be made in trying to predict the market direction for the day given the futures data.
In summary:
- Cramer is half right: Futures data is not good for predicting the market open of the Dow and NASDAQ indices. Contrary to Cramer though, it is very good for predicting the S&P and Russell indices - we can achieve an accuracy slightly over 80% for each.
- Making money in the market is hard. We can't just go to the futures and treat them as an oracle for where the market will close.
I hope you've enjoyed this, I quite enjoyed taking a deep dive in the analytics this way. I'll be posting more soon!

Binary file not shown.

After

Width:  |  Height:  |  Size: 117 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 136 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 27 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 160 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 136 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 18 KiB

View File

@ -0,0 +1,16 @@
Title: Complaining about the Weather
Date: 2016-01-01
Category: Blog
Tags: weather
Authors: Bradlee Speice
Summary: Figuring out whether people should be complaining about the recent weather in NC.
[//]: <> "Modified: "
{% notebook 2016-1-1-complaining-about-weather.ipynb %}
<script type="text/x-mathjax-config">
MathJax.Hub.Config({tex2jax: {inlineMath: [['$','$'], ['\(','\)']]}});
</script>
<script async src='https://cdn.mathjax.org/mathjax/latest/MathJax.js?config=TeX-AMS_CHTML'></script>

File diff suppressed because one or more lines are too long

File diff suppressed because one or more lines are too long

View File

@ -0,0 +1,258 @@
---
slug: 2016/01/complaining-about-the-weather
title: Complaining about the weather
date: 2016-01-01 12:00:00
authors: [bspeice]
tags: []
---
Figuring out whether people should be complaining about the recent weather in North Carolina.
<!-- truncate -->
```python
from bokeh.plotting import figure, output_notebook, show
from bokeh.palettes import PuBuGn9 as Palette
import pandas as pd
import numpy as np
from datetime import datetime
import pickle
output_notebook()
```
```
BokehJS successfully loaded.
```
I'm originally from North Carolina, and I've been hearing a lot of people talking about how often it's been raining recently. They're excited for any day that has sun.
So I got a bit curious: Has North Carolina over the past few months actually had more cloudy and rainy days recently than in previous years? This shouldn't be a particularly challenging task, but I'm interested to know if people's perceptions actually reflect reality.
The data we'll use comes from [forecast.io](https://forecast.io), since they can give us a cloud cover percentage. I've gone ahead and retrieved the data to a pickle file, and included the [code that was used to generate it](#Generating-the-Forecast-file). First up: What was the average cloud cover in North Carolina during August - November, and how many days were cloudy? We're going to assume that a "cloudy" day is defined as any day in which the cloud cover is above 50%.
```python
city_forecasts = pickle.load(open('city_forecasts.p', 'rb'))
forecast_df = pd.DataFrame.from_dict(city_forecasts)
```
```python
cary_forecast = forecast_df['cary']
years = range(1990, 2016)
months = range(7, 12)
months_str = ['July', 'August', 'September', 'October', 'November']
def safe_cover(frame):
if frame and 'cloudCover' in frame:
return frame['cloudCover']
else:
return np.NaN
def monthly_avg_cloudcover(year, month):
dates = pd.DatetimeIndex(start=datetime(year, month, 1, 12),
end=datetime(year, month + 1, 1, 12),
freq='D', closed='left')
cloud_cover_vals = list(map(lambda x: safe_cover(cary_forecast[x]['currently']), dates))
cloud_cover_samples = len(list(filter(lambda x: x is not np.NaN, cloud_cover_vals)))
return np.nanmean(cloud_cover_vals), cloud_cover_samples
monthly_cover_vals = [[monthly_avg_cloudcover(y, m)[0] for y in years] for m in months]
f = figure(title='Monthly Average Cloud Cover',
x_range=(1990, 2015),
x_axis_label='Year')
for x in range(0, len(months)):
f.line(years, monthly_cover_vals[x], legend=months_str[x], color=Palette[x])
show(f)
```
![Monthly average cloud cover chart](./1.png)
As we can see from the chart above, on the whole the monthly average cloud cover has been generally trending down over time. The average cloud cover is also lower than it was last year - it seems people are mostly just complaining. There are some data issues that start in 2012 that we need to be aware of - the cloud cover percentage doesn't exist for all days. Even so, the data that we have seems to reflect the wider trend, so we'll assume for now that the missing data doesn't skew our results.
There's one more metric we want to check though - how many cloudy days were there? This is probably a better gauge of sentiment than the average monthly cover.
```python
def monthly_cloudy_days(year, month):
dates = pd.DatetimeIndex(start=datetime(year, month, 1, 12),
end=datetime(year, month + 1, 1, 12),
freq='D', closed='left')
cloud_cover_vals = list(map(lambda x: safe_cover(cary_forecast[x]['currently']), dates))
cloud_cover_samples = len(list(filter(lambda x: x is not np.NaN, cloud_cover_vals)))
cloudy_days = [cover > .5 for cover in cloud_cover_vals]
return np.count_nonzero(cloudy_days), cloud_cover_samples
monthly_days_vals = [[monthly_cloudy_days(y, m)[0] for y in years] for m in months]
monthly_cover_samples = [[monthly_cloudy_days(y, m)[1] for y in years] for m in months]
f = figure(title='Monthly Cloudy Days',
x_range=(1990, 2015),
x_axis_label='Year')
for x in range(0, len(months)):
f.line(years, monthly_days_vals[x], legend=months_str[x], color=Palette[x])
show(f)
f = figure(title='Monthly Cloud Cover Samples',
x_range=(1990, 2015),
x_axis_label='Year',
height=300)
for x in range(0, len(months)):
f.line(years, monthly_cover_samples[x], legend=months_str[x], color=Palette[x])
show(f)
```
![Monthly cloudy days chart](./2.png)
![Monthly cloud cover samples chart](./3.png)
On the whole, the number of cloudy days seems to reflect the trend with average cloud cover - it's actually becoming more sunny as time progresses. That said, we need to be careful in how we view this number - because there weren't as many samples in 2015 as previous years, the number of days can get thrown off. In context though, even if most days not recorded were in fact cloudy, the overall count for 2015 would still be lower than previous years.
# Tracking Precipitation Chances
In addition to checking cloud cover, I wanted to check precipitation data as well - what is the average precipitation chance over a month, and how many days during a month is rain likely? The thinking is that days with a high-precipitation chance will also be days in which it is cloudy or depressing.
```python
def safe_precip(frame):
if frame and 'precipProbability' in frame:
return frame['precipProbability']
else:
return np.NaN
def monthly_avg_precip(year, month):
dates = pd.DatetimeIndex(start=datetime(year, month, 1, 12),
end=datetime(year, month + 1, 1, 12),
freq='D', closed='left')
precip_vals = list(map(lambda x: safe_precip(cary_forecast[x]['currently']), dates))
precip_samples = len(list(filter(lambda x: x is not np.NaN, precip_vals)))
return np.nanmean(precip_vals), precip_samples
monthly_avg_precip_vals = [[monthly_avg_precip(y, m)[0] for y in years] for m in months]
f = figure(title='Monthly Average Precipitation Chance',
x_range=(1990, 2015),
x_axis_label='Year')
for x in range(0, len(months)):
f.line(years, monthly_avg_precip_vals[x], legend=months_str[x], color=Palette[x])
show(f)
```
![Monthly average precipitation chance chart](./4.png)
As we can see from the chart, the average chance of precipitation over a month more or less stays within a band of 0 - .1 for all months over all years. This is further evidence that the past few months are no more cloudy or rainy than previous years. Like the cloud cover though, we still want to get a count of all the rainy days, in addition to the average chance. We'll define a "rainy day" as any day in which the chance of rain is greater than 25%.
```python
def monthly_rainy_days(year, month):
dates = pd.DatetimeIndex(start=datetime(year, month, 1, 12),
end=datetime(year, month + 1, 1, 12),
freq='D', closed='left')
precip_prob_vals = list(map(lambda x: safe_precip(cary_forecast[x]['currently']), dates))
precip_prob_samples = len(list(filter(lambda x: x is not np.NaN, precip_prob_vals)))
precip_days = [prob > .25 for prob in precip_prob_vals]
return np.count_nonzero(precip_days), precip_prob_samples
monthly_precip_days_vals = [[monthly_rainy_days(y, m)[0] for y in years] for m in months]
monthly_precip_samples = [[monthly_rainy_days(y, m)[1] for y in years] for m in months]
f = figure(title='Monthly Rainy Days',
x_range=(1990, 2015),
x_axis_label='Year')
for x in range(0, len(months)):
f.line(years, monthly_precip_days_vals[x], legend=months_str[x], color=Palette[x])
show(f)
f = figure(title='Monthly Rainy Days Samples',
x_range=(1990, 2015),
x_axis_label='Year',
height=300)
for x in range(0, len(months)):
f.line(years, monthly_precip_samples[x], legend=months_str[x], color=Palette[x])
show(f)
```
![Monthly rainy days chart](./5.png)
![Monthly rainy days samples chart](./6.png)
After trying to find the number of days that are rainy, we can see that November hit its max value for rainy days in 2015. However, that value is 6, as compared to a previous maximum of 5. While it is a new record, the value isn't actually all that different. And for other months, the values are mostly in-line with the averages.
# Summary and Conclusions
After having looked at forecast data for Cary, it appears that the months of July - November this year in terms of weather were at worst on par with prior years, if not slightly more sunny. This seems to be a case of confirmation bias: someone complains about a string of cloudy or rainy days, and suddenly you start noticing them more.
While this analysis doesn't take into account other areas of North Carolina, my initial guess would be to assume that other areas also will show similar results: nothing interesting is happening. Maybe that will be for another blog post later!
Coming soon: I'll compare rain/cloud conditions in North Carolina to some other places in the U.S.!
# Generating the Forecast file
The following code was generates the file that was used throughout the blog post. Please note that I'm retrieving data for other cities to use in a future blog post, only Cary data was used for this post.
```python
import pandas as pd
from functools import reduce
import requests
from datetime import datetime
# Coordinate data from http://itouchmap.com/latlong.html
cary_loc = (35.79154,-78.781117)
nyc_loc = (40.78306,-73.971249)
seattle_loc = (47.60621,-122.332071)
binghamton_loc = (42.098687,-75.917974)
cities = {
'cary': cary_loc,
'nyc': nyc_loc,
'seattle': seattle_loc,
'binghamton': binghamton_loc
}
apikey = '' # My super-secret API Key
def get_forecast(lat, long, date=None):
forecast_base = "https://api.forecast.io/forecast/"
if date is None:
url = forecast_base + apikey + '/{},{}'.format(lat, long)
else:
epoch = int(date.timestamp())
url = forecast_base + apikey + '/{},{},{}'.format(lat, long, epoch)
return requests.get(url).json()
years = range(1990,2016)
# For datetimes, the 12 is for getting the weather at noon.
# We're doing this over midnight because we're more concerned
# with what people see, and people don't typically see the weather
# at midnight.
dt_indices = [pd.date_range(start=datetime(year, 7, 1, 12),
end=datetime(year, 11, 30, 12))
for year in years]
dt_merge = reduce(lambda x, y: x.union(y), dt_indices)
# Because we have to pay a little bit to use the API, we use for loops here
# instead of a comprehension - if something breaks, we want to preserve the
# data already retrieved
city_forecasts = {}
for city, loc in cities.items():
print("Retrieving data for {} starting at {}".format(city,
datetime.now().strftime("%I:%M:%S %p")))
for dt in dt_merge:
try:
city_forecasts[(city, dt)] = get_forecast(*loc, dt)
except Exception as e:
print(e)
city_forecasts[(city, dt)] = None
print("End forecast retrieval: {}".format(datetime.now().strftime("%I:%M:%S %p")))
import pickle
pickle.dump(city_forecasts, open('city_forecasts.p', 'wb'))
### Output:
# Retrieving data for binghamton starting at 05:13:42 PM
# Retrieving data for seattle starting at 05:30:51 PM
# Retrieving data for nyc starting at 05:48:30 PM
# Retrieving data for cary starting at 06:08:32 PM
# End forecast retrieval: 06:25:21 PM
```

Binary file not shown.

After

Width:  |  Height:  |  Size: 98 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 100 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 94 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 110 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 100 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 97 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 101 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 97 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 113 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 103 KiB

View File

@ -0,0 +1,15 @@
Title: Cloudy in Seattle
Date: 2016-01-23
Category: Blog
Tags: weather, data science
Authors: Bradlee Speice
Summary: Building on prior analysis, is Seattle's reputation as a depressing city actually well-earned?
[//]: <> "Modified: "
{% notebook 2016-1-23-cloudy-in-seattle.ipynb %}
<script type="text/x-mathjax-config">
MathJax.Hub.Config({tex2jax: {inlineMath: [['$','$'], ['\(','\)']]}});
</script>
<script async src='https://cdn.mathjax.org/mathjax/latest/MathJax.js?config=TeX-AMS_CHTML'></script>

File diff suppressed because one or more lines are too long

File diff suppressed because one or more lines are too long

View File

@ -0,0 +1,154 @@
---
slug: 2016/01/cloudy-in-seattle
title: Cloudy in Seattle
date: 2016-01-23 12:00:00
authors: [bspeice]
tags: []
---
Building on prior analysis, is Seattle's reputation as a depressing city actually well-earned?
<!-- truncate -->
```python
import pickle
import pandas as pd
import numpy as np
from bokeh.plotting import output_notebook, figure, show
from bokeh.palettes import RdBu4 as Palette
from datetime import datetime
import warnings
output_notebook()
```
```
BokehJS successfully loaded.
```
## Examining other cities
After taking some time to explore how the weather in North Carolina stacked up over the past years, I was interested in doing the same analysis for other cities. Growing up with family from Binghamton, NY I was always told it was very cloudy there. And Seattle has a nasty reputation for being very depressing and cloudy. All said, the cities I want to examine are:
- Binghamton, NY
- Cary, NC
- Seattle, WA
- New York City, NY
I'd be interested to try this analysis worldwide at some point - comparing London and Seattle might be an interesting analysis. For now though, we'll stick with trying out the US data.
There will be plenty of charts. I want to know: **How has average cloud cover and precipitation chance changed over the years for each city mentioned?** This will hopefully tell us whether Seattle has actually earned its reputation for being a depressing city.
```python
city_forecasts = pickle.load(open('city_forecasts.p', 'rb'))
forecasts_df = pd.DataFrame.from_dict(city_forecasts)
```
```python
cities = ['binghamton', 'cary', 'nyc', 'seattle']
city_colors = {cities[i]: Palette[i] for i in range(0, 4)}
def safe_cover(frame):
if frame and 'cloudCover' in frame:
return frame['cloudCover']
else:
return np.NaN
def monthly_avg_cloudcover(city, year, month):
dates = pd.DatetimeIndex(start=datetime(year, month, 1, 12),
end=datetime(year, month + 1, 1, 12),
freq='D', closed='left')
cloud_cover_vals = list(map(lambda x: safe_cover(forecasts_df[city][x]['currently']), dates))
cloud_cover_samples = len(list(filter(lambda x: x is not np.NaN, cloud_cover_vals)))
# Ignore an issue with nanmean having all NaN values. We'll discuss the data issues below.
with warnings.catch_warnings():
warnings.simplefilter('ignore')
return np.nanmean(cloud_cover_vals), cloud_cover_samples
```
```python
years = range(1990, 2016)
def city_avg_cc(city, month):
return [monthly_avg_cloudcover(city, y, month) for y in years]
months = [
('July', 7),
('August', 8),
('September', 9),
('October', 10),
('November', 11)
]
for month, month_id in months:
month_averages = {city: city_avg_cc(city, month_id) for city in cities}
f = figure(title="{} Average Cloud Cover".format(month),
x_axis_label='Year',
y_axis_label='Cloud Cover Percentage')
for city in cities:
f.line(years, [x[0] for x in month_averages[city]],
legend=city, color=city_colors[city])
show(f)
```
![July average cloud cover chart](./1.png)
![August average cloud cover chart](./2.png)
![September average cloud cover chart](./3.png)
![October average cloud cover chart](./4.png)
![November average cloud cover chart](./5.png)
Well, as it so happens it looks like there are some data issues. July's data is a bit sporadic, and 2013 seems to be missing from most months as well. I think really only two things can really be confirmed here:
- Seattle, specifically for the months of October and November, is in fact significantly more cloudy on average than are other cities
- All cities surveyed have seen average cloud cover decline over the months studied. There are data issues, but the trend seems clear.
Let's now move from cloud cover data to looking at average rainfall chance.
```python
def safe_precip(frame):
if frame and 'precipProbability' in frame:
return frame['precipProbability']
else:
return np.NaN
def monthly_avg_precip(city, year, month):
dates = pd.DatetimeIndex(start=datetime(year, month, 1, 12),
end=datetime(year, month + 1, 1, 12),
freq='D', closed='left')
precip_vals = list(map(lambda x: safe_precip(forecasts_df[city][x]['currently']), dates))
precip_samples = len(list(filter(lambda x: x is not np.NaN, precip_vals)))
# Ignore an issue with nanmean having all NaN values. We'll discuss the data issues below.
with warnings.catch_warnings():
warnings.simplefilter('ignore')
return np.nanmean(precip_vals), precip_samples
def city_avg_precip(city, month):
return [monthly_avg_precip(city, y, month) for y in years]
for month, month_id in months:
month_averages = {city: city_avg_cc(city, month_id) for city in cities}
f = figure(title="{} Average Precipitation Chance".format(month),
x_axis_label='Year',
y_axis_label='Precipitation Chance Percentage')
for city in cities:
f.line(years, [x[0] for x in month_averages[city]],
legend=city, color=city_colors[city])
show(f)
```
![July average precipitation chance chart](./6.png)
![August average precipitation chance chart](./7.png)
![September average precipitation chance chart](./8.png)
![October average precipitation chance chart](./9.png)
![November average precipitation chance chart](./10.png)
The same data issue caveats apply here: 2013 seems to be missing some data, and July has some issues as well. However, this seems to confirm the trends we saw with cloud cover:
- Seattle, specifically for the months of August, October, and November has had a consistently higher chance of rain than other cities surveyed.
- Average precipitation chance, just like average cloud cover, has been trending down over time.
## Conclusion
I have to admit I was a bit surprised after doing this analysis. Seattle showed a higher average cloud cover and average precipitation chance than did the other cities surveyed. Maybe Seattle is actually an objectively more depressing city to live in.
Well that's all for weather data at the moment. It's been a great experiment, but I think this is about as far as I'll be able to get with weather data without some domain knowledge. Talk again soon!

View File

@ -0,0 +1,15 @@
Title: Guaranteed Money Maker
Date: 2016-02-03
Category: Blog
Tags: martingale, strategy
Authors: Bradlee Speice
Summary: Developing an investment strategy based on the Martingale betting strategy
[//]: <> "Modified: "
{% notebook 2016-2-3-guaranteed-money-maker.ipynb %}
<script type="text/x-mathjax-config">
MathJax.Hub.Config({tex2jax: {inlineMath: [['$','$'], ['\(','\)']]}});
</script>
<script async src='https://cdn.mathjax.org/mathjax/latest/MathJax.js?config=TeX-AMS_CHTML'></script>

View File

@ -0,0 +1,260 @@
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### If you can see into the future, that is.\n",
"\n",
"My previous class in Stochastic Calculus covered a lot of interesting topics, and the important one for today\n",
"is the [Gambler's Ruin][1] problem. If you're interested in some of the theory behind it, also make sure to check out\n",
"[random walks][2]. The important bit is that we studied the [Martingale Betting Strategy][3], which describes for us\n",
"a **guaranteed way** to <span style='font-size: x-small'>eventually</span> make money.\n",
"\n",
"The strategy goes like this: You are going to toss a fair coin with a friend. If you guess heads or tails correctly, you get back double the money you bet. If you guess incorrectly, you lose money. How should you bet?\n",
"\n",
"The correct answer is that you should double your bet each time you lose. Then when you finally win, you'll be guaranteed to make back everything you lost and then &#36;1 extra! Consider the scenario:\n",
"\n",
"1. You bet &#36;1, and guess incorrectly. You're 1 dollar in the hole.\n",
"2. You bet &#36;2, and guess incorrectly. You're 3 dollars in the hole now.\n",
"3. You bet &#36;4, and guess incorrectly. You're 7 dollars in the hole.\n",
"4. You bet &#36;8, and guess correctly! You now get back those 8 dollars you bet, plus 8 extra for winning, for a **total profit of one dollar**!\n",
"\n",
"Mathematically, we can prove that as long as you have unlimited money to bet, you are guaranteed to make money.\n",
"\n",
"# Applying the Martingale Strategy\n",
"\n",
"But we're all realistic people, and once you start talking about \"unlimited money\" eyebrows should be raised. Even still, this is an interesting strategy to investigate, and I want to apply it to the stock market. As long as we can guarantee there's a single day in which the stock goes up, we should be able to make money right? The question is just how much we have to invest to guarantee this.\n",
"\n",
"Now it's time for the math. We'll use the following definitions:\n",
"\n",
"- $o_i$ = the share price at the opening of day $i$\n",
"- $c_i$ = the share price at the close of day $i$\n",
"- $d_i$ = the amount of money we want to invest at the beginning of day $i$\n",
"\n",
"With those definitions in place, I'd like to present the formula that is **guaranteed to make you money**. I call it *Bradlee's Investment Formula*:\n",
"\n",
"$c_n \\sum_{i=1}^n \\frac{d_i}{o_i} > \\sum_{i=1}^{n} d_i$\n",
"\n",
"It might not look like much, but if you can manage to make it so that this formula holds true, you will be guaranteed to make money. The intuition behind the formula is this: The closing share price times the number of shares you have purchased ends up greater than the amount of money you invested.\n",
"\n",
"That is, on day $n$, <span style='font-size: x-small'>if you know what the closing price will be</span> you can set up the amount of money you invest that day to **guarantee you make money**. I'll even teach you to figure out how much money that is! Take a look:\n",
"\n",
"$\n",
"\\begin{align}\n",
"c_n \\sum_{i=1}^{n-1} \\frac{d_i}{o_i} + \\frac{c_nd_n}{o_n} &> \\sum_{i=1}^{n-1}d_i + d_n\\\\\n",
"\\frac{c_nd_n}{o_n} - d_n &> \\sum_{i=1}^{n-1}(d_i - \\frac{c_nd_i}{o_i})\\\\\n",
"d_n (\\frac{c_n - o_n}{o_n}) &> \\sum_{i=1}^{n-1} d_i(1 - \\frac{c_n}{o_i})\\\\\n",
"d_n &> \\frac{o_n}{c_n - o_n} \\sum_{i=1}^{n-1} d_i(1 - \\frac{1}{o_i})\n",
"\\end{align}$\n",
"\n",
"If you invest exactly $d_n$ that day, you'll break even. But if you can make sure the money you invest is greater than that quantity on the right <span style='font-size: x-small'>(which requires that you have a crystal ball tell you the stock's closing price)</span> you are **guaranteed to make money!**\n",
"\n",
"# Interesting Implications\n",
"\n",
"On a more serious note though, the formula above tells us a couple of interesting things:\n",
"\n",
"1. It's impossible to make money without the closing price at some point being greater than the opening price (or vice-versa if you are short selling) - there is no amount of money you can invest that will turn things in your favor.\n",
"2. Close prices of the past aren't important if you're concerned about the bottom line. While chart technicians use price history to make judgment calls, in the end, the closing price on anything other than the last day is irrelevant.\n",
"3. It's possible to make money as long as there is a single day where the closing price is greater than the opening price! You might have to invest a lot to do so, but it's possible.\n",
"4. You must make a prediction about where the stock will close at if you want to know how much to invest. That is, we can set up our investment for the day to make money if the stock goes up 1%, but if it only goes up .5% we'll still lose money.\n",
"5. It's possible the winning move is to scale back your position. Consider the scenario:\n",
" - You invest money and the stock closes down the day .5%\n",
" - You invest tomorrow expecting the stock to go up 1%\n",
" - The winning investment to break even (assuming a 1% increase) is to scale back the position, since the shares you purchased at the beginning would then be profitable\n",
"\n",
"# Running the simulation\n",
"\n",
"So now that we've defined our investment formula,we need to tweak a couple things in order to make an investment strategy we can actually work with. There are two issues we need to address:\n",
"\n",
"1. The formula only tells us how much to invest if we want to break even ($d_n$). If we actually want to turn a profit, we need to invest more than that, which we will refer to as the **bias**.\n",
"2. The formula assumes we know what the closing price will be on any given day. If we don't know this, we can still invest assuming the stock price will close at a level we choose. If the price doesn't meet this objective, we try again tomorrow! This predetermined closing price will be referred to as the **expectation**.\n",
"\n",
"Now that we've defined our *bias* and *expectation*, we can actually build a strategy we can simulate. Much like the martingale strategy told you to bet twice your previous bet in order to make money, we've designed a system that tells us how much to bet in order to make money as well.\n",
"\n",
"Now, let's get to the code!\n",
"\n",
"[1]: https://en.wikipedia.org/wiki/Gambler's_ruin\n",
"[2]: https://en.wikipedia.org/wiki/Random_walk\n",
"[3]: https://en.wikipedia.org/wiki/Martingale_%28betting_system%29"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"using Quandl\n",
"api_key = \"\"\n",
"daily_investment = function(current_open, current_close, purchase_history, open_history)\n",
" # We're not going to safeguard against divide by 0 - that's the user's responsibility\n",
" t1 = current_close / current_open - 1\n",
" t2 = sum(purchase_history - purchase_history*current_close ./ open_history)\n",
" return t2 / t1\n",
"end;"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"And let's code a way to run simulations quickly:"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"is_profitable = function(current_price, purchase_history, open_history)\n",
" shares = sum(purchase_history ./ open_history)\n",
" return current_price*shares > sum(purchase_history)\n",
"end\n",
"\n",
"simulate = function(name, start, init, expected, bias)\n",
" ticker_info = quandlget(name, from=start, api_key=api_key)\n",
" open_vals = ticker_info[\"Open\"].values\n",
" close_vals = ticker_info[\"Close\"].values\n",
" invested = [init]\n",
" \n",
" # The simulation stops once we've made a profit\n",
" day = 1\n",
" profitable = is_profitable(close_vals[day], invested, open_vals[1:length(invested)]) ||\n",
" is_profitable(open_vals[day+1], invested, open_vals[1:length(invested)])\n",
" while !profitable\n",
" expected_close = open_vals[day+1] * expected\n",
" todays_purchase = daily_investment(open_vals[day+1], expected_close, invested, open_vals[1:day])\n",
" invested = [invested; todays_purchase + bias]\n",
" # expected_profit = expected_close * sum(invested ./ open_vals[1:length(invested)]) - sum(invested)\n",
" day += 1\n",
" profitable = is_profitable(close_vals[day], invested, open_vals[1:length(invested)]) ||\n",
" is_profitable(open_vals[day+1], invested, open_vals[1:length(invested)])\n",
" end\n",
" \n",
" shares = sum(invested ./ open_vals[1:length(invested)])\n",
" max_profit = max(close_vals[day], open_vals[day+1])\n",
" profit = shares * max_profit - sum(invested)\n",
" return (invested, profit)\n",
"end\n",
"\n",
"sim_summary = function(investments, profit)\n",
" leverages = [sum(investments[1:i]) for i=1:length(investments)]\n",
" max_leverage = maximum(leverages) / investments[1]\n",
" println(\"Max leverage: $(max_leverage)\")\n",
" println(\"Days invested: $(length(investments))\")\n",
" println(\"Profit: $profit\")\n",
"end;"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now, let's get some data and run a simulation! Our first test:\n",
"\n",
"- We'll invest 100 dollars in LMT, and expect that the stock will close up 1% every day. We'll invest $d_n$ + 10 dollars every day that we haven't turned a profit, and end the simulation once we've made a profit."
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Max leverage: 5.590373200042106\n",
"Days invested: 5\n",
"Profit: 0.6894803101560001\n"
]
}
],
"source": [
"investments, profit = simulate(\"YAHOO/LMT\", Date(2015, 11, 29), 100, 1.01, 10)\n",
"sim_summary(investments, profit)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The result: We need to invest 5.6x our initial position over a period of 5 days to make approximately .69&#162;\n",
"\n",
"- Now let's try the same thing, but we'll assume the stock closes up 2% instead."
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Max leverage: 1.854949900247809\n",
"Days invested: 25\n",
"Profit: 0.08304813163696423\n"
]
}
],
"source": [
"investments, profit = simulate(\"YAHOO/LMT\", Date(2015, 11, 29), 100, 1.02, 10)\n",
"sim_summary(investments, profit)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"In this example, we only get up to a 1.85x leveraged position, but it takes 25 days to turn a profit of 8&#162;"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Summary\n",
"\n",
"We've defined an investment strategy that can tell us how much to invest when we know what the closing position of a stock will be. We can tweak the strategy to actually make money, but plenty of work needs to be done so that we can optimize the money invested.\n",
"\n",
"In the next post I'm going to post more information about some backtests and strategy tests on this strategy (unless of course this experiment actually produces a significant profit potential, and then I'm keeping it for myself)."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Side note and disclaimer\n",
"\n",
"The claims made in this presentation about being able to guarantee making money are intended as a joke and do not constitute investment advice of any sort."
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Julia 0.4.2",
"language": "julia",
"name": "julia-0.4"
},
"language_info": {
"file_extension": ".jl",
"mimetype": "application/julia",
"name": "julia",
"version": "0.4.2"
}
},
"nbformat": 4,
"nbformat_minor": 0
}

View File

@ -0,0 +1,171 @@
### If you can see into the future, that is.
My previous class in Stochastic Calculus covered a lot of interesting topics, and the important one for today
is the [Gambler's Ruin][1] problem. If you're interested in some of the theory behind it, also make sure to check out
[random walks][2]. The important bit is that we studied the [Martingale Betting Strategy][3], which describes for us
a **guaranteed way** to <span style='font-size: x-small'>eventually</span> make money.
The strategy goes like this: You are going to toss a fair coin with a friend. If you guess heads or tails correctly, you get back double the money you bet. If you guess incorrectly, you lose money. How should you bet?
The correct answer is that you should double your bet each time you lose. Then when you finally win, you'll be guaranteed to make back everything you lost and then &#36;1 extra! Consider the scenario:
1. You bet &#36;1, and guess incorrectly. You're 1 dollar in the hole.
2. You bet &#36;2, and guess incorrectly. You're 3 dollars in the hole now.
3. You bet &#36;4, and guess incorrectly. You're 7 dollars in the hole.
4. You bet &#36;8, and guess correctly! You now get back those 8 dollars you bet, plus 8 extra for winning, for a **total profit of one dollar**!
Mathematically, we can prove that as long as you have unlimited money to bet, you are guaranteed to make money.
# Applying the Martingale Strategy
But we're all realistic people, and once you start talking about "unlimited money" eyebrows should be raised. Even still, this is an interesting strategy to investigate, and I want to apply it to the stock market. As long as we can guarantee there's a single day in which the stock goes up, we should be able to make money right? The question is just how much we have to invest to guarantee this.
Now it's time for the math. We'll use the following definitions:
- $o_i$ = the share price at the opening of day $i$
- $c_i$ = the share price at the close of day $i$
- $d_i$ = the amount of money we want to invest at the beginning of day $i$
With those definitions in place, I'd like to present the formula that is **guaranteed to make you money**. I call it *Bradlee's Investment Formula*:
$c_n \sum_{i=1}^n \frac{d_i}{o_i} > \sum_{i=1}^{n} d_i$
It might not look like much, but if you can manage to make it so that this formula holds true, you will be guaranteed to make money. The intuition behind the formula is this: The closing share price times the number of shares you have purchased ends up greater than the amount of money you invested.
That is, on day $n$, <span style='font-size: x-small'>if you know what the closing price will be</span> you can set up the amount of money you invest that day to **guarantee you make money**. I'll even teach you to figure out how much money that is! Take a look:
$
\begin{align}
c_n \sum_{i=1}^{n-1} \frac{d_i}{o_i} + \frac{c_nd_n}{o_n} &> \sum_{i=1}^{n-1}d_i + d_n\\
\frac{c_nd_n}{o_n} - d_n &> \sum_{i=1}^{n-1}(d_i - \frac{c_nd_i}{o_i})\\
d_n (\frac{c_n - o_n}{o_n}) &> \sum_{i=1}^{n-1} d_i(1 - \frac{c_n}{o_i})\\
d_n &> \frac{o_n}{c_n - o_n} \sum_{i=1}^{n-1} d_i(1 - \frac{1}{o_i})
\end{align}$
If you invest exactly $d_n$ that day, you'll break even. But if you can make sure the money you invest is greater than that quantity on the right <span style='font-size: x-small'>(which requires that you have a crystal ball tell you the stock's closing price)</span> you are **guaranteed to make money!**
# Interesting Implications
On a more serious note though, the formula above tells us a couple of interesting things:
1. It's impossible to make money without the closing price at some point being greater than the opening price (or vice-versa if you are short selling) - there is no amount of money you can invest that will turn things in your favor.
2. Close prices of the past aren't important if you're concerned about the bottom line. While chart technicians use price history to make judgment calls, in the end, the closing price on anything other than the last day is irrelevant.
3. It's possible to make money as long as there is a single day where the closing price is greater than the opening price! You might have to invest a lot to do so, but it's possible.
4. You must make a prediction about where the stock will close at if you want to know how much to invest. That is, we can set up our investment for the day to make money if the stock goes up 1%, but if it only goes up .5% we'll still lose money.
5. It's possible the winning move is to scale back your position. Consider the scenario:
- You invest money and the stock closes down the day .5%
- You invest tomorrow expecting the stock to go up 1%
- The winning investment to break even (assuming a 1% increase) is to scale back the position, since the shares you purchased at the beginning would then be profitable
# Running the simulation
So now that we've defined our investment formula,we need to tweak a couple things in order to make an investment strategy we can actually work with. There are two issues we need to address:
1. The formula only tells us how much to invest if we want to break even ($d_n$). If we actually want to turn a profit, we need to invest more than that, which we will refer to as the **bias**.
2. The formula assumes we know what the closing price will be on any given day. If we don't know this, we can still invest assuming the stock price will close at a level we choose. If the price doesn't meet this objective, we try again tomorrow! This predetermined closing price will be referred to as the **expectation**.
Now that we've defined our *bias* and *expectation*, we can actually build a strategy we can simulate. Much like the martingale strategy told you to bet twice your previous bet in order to make money, we've designed a system that tells us how much to bet in order to make money as well.
Now, let's get to the code!
[1]: https://en.wikipedia.org/wiki/Gambler's_ruin
[2]: https://en.wikipedia.org/wiki/Random_walk
[3]: https://en.wikipedia.org/wiki/Martingale_%28betting_system%29
```julia
using Quandl
api_key = ""
daily_investment = function(current_open, current_close, purchase_history, open_history)
# We're not going to safeguard against divide by 0 - that's the user's responsibility
t1 = current_close / current_open - 1
t2 = sum(purchase_history - purchase_history*current_close ./ open_history)
return t2 / t1
end;
```
And let's code a way to run simulations quickly:
```julia
is_profitable = function(current_price, purchase_history, open_history)
shares = sum(purchase_history ./ open_history)
return current_price*shares > sum(purchase_history)
end
simulate = function(name, start, init, expected, bias)
ticker_info = quandlget(name, from=start, api_key=api_key)
open_vals = ticker_info["Open"].values
close_vals = ticker_info["Close"].values
invested = [init]
# The simulation stops once we've made a profit
day = 1
profitable = is_profitable(close_vals[day], invested, open_vals[1:length(invested)]) ||
is_profitable(open_vals[day+1], invested, open_vals[1:length(invested)])
while !profitable
expected_close = open_vals[day+1] * expected
todays_purchase = daily_investment(open_vals[day+1], expected_close, invested, open_vals[1:day])
invested = [invested; todays_purchase + bias]
# expected_profit = expected_close * sum(invested ./ open_vals[1:length(invested)]) - sum(invested)
day += 1
profitable = is_profitable(close_vals[day], invested, open_vals[1:length(invested)]) ||
is_profitable(open_vals[day+1], invested, open_vals[1:length(invested)])
end
shares = sum(invested ./ open_vals[1:length(invested)])
max_profit = max(close_vals[day], open_vals[day+1])
profit = shares * max_profit - sum(invested)
return (invested, profit)
end
sim_summary = function(investments, profit)
leverages = [sum(investments[1:i]) for i=1:length(investments)]
max_leverage = maximum(leverages) / investments[1]
println("Max leverage: $(max_leverage)")
println("Days invested: $(length(investments))")
println("Profit: $profit")
end;
```
Now, let's get some data and run a simulation! Our first test:
- We'll invest 100 dollars in LMT, and expect that the stock will close up 1% every day. We'll invest $d_n$ + 10 dollars every day that we haven't turned a profit, and end the simulation once we've made a profit.
```julia
investments, profit = simulate("YAHOO/LMT", Date(2015, 11, 29), 100, 1.01, 10)
sim_summary(investments, profit)
```
Max leverage: 5.590373200042106
Days invested: 5
Profit: 0.6894803101560001
The result: We need to invest 5.6x our initial position over a period of 5 days to make approximately .69&#162;
- Now let's try the same thing, but we'll assume the stock closes up 2% instead.
```julia
investments, profit = simulate("YAHOO/LMT", Date(2015, 11, 29), 100, 1.02, 10)
sim_summary(investments, profit)
```
Max leverage: 1.854949900247809
Days invested: 25
Profit: 0.08304813163696423
In this example, we only get up to a 1.85x leveraged position, but it takes 25 days to turn a profit of 8&#162;
# Summary
We've defined an investment strategy that can tell us how much to invest when we know what the closing position of a stock will be. We can tweak the strategy to actually make money, but plenty of work needs to be done so that we can optimize the money invested.
In the next post I'm going to post more information about some backtests and strategy tests on this strategy (unless of course this experiment actually produces a significant profit potential, and then I'm keeping it for myself).
# Side note and disclaimer
The claims made in this presentation about being able to guarantee making money are intended as a joke and do not constitute investment advice of any sort.

View File

@ -0,0 +1,183 @@
---
slug: 2016/02/guaranteed-money-maker
title: Guaranteed money maker
date: 2016-02-03 12:00:00
authors: [bspeice]
tags: []
---
Developing an investment strategy based on the Martingale betting strategy
If you can see into the future, that is.
<!-- truncate -->
My previous class in Stochastic Calculus covered a lot of interesting topics, and the important one for today is the [Gambler's Ruin][1] problem. If you're interested in some of the theory behind it, also make sure to check out [random walks][2]. The important bit is that we studied the [Martingale Betting Strategy][3], which describes for us a **guaranteed way** to <small>eventually</small> make money.
The strategy goes like this: You are going to toss a fair coin with a friend. If you guess heads or tails correctly, you get back double the money you bet. If you guess incorrectly, you lose money. How should you bet?
The correct answer is that you should double your bet each time you lose. Then when you finally win, you'll be guaranteed to make back everything you lost and then &#36;1 extra! Consider the scenario:
1. You bet &#36;1, and guess incorrectly. You're 1 dollar in the hole.
2. You bet &#36;2, and guess incorrectly. You're 3 dollars in the hole now.
3. You bet &#36;4, and guess incorrectly. You're 7 dollars in the hole.
4. You bet &#36;8, and guess correctly! You now get back those 8 dollars you bet, plus 8 extra for winning, for a **total profit of one dollar**!
Mathematically, we can prove that as long as you have unlimited money to bet, you are guaranteed to make money.
## Applying the Martingale Strategy
But we're all realistic people, and once you start talking about "unlimited money" eyebrows should be raised. Even still, this is an interesting strategy to investigate, and I want to apply it to the stock market. As long as we can guarantee there's a single day in which the stock goes up, we should be able to make money right? The question is just how much we have to invest to guarantee this.
Now it's time for the math. We'll use the following definitions:
- $o_i$ = the share price at the opening of day $i$
- $c_i$ = the share price at the close of day $i$
- $d_i$ = the amount of money we want to invest at the beginning of day $i$
With those definitions in place, I'd like to present the formula that is **guaranteed to make you money**. I call it *Bradlee's Investment Formula*:
$c_n \sum_{i=1}^n \frac{d_i}{o_i} > \sum_{i=1}^{n} d_i$
It might not look like much, but if you can manage to make it so that this formula holds true, you will be guaranteed to make money. The intuition behind the formula is this: The closing share price times the number of shares you have purchased ends up greater than the amount of money you invested.
That is, on day $n$, <small>if you know what the closing price will be</small> you can set up the amount of money you invest that day to **guarantee you make money**. I'll even teach you to figure out how much money that is! Take a look:
$$
\begin{align*}
c_n \sum_{i=1}^{n-1} \frac{d_i}{o_i} + \frac{c_nd_n}{o_n} &> \sum_{i=1}^{n-1}d_i + d_n\\
\frac{c_nd_n}{o_n} - d_n &> \sum_{i=1}^{n-1}(d_i - \frac{c_nd_i}{o_i})\\
d_n (\frac{c_n - o_n}{o_n}) &> \sum_{i=1}^{n-1} d_i(1 - \frac{c_n}{o_i})\\
d_n &> \frac{o_n}{c_n - o_n} \sum_{i=1}^{n-1} d_i(1 - \frac{1}{o_i})
\end{align*}
$$
If you invest exactly $d_n$ that day, you'll break even. But if you can make sure the money you invest is greater than that quantity on the right <small>(which requires that you have a crystal ball tell you the stock's closing price)</small> you are **guaranteed to make money!**
## Interesting Implications
On a more serious note though, the formula above tells us a couple of interesting things:
1. It's impossible to make money without the closing price at some point being greater than the opening price (or vice-versa if you are short selling) - there is no amount of money you can invest that will turn things in your favor.
2. Close prices of the past aren't important if you're concerned about the bottom line. While chart technicians use price history to make judgment calls, in the end, the closing price on anything other than the last day is irrelevant.
3. It's possible to make money as long as there is a single day where the closing price is greater than the opening price! You might have to invest a lot to do so, but it's possible.
4. You must make a prediction about where the stock will close at if you want to know how much to invest. That is, we can set up our investment for the day to make money if the stock goes up 1%, but if it only goes up .5% we'll still lose money.
5. It's possible the winning move is to scale back your position. Consider the scenario:
- You invest money and the stock closes down the day .5%
- You invest tomorrow expecting the stock to go up 1%
- The winning investment to break even (assuming a 1% increase) is to scale back the position, since the shares you purchased at the beginning would then be profitable
## Running the simulation
So now that we've defined our investment formula,we need to tweak a couple things in order to make an investment strategy we can actually work with. There are two issues we need to address:
1. The formula only tells us how much to invest if we want to break even ($d_n$). If we actually want to turn a profit, we need to invest more than that, which we will refer to as the **bias**.
2. The formula assumes we know what the closing price will be on any given day. If we don't know this, we can still invest assuming the stock price will close at a level we choose. If the price doesn't meet this objective, we try again tomorrow! This predetermined closing price will be referred to as the **expectation**.
Now that we've defined our *bias* and *expectation*, we can actually build a strategy we can simulate. Much like the martingale strategy told you to bet twice your previous bet in order to make money, we've designed a system that tells us how much to bet in order to make money as well.
Now, let's get to the code!
[1]: https://en.wikipedia.org/wiki/Gambler's_ruin
[2]: https://en.wikipedia.org/wiki/Random_walk
[3]: https://en.wikipedia.org/wiki/Martingale_%28betting_system%29
```julia
using Quandl
api_key = ""
daily_investment = function(current_open, current_close, purchase_history, open_history)
# We're not going to safeguard against divide by 0 - that's the user's responsibility
t1 = current_close / current_open - 1
t2 = sum(purchase_history - purchase_history*current_close ./ open_history)
return t2 / t1
end;
```
And let's code a way to run simulations quickly:
```julia
is_profitable = function(current_price, purchase_history, open_history)
shares = sum(purchase_history ./ open_history)
return current_price*shares > sum(purchase_history)
end
simulate = function(name, start, init, expected, bias)
ticker_info = quandlget(name, from=start, api_key=api_key)
open_vals = ticker_info["Open"].values
close_vals = ticker_info["Close"].values
invested = [init]
# The simulation stops once we've made a profit
day = 1
profitable = is_profitable(close_vals[day], invested, open_vals[1:length(invested)]) ||
is_profitable(open_vals[day+1], invested, open_vals[1:length(invested)])
while !profitable
expected_close = open_vals[day+1] * expected
todays_purchase = daily_investment(open_vals[day+1], expected_close, invested, open_vals[1:day])
invested = [invested; todays_purchase + bias]
# expected_profit = expected_close * sum(invested ./ open_vals[1:length(invested)]) - sum(invested)
day += 1
profitable = is_profitable(close_vals[day], invested, open_vals[1:length(invested)]) ||
is_profitable(open_vals[day+1], invested, open_vals[1:length(invested)])
end
shares = sum(invested ./ open_vals[1:length(invested)])
max_profit = max(close_vals[day], open_vals[day+1])
profit = shares * max_profit - sum(invested)
return (invested, profit)
end
sim_summary = function(investments, profit)
leverages = [sum(investments[1:i]) for i=1:length(investments)]
max_leverage = maximum(leverages) / investments[1]
println("Max leverage: $(max_leverage)")
println("Days invested: $(length(investments))")
println("Profit: $profit")
end;
```
Now, let's get some data and run a simulation! Our first test:
- We'll invest 100 dollars in LMT, and expect that the stock will close up 1% every day. We'll invest $d_n$ + 10 dollars every day that we haven't turned a profit, and end the simulation once we've made a profit.
```julia
investments, profit = simulate("YAHOO/LMT", Date(2015, 11, 29), 100, 1.01, 10)
sim_summary(investments, profit)
```
```
Max leverage: 5.590373200042106
Days invested: 5
Profit: 0.6894803101560001
```
The result: We need to invest 5.6x our initial position over a period of 5 days to make approximately .69&#162;
- Now let's try the same thing, but we'll assume the stock closes up 2% instead.
```julia
investments, profit = simulate("YAHOO/LMT", Date(2015, 11, 29), 100, 1.02, 10)
sim_summary(investments, profit)
```
```
Max leverage: 1.854949900247809
Days invested: 25
Profit: 0.08304813163696423
```
In this example, we only get up to a 1.85x leveraged position, but it takes 25 days to turn a profit of 8&#162;
## Summary
We've defined an investment strategy that can tell us how much to invest when we know what the closing position of a stock will be. We can tweak the strategy to actually make money, but plenty of work needs to be done so that we can optimize the money invested.
In the next post I'm going to post more information about some backtests and strategy tests on this strategy (unless of course this experiment actually produces a significant profit potential, and then I'm keeping it for myself).
### Side note and disclaimer
The claims made in this presentation about being able to guarantee making money are intended as a joke and do not constitute investment advice of any sort.

View File

@ -0,0 +1,15 @@
Title: Profitability using the Investment Formula
Date: 2016-02-26
Category: Blog
Tags: algorithmic-trading, python
Authors: Bradlee Speice
Summary: After developing a formula to guide our investing, how do we actually evaluate its performance in the real world?
[//]: <> "Modified: "
{% notebook 2016-2-26-profitability-using-the-investment-formula.ipynb %}
<script type="text/x-mathjax-config">
MathJax.Hub.Config({tex2jax: {inlineMath: [['$','$'], ['\\(','\\)']]}});
</script>
<script async src='https://cdn.mathjax.org/mathjax/latest/MathJax.js?config=TeX-AMS_CHTML'></script>

File diff suppressed because one or more lines are too long

View File

@ -0,0 +1,237 @@
# Profitability using the Investment Formula
I've previously talked about crafting an [Investment Formula][1] that would guarantee making money if you could predict which direction the stock market was going to go. This is going to be the first in a series of posts trying to flesh out what an actual investment strategy based on this formula would look like.
But first, the formula doesn't take into account two very important things: **leverage**, and the **number of days invested**. That's why I want to set up what I'm going to call the **Profitability Score**.
The definition is going to be very simple:
- $p$: Profit made once you exit the investment
- $i$: Initial investment into the asset
- $m$: Maximum investment in the asset
- $l = m / i$: The maximum leverage of an investment, as the ratio of maximum invested to initial investment
- $d$: The number of days it takes to turn a profit
$s = \frac{1000 p}{i(l + d)} = \frac{1000 p}{m + i\cdot d}$
Crazy, right? The score is simply the (normalized) profit you made divided by the leverage plus days invested. The $\cdot 1000$ is just to turn the number into something more reasonable - people don't like hearing something with a profitability score of .001 for example.
# Theoretical Justification
The formula itself is designed to be simple in principle: I like making a profit, and I want to penalize the leverage you incur and days you have to invest. Ideally, we want to have a stock that goes up all the time. However, the investment formula takes advantage of a different case: trying to profit from highly volatile assets. If we can make money when the investment only has one day up, let's do it!
Even so, there are two potential issues: First, stocks that trend upward will have a higher profitability score - both leverage and days invested will be 1. To protect against only investing in this trend, I can do things like taking $\log(d)$. I don't want to start biasing the scoring function until I have a practical reason to do so, so right now I'll leave it standing.
The second issue is how to penalize leverage and days invested relative to each other. As it currently stands, a leverage of 6x with only 1 day invested is the same as leveraging 2x with 3 days invested. In the future, I'd again want to look at making the impact of days invested smaller - I can get over an extra 3 days in the market if it means that I don't have to incur a highly leveraged position.
So there could be things about the scoring function we change in the future, but I want to run some actual tests before we start worrying about things like that!
# Running a simulation
This won't be an incredibly rigorous backtest, I just want to see some results from the work so far. Let's set up the simulation code again, and start looking into some random stocks. **If you've read the last blog post, you can skip over the code.** The only difference is that it's been ported to python to make the data-wrangling easier. Julia doesn't yet support some of the multi-index things I'm trying to do.
[1]: https://bspeice.github.io/guaranteed-money-maker.html
```python
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from Quandl import get as qget
%matplotlib inline
api_key = ''
profitability = lambda p, i, m, d: 1000*p / (m + i*d)
def is_profitable(current_price, purchase_history, open_history):
shares = (purchase_history / open_history).sum()
return current_price * shares > sum(purchase_history)
def daily_investment(current_open, current_close, purchase_history, open_history):
t1 = current_close / current_open - 1
t2 = (purchase_history - purchase_history * current_close / open_history).sum()
return t2 / t1
def simulate_day(open_vals, close_vals, init, expected, bias):
invested = np.array([init])
day = 1
profitable = is_profitable(close_vals[day-1], invested, open_vals[0:len(invested)]) \
or is_profitable(open_vals[day], invested, open_vals[0:len(invested)])
while not profitable:
expected_close = open_vals[day] * expected
todays_purchase = daily_investment(open_vals[day], expected_close, invested, open_vals[0:day])
invested = np.append(invested, todays_purchase + bias)
# expected_profit = expected_close * (invested / open_vals[0:len(invested)]).sum() - invested.sum()
day += 1
profitable = is_profitable(close_vals[day-1], invested, open_vals[0:len(invested)]) \
or is_profitable(open_vals[day], invested, open_vals[0:len(invested)])
shares = (invested / open_vals[0:len(invested)]).sum()
# Make sure we can't see into the future - we know either today's close or tomorrow's open
# will be profitable, but we need to check which one.
if is_profitable(close_vals[day-1], invested, open_vals[0:len(invested)]):
ending_price = close_vals[day-1]
else:
ending_price = open_vals[day]
profit = shares * ending_price - sum(invested)
return invested, profit
def simulate_ts(name, start, end, initial, expected, bias):
ticker_info = qget(name, trim_start=start, api_key=api_key)
evaluation_times = ticker_info[:end].index
# Handle Google vs. YFinance data
if "Adjusted Close" in ticker_info.columns:
close_column = "Adjusted Close"
else:
close_column = "Close"
sim = {d: simulate_day(ticker_info[d:]["Open"], ticker_info[d:][close_column],
100, 1.02, 10) for d in evaluation_times}
sim_series = pd.Series(sim)
result = pd.DataFrame()
result["profit"] = sim_series.apply(lambda x: x[1])
result["max"] = sim_series.apply(lambda x: max(x[0]))
result["days"] = sim_series.apply(lambda x: len(x[0]))
result["score"] = sim_series.apply(lambda x: profitability(x[1], x[0][0], max(x[0]), len(x[0])))
result["investments"] = sim_series.apply(lambda x: x[0])
return result
def simulate_tickers(tickers):
from datetime import datetime
results = {}
for ticker in tickers:
start = datetime(2015, 1, 1)
results_df = simulate_ts(ticker, start, datetime(2016, 1, 1), 100, 1.01, 10)
results[ticker] = results_df
return pd.concat(list(results.values()), keys=list(results.keys()), axis=1)
```
# And now the interesting part
Let's start looking into the data! FANG stocks have been big over the past year, let's see how they look:
```python
fang_df = simulate_tickers(["YAHOO/FB", "YAHOO/AAPL", "YAHOO/NFLX", "YAHOO/GOOG"])
```
```python
fang_df.xs('days', axis=1, level=1).hist()
plt.gcf().set_size_inches(18, 8);
plt.gcf().suptitle("Distribution of Days Until Profitability", fontsize=18);
```
![png](_notebook_files/_notebook_4_0.png)
```python
fang_df.xs('score', axis=1, level=1).plot()
plt.gcf().set_size_inches(18, 6)
plt.gcf().suptitle("Profitability score over time", fontsize=18);
```
![png](_notebook_files/_notebook_5_0.png)
Let's think about these graphs. First, the histogram. What we like seeing is a lot of 1's - that means there were a lot of days that the stock went up and we didn't have to worry about actually implementing the strategy - we were able to close the trade at a profit.
Looking at the profitability score over time though is a bit more interesting. First off, stocks that are more volatile will tend to have a higher profitability score, no two ways about that. However, Netflix consistently outperformed on this metric. We know that 2015 was a good year for Netflix, so that's a (small) sign the strategy is performing as expected.
The final interesting note happens around the end of August 2015. Around this period, the markets were selling off in a big way due to issues in China (not unlike what's happening now). Even so, all of the FANG stocks saw an uptick in profitability around this time. This is another sign that the strategy being developed performs better during periods of volatility, rather than from riding markets up or down.
What about FANG vs. some cyclicals?
```python
cyclic_df = simulate_tickers(["YAHOO/X", "YAHOO/CAT", "YAHOO/NFLX", "YAHOO/GOOG"])
```
```python
cyclic_df.xs('days', axis=1, level=1).hist()
plt.gcf().set_size_inches(18, 8);
plt.gcf().suptitle("Distribution of Days Until Profitability", fontsize=18);
```
![png](_notebook_files/_notebook_8_0.png)
```python
cyclic_df.xs('score', axis=1, level=1).plot()
plt.gcf().set_size_inches(18, 6)
plt.gcf().suptitle("Profitability score over time", fontsize=18);
```
![png](_notebook_files/_notebook_9_0.png)
Some more interesting results come from this as well. First off, US Steel (X) has a much smoother distribution of days until profitability - it doesn't have a huge number of values at 1 and then drop off. Intuitively, we're not terribly large fans of this, we want a stock to go up! However, on the profitability score it is the only serious contender to Netflix.
Second, we see the same trend around August - the algorithm performs well in volatile markets.
For a final test, let's try some biotech and ETFs!
```python
biotech_df = simulate_tickers(['YAHOO/REGN', 'YAHOO/CELG', 'GOOG/NASDAQ_BIB', 'GOOG/NASDAQ_IBB'])
```
```python
biotech_df.xs('days', axis=1, level=1).hist()
plt.gcf().set_size_inches(18, 8);
plt.gcf().suptitle("Distribution of Days Until Profitability", fontsize=18);
```
![png](_notebook_files/_notebook_12_0.png)
```python
biotech_df.xs('score', axis=1, level=1).plot()
plt.gcf().set_size_inches(18, 6)
plt.gcf().suptitle("Profitability score over time", fontsize=18);
```
![png](_notebook_files/_notebook_13_0.png)
In this example, we don't see a whole lot of interesting things: the scores are all fairly close together with notable exceptions in late August, and mid-October.
What is interesting is that during the volatile period, the ETF's performed significantly better than the stocks did in terms of profitability. The leveraged ETF (BIB) performed far above anyone else, and it appears that indeed, it is most profitable during volatile periods. Even so, it was far more likely to take multiple days to give a return. Its count of 1-day investments trails the other ETF and both stocks by a decent margin.
And consider me an OCD freak, but I just really like Celgene's distribution - it looks nice and smooth.
# Summary and plans for the next post
So far I'm really enjoying playing with this strategy - there's a lot of depth here to understand, though the preliminary results seem to indicate that it profits mostly from taking the other side of a volatile trade. I'd be interested to run results later on data from January - It's been a particularly volatile start to the year so it would be neat to see whether this strategy would work then.
For the next post, I want to start playing with some of the parameters: How do the bias and expected close influence the process? The values have been fairly conservative so far, it will be interesting to see how the simulations respond afterward.

Binary file not shown.

After

Width:  |  Height:  |  Size: 29 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 95 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 25 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 89 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 25 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 97 KiB

View File

@ -0,0 +1,214 @@
---
slug: 2016/02/profitability-using-the-investment-formula
title: Profitability using the investment formula
date: 2016-02-26 12:00:00
authors: [bspeice]
tags: []
---
After developing a formula to guide our investing, how do we actually evaluate its performance in the real world?
<!-- truncate -->
I've previously talked about crafting an [Investment Formula](../2016-02-03-guaranteed-money-maker/index.mdx) that would guarantee making money if you could predict which direction the stock market was going to go. This is going to be the first in a series of posts trying to flesh out what an actual investment strategy based on this formula would look like.
But first, the formula doesn't take into account two very important things: **leverage**, and the **number of days invested**. That's why I want to set up what I'm going to call the **Profitability Score**.
The definition is going to be very simple:
- $p$: Profit made once you exit the investment
- $i$: Initial investment into the asset
- $m$: Maximum investment in the asset
- $l = m / i$: The maximum leverage of an investment, as the ratio of maximum invested to initial investment
- $d$: The number of days it takes to turn a profit
$s = \frac{1000 p}{i(l + d)} = \frac{1000 p}{m + i\cdot d}$
Crazy, right? The score is simply the (normalized) profit you made divided by the leverage plus days invested. The $\cdot 1000$ is just to turn the number into something more reasonable - people don't like hearing something with a profitability score of .001 for example.
## Theoretical Justification
The formula itself is designed to be simple in principle: I like making a profit, and I want to penalize the leverage you incur and days you have to invest. Ideally, we want to have a stock that goes up all the time. However, the investment formula takes advantage of a different case: trying to profit from highly volatile assets. If we can make money when the investment only has one day up, let's do it!
Even so, there are two potential issues: First, stocks that trend upward will have a higher profitability score - both leverage and days invested will be 1. To protect against only investing in this trend, I can do things like taking $\log(d)$. I don't want to start biasing the scoring function until I have a practical reason to do so, so right now I'll leave it standing.
The second issue is how to penalize leverage and days invested relative to each other. As it currently stands, a leverage of 6x with only 1 day invested is the same as leveraging 2x with 3 days invested. In the future, I'd again want to look at making the impact of days invested smaller - I can get over an extra 3 days in the market if it means that I don't have to incur a highly leveraged position.
So there could be things about the scoring function we change in the future, but I want to run some actual tests before we start worrying about things like that!
## Running a simulation
This won't be an incredibly rigorous backtest, I just want to see some results from the work so far. Let's set up the simulation code again, and start looking into some random stocks. **If you've read the last blog post, you can skip over the code.** The only difference is that it's been ported to python to make the data-wrangling easier. Julia doesn't yet support some of the multi-index things I'm trying to do.
```python
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from Quandl import get as qget
%matplotlib inline
api_key = ''
profitability = lambda p, i, m, d: 1000*p / (m + i*d)
def is_profitable(current_price, purchase_history, open_history):
shares = (purchase_history / open_history).sum()
return current_price * shares > sum(purchase_history)
def daily_investment(current_open, current_close, purchase_history, open_history):
t1 = current_close / current_open - 1
t2 = (purchase_history - purchase_history * current_close / open_history).sum()
return t2 / t1
def simulate_day(open_vals, close_vals, init, expected, bias):
invested = np.array([init])
day = 1
profitable = is_profitable(close_vals[day-1], invested, open_vals[0:len(invested)]) \
or is_profitable(open_vals[day], invested, open_vals[0:len(invested)])
while not profitable:
expected_close = open_vals[day] * expected
todays_purchase = daily_investment(open_vals[day], expected_close, invested, open_vals[0:day])
invested = np.append(invested, todays_purchase + bias)
# expected_profit = expected_close * (invested / open_vals[0:len(invested)]).sum() - invested.sum()
day += 1
profitable = is_profitable(close_vals[day-1], invested, open_vals[0:len(invested)]) \
or is_profitable(open_vals[day], invested, open_vals[0:len(invested)])
shares = (invested / open_vals[0:len(invested)]).sum()
# Make sure we can't see into the future - we know either today's close or tomorrow's open
# will be profitable, but we need to check which one.
if is_profitable(close_vals[day-1], invested, open_vals[0:len(invested)]):
ending_price = close_vals[day-1]
else:
ending_price = open_vals[day]
profit = shares * ending_price - sum(invested)
return invested, profit
def simulate_ts(name, start, end, initial, expected, bias):
ticker_info = qget(name, trim_start=start, api_key=api_key)
evaluation_times = ticker_info[:end].index
# Handle Google vs. YFinance data
if "Adjusted Close" in ticker_info.columns:
close_column = "Adjusted Close"
else:
close_column = "Close"
sim = {d: simulate_day(ticker_info[d:]["Open"], ticker_info[d:][close_column],
100, 1.02, 10) for d in evaluation_times}
sim_series = pd.Series(sim)
result = pd.DataFrame()
result["profit"] = sim_series.apply(lambda x: x[1])
result["max"] = sim_series.apply(lambda x: max(x[0]))
result["days"] = sim_series.apply(lambda x: len(x[0]))
result["score"] = sim_series.apply(lambda x: profitability(x[1], x[0][0], max(x[0]), len(x[0])))
result["investments"] = sim_series.apply(lambda x: x[0])
return result
def simulate_tickers(tickers):
from datetime import datetime
results = {}
for ticker in tickers:
start = datetime(2015, 1, 1)
results_df = simulate_ts(ticker, start, datetime(2016, 1, 1), 100, 1.01, 10)
results[ticker] = results_df
return pd.concat(list(results.values()), keys=list(results.keys()), axis=1)
```
## And now the interesting part
Let's start looking into the data! FANG stocks have been big over the past year, let's see how they look:
```python
fang_df = simulate_tickers(["YAHOO/FB", "YAHOO/AAPL", "YAHOO/NFLX", "YAHOO/GOOG"])
```
```python
fang_df.xs('days', axis=1, level=1).hist()
plt.gcf().set_size_inches(18, 8);
plt.gcf().suptitle("Distribution of Days Until Profitability", fontsize=18);
```
![png](_notebook_files/_notebook_4_0.png)
```python
fang_df.xs('score', axis=1, level=1).plot()
plt.gcf().set_size_inches(18, 6)
plt.gcf().suptitle("Profitability score over time", fontsize=18);
```
![png](_notebook_files/_notebook_5_0.png)
Let's think about these graphs. First, the histogram. What we like seeing is a lot of 1's - that means there were a lot of days that the stock went up and we didn't have to worry about actually implementing the strategy - we were able to close the trade at a profit.
Looking at the profitability score over time though is a bit more interesting. First off, stocks that are more volatile will tend to have a higher profitability score, no two ways about that. However, Netflix consistently outperformed on this metric. We know that 2015 was a good year for Netflix, so that's a (small) sign the strategy is performing as expected.
The final interesting note happens around the end of August 2015. Around this period, the markets were selling off in a big way due to issues in China (not unlike what's happening now). Even so, all of the FANG stocks saw an uptick in profitability around this time. This is another sign that the strategy being developed performs better during periods of volatility, rather than from riding markets up or down.
What about FANG vs. some cyclicals?
```python
cyclic_df = simulate_tickers(["YAHOO/X", "YAHOO/CAT", "YAHOO/NFLX", "YAHOO/GOOG"])
```
```python
cyclic_df.xs('days', axis=1, level=1).hist()
plt.gcf().set_size_inches(18, 8);
plt.gcf().suptitle("Distribution of Days Until Profitability", fontsize=18);
```
![png](_notebook_files/_notebook_8_0.png)
```python
cyclic_df.xs('score', axis=1, level=1).plot()
plt.gcf().set_size_inches(18, 6)
plt.gcf().suptitle("Profitability score over time", fontsize=18);
```
![png](_notebook_files/_notebook_9_0.png)
Some more interesting results come from this as well. First off, US Steel (X) has a much smoother distribution of days until profitability - it doesn't have a huge number of values at 1 and then drop off. Intuitively, we're not terribly large fans of this, we want a stock to go up! However, on the profitability score it is the only serious contender to Netflix.
Second, we see the same trend around August - the algorithm performs well in volatile markets.
For a final test, let's try some biotech and ETFs!
```python
biotech_df = simulate_tickers(['YAHOO/REGN', 'YAHOO/CELG', 'GOOG/NASDAQ_BIB', 'GOOG/NASDAQ_IBB'])
```
```python
biotech_df.xs('days', axis=1, level=1).hist()
plt.gcf().set_size_inches(18, 8);
plt.gcf().suptitle("Distribution of Days Until Profitability", fontsize=18);
```
![png](_notebook_files/_notebook_12_0.png)
```python
biotech_df.xs('score', axis=1, level=1).plot()
plt.gcf().set_size_inches(18, 6)
plt.gcf().suptitle("Profitability score over time", fontsize=18);
```
![png](_notebook_files/_notebook_13_0.png)
In this example, we don't see a whole lot of interesting things: the scores are all fairly close together with notable exceptions in late August, and mid-October.
What is interesting is that during the volatile period, the ETF's performed significantly better than the stocks did in terms of profitability. The leveraged ETF (BIB) performed far above anyone else, and it appears that indeed, it is most profitable during volatile periods. Even so, it was far more likely to take multiple days to give a return. Its count of 1-day investments trails the other ETF and both stocks by a decent margin.
And consider me an OCD freak, but I just really like Celgene's distribution - it looks nice and smooth.
## Summary and plans for the next post
So far I'm really enjoying playing with this strategy - there's a lot of depth here to understand, though the preliminary results seem to indicate that it profits mostly from taking the other side of a volatile trade. I'd be interested to run results later on data from January - It's been a particularly volatile start to the year so it would be neat to see whether this strategy would work then.
For the next post, I want to start playing with some of the parameters: How do the bias and expected close influence the process? The values have been fairly conservative so far, it will be interesting to see how the simulations respond afterward.

View File

@ -0,0 +1,16 @@
Title: Predicting Santander Customer Happiness
Date: 2016-03-05
Category: Blog
Tags: machine learning, data science, kaggle
Authors: Bradlee Speice
Summary: My first real-world data challenge: predicting whether a bank's customers will be happy.
[//]: <> "Modified: "
{% notebook 2016-3-5-predicting-santander-customer-happiness.ipynb %}
<script type="text/x-mathjax-config">
# MathJax.Hub.Config({tex2jax: {inlineMath: [['$','$'], ['\(','\)']]}});
MathJax.Hub.Config({tex2jax: {inlineMath: [['$','$']]}});
</script>
<script async src='https://cdn.mathjax.org/mathjax/latest/MathJax.js?config=TeX-AMS_CHTML'></script>

File diff suppressed because one or more lines are too long

View File

@ -0,0 +1,269 @@
### My first Kaggle competition
It's time! After embarking on a Machine Learning class this semester, and with a Saturday in which I don't have much planned, I wanted to put this class and training to work. It's my first competition submission. I want to walk you guys through how I'm approaching this problem, because I thought it would be really neat. The competition is Banco Santander's [Santander Customer Satisfaction][1] competition. It seemed like an easy enough problem I could actually make decent progress on it.
# Data Exploration
First up: we need to load our data and do some exploratory work. Because we're going to be using this data for model selection prior to testing, we need to make a further split. I've already gone ahead and done this work, please see the code in the [appendix below](#Appendix).
[1]: https://www.kaggle.com/c/santander-customer-satisfaction
```python
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
# Record how long it takes to run the notebook - I'm curious.
from datetime import datetime
start = datetime.now()
dataset = pd.read_csv('split_train.csv')
dataset.index = dataset.ID
X = dataset.drop(['TARGET', 'ID', 'ID.1'], 1)
y = dataset.TARGET
```
```python
y.unique()
```
array([0, 1], dtype=int64)
```python
len(X.columns)
```
369
Okay, so there are only [two classes we're predicting][2]: 1 for unsatisfied customers, 0 for satisfied customers. I would have preferred this to be something more like a regression, or predicting multiple classes: maybe the customer isn't the most happy, but is nowhere near closing their accounts. For now though, that's just the data we're working with.
Now, I'd like to make a scatter matrix of everything going on. Unfortunately as noted above, we have 369 different features. There's no way I can graphically make sense of that much data to start with.
We're also not told what the data actually represents: Are these survey results? Average time between contact with a customer care person? Frequency of contacting a customer care person? The idea is that I need to reduce the number of dimensions we're predicting across.
## Dimensionality Reduction pt. 1 - Binary Classifiers
My first attempt to reduce the data dimensionality is to find all the binary classifiers in the dataset \(i.e. 0 or 1 values\) and see if any of those are good \(or anti-good\) predictors of the final data.
[2]: https://www.kaggle.com/c/santander-customer-satisfaction/data
```python
cols = X.columns
b_class = []
for c in cols:
if len(X[c].unique()) == 2:
b_class.append(c)
len(b_class)
```
111
So there are 111 features in the dataset that are a binary label. Let's see if any of them are good at predicting the users satisfaction!
```python
# First we need to `binarize` the data to 0-1; some of the labels are {0, 1},
# some are {0, 3}, etc.
from sklearn.preprocessing import binarize
X_bin = binarize(X[b_class])
accuracy = [np.mean(X_bin[:,i] == y) for i in range(0, len(b_class))]
acc_df = pd.DataFrame({"Accuracy": accuracy}, index=b_class)
acc_df.describe()
```
<div>
<table border="1" class="dataframe">
<thead>
<tr style="text-align: right;">
<th></th>
<th>Accuracy</th>
</tr>
</thead>
<tbody>
<tr>
<th>count</th>
<td>111.000000</td>
</tr>
<tr>
<th>mean</th>
<td>0.905159</td>
</tr>
<tr>
<th>std</th>
<td>0.180602</td>
</tr>
<tr>
<th>min</th>
<td>0.043598</td>
</tr>
<tr>
<th>25%</th>
<td>0.937329</td>
</tr>
<tr>
<th>50%</th>
<td>0.959372</td>
</tr>
<tr>
<th>75%</th>
<td>0.960837</td>
</tr>
<tr>
<th>max</th>
<td>0.960837</td>
</tr>
</tbody>
</table>
</div>
Wow! Looks like we've got some incredibly predictive features! So much so that we should be a bit concerned. My initial guess for what's happening is that we have a sparsity issue: so many of the values are 0, and these likely happen to line up with satisfied customers.
So the question we must now answer, which I likely should have asked long before now: What exactly is the distribution of un/satisfied customers?
```python
unsat = y[y == 1].count()
print("Satisfied customers: {}; Unsatisfied customers: {}".format(len(y) - unsat, unsat))
naive_guess = np.mean(y == np.zeros(len(y)))
print("Naive guess accuracy: {}".format(naive_guess))
```
Satisfied customers: 51131; Unsatisfied customers: 2083
Naive guess accuracy: 0.9608561656706882
This is a bit discouraging. A naive guess of "always satisfied" performs as well as our best individual binary classifier. What this tells me then, is that these data columns aren't incredibly helpful in prediction. I'd be interested in a polynomial expansion of this data-set, but for now, that's more computation than I want to take on.
# Dimensionality Reduction pt. 2 - LDA
Knowing that our naive guess performs so well is a blessing and a curse:
- Curse: The threshold for performance is incredibly high: We can only "improve" over the naive guess by 4%
- Blessing: All the binary classification features we just discovered are worthless on their own. We can throw them out and reduce the data dimensionality from 369 to 111.
Now, in removing these features from the dataset, I'm not saying that there is no "information" contained within them. There might be. But the only way we'd know is through a polynomial expansion, and I'm not going to take that on within this post.
My initial thought for a "next guess" is to use the [LDA][3] model for dimensionality reduction. However, it can only reduce dimensions to $1 - p$, with $p$ being the number of classes. Since this is a binary classification, every LDA model that I try will have dimensionality one; when I actually try this, the predictor ends up being slightly less accurate than the naive guess.
Instead, let's take a different approach to dimensionality reduction: [principle components analysis][4]. This allows us to perform the dimensionality reduction without worrying about the number of classes. Then, we'll use a [Gaussian Naive Bayes][5] model to actually do the prediction. This model is chosen simply because it doesn't take a long time to fit and compute; because PCA will take so long, I just want a prediction at the end of this. We can worry about using a more sophisticated LDA/QDA/SVM model later.
Now into the actual process: We're going to test out PCA dimensionality reduction from 1 - 20 dimensions, and then predict using a Gaussian Naive Bayes model. The 20 dimensions upper limit was selected because the accuracy never improves after you get beyond that \(I found out by running it myself\). Hopefully, we'll find that we can create a model better than the naive guess.
[3]:http://scikit-learn.org/stable/modules/lda_qda.html
[4]:http://scikit-learn.org/stable/modules/generated/sklearn.decomposition.PCA.html
[5]:http://scikit-learn.org/stable/modules/naive_bayes.html#gaussian-naive-bayes
```python
from sklearn.naive_bayes import GaussianNB
from sklearn.decomposition import PCA
X_no_bin = X.drop(b_class, 1)
def evaluate_gnb(dims):
pca = PCA(n_components=dims)
X_xform = pca.fit_transform(X_no_bin)
gnb = GaussianNB()
gnb.fit(X_xform, y)
return gnb.score(X_xform, y)
dim_range = np.arange(1, 21)
plt.plot(dim_range, [evaluate_gnb(dim) for dim in dim_range], label="Gaussian NB Accuracy")
plt.axhline(naive_guess, label="Naive Guess", c='k')
plt.axhline(1 - naive_guess, label="Inverse Naive Guess", c='k')
plt.gcf().set_size_inches(12, 6)
plt.legend();
```
![png](_notebook_files/_notebook_11_0.png)
\*\*sigh...\*\* After all the effort and computational power, we're still at square one: we have yet to beat out the naive guess threshold. With PCA in play we end up performing terribly, but not terribly enough that we can guess against ourselves.
Let's try one last-ditch attempt using the entire data set:
```python
def evaluate_gnb_full(dims):
pca = PCA(n_components=dims)
X_xform = pca.fit_transform(X)
gnb = GaussianNB()
gnb.fit(X_xform, y)
return gnb.score(X_xform, y)
dim_range = np.arange(1, 21)
plt.plot(dim_range, [evaluate_gnb(dim) for dim in dim_range], label="Gaussian NB Accuracy")
plt.axhline(naive_guess, label="Naive Guess", c='k')
plt.axhline(1 - naive_guess, label="Inverse Naive Guess", c='k')
plt.gcf().set_size_inches(12, 6)
plt.legend();
```
![png](_notebook_files/_notebook_13_0.png)
Nothing. It is interesting to note that the graphs are almost exactly the same: This would imply again that the variables we removed earlier (all the binary classifiers) indeed have almost no predictive power. It seems this problem is high-dimensional, but with almost no data that can actually inform our decisions.
# Summary for Day 1
After spending a couple hours with this dataset, there seems to be a fundamental issue in play: We have very high-dimensional data, and it has no bearing on our ability to actually predict customer satisfaction. This can be a huge issue: it implies that **no matter what model we use, we fundamentally can't perform well.** I'm sure most of this is because I'm not an experienced data scientist. Even so, we have yet to develop a strategy that can actually beat out the village idiot; **so far, the bank is best off just assuming all its customers are satisfied.** Hopefully more to come soon.
```python
end = datetime.now()
print("Running time: {}".format(end - start))
```
Running time: 0:00:58.715714
# Appendix
Code used to split the initial training data:
```python
from sklearn.cross_validation import train_test_split
data = pd.read_csv('train.csv')
data.index = data.ID
data_train, data_validate = train_test_split(
data, train_size=.7)
data_train.to_csv('split_train.csv')
data_validate.to_csv('split_validate.csv')
```

Binary file not shown.

After

Width:  |  Height:  |  Size: 13 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 13 KiB

View File

@ -0,0 +1,256 @@
---
slug: 2016/03/predicting-santander-customer-happiness
title: Predicting Santander customer happiness
date: 2016-03-05 12:00:00
authors: [bspeice]
tags: []
---
My first Kaggle competition.
<!-- truncate -->
It's time! After embarking on a Machine Learning class this semester, and with a Saturday in which I don't have much planned, I wanted to put this class and training to work. It's my first competition submission. I want to walk you guys through how I'm approaching this problem, because I thought it would be really neat. The competition is Banco Santander's [Santander Customer Satisfaction][1] competition. It seemed like an easy enough problem I could actually make decent progress on it.
## Data Exploration
First up: we need to load our data and do some exploratory work. Because we're going to be using this data for model selection prior to testing, we need to make a further split. I've already gone ahead and done this work, please see the code in the [appendix below](#appendix).
[1]: https://www.kaggle.com/c/santander-customer-satisfaction
```python
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
# Record how long it takes to run the notebook - I'm curious.
from datetime import datetime
start = datetime.now()
dataset = pd.read_csv('split_train.csv')
dataset.index = dataset.ID
X = dataset.drop(['TARGET', 'ID', 'ID.1'], 1)
y = dataset.TARGET
```
```python
y.unique()
```
```
array([0, 1], dtype=int64)
```
```python
len(X.columns)
```
```
369
```
Okay, so there are only [two classes we're predicting][2]: 1 for unsatisfied customers, 0 for satisfied customers. I would have preferred this to be something more like a regression, or predicting multiple classes: maybe the customer isn't the most happy, but is nowhere near closing their accounts. For now though, that's just the data we're working with.
Now, I'd like to make a scatter matrix of everything going on. Unfortunately as noted above, we have 369 different features. There's no way I can graphically make sense of that much data to start with.
We're also not told what the data actually represents: Are these survey results? Average time between contact with a customer care person? Frequency of contacting a customer care person? The idea is that I need to reduce the number of dimensions we're predicting across.
### Dimensionality Reduction pt. 1 - Binary Classifiers
My first attempt to reduce the data dimensionality is to find all the binary classifiers in the dataset \(i.e. 0 or 1 values\) and see if any of those are good \(or anti-good\) predictors of the final data.
[2]: https://www.kaggle.com/c/santander-customer-satisfaction/data
```python
cols = X.columns
b_class = []
for c in cols:
if len(X[c].unique()) == 2:
b_class.append(c)
len(b_class)
```
```
111
```
So there are 111 features in the dataset that are a binary label. Let's see if any of them are good at predicting the users satisfaction!
```python
# First we need to `binarize` the data to 0-1; some of the labels are {0, 1},
# some are {0, 3}, etc.
from sklearn.preprocessing import binarize
X_bin = binarize(X[b_class])
accuracy = [np.mean(X_bin[:,i] == y) for i in range(0, len(b_class))]
acc_df = pd.DataFrame({"Accuracy": accuracy}, index=b_class)
acc_df.describe()
```
<div>
<table>
<thead>
<tr>
<th></th>
<th>Accuracy</th>
</tr>
</thead>
<tbody>
<tr>
<th>count</th>
<td>111.000000</td>
</tr>
<tr>
<th>mean</th>
<td>0.905159</td>
</tr>
<tr>
<th>std</th>
<td>0.180602</td>
</tr>
<tr>
<th>min</th>
<td>0.043598</td>
</tr>
<tr>
<th>25%</th>
<td>0.937329</td>
</tr>
<tr>
<th>50%</th>
<td>0.959372</td>
</tr>
<tr>
<th>75%</th>
<td>0.960837</td>
</tr>
<tr>
<th>max</th>
<td>0.960837</td>
</tr>
</tbody>
</table>
</div>
Wow! Looks like we've got some incredibly predictive features! So much so that we should be a bit concerned. My initial guess for what's happening is that we have a sparsity issue: so many of the values are 0, and these likely happen to line up with satisfied customers.
So the question we must now answer, which I likely should have asked long before now: What exactly is the distribution of un/satisfied customers?
```python
unsat = y[y == 1].count()
print("Satisfied customers: {}; Unsatisfied customers: {}".format(len(y) - unsat, unsat))
naive_guess = np.mean(y == np.zeros(len(y)))
print("Naive guess accuracy: {}".format(naive_guess))
```
```
Satisfied customers: 51131; Unsatisfied customers: 2083
Naive guess accuracy: 0.9608561656706882
```
This is a bit discouraging. A naive guess of "always satisfied" performs as well as our best individual binary classifier. What this tells me then, is that these data columns aren't incredibly helpful in prediction. I'd be interested in a polynomial expansion of this data-set, but for now, that's more computation than I want to take on.
### Dimensionality Reduction pt. 2 - LDA
Knowing that our naive guess performs so well is a blessing and a curse:
- Curse: The threshold for performance is incredibly high: We can only "improve" over the naive guess by 4%
- Blessing: All the binary classification features we just discovered are worthless on their own. We can throw them out and reduce the data dimensionality from 369 to 111.
Now, in removing these features from the dataset, I'm not saying that there is no "information" contained within them. There might be. But the only way we'd know is through a polynomial expansion, and I'm not going to take that on within this post.
My initial thought for a "next guess" is to use the [LDA][3] model for dimensionality reduction. However, it can only reduce dimensions to $1 - p$, with $p$ being the number of classes. Since this is a binary classification, every LDA model that I try will have dimensionality one; when I actually try this, the predictor ends up being slightly less accurate than the naive guess.
Instead, let's take a different approach to dimensionality reduction: [principle components analysis][4]. This allows us to perform the dimensionality reduction without worrying about the number of classes. Then, we'll use a [Gaussian Naive Bayes][5] model to actually do the prediction. This model is chosen simply because it doesn't take a long time to fit and compute; because PCA will take so long, I just want a prediction at the end of this. We can worry about using a more sophisticated LDA/QDA/SVM model later.
Now into the actual process: We're going to test out PCA dimensionality reduction from 1 - 20 dimensions, and then predict using a Gaussian Naive Bayes model. The 20 dimensions upper limit was selected because the accuracy never improves after you get beyond that \(I found out by running it myself\). Hopefully, we'll find that we can create a model better than the naive guess.
[3]:http://scikit-learn.org/stable/modules/lda_qda.html
[4]:http://scikit-learn.org/stable/modules/generated/sklearn.decomposition.PCA.html
[5]:http://scikit-learn.org/stable/modules/naive_bayes.html#gaussian-naive-bayes
```python
from sklearn.naive_bayes import GaussianNB
from sklearn.decomposition import PCA
X_no_bin = X.drop(b_class, 1)
def evaluate_gnb(dims):
pca = PCA(n_components=dims)
X_xform = pca.fit_transform(X_no_bin)
gnb = GaussianNB()
gnb.fit(X_xform, y)
return gnb.score(X_xform, y)
dim_range = np.arange(1, 21)
plt.plot(dim_range, [evaluate_gnb(dim) for dim in dim_range], label="Gaussian NB Accuracy")
plt.axhline(naive_guess, label="Naive Guess", c='k')
plt.axhline(1 - naive_guess, label="Inverse Naive Guess", c='k')
plt.gcf().set_size_inches(12, 6)
plt.legend();
```
![png](_notebook_files/_notebook_11_0.png)
**sigh...** After all the effort and computational power, we're still at square one: we have yet to beat out the naive guess threshold. With PCA in play we end up performing terribly, but not terribly enough that we can guess against ourselves.
Let's try one last-ditch attempt using the entire data set:
```python
def evaluate_gnb_full(dims):
pca = PCA(n_components=dims)
X_xform = pca.fit_transform(X)
gnb = GaussianNB()
gnb.fit(X_xform, y)
return gnb.score(X_xform, y)
dim_range = np.arange(1, 21)
plt.plot(dim_range, [evaluate_gnb(dim) for dim in dim_range], label="Gaussian NB Accuracy")
plt.axhline(naive_guess, label="Naive Guess", c='k')
plt.axhline(1 - naive_guess, label="Inverse Naive Guess", c='k')
plt.gcf().set_size_inches(12, 6)
plt.legend();
```
![png](_notebook_files/_notebook_13_0.png)
Nothing. It is interesting to note that the graphs are almost exactly the same: This would imply again that the variables we removed earlier (all the binary classifiers) indeed have almost no predictive power. It seems this problem is high-dimensional, but with almost no data that can actually inform our decisions.
## Summary for Day 1
After spending a couple hours with this dataset, there seems to be a fundamental issue in play: We have very high-dimensional data, and it has no bearing on our ability to actually predict customer satisfaction. This can be a huge issue: it implies that **no matter what model we use, we fundamentally can't perform well.** I'm sure most of this is because I'm not an experienced data scientist. Even so, we have yet to develop a strategy that can actually beat out the village idiot; **so far, the bank is best off just assuming all its customers are satisfied.** Hopefully more to come soon.
```python
end = datetime.now()
print("Running time: {}".format(end - start))
```
```
Running time: 0:00:58.715714
```
## Appendix
Code used to split the initial training data:
```python
from sklearn.cross_validation import train_test_split
data = pd.read_csv('train.csv')
data.index = data.ID
data_train, data_validate = train_test_split(
data, train_size=.7)
data_train.to_csv('split_train.csv')
data_validate.to_csv('split_validate.csv')
```

View File

@ -0,0 +1,14 @@
Title: Tweet Like Me
Date: 2016-03-28
Category: Blog
Tags: twitter, MCMC
Authors: Bradlee Speice
Summary: In which I try to create a robot that will tweet like I tweet.
[//]: <> "Modified: "
{% notebook 2016-3-28-tweet-like-me.ipynb %}
<script type="text/x-mathjax-config">
MathJax.Hub.Config({tex2jax: {inlineMath: [['$','$'], ['\(','\)']]}});
</script>
<script async src='https://cdn.mathjax.org/mathjax/latest/MathJax.js?config=TeX-AMS_CHTML'></script>

File diff suppressed because one or more lines are too long

View File

@ -0,0 +1,295 @@
An experiment in creating a robot that will imitate me on Twitter.
---
So, I'm taking a Machine Learning course this semester in school, and one of the topics we keep coming back to is natural language processing and the 'bag of words' data structure. That is, given a sentence:
`How much wood would a woodchuck chuck if a woodchuck could chuck wood?`
We can represent that sentence as the following list:
`{
How: 1
much: 1
wood: 2
would: 2
a: 2
woodchuck: 2
chuck: 2
if: 1
}`
Ignoring *where* the words happened, we're just interested in how *often* the words occurred. That got me thinking: I wonder what would happen if I built a robot that just imitated how often I said things? It's dangerous territory when computer scientists ask "what if," but I got curious enough I wanted to follow through.
## The Objective
Given an input list of Tweets, build up the following things:
1. The distribution of starting words; since there are no "prior" words to go from, we need to treat this as a special case.
2. The distribution of words given a previous word; for example, every time I use the word `woodchuck` in the example sentence, there is a 50% chance it is followed by `chuck` and a 50% chance it is followed by `could`. I need this distribution for all words.
3. The distribution of quantity of hashtags; Do I most often use just one? Two? Do they follow something like a Poisson distribution?
4. Distribution of hashtags; Given a number of hashtags, what is the actual content? I'll treat hashtags as separate from the content of a tweet.
## The Data
I'm using as input my tweet history. I don't really use Twitter anymore, but it seems like a fun use of the dataset. I'd like to eventually build this to a point where I can imitate anyone on Twitter using their last 100 tweets or so, but I'll start with this as example code.
## The Algorithm
I'll be using the [NLTK](http://www.nltk.org/) library for doing a lot of the heavy lifting. First, let's import the data:
```python
import pandas as pd
tweets = pd.read_csv('tweets.csv')
text = tweets.text
# Don't include tweets in reply to or mentioning people
replies = text.str.contains('@')
text_norep = text.loc[~replies]
```
And now that we've got data, let's start crunching. First, tokenize and build out the distribution of first word:
```python
from nltk.tokenize import TweetTokenizer
tknzr = TweetTokenizer()
tokens = text_norep.map(tknzr.tokenize)
first_words = tokens.map(lambda x: x[0])
first_words_alpha = first_words[first_words.str.isalpha()]
first_word_dist = first_words_alpha.value_counts() / len(first_words_alpha)
```
Next, we need to build out the conditional distributions. That is, what is the probability of the next word given the current word is $X$? This one is a bit more involved. First, find all unique words, and then find what words proceed them. This can probably be done in a more efficient manner than I'm currently doing here, but we'll ignore that for the moment.
```python
from functools import reduce
# Get all possible words
all_words = reduce(lambda x, y: x+y, tokens, [])
unique_words = set(all_words)
actual_words = set([x if x[0] != '.' else None for x in unique_words])
word_dist = {}
for word in iter(actual_words):
indices = [i for i, j in enumerate(all_words) if j == word]
proceeding = [all_words[i+1] for i in indices]
word_dist[word] = proceeding
```
Now that we've got the tweet analysis done, it's time for the fun part: hashtags! Let's count how many hashtags are in each tweet, I want to get a sense of the distribution.
```python
import matplotlib.pyplot as plt
%matplotlib inline
hashtags = text_norep.str.count('#')
bins = hashtags.unique().max()
hashtags.plot(kind='hist', bins=bins)
```
<matplotlib.axes._subplots.AxesSubplot at 0x18e59dc28d0>
![png](_notebook_files/_notebook_7_1.png)
That looks like a Poisson distribution, kind of as I expected. I'm guessing my number of hashtags per tweet is $\sim Poi(1)$, but let's actually find the [most likely estimator](https://en.wikipedia.org/wiki/Poisson_distribution#Maximum_likelihood) which in this case is just $\bar{\lambda}$:
```python
mle = hashtags.mean()
mle
```
0.870236869207003
Pretty close! So we can now simulate how many hashtags are in a tweet. Let's also find what hashtags are actually used:
```python
hashtags = [x for x in all_words if x[0] == '#']
n_hashtags = len(hashtags)
unique_hashtags = list(set([x for x in unique_words if x[0] == '#']))
hashtag_dist = pd.DataFrame({'hashtags': unique_hashtags,
'prob': [all_words.count(h) / n_hashtags
for h in unique_hashtags]})
len(hashtag_dist)
```
603
Turns out I have used 603 different hashtags during my time on Twitter. That means I was using a unique hashtag for about every third tweet.
In better news though, we now have all the data we need to go about actually constructing tweets! The process will happen in a few steps:
1. Randomly select what the first word will be.
2. Randomly select the number of hashtags for this tweet, and then select the actual hashtags.
3. Fill in the remaining space of 140 characters with random words taken from my tweets.
And hopefully, we won't have anything too crazy come out the other end. The way we do the selection follows a [Multinomial Distribution](https://en.wikipedia.org/wiki/Multinomial_distribution): given a lot of different values with specific probability, pick one. Let's give a quick example:
```
x: .33
y: .5
z: .17
```
That is, I pick `x` with probability 33%, `y` with probability 50%, and so on. In context of our sentence construction, I've built out the probabilities of specific words already - now I just need to simulate that distribution. Time for the engine to actually be developed!
```python
import numpy as np
def multinom_sim(n, vals, probs):
occurrences = np.random.multinomial(n, probs)
results = occurrences * vals
return ' '.join(results[results != ''])
def sim_n_hashtags(hashtag_freq):
return np.random.poisson(hashtag_freq)
def sim_hashtags(n, hashtag_dist):
return multinom_sim(n, hashtag_dist.hashtags, hashtag_dist.prob)
def sim_first_word(first_word_dist):
probs = np.float64(first_word_dist.values)
return multinom_sim(1, first_word_dist.reset_index()['index'], probs)
def sim_next_word(current, word_dist):
dist = pd.Series(word_dist[current])
probs = np.ones(len(dist)) / len(dist)
return multinom_sim(1, dist, probs)
```
## Pulling it all together
I've now built out all the code I need to actually simulate a sentence written by me. Let's try doing an example with five words and a single hashtag:
```python
first = sim_first_word(first_word_dist)
second = sim_next_word(first, word_dist)
third = sim_next_word(second, word_dist)
fourth = sim_next_word(third, word_dist)
fifth = sim_next_word(fourth, word_dist)
hashtag = sim_hashtags(1, hashtag_dist)
' '.join((first, second, third, fourth, fifth, hashtag))
```
'My first all-nighter of friends #oldschool'
Let's go ahead and put everything together! We're going to simulate a first word, simulate the hashtags, and then simulate to fill the gap until we've either taken up all the space or reached a period.
```python
def simulate_tweet():
chars_remaining = 140
first = sim_first_word(first_word_dist)
n_hash = sim_n_hashtags(mle)
hashtags = sim_hashtags(n_hash, hashtag_dist)
chars_remaining -= len(first) + len(hashtags)
tweet = first
current = first
while chars_remaining > len(tweet) + len(hashtags) and current[0] != '.' and current[0] != '!':
current = sim_next_word(current, word_dist)
tweet += ' ' + current
tweet = tweet[:-2] + tweet[-1]
return ' '.join((tweet, hashtags)).strip()
```
## The results
And now for something completely different: twenty random tweets dreamed up by my computer and my Twitter data. Here you go:
```python
for i in range(0, 20):
print(simulate_tweet())
print()
```
Also , I'm at 8 this morning. #thursdaysgohard #ornot
Turns out of us breathe the code will want to my undergraduate career is becoming more night trying ? Religion is now as a chane #HYPE
You know what recursion is to review the UNCC. #ornot
There are really sore 3 bonfires in my first writing the library ground floor if awesome. #realtalk #impressed
So we can make it out there's nothing but I'm not let us so hot I could think I may be good. #SwingDance
Happy Christmas , at Harris Teeter to be be godly or Roman Catholic ). #4b392b#4b392b #Isaiah26
For context , I in the most decisive factor of the same for homework. #accomplishment
Freaking done. #loveyouall
New blog post : Don't jump in a quiz in with a knife fight. #haskell #earlybirthday
God shows me legitimately want to get some food and one day.
Stormed the queen city. #mindblown
The day of a cold at least outside right before the semester ..
Finished with the way back. #winners
Waking up , OJ , I feel like Nick Jonas today.
First draft of so hard drive. #humansvszombies
Eric Whitacre is the wise creation.
Ethics paper first , music in close to everyone who just be posting up with my sin , and Jerry Springr #TheLittleThings
Love that you know enough time I've eaten at 8 PM. #deepthoughts #stillblownaway
Lead. #ThinkingTooMuch #Christmas
Aamazing conference when you married #DepartmentOfRedundancyDepartment Yep , but there's a legitimate challenge.
...Which all ended up being a whole lot more nonsensical than I had hoped for. There are some good ones, so I'll call that an accomplishment! I was banking on grammar not being an issue: since my tweets use impeccable grammar, the program modeled off them should have pretty good grammar as well. There are going to be some hilarious edge cases (I'm looking at you, `Ethics paper first, music in close to everyone`) that make no sense, and some hilarious edge cases (`Waking up, OJ, I feel like Nick Jonas today`) that make me feel like I should have a Twitter rap career. On the whole though, the structure came out alright.
## Moving on from here
During class we also talked about an interesting idea: trying to analyze corporate documents and corporate speech. I'd be interested to know what this analysis applied to something like a couple of bank press releases could do. By any means, the code needs some work to clean it up before I get that far.
## For further reading
I'm pretty confident I re-invented a couple wheels along the way - what I'm doing feels a lot like what [Markov Chain Monte Carlo](https://en.wikipedia.org/wiki/Markov_chain_Monte_Carlo) is intended to do. But I've never worked explicitly with that before, so more research is needed.

Binary file not shown.

After

Width:  |  Height:  |  Size: 4.6 KiB

View File

@ -0,0 +1,291 @@
---
slug: 2016/03/tweet-like-me
title: Tweet like me
date: 2016-03-28 12:00:00
authors: [bspeice]
tags: []
---
In which I try to create a robot that will tweet like I tweet.
<!-- truncate -->
So, I'm taking a Machine Learning course this semester in school, and one of the topics we keep coming back to is natural language processing and the 'bag of words' data structure. That is, given a sentence:
`How much wood would a woodchuck chuck if a woodchuck could chuck wood?`
We can represent that sentence as the following list:
`{
How: 1
much: 1
wood: 2
would: 2
a: 2
woodchuck: 2
chuck: 2
if: 1
}`
Ignoring *where* the words happened, we're just interested in how *often* the words occurred. That got me thinking: I wonder what would happen if I built a robot that just imitated how often I said things? It's dangerous territory when computer scientists ask "what if," but I got curious enough I wanted to follow through.
## The Objective
Given an input list of Tweets, build up the following things:
1. The distribution of starting words; since there are no "prior" words to go from, we need to treat this as a special case.
2. The distribution of words given a previous word; for example, every time I use the word `woodchuck` in the example sentence, there is a 50% chance it is followed by `chuck` and a 50% chance it is followed by `could`. I need this distribution for all words.
3. The distribution of quantity of hashtags; Do I most often use just one? Two? Do they follow something like a Poisson distribution?
4. Distribution of hashtags; Given a number of hashtags, what is the actual content? I'll treat hashtags as separate from the content of a tweet.
## The Data
I'm using as input my tweet history. I don't really use Twitter anymore, but it seems like a fun use of the dataset. I'd like to eventually build this to a point where I can imitate anyone on Twitter using their last 100 tweets or so, but I'll start with this as example code.
## The Algorithm
I'll be using the [NLTK](http://www.nltk.org/) library for doing a lot of the heavy lifting. First, let's import the data:
```python
import pandas as pd
tweets = pd.read_csv('tweets.csv')
text = tweets.text
# Don't include tweets in reply to or mentioning people
replies = text.str.contains('@')
text_norep = text.loc[~replies]
```
And now that we've got data, let's start crunching. First, tokenize and build out the distribution of first word:
```python
from nltk.tokenize import TweetTokenizer
tknzr = TweetTokenizer()
tokens = text_norep.map(tknzr.tokenize)
first_words = tokens.map(lambda x: x[0])
first_words_alpha = first_words[first_words.str.isalpha()]
first_word_dist = first_words_alpha.value_counts() / len(first_words_alpha)
```
Next, we need to build out the conditional distributions. That is, what is the probability of the next word given the current word is $X$? This one is a bit more involved. First, find all unique words, and then find what words proceed them. This can probably be done in a more efficient manner than I'm currently doing here, but we'll ignore that for the moment.
```python
from functools import reduce
# Get all possible words
all_words = reduce(lambda x, y: x+y, tokens, [])
unique_words = set(all_words)
actual_words = set([x if x[0] != '.' else None for x in unique_words])
word_dist = {}
for word in iter(actual_words):
indices = [i for i, j in enumerate(all_words) if j == word]
proceeding = [all_words[i+1] for i in indices]
word_dist[word] = proceeding
```
Now that we've got the tweet analysis done, it's time for the fun part: hashtags! Let's count how many hashtags are in each tweet, I want to get a sense of the distribution.
```python
import matplotlib.pyplot as plt
%matplotlib inline
hashtags = text_norep.str.count('#')
bins = hashtags.unique().max()
hashtags.plot(kind='hist', bins=bins)
```
```
<matplotlib.axes._subplots.AxesSubplot at 0x18e59dc28d0>
```
![png](_notebook_files/_notebook_7_1.png)
That looks like a Poisson distribution, kind of as I expected. I'm guessing my number of hashtags per tweet is $\sim Poi(1)$, but let's actually find the [most likely estimator](https://en.wikipedia.org/wiki/Poisson_distribution#Maximum_likelihood) which in this case is just $\bar{\lambda}$:
```python
mle = hashtags.mean()
mle
```
```
0.870236869207003
```
Pretty close! So we can now simulate how many hashtags are in a tweet. Let's also find what hashtags are actually used:
```python
hashtags = [x for x in all_words if x[0] == '#']
n_hashtags = len(hashtags)
unique_hashtags = list(set([x for x in unique_words if x[0] == '#']))
hashtag_dist = pd.DataFrame({'hashtags': unique_hashtags,
'prob': [all_words.count(h) / n_hashtags
for h in unique_hashtags]})
len(hashtag_dist)
```
```
603
```
Turns out I have used 603 different hashtags during my time on Twitter. That means I was using a unique hashtag for about every third tweet.
In better news though, we now have all the data we need to go about actually constructing tweets! The process will happen in a few steps:
1. Randomly select what the first word will be.
2. Randomly select the number of hashtags for this tweet, and then select the actual hashtags.
3. Fill in the remaining space of 140 characters with random words taken from my tweets.
And hopefully, we won't have anything too crazy come out the other end. The way we do the selection follows a [Multinomial Distribution](https://en.wikipedia.org/wiki/Multinomial_distribution): given a lot of different values with specific probability, pick one. Let's give a quick example:
```
x: .33
y: .5
z: .17
```
That is, I pick `x` with probability 33%, `y` with probability 50%, and so on. In context of our sentence construction, I've built out the probabilities of specific words already - now I just need to simulate that distribution. Time for the engine to actually be developed!
```python
import numpy as np
def multinom_sim(n, vals, probs):
occurrences = np.random.multinomial(n, probs)
results = occurrences * vals
return ' '.join(results[results != ''])
def sim_n_hashtags(hashtag_freq):
return np.random.poisson(hashtag_freq)
def sim_hashtags(n, hashtag_dist):
return multinom_sim(n, hashtag_dist.hashtags, hashtag_dist.prob)
def sim_first_word(first_word_dist):
probs = np.float64(first_word_dist.values)
return multinom_sim(1, first_word_dist.reset_index()['index'], probs)
def sim_next_word(current, word_dist):
dist = pd.Series(word_dist[current])
probs = np.ones(len(dist)) / len(dist)
return multinom_sim(1, dist, probs)
```
## Pulling it all together
I've now built out all the code I need to actually simulate a sentence written by me. Let's try doing an example with five words and a single hashtag:
```python
first = sim_first_word(first_word_dist)
second = sim_next_word(first, word_dist)
third = sim_next_word(second, word_dist)
fourth = sim_next_word(third, word_dist)
fifth = sim_next_word(fourth, word_dist)
hashtag = sim_hashtags(1, hashtag_dist)
' '.join((first, second, third, fourth, fifth, hashtag))
```
```
'My first all-nighter of friends #oldschool'
```
Let's go ahead and put everything together! We're going to simulate a first word, simulate the hashtags, and then simulate to fill the gap until we've either taken up all the space or reached a period.
```python
def simulate_tweet():
chars_remaining = 140
first = sim_first_word(first_word_dist)
n_hash = sim_n_hashtags(mle)
hashtags = sim_hashtags(n_hash, hashtag_dist)
chars_remaining -= len(first) + len(hashtags)
tweet = first
current = first
while chars_remaining > len(tweet) + len(hashtags) and current[0] != '.' and current[0] != '!':
current = sim_next_word(current, word_dist)
tweet += ' ' + current
tweet = tweet[:-2] + tweet[-1]
return ' '.join((tweet, hashtags)).strip()
```
## The results
And now for something completely different: twenty random tweets dreamed up by my computer and my Twitter data. Here you go:
```python
for i in range(0, 20):
print(simulate_tweet())
print()
```
```
Also , I'm at 8 this morning. #thursdaysgohard #ornot
Turns out of us breathe the code will want to my undergraduate career is becoming more night trying ? Religion is now as a chane #HYPE
You know what recursion is to review the UNCC. #ornot
There are really sore 3 bonfires in my first writing the library ground floor if awesome. #realtalk #impressed
So we can make it out there's nothing but I'm not let us so hot I could think I may be good. #SwingDance
Happy Christmas , at Harris Teeter to be be godly or Roman Catholic ). #4b392b#4b392b #Isaiah26
For context , I in the most decisive factor of the same for homework. #accomplishment
Freaking done. #loveyouall
New blog post : Don't jump in a quiz in with a knife fight. #haskell #earlybirthday
God shows me legitimately want to get some food and one day.
Stormed the queen city. #mindblown
The day of a cold at least outside right before the semester ..
Finished with the way back. #winners
Waking up , OJ , I feel like Nick Jonas today.
First draft of so hard drive. #humansvszombies
Eric Whitacre is the wise creation.
Ethics paper first , music in close to everyone who just be posting up with my sin , and Jerry Springr #TheLittleThings
Love that you know enough time I've eaten at 8 PM. #deepthoughts #stillblownaway
Lead. #ThinkingTooMuch #Christmas
Aamazing conference when you married #DepartmentOfRedundancyDepartment Yep , but there's a legitimate challenge.
```
...Which all ended up being a whole lot more nonsensical than I had hoped for. There are some good ones, so I'll call that an accomplishment! I was banking on grammar not being an issue: since my tweets use impeccable grammar, the program modeled off them should have pretty good grammar as well. There are going to be some hilarious edge cases (I'm looking at you, `Ethics paper first, music in close to everyone`) that make no sense, and some hilarious edge cases (`Waking up, OJ, I feel like Nick Jonas today`) that make me feel like I should have a Twitter rap career. On the whole though, the structure came out alright.
## Moving on from here
During class we also talked about an interesting idea: trying to analyze corporate documents and corporate speech. I'd be interested to know what this analysis applied to something like a couple of bank press releases could do. By any means, the code needs some work to clean it up before I get that far.
## For further reading
I'm pretty confident I re-invented a couple wheels along the way - what I'm doing feels a lot like what [Markov Chain Monte Carlo](https://en.wikipedia.org/wiki/Markov_chain_Monte_Carlo) is intended to do. But I've never worked explicitly with that before, so more research is needed.

View File

@ -0,0 +1,15 @@
Title: Tick Tock...
Date: 2016-04-06
Category: Blog
Tags: fitbit, heartrate
Authors: Bradlee Speice
Summary: In which I try to predict the rest of my life using 3 months' worth of data.
[//]: <> "Modified: "
{% notebook 2016-4-6-tick-tock....ipynb %}
<script type="text/x-mathjax-config">
//MathJax.Hub.Config({tex2jax: {inlineMath: [['$','$'], ['\(','\)']]}});
MathJax.Hub.Config({tex2jax: {inlineMath: [['\$','\$']]}});
</script>
<script async src='https://cdn.mathjax.org/mathjax/latest/MathJax.js?config=TeX-AMS_CHTML'></script>

View File

@ -0,0 +1,490 @@
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"If all we have is a finite number of heartbeats left, what about me?\n",
"\n",
"---\n",
"\n",
"Warning: this one is a bit creepier. But that's what you get when you come up with data science ideas as you're drifting off to sleep.\n",
"\n",
"# 2.5 Billion\n",
"\n",
"If [PBS][1] is right, that's the total number of heartbeats we get. Approximately once every second that number goes down, and down, and down again...\n",
"[1]: http://www.pbs.org/wgbh/nova/heart/heartfacts.html\n"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"total_heartbeats = 2500000000"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"I got a Fitbit this past Christmas season, mostly because I was interested in the data and trying to work on some data science projects with it. This is going to be the first project, but there will likely be more (and not nearly as morbid). My idea was: If this is the final number that I'm running up against, how far have I come, and how far am I likely to go? I've currently had about 3 months' time to estimate what my data will look like, so let's go ahead and see: given a lifetime 2.5 billion heart beats, how much time do I have left?\n",
"\n",
"# Statistical Considerations\n",
"\n",
"Since I'm starting to work with health data, there are a few considerations I think are important before I start digging through my data.\n",
"\n",
"1. The concept of 2.5 billion as an agreed-upon number is tenuous at best. I've seen anywhere from [2.21 billion][2] to [3.4 billion][3] so even if I knew exactly how many times my heart had beaten so far, the ending result is suspect at best. I'm using 2.5 billion because that seems to be about the midpoint of the estimates I've seen so far.\n",
"2. Most of the numbers I've seen so far are based on extrapolating number of heart beats from life expectancy. As life expectancy goes up, the number of expected heart beats goes up too.\n",
"3. My estimation of the number of heartbeats in my life so far is based on 3 months worth of data, and I'm extrapolating an entire lifetime based on this.\n",
"\n",
"So while the ending number is **not useful in any medical context**, it is still an interesting project to work with the data I have on hand.\n",
"\n",
"# Getting the data\n",
"\n",
"[Fitbit](https://www.fitbit.com/) has an [API available](https://dev.fitbit.com/) for people to pull their personal data off the system. It requires registering an application, authentication with OAuth, and some other complicated things. **If you're not interested in how I fetch the data, skip [here](#Wild-Extrapolations-from-Small-Data)**.\n",
"\n",
"## Registering an application\n",
"\n",
"I've already [registered a personal application](https://dev.fitbit.com/apps/new) with Fitbit, so I can go ahead and retrieve things like the client secret from a file.\n",
"\n",
"[1]: http://www.pbs.org/wgbh/nova/heart/heartfacts.html\n",
"[2]: http://gizmodo.com/5982977/how-many-heartbeats-does-each-species-get-in-a-lifetime\n",
"[3]: http://wonderopolis.org/wonder/how-many-times-does-your-heart-beat-in-a-lifetime/"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"# Import all the OAuth secret information from a local file\n",
"from secrets import CLIENT_SECRET, CLIENT_ID, CALLBACK_URL"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Handling OAuth 2\n",
"\n",
"So, all the people that know what OAuth 2 is know what's coming next. For those who don't: OAuth is how people allow applications to access other data without having to know your password. Essentially the dialog goes like this:\n",
"\n",
"```\n",
"Application: I've got a user here who wants to use my application, but I need their data.\n",
"Fitbit: OK, what data do you need access to, and for how long?\n",
"Application: I need all of these scopes, and for this amount of time.\n",
"Fitbit: OK, let me check with the user to make sure they really want to do this.\n",
"\n",
"Fitbit: User, do you really want to let this application have your data?\n",
"User: I do! And to prove it, here's my password.\n",
"Fitbit: OK, everything checks out. I'll let the application access your data.\n",
"\n",
"Fitbit: Application, you can access the user's data. Use this special value whenever you need to request data from me.\n",
"Application: Thank you, now give me all the data.\n",
"```\n",
"\n",
"Effectively, this allows an application to gain access to a user's data without ever needing to know the user's password. That way, even if the other application is hacked, the user's original data remains safe. Plus, the user can let the data service know to stop providing the application access any time they want. All in all, very secure.\n",
"\n",
"It does make handling small requests a bit challenging, but I'll go through the steps here. We'll be using the [Implicit Grant](https://dev.fitbit.com/docs/oauth2/) workflow, as it requires fewer steps in processing.\n",
"\n",
"First, we need to set up the URL the user would visit to authenticate:"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"import urllib\n",
"\n",
"FITBIT_URI = 'https://www.fitbit.com/oauth2/authorize'\n",
"params = {\n",
" # If we need more than one scope, must be a CSV string\n",
" 'scope': 'heartrate',\n",
" 'response_type': 'token',\n",
" 'expires_in': 86400, # 1 day\n",
" 'redirect_uri': CALLBACK_URL,\n",
" 'client_id': CLIENT_ID\n",
"}\n",
"\n",
"request_url = FITBIT_URI + '?' + urllib.parse.urlencode(params)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now, here you would print out the request URL, go visit it, and get the full URL that it sends you back to. Because that is very sensitive information (specifically containing my `CLIENT_ID` that I'd really rather not share on the internet), I've skipped that step in the code here, but it happens in the background."
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"# The `response_url` variable contains the full URL that\n",
"# FitBit sent back to us, but most importantly,\n",
"# contains the token we need for authorization.\n",
"access_token = dict(urllib.parse.parse_qsl(response_url))['access_token']"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Requesting the data\n",
"\n",
"Now that we've actually set up our access via the `access_token`, it's time to get the actual [heart rate data](https://dev.fitbit.com/docs/heart-rate/). I'll be using data from January 1, 2016 through March 31, 2016, and extrapolating wildly from that.\n",
"\n",
"Fitbit only lets us fetch intraday data one day at a time, so I'll create a date range using pandas and iterate through that to pull down all the data."
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Heartbeats from 2016-01-01 00:00:00 to 2016-03-31 23:59:00: 8139060\n"
]
}
],
"source": [
"from requests_oauthlib import OAuth2Session\n",
"import pandas as pd\n",
"from datetime import datetime\n",
"\n",
"session = OAuth2Session(token={\n",
" 'access_token': access_token,\n",
" 'token_type': 'Bearer'\n",
" })\n",
"\n",
"format_str = '%Y-%m-%d'\n",
"start_date = datetime(2016, 1, 1)\n",
"end_date = datetime(2016, 3, 31)\n",
"dr = pd.date_range(start_date, end_date)\n",
"\n",
"url = 'https://api.fitbit.com/1/user/-/activities/heart/date/{0}/1d/1min.json'\n",
"hr_responses = [session.get(url.format(d.strftime(format_str))) for d in dr]\n",
"\n",
"def record_to_df(record):\n",
" if 'activities-heart' not in record:\n",
" return None\n",
" date_str = record['activities-heart'][0]['dateTime']\n",
" df = pd.DataFrame(record['activities-heart-intraday']['dataset'])\n",
" \n",
" df.index = df['time'].apply(\n",
" lambda x: datetime.strptime(date_str + ' ' + x, '%Y-%m-%d %H:%M:%S'))\n",
" return df\n",
"\n",
"hr_dataframes = [record_to_df(record.json()) for record in hr_responses]\n",
"hr_df_concat = pd.concat(hr_dataframes)\n",
"\n",
"\n",
"# There are some minutes with missing data, so we need to correct that\n",
"full_daterange = pd.date_range(hr_df_concat.index[0],\n",
" hr_df_concat.index[-1],\n",
" freq='min')\n",
"hr_df_full = hr_df_concat.reindex(full_daterange, method='nearest')\n",
"\n",
"print(\"Heartbeats from {} to {}: {}\".format(hr_df_full.index[0],\n",
" hr_df_full.index[-1],\n",
" hr_df_full['value'].sum()))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"And now we've retrieved all the available heart rate data for January 1<sup>st</sup> through March 31<sup>st</sup>! Let's get to the actual analysis.\n",
"\n",
"# Wild Extrapolations from Small Data\n",
"\n",
"A fundamental issue of this data is that it's pretty small. I'm using 3 months of data to make predictions about my entire life. But, purely as an exercise, I'll move forward.\n",
"\n",
"## How many heartbeats so far?\n",
"\n",
"The first step is figuring out how many of the 2.5 billion heartbeats I've used so far. We're going to try and work backward from the present day to when I was born to get that number. The easy part comes first: going back to January 1<sup>st</sup>, 1992. That's because I can generalize how many 3-month increments there were between now and then, account for leap years, and call that section done.\n",
"\n",
"Between January 1992 and January 2016 there were 96 quarters, and 6 leap days. The number we're looking for is:\n",
"\n",
"\\begin{equation}\n",
"hr_q \\cdot n - hr_d \\cdot (n-m)\n",
"\\end{equation}\n",
"\n",
"- $hr_q$: Number of heartbeats per quarter\n",
"- $hr_d$: Number of heartbeats on leap day\n",
"- $n$: Number of quarters, in this case 96\n",
"- $m$: Number of leap days, in this case 6"
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"773609400"
]
},
"execution_count": 8,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"quarterly_count = hr_df_full['value'].sum()\n",
"leap_day_count = hr_df_full[(hr_df_full.index.month == 2) &\n",
" (hr_df_full.index.day == 29)]['value'].sum()\n",
"num_quarters = 96\n",
"leap_days = 6\n",
"\n",
"jan_92_jan_16 = quarterly_count * num_quarters - leap_day_count * (num_quarters - leap_days)\n",
"jan_92_jan_16"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"So between January 1992 and January 2016 I've used $\\approx$ 774 million heartbeats. Now, I need to go back to my exact birthday. I'm going to first find on average how many heartbeats I use in a minute, and multiply that by the number of minutes between my birthday and January 1992.\n",
"\n",
"For privacy purposes I'll put the code here that I'm using, but without any identifying information:"
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Heartbeats so far: 775804660\n",
"Remaining heartbeats: 1724195340\n"
]
}
],
"source": [
"minute_mean = hr_df_full['value'].mean()\n",
"# Don't you wish you knew?\n",
"# birthday_minutes = ???\n",
"\n",
"birthday_heartbeats = birthday_minutes * minute_mean\n",
"\n",
"heartbeats_until_2016 = int(birthday_heartbeats + jan_92_jan_16)\n",
"remaining_2016 = total_heartbeats - heartbeats_until_2016\n",
"\n",
"print(\"Heartbeats so far: {}\".format(heartbeats_until_2016))\n",
"print(\"Remaining heartbeats: {}\".format(remaining_2016))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"It would appear that my heart has beaten 775,804,660 times between my moment of birth and January 1<sup>st</sup> 2016, and that I have 1.72 billion left.\n",
"\n",
"## How many heartbeats longer?\n",
"\n",
"Now comes the tricky bit. I know how many heart beats I've used so far, and how many I have remaining, so I'd like to come up with a (relatively) accurate estimate of when exactly my heart should give out. We'll do this in a few steps, increasing in granularity.\n",
"\n",
"First step, how many heartbeats do I use in a 4-year period? I have data for a single quarter including leap day, so I want to know:\n",
"\n",
"\\begin{equation}\n",
"hr_q \\cdot n - hr_d \\cdot (n - m)\n",
"\\end{equation}\n",
"\n",
"- $hr_q$: Heartbeats per quarter\n",
"- $hr_d$: Heartbeats per leap day\n",
"- $n$: Number of quarters = 16\n",
"- $m$: Number of leap days = 1"
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"128934900"
]
},
"execution_count": 10,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"heartbeats_4year = quarterly_count * 16 - leap_day_count * (16 - 1)\n",
"heartbeats_4year"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now, I can fast forward from 2016 the number of periods of 4 years I have left."
]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Four year periods remaining: 13\n",
"Remaining heartbeats after 4 year periods: 48041640\n"
]
}
],
"source": [
"four_year_periods = remaining_2016 // heartbeats_4year\n",
"remaining_4y = remaining_2016 - four_year_periods * heartbeats_4year\n",
"\n",
"print(\"Four year periods remaining: {}\".format(four_year_periods))\n",
"print(\"Remaining heartbeats after 4 year periods: {}\".format(remaining_4y))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Given that there are 13 four-year periods left, I can move from 2016 all the way to 2068, and find that I will have 48 million heart beats left. Let's drop down to figuring out how many quarters that is. I know that 2068 will have a leap day (unless someone finally decides to get rid of them), so I'll subtract that out first. Then, I'm left to figure out how many quarters exactly are left."
]
},
{
"cell_type": "code",
"execution_count": 12,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Quarters left starting 2068: 8\n",
"Remaining heartbeats after that: 4760716\n"
]
}
],
"source": [
"remaining_leap = remaining_4y - leap_day_count\n",
"# Ignore leap day in the data set\n",
"heartbeats_quarter = hr_df_full[(hr_df_full.index.month != 2) &\n",
" (hr_df_full.index.day != 29)]['value'].sum()\n",
"quarters_left = remaining_leap // heartbeats_quarter\n",
"remaining_year = remaining_leap - quarters_left * heartbeats_quarter\n",
"\n",
"print(\"Quarters left starting 2068: {}\".format(quarters_left))\n",
"print(\"Remaining heartbeats after that: {}\".format(remaining_year))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"So, that analysis gets me through until January 1<sup>st</sup> 2070. Final step, using that minute estimate to figure out how many minutes past that I'm predicted to have:"
]
},
{
"cell_type": "code",
"execution_count": 13,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"datetime.datetime(2070, 2, 23, 5, 28)"
]
},
"execution_count": 13,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"from datetime import timedelta\n",
"\n",
"base = datetime(2070, 1, 1)\n",
"minutes_left = remaining_year // minute_mean\n",
"\n",
"kaput = timedelta(minutes=minutes_left)\n",
"base + kaput"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"According to this, I've got until February 23<sup>rd</sup>, 2070 at 5:28 PM in the evening before my heart gives out.\n",
"\n",
"# Summary\n",
"\n",
"Well, that's kind of a creepy date to know. As I said at the top though, **this number is totally useless in any medical context**. It ignores the rate at which we continue to get better at making people live longer, and is extrapolating from 3 months' worth of data the rest of my life. Additionally, throughout my time developing this post I made many minor mistakes. I think they're all fixed now, but it's easy to mix a number up here or there and the analysis gets thrown off by a couple years.\n",
"\n",
"Even still, I think philosophically humans have a desire to know how much time we have left in the world. [Man is but a breath](https://www.biblegateway.com/passage/?search=psalm+144&version=ESV), and it's scary to think just how quickly that date may be coming up. This analysis asks an important question though: what are you going to do with the time you have left?\n",
"\n",
"Thanks for sticking with me on this one, I promise it will be much less depressing next time!"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.5.1"
}
},
"nbformat": 4,
"nbformat_minor": 0
}

View File

@ -0,0 +1,307 @@
If all we have is a finite number of heartbeats left, what about me?
---
Warning: this one is a bit creepier. But that's what you get when you come up with data science ideas as you're drifting off to sleep.
# 2.5 Billion
If [PBS][1] is right, that's the total number of heartbeats we get. Approximately once every second that number goes down, and down, and down again...
[1]: http://www.pbs.org/wgbh/nova/heart/heartfacts.html
```python
total_heartbeats = 2500000000
```
I got a Fitbit this past Christmas season, mostly because I was interested in the data and trying to work on some data science projects with it. This is going to be the first project, but there will likely be more (and not nearly as morbid). My idea was: If this is the final number that I'm running up against, how far have I come, and how far am I likely to go? I've currently had about 3 months' time to estimate what my data will look like, so let's go ahead and see: given a lifetime 2.5 billion heart beats, how much time do I have left?
# Statistical Considerations
Since I'm starting to work with health data, there are a few considerations I think are important before I start digging through my data.
1. The concept of 2.5 billion as an agreed-upon number is tenuous at best. I've seen anywhere from [2.21 billion][2] to [3.4 billion][3] so even if I knew exactly how many times my heart had beaten so far, the ending result is suspect at best. I'm using 2.5 billion because that seems to be about the midpoint of the estimates I've seen so far.
2. Most of the numbers I've seen so far are based on extrapolating number of heart beats from life expectancy. As life expectancy goes up, the number of expected heart beats goes up too.
3. My estimation of the number of heartbeats in my life so far is based on 3 months worth of data, and I'm extrapolating an entire lifetime based on this.
So while the ending number is **not useful in any medical context**, it is still an interesting project to work with the data I have on hand.
# Getting the data
[Fitbit](https://www.fitbit.com/) has an [API available](https://dev.fitbit.com/) for people to pull their personal data off the system. It requires registering an application, authentication with OAuth, and some other complicated things. **If you're not interested in how I fetch the data, skip [here](#Wild-Extrapolations-from-Small-Data)**.
## Registering an application
I've already [registered a personal application](https://dev.fitbit.com/apps/new) with Fitbit, so I can go ahead and retrieve things like the client secret from a file.
[1]: http://www.pbs.org/wgbh/nova/heart/heartfacts.html
[2]: http://gizmodo.com/5982977/how-many-heartbeats-does-each-species-get-in-a-lifetime
[3]: http://wonderopolis.org/wonder/how-many-times-does-your-heart-beat-in-a-lifetime/
```python
# Import all the OAuth secret information from a local file
from secrets import CLIENT_SECRET, CLIENT_ID, CALLBACK_URL
```
## Handling OAuth 2
So, all the people that know what OAuth 2 is know what's coming next. For those who don't: OAuth is how people allow applications to access other data without having to know your password. Essentially the dialog goes like this:
```
Application: I've got a user here who wants to use my application, but I need their data.
Fitbit: OK, what data do you need access to, and for how long?
Application: I need all of these scopes, and for this amount of time.
Fitbit: OK, let me check with the user to make sure they really want to do this.
Fitbit: User, do you really want to let this application have your data?
User: I do! And to prove it, here's my password.
Fitbit: OK, everything checks out. I'll let the application access your data.
Fitbit: Application, you can access the user's data. Use this special value whenever you need to request data from me.
Application: Thank you, now give me all the data.
```
Effectively, this allows an application to gain access to a user's data without ever needing to know the user's password. That way, even if the other application is hacked, the user's original data remains safe. Plus, the user can let the data service know to stop providing the application access any time they want. All in all, very secure.
It does make handling small requests a bit challenging, but I'll go through the steps here. We'll be using the [Implicit Grant](https://dev.fitbit.com/docs/oauth2/) workflow, as it requires fewer steps in processing.
First, we need to set up the URL the user would visit to authenticate:
```python
import urllib
FITBIT_URI = 'https://www.fitbit.com/oauth2/authorize'
params = {
# If we need more than one scope, must be a CSV string
'scope': 'heartrate',
'response_type': 'token',
'expires_in': 86400, # 1 day
'redirect_uri': CALLBACK_URL,
'client_id': CLIENT_ID
}
request_url = FITBIT_URI + '?' + urllib.parse.urlencode(params)
```
Now, here you would print out the request URL, go visit it, and get the full URL that it sends you back to. Because that is very sensitive information (specifically containing my `CLIENT_ID` that I'd really rather not share on the internet), I've skipped that step in the code here, but it happens in the background.
```python
# The `response_url` variable contains the full URL that
# FitBit sent back to us, but most importantly,
# contains the token we need for authorization.
access_token = dict(urllib.parse.parse_qsl(response_url))['access_token']
```
## Requesting the data
Now that we've actually set up our access via the `access_token`, it's time to get the actual [heart rate data](https://dev.fitbit.com/docs/heart-rate/). I'll be using data from January 1, 2016 through March 31, 2016, and extrapolating wildly from that.
Fitbit only lets us fetch intraday data one day at a time, so I'll create a date range using pandas and iterate through that to pull down all the data.
```python
from requests_oauthlib import OAuth2Session
import pandas as pd
from datetime import datetime
session = OAuth2Session(token={
'access_token': access_token,
'token_type': 'Bearer'
})
format_str = '%Y-%m-%d'
start_date = datetime(2016, 1, 1)
end_date = datetime(2016, 3, 31)
dr = pd.date_range(start_date, end_date)
url = 'https://api.fitbit.com/1/user/-/activities/heart/date/{0}/1d/1min.json'
hr_responses = [session.get(url.format(d.strftime(format_str))) for d in dr]
def record_to_df(record):
if 'activities-heart' not in record:
return None
date_str = record['activities-heart'][0]['dateTime']
df = pd.DataFrame(record['activities-heart-intraday']['dataset'])
df.index = df['time'].apply(
lambda x: datetime.strptime(date_str + ' ' + x, '%Y-%m-%d %H:%M:%S'))
return df
hr_dataframes = [record_to_df(record.json()) for record in hr_responses]
hr_df_concat = pd.concat(hr_dataframes)
# There are some minutes with missing data, so we need to correct that
full_daterange = pd.date_range(hr_df_concat.index[0],
hr_df_concat.index[-1],
freq='min')
hr_df_full = hr_df_concat.reindex(full_daterange, method='nearest')
print("Heartbeats from {} to {}: {}".format(hr_df_full.index[0],
hr_df_full.index[-1],
hr_df_full['value'].sum()))
```
Heartbeats from 2016-01-01 00:00:00 to 2016-03-31 23:59:00: 8139060
And now we've retrieved all the available heart rate data for January 1<sup>st</sup> through March 31<sup>st</sup>! Let's get to the actual analysis.
# Wild Extrapolations from Small Data
A fundamental issue of this data is that it's pretty small. I'm using 3 months of data to make predictions about my entire life. But, purely as an exercise, I'll move forward.
## How many heartbeats so far?
The first step is figuring out how many of the 2.5 billion heartbeats I've used so far. We're going to try and work backward from the present day to when I was born to get that number. The easy part comes first: going back to January 1<sup>st</sup>, 1992. That's because I can generalize how many 3-month increments there were between now and then, account for leap years, and call that section done.
Between January 1992 and January 2016 there were 96 quarters, and 6 leap days. The number we're looking for is:
\begin{equation}
hr_q \cdot n - hr_d \cdot (n-m)
\end{equation}
- $hr_q$: Number of heartbeats per quarter
- $hr_d$: Number of heartbeats on leap day
- $n$: Number of quarters, in this case 96
- $m$: Number of leap days, in this case 6
```python
quarterly_count = hr_df_full['value'].sum()
leap_day_count = hr_df_full[(hr_df_full.index.month == 2) &
(hr_df_full.index.day == 29)]['value'].sum()
num_quarters = 96
leap_days = 6
jan_92_jan_16 = quarterly_count * num_quarters - leap_day_count * (num_quarters - leap_days)
jan_92_jan_16
```
773609400
So between January 1992 and January 2016 I've used $\approx$ 774 million heartbeats. Now, I need to go back to my exact birthday. I'm going to first find on average how many heartbeats I use in a minute, and multiply that by the number of minutes between my birthday and January 1992.
For privacy purposes I'll put the code here that I'm using, but without any identifying information:
```python
minute_mean = hr_df_full['value'].mean()
# Don't you wish you knew?
# birthday_minutes = ???
birthday_heartbeats = birthday_minutes * minute_mean
heartbeats_until_2016 = int(birthday_heartbeats + jan_92_jan_16)
remaining_2016 = total_heartbeats - heartbeats_until_2016
print("Heartbeats so far: {}".format(heartbeats_until_2016))
print("Remaining heartbeats: {}".format(remaining_2016))
```
Heartbeats so far: 775804660
Remaining heartbeats: 1724195340
It would appear that my heart has beaten 775,804,660 times between my moment of birth and January 1<sup>st</sup> 2016, and that I have 1.72 billion left.
## How many heartbeats longer?
Now comes the tricky bit. I know how many heart beats I've used so far, and how many I have remaining, so I'd like to come up with a (relatively) accurate estimate of when exactly my heart should give out. We'll do this in a few steps, increasing in granularity.
First step, how many heartbeats do I use in a 4-year period? I have data for a single quarter including leap day, so I want to know:
\begin{equation}
hr_q \cdot n - hr_d \cdot (n - m)
\end{equation}
- $hr_q$: Heartbeats per quarter
- $hr_d$: Heartbeats per leap day
- $n$: Number of quarters = 16
- $m$: Number of leap days = 1
```python
heartbeats_4year = quarterly_count * 16 - leap_day_count * (16 - 1)
heartbeats_4year
```
128934900
Now, I can fast forward from 2016 the number of periods of 4 years I have left.
```python
four_year_periods = remaining_2016 // heartbeats_4year
remaining_4y = remaining_2016 - four_year_periods * heartbeats_4year
print("Four year periods remaining: {}".format(four_year_periods))
print("Remaining heartbeats after 4 year periods: {}".format(remaining_4y))
```
Four year periods remaining: 13
Remaining heartbeats after 4 year periods: 48041640
Given that there are 13 four-year periods left, I can move from 2016 all the way to 2068, and find that I will have 48 million heart beats left. Let's drop down to figuring out how many quarters that is. I know that 2068 will have a leap day (unless someone finally decides to get rid of them), so I'll subtract that out first. Then, I'm left to figure out how many quarters exactly are left.
```python
remaining_leap = remaining_4y - leap_day_count
# Ignore leap day in the data set
heartbeats_quarter = hr_df_full[(hr_df_full.index.month != 2) &
(hr_df_full.index.day != 29)]['value'].sum()
quarters_left = remaining_leap // heartbeats_quarter
remaining_year = remaining_leap - quarters_left * heartbeats_quarter
print("Quarters left starting 2068: {}".format(quarters_left))
print("Remaining heartbeats after that: {}".format(remaining_year))
```
Quarters left starting 2068: 8
Remaining heartbeats after that: 4760716
So, that analysis gets me through until January 1<sup>st</sup> 2070. Final step, using that minute estimate to figure out how many minutes past that I'm predicted to have:
```python
from datetime import timedelta
base = datetime(2070, 1, 1)
minutes_left = remaining_year // minute_mean
kaput = timedelta(minutes=minutes_left)
base + kaput
```
datetime.datetime(2070, 2, 23, 5, 28)
According to this, I've got until February 23<sup>rd</sup>, 2070 at 5:28 PM in the evening before my heart gives out.
# Summary
Well, that's kind of a creepy date to know. As I said at the top though, **this number is totally useless in any medical context**. It ignores the rate at which we continue to get better at making people live longer, and is extrapolating from 3 months' worth of data the rest of my life. Additionally, throughout my time developing this post I made many minor mistakes. I think they're all fixed now, but it's easy to mix a number up here or there and the analysis gets thrown off by a couple years.
Even still, I think philosophically humans have a desire to know how much time we have left in the world. [Man is but a breath](https://www.biblegateway.com/passage/?search=psalm+144&version=ESV), and it's scary to think just how quickly that date may be coming up. This analysis asks an important question though: what are you going to do with the time you have left?
Thanks for sticking with me on this one, I promise it will be much less depressing next time!

View File

@ -0,0 +1,311 @@
---
slug: 2016/04/tick-tock
title: Tick tock...
date: 2016-04-06 12:00:00
authors: [bspeice]
tags: []
---
If all we have is a finite number of heartbeats left, what about me?
<!-- truncate -->
Warning: this one is a bit creepier. But that's what you get when you come up with data science ideas as you're drifting off to sleep.
## 2.5 Billion
If [PBS][1] is right, that's the total number of heartbeats we get. Approximately once every second that number goes down, and down, and down again...
[1]: http://www.pbs.org/wgbh/nova/heart/heartfacts.html
```python
total_heartbeats = 2500000000
```
I got a Fitbit this past Christmas season, mostly because I was interested in the data and trying to work on some data science projects with it. This is going to be the first project, but there will likely be more (and not nearly as morbid). My idea was: If this is the final number that I'm running up against, how far have I come, and how far am I likely to go? I've currently had about 3 months' time to estimate what my data will look like, so let's go ahead and see: given a lifetime 2.5 billion heart beats, how much time do I have left?
## Statistical Considerations
Since I'm starting to work with health data, there are a few considerations I think are important before I start digging through my data.
1. The concept of 2.5 billion as an agreed-upon number is tenuous at best. I've seen anywhere from [2.21 billion][2] to [3.4 billion][3] so even if I knew exactly how many times my heart had beaten so far, the ending result is suspect at best. I'm using 2.5 billion because that seems to be about the midpoint of the estimates I've seen so far.
2. Most of the numbers I've seen so far are based on extrapolating number of heart beats from life expectancy. As life expectancy goes up, the number of expected heart beats goes up too.
3. My estimation of the number of heartbeats in my life so far is based on 3 months worth of data, and I'm extrapolating an entire lifetime based on this.
So while the ending number is **not useful in any medical context**, it is still an interesting project to work with the data I have on hand.
## Getting the data
[Fitbit](https://www.fitbit.com/) has an [API available](https://dev.fitbit.com/) for people to pull their personal data off the system. It requires registering an application, authentication with OAuth, and some other complicated things. **If you're not interested in how I fetch the data, skip [here](#Wild-Extrapolations-from-Small-Data)**.
## Registering an application
I've already [registered a personal application](https://dev.fitbit.com/apps/new) with Fitbit, so I can go ahead and retrieve things like the client secret from a file.
[1]: http://www.pbs.org/wgbh/nova/heart/heartfacts.html
[2]: http://gizmodo.com/5982977/how-many-heartbeats-does-each-species-get-in-a-lifetime
[3]: http://wonderopolis.org/wonder/how-many-times-does-your-heart-beat-in-a-lifetime/
```python
# Import all the OAuth secret information from a local file
from secrets import CLIENT_SECRET, CLIENT_ID, CALLBACK_URL
```
### Handling OAuth 2
So, all the people that know what OAuth 2 is know what's coming next. For those who don't: OAuth is how people allow applications to access other data without having to know your password. Essentially the dialog goes like this:
```
Application: I've got a user here who wants to use my application, but I need their data.
Fitbit: OK, what data do you need access to, and for how long?
Application: I need all of these scopes, and for this amount of time.
Fitbit: OK, let me check with the user to make sure they really want to do this.
Fitbit: User, do you really want to let this application have your data?
User: I do! And to prove it, here's my password.
Fitbit: OK, everything checks out. I'll let the application access your data.
Fitbit: Application, you can access the user's data. Use this special value whenever you need to request data from me.
Application: Thank you, now give me all the data.
```
Effectively, this allows an application to gain access to a user's data without ever needing to know the user's password. That way, even if the other application is hacked, the user's original data remains safe. Plus, the user can let the data service know to stop providing the application access any time they want. All in all, very secure.
It does make handling small requests a bit challenging, but I'll go through the steps here. We'll be using the [Implicit Grant](https://dev.fitbit.com/docs/oauth2/) workflow, as it requires fewer steps in processing.
First, we need to set up the URL the user would visit to authenticate:
```python
import urllib
FITBIT_URI = 'https://www.fitbit.com/oauth2/authorize'
params = {
# If we need more than one scope, must be a CSV string
'scope': 'heartrate',
'response_type': 'token',
'expires_in': 86400, # 1 day
'redirect_uri': CALLBACK_URL,
'client_id': CLIENT_ID
}
request_url = FITBIT_URI + '?' + urllib.parse.urlencode(params)
```
Now, here you would print out the request URL, go visit it, and get the full URL that it sends you back to. Because that is very sensitive information (specifically containing my `CLIENT_ID` that I'd really rather not share on the internet), I've skipped that step in the code here, but it happens in the background.
```python
# The `response_url` variable contains the full URL that
# FitBit sent back to us, but most importantly,
# contains the token we need for authorization.
access_token = dict(urllib.parse.parse_qsl(response_url))['access_token']
```
### Requesting the data
Now that we've actually set up our access via the `access_token`, it's time to get the actual [heart rate data](https://dev.fitbit.com/docs/heart-rate/). I'll be using data from January 1, 2016 through March 31, 2016, and extrapolating wildly from that.
Fitbit only lets us fetch intraday data one day at a time, so I'll create a date range using pandas and iterate through that to pull down all the data.
```python
from requests_oauthlib import OAuth2Session
import pandas as pd
from datetime import datetime
session = OAuth2Session(token={
'access_token': access_token,
'token_type': 'Bearer'
})
format_str = '%Y-%m-%d'
start_date = datetime(2016, 1, 1)
end_date = datetime(2016, 3, 31)
dr = pd.date_range(start_date, end_date)
url = 'https://api.fitbit.com/1/user/-/activities/heart/date/{0}/1d/1min.json'
hr_responses = [session.get(url.format(d.strftime(format_str))) for d in dr]
def record_to_df(record):
if 'activities-heart' not in record:
return None
date_str = record['activities-heart'][0]['dateTime']
df = pd.DataFrame(record['activities-heart-intraday']['dataset'])
df.index = df['time'].apply(
lambda x: datetime.strptime(date_str + ' ' + x, '%Y-%m-%d %H:%M:%S'))
return df
hr_dataframes = [record_to_df(record.json()) for record in hr_responses]
hr_df_concat = pd.concat(hr_dataframes)
# There are some minutes with missing data, so we need to correct that
full_daterange = pd.date_range(hr_df_concat.index[0],
hr_df_concat.index[-1],
freq='min')
hr_df_full = hr_df_concat.reindex(full_daterange, method='nearest')
print("Heartbeats from {} to {}: {}".format(hr_df_full.index[0],
hr_df_full.index[-1],
hr_df_full['value'].sum()))
```
```
Heartbeats from 2016-01-01 00:00:00 to 2016-03-31 23:59:00: 8139060
```
And now we've retrieved all the available heart rate data for January 1<sup>st</sup> through March 31<sup>st</sup>! Let's get to the actual analysis.
## Wild Extrapolations from Small Data
A fundamental issue of this data is that it's pretty small. I'm using 3 months of data to make predictions about my entire life. But, purely as an exercise, I'll move forward.
### How many heartbeats so far?
The first step is figuring out how many of the 2.5 billion heartbeats I've used so far. We're going to try and work backward from the present day to when I was born to get that number. The easy part comes first: going back to January 1<sup>st</sup>, 1992. That's because I can generalize how many 3-month increments there were between now and then, account for leap years, and call that section done.
Between January 1992 and January 2016 there were 96 quarters, and 6 leap days. The number we're looking for is:
$$
\begin{equation*}
hr_q \cdot n - hr_d \cdot (n-m)
\end{equation*}
$$
- $hr_q$: Number of heartbeats per quarter
- $hr_d$: Number of heartbeats on leap day
- $n$: Number of quarters, in this case 96
- $m$: Number of leap days, in this case 6
```python
quarterly_count = hr_df_full['value'].sum()
leap_day_count = hr_df_full[(hr_df_full.index.month == 2) &
(hr_df_full.index.day == 29)]['value'].sum()
num_quarters = 96
leap_days = 6
jan_92_jan_16 = quarterly_count * num_quarters - leap_day_count * (num_quarters - leap_days)
jan_92_jan_16
```
```
773609400
```
So between January 1992 and January 2016 I've used $\approx$ 774 million heartbeats. Now, I need to go back to my exact birthday. I'm going to first find on average how many heartbeats I use in a minute, and multiply that by the number of minutes between my birthday and January 1992.
For privacy purposes I'll put the code here that I'm using, but without any identifying information:
```python
minute_mean = hr_df_full['value'].mean()
# Don't you wish you knew?
# birthday_minutes = ???
birthday_heartbeats = birthday_minutes * minute_mean
heartbeats_until_2016 = int(birthday_heartbeats + jan_92_jan_16)
remaining_2016 = total_heartbeats - heartbeats_until_2016
print("Heartbeats so far: {}".format(heartbeats_until_2016))
print("Remaining heartbeats: {}".format(remaining_2016))
```
```
Heartbeats so far: 775804660
Remaining heartbeats: 1724195340
```
It would appear that my heart has beaten 775,804,660 times between my moment of birth and January 1<sup>st</sup> 2016, and that I have 1.72 billion left.
### How many heartbeats longer?
Now comes the tricky bit. I know how many heart beats I've used so far, and how many I have remaining, so I'd like to come up with a (relatively) accurate estimate of when exactly my heart should give out. We'll do this in a few steps, increasing in granularity.
First step, how many heartbeats do I use in a 4-year period? I have data for a single quarter including leap day, so I want to know:
$$
\begin{equation*}
hr_q \cdot n - hr_d \cdot (n - m)
\end{equation*}
$$
- $hr_q$: Heartbeats per quarter
- $hr_d$: Heartbeats per leap day
- $n$: Number of quarters = 16
- $m$: Number of leap days = 1
```python
heartbeats_4year = quarterly_count * 16 - leap_day_count * (16 - 1)
heartbeats_4year
```
```
128934900
```
Now, I can fast forward from 2016 the number of periods of 4 years I have left.
```python
four_year_periods = remaining_2016 // heartbeats_4year
remaining_4y = remaining_2016 - four_year_periods * heartbeats_4year
print("Four year periods remaining: {}".format(four_year_periods))
print("Remaining heartbeats after 4 year periods: {}".format(remaining_4y))
```
```
Four year periods remaining: 13
Remaining heartbeats after 4 year periods: 48041640
```
Given that there are 13 four-year periods left, I can move from 2016 all the way to 2068, and find that I will have 48 million heart beats left. Let's drop down to figuring out how many quarters that is. I know that 2068 will have a leap day (unless someone finally decides to get rid of them), so I'll subtract that out first. Then, I'm left to figure out how many quarters exactly are left.
```python
remaining_leap = remaining_4y - leap_day_count
# Ignore leap day in the data set
heartbeats_quarter = hr_df_full[(hr_df_full.index.month != 2) &
(hr_df_full.index.day != 29)]['value'].sum()
quarters_left = remaining_leap // heartbeats_quarter
remaining_year = remaining_leap - quarters_left * heartbeats_quarter
print("Quarters left starting 2068: {}".format(quarters_left))
print("Remaining heartbeats after that: {}".format(remaining_year))
```
```
Quarters left starting 2068: 8
Remaining heartbeats after that: 4760716
```
So, that analysis gets me through until January 1<sup>st</sup> 2070. Final step, using that minute estimate to figure out how many minutes past that I'm predicted to have:
```python
from datetime import timedelta
base = datetime(2070, 1, 1)
minutes_left = remaining_year // minute_mean
kaput = timedelta(minutes=minutes_left)
base + kaput
```
```
datetime.datetime(2070, 2, 23, 5, 28)
```
According to this, I've got until February 23<sup>rd</sup>, 2070 at 5:28 PM in the evening before my heart gives out.
## Summary
Well, that's kind of a creepy date to know. As I said at the top though, **this number is totally useless in any medical context**. It ignores the rate at which we continue to get better at making people live longer, and is extrapolating from 3 months' worth of data the rest of my life. Additionally, throughout my time developing this post I made many minor mistakes. I think they're all fixed now, but it's easy to mix a number up here or there and the analysis gets thrown off by a couple years.
Even still, I think philosophically humans have a desire to know how much time we have left in the world. [Man is but a breath](https://www.biblegateway.com/passage/?search=psalm+144&version=ESV), and it's scary to think just how quickly that date may be coming up. This analysis asks an important question though: what are you going to do with the time you have left?
Thanks for sticking with me on this one, I promise it will be much less depressing next time!

View File

@ -0,0 +1,15 @@
Title: The Unfair Casino
Date: 2016-05-15
Category: Blog
Tags: casino, probability, em, machine learning, simulated annealing
Authors: Bradlee Speice
Summary: Trying to figure out how exactly two dice are loaded in a cheating casino
[//]: <> "Modified: "
{% notebook 2016-5-15-the-unfair-casino.ipynb %}
<script type="text/x-mathjax-config">
//MathJax.Hub.Config({tex2jax: {inlineMath: [['$','$'], ['\(','\)']]}});
MathJax.Hub.Config({tex2jax: {inlineMath: [['\$','\$']]}});
</script>
<script async src='https://cdn.mathjax.org/mathjax/latest/MathJax.js?config=TeX-AMS_CHTML'></script>

File diff suppressed because one or more lines are too long

View File

@ -0,0 +1,497 @@
Or, how to get thrown out of a casino because you're a mathematician.
---
In the ongoing eternal cycle of mathematicians asking generally useless questions about probability, I dreamt up another one. The scenario is as follows:
**You're playing a game with two die, and you do not get to see what the outcome of the die are on each roll. All you get to see is their sum. Given an arbitrarily long list of the sum of two rolls, can you determine if one or both die are loaded, and what those loadings are?**
# Proving we can detect cheating
My first question is simply, is this possible? There's a lot of trivial cases that make it obvious that there's cheating going on. But there are some edge cases that might give us more difficulty. First though, let's get a picture of what the fair distribution looks like. In principle, we can only detect cheating if the distribution of the fair die differs from the distribution of the loaded die.
```python
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
fair_1 = np.random.randint(1, 7, 10000)
fair_2 = np.random.randint(1, 7, 10000)
pd.Series(fair_1 + fair_2).plot(kind='hist', bins=11);
plt.title('Fair Distribution');
```
![png](_notebook_files/_notebook_1_0.png)
This distribution makes sense: there are many ways to make a 7 (the most frequent observed value) and very few ways to make a 12 or 2; an important symmetry. As a special note, you can notice that the sum of two fair dice is a discrete case of the [Triangle Distribution][1], which is itself a special case of the [Irwin-Hall Distribution][2].
# The Edge Cases
Given that we understand how the results of two fair dice are distributed, let's see some of the interesting edge cases that come up. This will give us assurance that when a casino is cheating, it is detectable (given sufficient data). To make this as hard as possible, we will think of scenarios where the expected value of the sum of loaded dice is the same as the expected value of the sum of fair dice.
### Edge Case 1
What happens when one die is biased low, and one die is biased high? That is, where:
\begin{align}
\begin{array}{cc}
D_1 = \left\{
\begin{array}{lr}
1 & w.p. 1/3\\
2 & w.p. 1/3\\
3 & w.p. 1/12\\
4 & w.p. 1/12\\
5 & w.p. 1/12\\
6 & w.p. 1/12
\end{array}
\right. &
D_2 = \left\{
\begin{array}{lr}
1 & w.p. 1/12\\
2 & w.p. 1/12\\
3 & w.p. 1/12\\
4 & w.p. 1/12\\
5 & w.p. 1/3\\
6 & w.p. 1/3
\end{array}
\right. \\
\mathbb{E}[D_1] = 2.5 & \mathbb{E}[D_2] = 4.5
\end{array}\\
\mathbb{E}[D_1 + D_2] = 7 = \mathbb{E}[D_{fair} + D_{fair}]
\end{align}
[1]: https://en.wikipedia.org/wiki/Triangular_distribution
[2]: https://en.wikipedia.org/wiki/Irwin%E2%80%93Hall_distribution
```python
def unfair_die(p_vals, n):
x = np.random.multinomial(1, p_vals, n)
return x.nonzero()[1] + 1
d1 = [1/3, 1/3, 1/12, 1/12, 1/12, 1/12]
d2 = [1/12, 1/12, 1/12, 1/12, 1/3, 1/3]
x1 = unfair_die(d1, 10000)
x2 = unfair_die(d2, 10000)
pd.Series(x1 + x2).plot(kind='hist', bins=11);
plt.title('$D_1$ biased low, $D_2$ biased high');
```
![png](_notebook_files/_notebook_3_0.png)
We can see that while the 7 value remains the most likely (as expected), the distribution is not so nicely shaped any more.
**Edge Case 2:** When one die is loaded low, and one is loaded high, we've seen how we can detect them. How about when two die are loaded both low and high? That is, we have the following distribution:
\begin{align}
\begin{array}{cc}
D_1 = \left\{
\begin{array}{lr}
1 & w.p. 1/3\\
2 & w.p. 1/12\\
3 & w.p. 1/12\\
4 & w.p. 1/12\\
5 & w.p. 1/12\\
6 & w.p. 1/3
\end{array}
\right. &
D_2 = \left\{
\begin{array}{lr}
1 & w.p. 1/3\\
2 & w.p. 1/12\\
3 & w.p. 1/12\\
4 & w.p. 1/12\\
5 & w.p. 1/12\\
6 & w.p. 1/3
\end{array}
\right. \\
\mathbb{E}[D_1] = 3.5 & \mathbb{E}[D_2] = 3.5
\end{array}\\
\mathbb{E}[D_1 + D_2] = 7 = \mathbb{E}[D_{fair} + D_{fair}]
\end{align}
We can see even that the expected value of each individual die is the same as the fair die! However, the distribution (if we are doing this correctly) should still be skewed:
```python
d1 = [1/3, 1/12, 1/12, 1/12, 1/12, 1/3]
d2 = d1
x1 = unfair_die(d1, 10000)
x2 = unfair_die(d2, 10000)
pd.Series(x1 + x2).plot(kind='hist', bins=11)
plt.title("$D_1$ and $D_2$ biased to 1 and 6");
```
![png](_notebook_files/_notebook_5_0.png)
In a very un-subtle way, we have of course made the values 2 and 12 far more likely.
# Detection Conclusion
There are some trivial examples of cheating that are easy to detect: whenever the expected value of the sum of two fair dice deviates from the expected value for the sum of two fair dice, we can immediately conclude that there is cheating at stake.
The interesting edge cases occur when the expected value of the sum of loaded dice matches the expected value of the sum of fair dice. Considering the above examples (and a couple more I ran through in developing this), we have seen that in every circumstance having two unfair dice leads to a distribution of results different from the fair results.
We can thus finally state: **just by looking at the distribution of results from this game, we can immediately conclude whether there is cheating.**
# Simulated Annealing
What we really would like to do though, is see if there is any way to determine how exactly the dice are loaded. This is significantly more complicated, but we can borrow some algorithms from Machine Learning to figure out exactly how to perform this process. I'm using the Simulated Annealing algorithm, and I discuss why this works and why I chose it over some of the alternatives in the [justification](#Justification-of-Simulated-Annealing). If you don't care about how I set up the model and just want to see the code, check out [the actual code](#The-actual-code).
[Simulated Annealing][3] is a variation of the [Metropolis-Hastings Algorithm][4], but the important thing for us is: Simulated Annealing allows us to quickly optimize high-dimensional problems. But what exactly are we trying to optimize? Ideally, we want a function that can tell us whether one distribution for the dice better explains the results than another distribution. This is known as the **likelihood** function.
## Deriving the Likelihood function
To derive our likelihood function, we want to know: **what is the probability of seeing a specific result given those hidden parameters?** This is actually a surprisingly difficult problem. While we can do a lot of calculations by hand, we need a more general solution since we will be working with very some interesting die distributions.
We first note that the sum of two dice can take on 11 different values - 2 through 12. This implies that each individual sum follows a [Categorical distribution](https://en.wikipedia.org/wiki/Categorical_distribution). That is:
\begin{align}
\mathcal{L(x)} = \left\{
\begin{array}{lr}
p_2 & x = 2\\
p_3 & x = 3\\
\ldots & \\
p_{11} & x = 11\\
p_{12} & x = 12
\end{array}
\right.
\end{align}
Where each $p_i$ is the probability of seeing that specific result. However, we need to calculate what each probability is! I'll save you the details, but [this author](http://math.stackexchange.com/a/1646360/320784) explains how to do it.
Now, we would like to know the likelihood of our entire data-set. This is trivial:
\begin{align}
\mathcal{L(\mathbf{X})} &= \prod_{i=1}^n L(x)
\end{align}
However, it's typically much easier to work with the $\log(\mathcal{L})$ function instead. This is critically important from a computational perspective: when you multiply so many small numbers together (i.e. the product of $L(x)$ terms) the computer suffers from rounding error; if we don't control for this, we will find that no matter the distributions we choose for each die, the "likelihood" will be close to zero because the computer is not precise enough.
\begin{align}
\log(\mathcal{L}) &= \sum_{i=1}^n \log(L)
\end{align}
## The process of Simulated Annealing
The means by which we optimize our likelihood function is the simulated annealing algorithm. The way it works is as follows:
1. Start with a random guess for the parameters we are trying to optimize. In our case we are trying to guess the distribution of two dice, and so we "optimize" until we have a distribution that matches the data.
2. For each iteration of the algorithm:
1. Generate a new "proposed" set of parameters based on the current parameters -
i.e. slightly modify the current parameters to get a new set of parameters.
2. Calculate the value of $\log(\mathcal{L})$ for each set of parameters. If the function value for the
proposed parameter set is higher than for the current, automatically switch to the new parameter set
and continue the next iteration.
3. Given the new parameter set performs worse, determine a probability of switching to the new parameter set anyways: $\mathcal{P}(p_{current}, p_{proposed})$
4. Switch to the new parameter set with probability $\mathcal{P}$. If you fail to switch, begin the next iteration.
3. The algorithm is complete after we fail to make a transition $n$ times in a row.
If everything goes according to plan, we will have a value that is close to the true distribution of each die.
# The actual code
We start by defining the score function. This will tell us how well the proposed die densities actually explain the results.
[3]:https://en.wikipedia.org/wiki/Simulated_annealing
[4]:https://en.wikipedia.org/wiki/Metropolis%E2%80%93Hastings_algorithm
```python
import numpy as np
from numpy import polynomial
def density_coef(d1_density, d2_density):
# Calculating the probabilities of each outcome was taken
# from this author: http://math.stackexchange.com/a/1710392/320784
d1_p = polynomial.Polynomial(d1_density)
d2_p = polynomial.Polynomial(d2_density)
coefs = (d1_p * d2_p).coef
return coefs
def score(x, d1_density, d2_density):
# We've now got the probabilities of each event, but we need
# to shift the array a bit so we can use the x values to actually
# index into it. This will allow us to do all the calculations
# incredibly quickly
coefs = density_coef(d1_density, d2_density)
coefs = np.hstack((0, 0, coefs))
return np.log(coefs[x]).sum()
```
Afterward, we need to write something to permute the proposal densities. We make random modifications, and eventually the best one survives.
```python
def permute(d1_density, d2_density):
# To ensure we have legitimate densities, we will randomly
# increase one die face probability by `change`,
# and decrease one by `change`.
# This means there are something less than (1/`change`)^12 possibilities
# we are trying to search over.
change = .01
d1_index1, d1_index2 = np.random.randint(0, 6, 2)
d2_index1, d2_index2 = np.random.randint(0, 6, 2)
# Also make sure to copy. I've had some weird aliasing issues
# in the past that made everything blow up.
new_d1 = np.float64(np.copy(d1_density))
new_d2 = np.float64(np.copy(d2_density))
# While this doesn't account for the possibility that some
# values go negative, in practice this never happens
new_d1[d1_index1] += change
new_d1[d1_index2] -= change
new_d2[d2_index1] += change
new_d2[d2_index2] -= change
return new_d1, new_d2
```
Now we've got the main algorithm code to do. This is what brings all the pieces together.
```python
def optimize(data, conv_count=10, max_iter=1e4):
switch_failures = 0
iter_count = 0
# Start with guessing fair dice
cur_d1 = np.repeat(1/6, 6)
cur_d2 = np.repeat(1/6, 6)
cur_score = score(data, cur_d1, cur_d2)
# Keep track of our best guesses - may not be
# what we end with
max_score = cur_score
max_d1 = cur_d1
max_d2 = cur_d2
# Optimization stops when we have failed to switch `conv_count`
# times (presumably because we have a great guess), or we reach
# the maximum number of iterations.
while switch_failures < conv_count and iter_count < max_iter:
iter_count += 1
if iter_count % (max_iter / 10) == 0:
print('Iteration: {}; Current score (higher is better): {}'.format(
iter_count, cur_score))
new_d1, new_d2 = permute(cur_d1, cur_d2)
new_score = score(data, new_d1, new_d2)
if new_score > max_score:
max_score = new_score
max_d1 = new_d1
max_d2 = new_d2
if new_score > cur_score:
# If the new permutation beats the old one,
# automatically select it.
cur_score = new_score
cur_d1 = new_d1
cur_d2 = new_d2
switch_failures = 0
else:
# We didn't beat the current score, but allow
# for possibly switching anyways.
accept_prob = np.exp(new_score - cur_score)
coin_toss = np.random.rand()
if coin_toss < accept_prob:
# We randomly switch to the new distribution
cur_score = new_score
cur_d1 = new_d1
cur_d2 = new_d2
switch_failures = 0
else:
switch_failures += 1
# Return both our best guess, and the ending guess
return max_d1, max_d2, cur_d1, cur_d2
```
And now we have finished the hard work!
# Catching the Casino
Let's go through a couple of scenarios and see if we can catch the casino cheating with some loaded dice. **In every scenario we start with an assumption of fair dice**, and then try our hand to figure out what the *actual* distribution was.
## Attempt 1
The casino is using two dice that are both biased low. How well can we recover the distribution?
```python
import time
def simulate_casino(d1_dist, d2_dist, n=10000):
d1_vals = unfair_die(d1_dist, n)
d2_vals = unfair_die(d2_dist, n)
start = time.perf_counter()
max_d1, max_d2, final_d1, final_d2 = optimize(d1_vals + d2_vals)
end = time.perf_counter()
print("Simulated Annealing time: {:.02f}s".format(end - start))
coef_range = np.arange(2, 13) - .5
plt.subplot(221)
plt.bar(coef_range, density_coef(d1_dist, d2_dist), width=1)
plt.title('True Distribution')
plt.subplot(222)
plt.hist(d1_vals + d2_vals, bins=11)
plt.title('Empirical Distribution')
plt.subplot(223)
plt.bar(coef_range, density_coef(max_d1, max_d2), width=1)
plt.title('Recovered Distribution')
plt.gcf().set_size_inches(10, 10)
simulate_casino([2/9, 2/9, 2/9, 1/9, 1/9, 1/9],
[2/9, 2/9, 2/9, 1/9, 1/9, 1/9])
```
Iteration: 1000; Current score (higher is better): -22147.004400281654
Simulated Annealing time: 0.30s
![png](_notebook_files/_notebook_14_1.png)
## Attempt 2
The casino now uses dice that are both biased towards 1 and 6.
```python
simulate_casino([1/3, 1/12, 1/12, 1/12, 1/12, 1/3],
[1/3, 1/12, 1/12, 1/12, 1/12, 1/3])
```
Simulated Annealing time: 0.08s
![png](_notebook_files/_notebook_16_1.png)
## Attempt 3
The casino will now use one die biased towards 1 and 6, and one die towards 3 and 4.
```python
simulate_casino([1/3, 1/12, 1/12, 1/12, 1/12, 1/3],
[1/12, 1/12, 1/3, 1/3, 1/12, 1/12])
```
Simulated Annealing time: 0.09s
![png](_notebook_files/_notebook_18_1.png)
## Attempt 4
We'll now finally go to a fair casino to make sure that we can still recognize a positive result.
```python
simulate_casino(np.repeat(1/6, 6), np.repeat(1/6, 6))
```
Simulated Annealing time: 0.02s
![png](_notebook_files/_notebook_20_1.png)
## Attempt 5
We've so far been working with a large amount of data - 10,000 data points. Can we now scale things back to only 250 throws? We'll start with two dice biased high.
```python
simulate_casino([1/9, 1/9, 1/9, 2/9, 2/9, 2/9],
[1/9, 1/9, 1/9, 2/9, 2/9, 2/9],
n=250)
```
Iteration: 1000; Current score (higher is better): -551.6995384525453
Iteration: 2000; Current score (higher is better): -547.7803673440676
Iteration: 3000; Current score (higher is better): -547.9805613193807
Iteration: 4000; Current score (higher is better): -546.7574874775273
Iteration: 5000; Current score (higher is better): -549.5798007672656
Iteration: 6000; Current score (higher is better): -545.0354060154496
Iteration: 7000; Current score (higher is better): -550.1134504086606
Iteration: 8000; Current score (higher is better): -549.9306537114975
Iteration: 9000; Current score (higher is better): -550.7075182119111
Iteration: 10000; Current score (higher is better): -549.400679551826
Simulated Annealing time: 1.94s
![png](_notebook_files/_notebook_22_1.png)
The results are surprisingly good. While the actual optimization process took much longer to finish than in the other examples, we still have a very good guess. As a caveat though: the recovered distribution tends to overfit the data. That is, if the data doesn't fit the underlying distribution well, the model will also fail.
# Conclusion
Given the results above, we can see that we have indeed come up with a very good algorithm to determine the distribution of two dice given their results. As a benefit, we have even seen that results come back very quickly; it's not uncommon for the optimization to converge within a tenth of a second.
Additionally, we have seen that the algorithm can intuit the distribution even when there is not much data. While the final example shows that we can 'overfit' on the dataset, we can still get valuable information from a relatively small dataset.
We can declare at long last: **the mathematicians have again triumphed over the casino**.
---
# Justification of Simulated Annealing
## Why Simulated Annealing?
So why even use an algorithm with a fancy title like Simulated Annealing? First of all, because the title is sexy. Second of all, because this is a reasonably complicated problem to try and solve. We have a parameter space where each value $p_{ij} \in (0, 1); i, j \in \{1, \ldots, 6\}$, for a total of 12 different variables we are trying to optimize over. Additionally, given a 12-dimensional function we are trying to optimize, simulated annealing makes sure that we don't fall into a local minimum.
## Why not something else?
This is a fair question. There are two classes of algorithms that can also be used to solve this problem: [Non-linear optimization](https://en.wikipedia.org/wiki/Nonlinear_programming) methods, and the [EM algorithm](https://en.wikipedia.org/wiki/Expectation%E2%80%93maximization_algorithm).
1. I chose not to use non-linear optimization simply because I'm a bit concerned that it will trap me in a local maximum. Instead of running multiple different optimizations from different starting points, I can just use simulated annealing to take that into account. In addition, throughout the course of testing the simulated annealing code converged *incredibly* quickly - far more quickly than any non-linear solver would be able to accomplish.
2. The EM Algorithm was originally what I intended to write this blog post with. Indeed, the post was inspired by the [crooked casino](http://web.stanford.edu/class/stats366/hmmR2.html) example which uses the EM algorithm to solve it. However, after modeling the likelihood function I realized that the algebra would very quickly get out of hand. Trying to compute all the polynomial terms would not be fun, which would be needed to actually optimize for each parameter. So while the EM algorithm would likely be much faster in raw speed terms, the amount of time needed to program and verify it meant that I was far better off using a different method for optimization.

Binary file not shown.

After

Width:  |  Height:  |  Size: 18 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 16 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 18 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 7.4 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 19 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 18 KiB

Some files were not shown because too many files have changed in this diff Show More