diff --git a/a-rustic-re-podcasting-server-part-1.html b/a-rustic-re-podcasting-server-part-1.html index a327a81..8aca65b 100644 --- a/a-rustic-re-podcasting-server-part-1.html +++ b/a-rustic-re-podcasting-server-part-1.html @@ -4,20 +4,22 @@ - + - + A Rustic Re-Podcasting Server (Part 1) - Bradlee Speice - - - - + + + + + + @@ -26,6 +28,17 @@ + + + @@ -39,7 +52,7 @@
@@ -54,12 +67,12 @@

A Rustic Re-Podcasting Server (Part 1)

-

Bradlee Speice, Sat 22 October 2016, Blog

+

Bradlee Speice, Sat 22 October 2016, Blog

-nutone, Rust

+nutone, Rust

@@ -115,15 +128,15 @@ on some bad cases, str <-> bytes specifically), but Rust is h should be incredibly simple: All I want is to echo back Didn't find URL: <url>. Shouldn't be that hard right? In Python I'd just do something like:

-
def echo_handler(request):
+
def echo_handler(request):
     return "You're visiting: {}".format(request.uri)
 

And we'd call it a day. Rust isn't so simple. Let's start with the trivial examples people post online:

-
fn hello_world(req: &mut Request) -> IronResult<Response> {
-    Ok(Response::with((status::Ok, "You found the server!")))
+
fn hello_world(req: &mut Request) -> IronResult<Response> {
+    Ok(Response::with((status::Ok, "You found the server!")))
 }
 
@@ -132,22 +145,22 @@ examples people post online:

version! All we need to do is just send back a string of some form. So, we look up the documentation for Request and see a url field that will contain what we want. Let's try the first iteration:

-
fn hello_world(req: &mut Request) -> IronResult<Response> {
-    Ok(Response::with((status::Ok, "You found the URL: " + req.url)))
+
fn hello_world(req: &mut Request) -> IronResult<Response> {
+    Ok(Response::with((status::Ok, "You found the URL: " + req.url)))
 }
 

Which yields the error:

-
error[E0369]: binary operation `+` cannot be applied to type `&'static str`
+
error[E0369]: binary operation `+` cannot be applied to type `&'static str`
 

OK, what's going on here? Time to start Googling for "concatenate strings in Rust". That's what we want to do right? Concatenate a static string and the URL.

After Googling, we come across a helpful concat! macro that looks really nice! Let's try that one:

-
fn hello_world(req: &mut Request) -> IronResult<Response> {
-    Ok(Response::with((status::Ok, concat!("You found the URL: ", req.url))))
+
fn hello_world(req: &mut Request) -> IronResult<Response> {
+    Ok(Response::with((status::Ok, concat!("You found the URL: ", req.url))))
 }
 
@@ -159,8 +172,8 @@ at compile time what req.url is. Which, in my outsider opinion, is strange. println! and format!, etc., all handle values they don't know at compile time. Why can't concat!? By any means, we need a new plan of attack. How about we try formatting strings?

-
fn hello_world(req: &mut Request) -> IronResult<Response> {
-    Ok(Response::with((status::Ok, format!("You found the URL: {}", req.url))))
+
fn hello_world(req: &mut Request) -> IronResult<Response> {
+    Ok(Response::with((status::Ok, format!("You found the URL: {}", req.url))))
 }
 
@@ -191,9 +204,9 @@ working on things that are a bit more complex?

We're going to cover that here. Our first try: creating a function which returns other functions. This is a principle called currying. We set up a function that allows us to keep some data in scope for another function to come later.

-
fn build_handler(message: String) -> Fn(&mut Request) -> IronResult<Response> {
-    move |_: &mut Request| {
-        Ok(Response::with((status::Ok, message)))
+
fn build_handler(message: String) -> Fn(&mut Request) -> IronResult<Response> {
+    move |_: &mut Request| {
+        Ok(Response::with((status::Ok, message)))
     }
 }
 
@@ -202,7 +215,7 @@ for another function to come later.

We've simply set up a function that returns another anonymous function with the message parameter scoped in. If you compile this, you get not 1, not 2, but 5 new errors. 4 of them are the same though:

-
error[E0277]: the trait bound `for<'r, 'r, 'r> std::ops::Fn(&'r mut iron::Request<'r, 'r>) -> std::result::Result<iron::Response, iron::IronError> + 'static: std::marker::Sized` is not satisfied
+
error[E0277]: the trait bound `for<'r, 'r, 'r> std::ops::Fn(&'r mut iron::Request<'r, 'r>) -> std::result::Result<iron::Response, iron::IronError> + 'static: std::marker::Sized` is not satisfied
 
@@ -230,19 +243,19 @@ we've been working with so far.

The principle is that we need to define a new struct to hold our data, then implement that handle() method to return the result. Something that looks like this might do:

-
struct EchoHandler {
-    message: String
+
struct EchoHandler {
+    message: String
 }
 
 impl Handler for EchoHandler {
-    fn handle(&self, _: &mut Request) -> IronResult<Response> {
-        Ok(Response::with((status::Ok, self.message)))
+    fn handle(&self, _: &mut Request) -> IronResult<Response> {
+        Ok(Response::with((status::Ok, self.message)))
     }
 }
 
 // Later in the code when we set up the router...
 let echo = EchoHandler {
-    message: "Is it working yet?"
+    message: "Is it working yet?"
 }
 router.get("/", echo.handle, "index");
 
@@ -251,38 +264,38 @@ like this might do:

We attempt to build a struct, and give its handle method off to the router so the router knows what to do.

You guessed it, more errors:

-
error: attempted to take value of method `handle` on type `EchoHandler`
+
error: attempted to take value of method `handle` on type `EchoHandler`
 

Now, the Rust compiler is actually a really nice fellow, and offers us help:

-
help: maybe a `()` to call it is missing? If not, try an anonymous function
+
help: maybe a `()` to call it is missing? If not, try an anonymous function
 

We definitely don't want to call that function, so maybe try an anonymous function as it recommends?

-
router.get("/", |req: &mut Request| echo.handle(req), "index");
+
router.get("/", |req: &mut Request| echo.handle(req), "index");
 

Another error:

-
error[E0373]: closure may outlive the current function, but it borrows `echo`, which is owned by the current function
+
error[E0373]: closure may outlive the current function, but it borrows `echo`, which is owned by the current function
 

Another helpful message:

-
help: to force the closure to take ownership of `echo` (and any other referenced variables), use the `move` keyword
+
help: to force the closure to take ownership of `echo` (and any other referenced variables), use the `move` keyword
 

We're getting closer though! Let's implement this change:

-
router.get("/", move |req: &mut Request| echo.handle(req), "index");
+
router.get("/", move |req: &mut Request| echo.handle(req), "index");
 

And here's where things get strange:

-
error[E0507]: cannot move out of borrowed content
+
error[E0507]: cannot move out of borrowed content
   --> src/main.rs:18:40
    |
 18 |         Ok(Response::with((status::Ok, self.message)))
@@ -307,7 +320,7 @@ instead of transferring ownership
 audience out. Because iron won't accept a reference, we are forced into the
 second option: making a copy. To do so, we just need to change the function
 to look like this:

-
Ok(Response::with((status::Ok, self.message.clone())))
+
Ok(Response::with((status::Ok, self.message.clone())))
 
@@ -343,6 +356,20 @@ incredibly precise about how I use it.

going to take me a lot longer to do this than I originally thought.

+
+
+ + +
@@ -354,6 +381,7 @@ going to take me a lot longer to do this than I originally thought.

    +
diff --git a/audio-compression-using-pca.html b/audio-compression-using-pca.html index 99c2a1a..3e61982 100644 --- a/audio-compression-using-pca.html +++ b/audio-compression-using-pca.html @@ -4,20 +4,22 @@ - + - + Audio Compression using PCA - Bradlee Speice - - - - + + + + + + @@ -26,6 +28,17 @@ + + + @@ -39,7 +52,7 @@
@@ -54,12 +67,12 @@

Audio Compression using PCA

-

Bradlee Speice, Tue 01 November 2016, Blog

+

Bradlee Speice, Tue 01 November 2016, Blog

-Digital Signal Processing, Machine Learning, PCA

+Digital Signal Processing, Machine Learning, PCA

@@ -73,9 +86,592 @@
-

{% notebook 2016-11-01-PCA-audio-compression.ipynb %}

+

+

+
+
+
+
+

Towards a new (and pretty poor) compression scheme

I'm going to be working with some audio data for a while as I get prepared for a term project this semester. I'll be working (with a partner) to design a system for separating voices from music. Given my total lack of experience with Digital Signal Processing I figured that now was as good a time as ever to work on a couple of fun projects that would get me back up to speed.

+

The first project I want to work on: Designing a new compression scheme for audio data.

+

A Brief Introduction to Audio Compression

Audio files when uncompressed (files ending with .wav) are huge. Like, 10.5 Megabytes per minute huge. Storage is cheap these days, but that's still an incredible amount of data that we don't really need. Instead, we'd like to compress that data so that it's not taking up so much space. There are broadly two ways to accomplish this:

+
    +
  1. Lossless compression - Formats like FLAC, ALAC, and Monkey's Audio (.ape) all go down this route. The idea is that when you compress and uncompress a file, you get exactly the same as what you started with.

    +
  2. +
  3. Lossy compression - Formats like MP3, Ogg, and AAC (.m4a) are far more popular, but make a crucial tradeoff: We can reduce the file size even more during compression, but the decompressed file won't be the same.

    +
  4. +
+

There is a fundamental tradeoff at stake: Using lossy compression sacrifices some of the integrity of the resulting file to save on storage space. Most people (I personally believe it's everybody) can't hear the difference, so this is an acceptable tradeoff. You have files that take up a 10th of the space, and nobody can tell there's a difference in audio quality.

+

A PCA-based Compression Scheme

What I want to try out is a PCA approach to encoding audio. The PCA technique comes from Machine Learning, where it is used for a process called Dimensionality Reduction. Put simply, the idea is the same as lossy compression: if we can find a way that represents the data well enough, we can save on space. There are a lot of theoretical concerns that lead me to believe this compression style will not end well, but I'm interested to try it nonetheless.

+

PCA works as follows: Given a dataset with a number of features, I find a way to approximate those original features using some "new features" that are statistically as close as possible to the original ones. This is comparable to a scheme like MP3: Given an original signal, I want to find a way of representing it that gets approximately close to what the original was. The difference is that PCA is designed for statistical data, and not signal data. But we won't let that stop us.

+

The idea is as follows: Given a signal, reshape it into 1024 columns by however many rows are needed (zero-padded if necessary). Run the PCA algorithm, and do dimensionality reduction with a couple different settings. The number of components I choose determines the quality: If I use 1024 components, I will essentially be using the original signal. If I use a smaller number of components, I start losing some of the data that was in the original file. This will give me an idea of whether it's possible to actually build an encoding scheme off of this, or whether I'm wasting my time.

+

Running the Algorithm

The audio I will be using comes from the song Tabulasa, by Broke for Free. I'll be loading in the audio signal to Python and using Scikit-Learn to actually run the PCA algorithm.

+

We first need to convert the FLAC file I have to a WAV:

+ +
+
+
+
+
+
In [1]:
+
+
+
!ffmpeg -hide_banner -loglevel panic -i "Broke For Free/XXVII/01 Tabulasa.flac" "Tabulasa.wav" -c wav
+
+ +
+
+
+ +
+
+
+
+
+
+

Then, let's go ahead and load a small sample so you can hear what is going on.

+ +
+
+
+
+
+
In [2]:
+
+
+
from IPython.display import Audio
+from scipy.io import wavfile
+
+samplerate, tabulasa = wavfile.read('Tabulasa.wav')
+
+start = samplerate * 14 # 10 seconds in
+end = start + samplerate * 10 # 5 second duration
+Audio(data=tabulasa[start:end, 0], rate=samplerate)
+
+ +
+
+
+ +
+
+
Out[2]:
+ +
+ + + +
+ +
+ +
+
+ +
+
+
+
+
+
+

Next, we'll define the code we will be using to do PCA. It's very short, as the PCA algorithm is very simple.

+ +
+
+
+
+
+
In [3]:
+
+
+
from sklearn.decomposition import PCA
+import numpy as np
+
+def pca_reduce(signal, n_components, block_size=1024):
+    
+    # First, zero-pad the signal so that it is divisible by the block_size
+    samples = len(signal)
+    hanging = block_size - np.mod(samples, block_size)
+    padded = np.lib.pad(signal, (0, hanging), 'constant', constant_values=0)
+    
+    # Reshape the signal to have 1024 dimensions
+    reshaped = padded.reshape((len(padded) // block_size, block_size))
+    
+    # Second, do the actual PCA process
+    pca = PCA(n_components=n_components)
+    pca.fit(reshaped)
+    
+    transformed = pca.transform(reshaped)
+    reconstructed = pca.inverse_transform(transformed).reshape((len(padded)))
+    return pca, transformed, reconstructed
+
+ +
+
+
+ +
+
+
+
+
+
+

Now that we've got our functions set up, let's try actually running something. First, we'll use n_components == block_size, which implies that we should end up with the same signal we started with.

+ +
+
+
+
+
+
In [4]:
+
+
+
tabulasa_left = tabulasa[:,0]
+
+_, _, reconstructed = pca_reduce(tabulasa_left, 1024, 1024)
+
+Audio(data=reconstructed[start:end], rate=samplerate)
+
+ +
+
+
+ +
+
+ + +
Out[4]:
+ +
+ + + +
+ +
+ +
+
+ +
+
+
+
+
+
+

OK, that does indeed sound like what we originally had. Let's drastically cut down the number of components we're doing this with as a sanity check: the audio quality should become incredibly poor.

+ +
+
+
+
+
+
In [5]:
+
+
+
_, _, reconstructed = pca_reduce(tabulasa_left, 32, 1024)
+
+Audio(data=reconstructed[start:end], rate=samplerate)
+
+ +
+
+
+ +
+
+ + +
Out[5]:
+ +
+ + + +
+ +
+ +
+
+ +
+
+
+
+
+
+

As expected, our reconstructed audio does sound incredibly poor! But there's something else very interesting going on here under the hood. Did you notice that the bassline comes across very well, but that there's no midrange or treble? The drums are almost entirely gone.

+

Drop the (Treble)

It will help to understand PCA more fully when trying to read this part, but I'll do my best to break it down. PCA tries to find a way to best represent the dataset using "components." Think of each "component" as containing some of the information you need in order to reconstruct the full audio. For example, you might have a "low frequency" component that contains all the information you need in order to hear the bassline. There might be other components that explain the high frequency things like singers, or melodies, that you also need.

+

What makes PCA interesting is that it attempts to find the "most important" components in explaining the signal. In a signal processing world, this means that PCA is trying to find the signal amongst the noise in your data. In our case, this means that PCA, when forced to work with small numbers of components, will chuck out the noisy components first. It's doing it's best job to reconstruct the signal, but it has to make sacrifices somewhere.

+

So I've mentioned that PCA identifies the "noisy" components in our dataset. This is equivalent to saying that PCA removes the "high frequency" components in this case: it's very easy to represent a low-frequency signal like a bassline. It's far more difficult to represent a high-frequency signal because it's changing all the time. When you force PCA to make a tradeoff by using a small number of components, the best it can hope to do is replicate the low-frequency sections and skip the high-frequency things.

+

This is a very interesting insight, and it also has echos (pardon the pun) of how humans understand music in general. Other encoding schemes (like MP3, etc.) typically chop off a lot of the high-frequency range as well. There is typically a lot of high-frequency noise in audio that is nearly impossible to hear, so it's easy to remove it without anyone noticing. PCA ends up doing something similar, and while that certainly wasn't the intention, it is an interesting effect.

+

A More Realistic Example

So we've seen the edge cases so far: Using a large number of components results in audio very close to the original, and using a small number of components acts as a low-pass filter. How about we develop something that sounds "good enough" in practice, that we can use as a benchmark for size? We'll use ourselves as judges of audio quality, and build another function to help us estimate how much space we need to store everything in.

+ +
+
+
+
+
+
In [6]:
+
+
+
from bz2 import compress
+import pandas as pd
+
+def raw_estimate(transformed, pca):
+    # We assume that we'll be storing things as 16-bit WAV,
+    # meaning two bytes per sample
+    signal_bytes = transformed.tobytes()
+    # PCA stores the components as floating point, we'll assume
+    # that means 32-bit floats, so 4 bytes per element
+    component_bytes = transformed.tobytes()
+    
+    # Return a result in megabytes
+    return (len(signal_bytes) + len(component_bytes)) / (2**20)
+
+# Do an estimate for lossless compression applied on top of our
+# PCA reduction
+def bz2_estimate(transformed, pca):
+    bytestring = transformed.tobytes() + b';' + pca.components_.tobytes()
+    compressed = compress(bytestring)
+    return len(compressed) / (2**20)
+
+compression_attempts = [
+    (1, 1),
+    (1, 2),
+    (1, 4),
+    (4, 32),
+    (16, 256),
+    (32, 256),
+    (64, 256),
+    (128, 1024),
+    (256, 1024),
+    (512, 1024),
+    (128, 2048),
+    (256, 2048),
+    (512, 2048),
+    (1024, 2048)
+]
+
+def build_estimates(signal, n_components, block_size):
+    pca, transformed, recon = pca_reduce(tabulasa_left, n_components, block_size)
+    raw_pca_estimate = raw_estimate(transformed, pca)
+    bz2_pca_estimate = bz2_estimate(transformed, pca)
+    raw_size = len(recon.tobytes()) / (2**20)
+    return raw_size, raw_pca_estimate, bz2_pca_estimate
+
+pca_compression_results = pd.DataFrame([
+        build_estimates(tabulasa_left, n, bs)
+        for n, bs in compression_attempts
+    ])
+
+pca_compression_results.columns = ["Raw", "PCA", "PCA w/ BZ2"]
+pca_compression_results.index = compression_attempts
+pca_compression_results
+
+ +
+
+
+ +
+
+ + +
Out[6]:
+ +
+
+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
RawPCAPCA w/ BZ2
(1, 1)69.054298138.10859716.431797
(1, 2)69.05430669.05430632.981380
(1, 4)69.05432134.52716116.715032
(4, 32)69.05444317.2636118.481735
(16, 256)69.0546888.6318364.274846
(32, 256)69.05468817.2636728.542909
(64, 256)69.05468834.52734417.097543
(128, 1024)69.05468817.2636729.430644
(256, 1024)69.05468834.52734418.870387
(512, 1024)69.05468869.05468837.800940
(128, 2048)69.0625008.6328126.185015
(256, 2048)69.06250017.26562512.366942
(512, 2048)69.06250034.53125024.736506
(1024, 2048)69.06250069.06250049.517493
+
+
+ +
+ +
+
+ +
+
+
+
+
+
+

As we can see, there are a couple of instances where we do nearly 20 times better on storage space than the uncompressed file. Let's here what that sounds like:

+ +
+
+
+
+
+
In [7]:
+
+
+
_, _, reconstructed = pca_reduce(tabulasa_left, 16, 256)
+Audio(data=reconstructed[start:end], rate=samplerate)
+
+ +
+
+
+ +
+
+ + +
Out[7]:
+ +
+ + + +
+ +
+ +
+
+ +
+
+
+
+
+
+

It sounds incredibly poor though. Let's try something that's a bit more realistic:

+ +
+
+
+
+
+
In [8]:
+
+
+
_, _, reconstructed = pca_reduce(tabulasa_left, 1, 4)
+Audio(data=reconstructed[start:end], rate=samplerate)
+
+ +
+
+
+ +
+
+ + +
Out[8]:
+ +
+ + + +
+ +
+ +
+
+ +
+
+
+
+
+
+

And just out of curiosity, we can try something that has the same ratio of components to block size. This should be close to an apples-to-apples comparison.

+ +
+
+
+
+
+
In [9]:
+
+
+
_, _, reconstructed = pca_reduce(tabulasa_left, 64, 256)
+Audio(data=reconstructed[start:end], rate=samplerate)
+
+ +
+
+
+ +
+
+ + +
Out[9]:
+ +
+ + + +
+ +
+ +
+
+ +
+
+
+
+
+
+

The smaller block size definitely has better high-end response, but I personally think the larger block size sounds better overall.

+ +
+
+
+
+
+
+
+
+

Conclusions

So, what do I think about audio compression using PCA?

+

Strangely enough, it actually works pretty well relative to what I expected. That said, it's a terrible idea in general.

+

First off, you don't really save any space. The component matrix needed to actually run the PCA algorithm takes up a lot of space on its own, so it's very difficult to save space without sacrificing a huge amount of audio quality. And even then, codecs like AAC sound very nice even at bitrates that this PCA method could only dream of.

+

Second, there's the issue of audio streaming. PCA relies on two components: the datastream, and a matrix used to reconstruct the original signal. While it is easy to stream the data, you can't stream that matrix. And even if you divided the stream up into small blocks to give you a small matrix, you must guarantee that the matrix arrives; if you don't have that matrix, the data stream will make no sense whatsoever.

+

All said, this was an interesting experiment. It's really cool seeing PCA used for signal analysis where I haven't seen it applied before, but I don't think it will lead to any practical results. Look forward to more signal processing stuff in the future!

+ +
+
+

+ + +
+
+ + +
@@ -87,6 +683,7 @@
    +
diff --git a/autocallable-bonds.html b/autocallable-bonds.html index 7650197..152d4b9 100644 --- a/autocallable-bonds.html +++ b/autocallable-bonds.html @@ -4,20 +4,22 @@ - + - + Autocallable Bonds - Bradlee Speice - - - - + + + + + + @@ -26,6 +28,17 @@ + + + @@ -39,7 +52,7 @@
@@ -54,12 +67,12 @@

Autocallable Bonds

-

Bradlee Speice, Fri 27 November 2015, Blog

+

Bradlee Speice, Fri 27 November 2015, Blog

-finance, monte carlo, simulation

+finance, monte carlo, simulation

@@ -76,7 +89,3778 @@

My only non-core class this semester has been in Structure Products. We've been surveying a wide variety of products, and the final project was to pick one to report on. Because these are all very similar, we decided to demonstrate all 3 products at once.

What follows below is a notebook demonstrating the usage of Julia for Monte-Carlo simulation of some exotic products.

-

{% notebook 2015-11-27-autocallable.ipynb language[julia] %}

+

+

+
+
In [1]:
+
+
+
using Gadfly
+
+ +
+
+
+ +
+
+
+
+
+
+

Athena/Phoenix Simulation

Underlying simulation

In order to price the autocallable bonds, we need to simulate the underlying assets. Let's go ahead and set up the simulation first, as this lays the foundation for what we're trying to do. We're going to use JNJ as the basis for our simulation. This implies the following parameters:

+
    +
  • $S_0$ = \$102.2 (as of time of writing)
  • +
  • $q$ = 2.84%
  • +
  • $r$ = [.49, .9, 1.21, 1.45, 1.69] (term structure as of time of writing, linear interpolation)
  • +
  • $\mu$ = $r - q$ (note that this implies a negative drift because of current low rates)
  • +
  • $\sigma$ = $\sigma_{imp}$ = 15.62% (from VIX implied volatility)
  • +
+

We additionally define some parameters for simulation:

+
    +
  • T: The number of years to simulate
  • +
  • m: The number of paths to simulate
  • +
  • n: The number of steps to simulate in a year
  • +
+ +
+
+
+
+
+
In [2]:
+
+
+
S0 = 102.2
+nominal = 100
+q = 2.84 / 100
+Ļƒ = 15.37 / 100
+term = [0, .49, .9, 1.21, 1.45, 1.69] / 100 + 1
+
+###
+# Potential: Based on PEP
+# S0 = 100.6
+# Ļƒ = 14.86
+# q = 2.7
+###
+
+# Simulation parameters
+T = 5 # Using years as the unit of time
+n = 250 # simulations per year
+m = 100000 # paths
+num_simulations = 5; # simulation rounds per price
+
+ +
+
+
+ +
+
+ + +
Out[2]:
+ + +
+
5
+
+ +
+ +
+
+ +
+
+
+
+
+
+

Defining the simulation

To make things simpler, we simulate a single year at a time. This allows us to easily add in a dividend policy without too much difficulty, and update the simulation every year to match the term structure. The underlying uses GBM for simulation between years.

+ +
+
+
+
+
+
In [3]:
+
+
+
simulate_gbm = function(S0, Ī¼, Ļƒ, T, n)
+    # Set the initial state
+    m = length(S0)
+    t = T / n
+    motion = zeros(m, n)
+    motion[:,1] = S0
+    
+    # Build out all states
+    for i=1:(n-1)
+        motion[:,i+1] = motion[:,i] .* exp((Ī¼ - Ļƒ^2/2)*t) .* exp(sqrt(t) * Ļƒ .* randn(m))
+    end
+    
+    return motion
+end
+
+function display_motion(motion, T)
+    # Given a matrix of paths, display the motion
+    n = length(motion[1,:])
+    m = length(motion[:,1])
+    x = repmat(1:n, m)
+    
+    # Calculate the ticks we're going to use. We'd like to
+    # have an xtick every month, so calculate where those
+    # ticks will actually be at.
+    if (T > 3)
+        num_ticks = T
+        xlabel = "Years"
+    else
+        num_ticks = T * 12
+        xlabel = "Months"
+    end
+    tick_width = n / num_ticks
+    x_ticks = []
+    for i=1:round(num_ticks)
+        x_ticks = vcat(x_ticks, i*tick_width)
+    end
+    
+    # Use one color for each path. I'm not sure if there's
+    # a better way to do this without going through DataFrames
+    colors = []
+    for i = 1:m
+        colors = vcat(colors, ones(n)*i)
+    end
+    
+    plot(x=x, y=motion', color=colors, Geom.line,
+    Guide.xticks(ticks=x_ticks, label=false),
+    Guide.xlabel(xlabel),
+    Guide.ylabel("Value"))
+end;
+
+ +
+
+
+ +
+
+
+
+
+
+

Example simulation

Let's go ahead and run a sample simulation to see what the functions got us!

+ +
+
+
+
+
+
In [4]:
+
+
+
initial = ones(5) * S0
+# Using Ī¼=0, T=.25 for now, we'll use the proper values later
+motion = simulate_gbm(initial, 0, Ļƒ, .25, 200) 
+
+display_motion(motion, .25)
+
+ +
+
+
+ +
+
+ + +
Out[4]:
+ +
+ + + + + Months + + + + 1 + 5 + 2 + 3 + 4 + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + Color + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + 65 + 70 + 75 + 80 + 85 + 90 + 95 + 100 + 105 + 110 + 115 + 120 + 125 + 130 + 135 + 140 + 145 + 150 + 70 + 71 + 72 + 73 + 74 + 75 + 76 + 77 + 78 + 79 + 80 + 81 + 82 + 83 + 84 + 85 + 86 + 87 + 88 + 89 + 90 + 91 + 92 + 93 + 94 + 95 + 96 + 97 + 98 + 99 + 100 + 101 + 102 + 103 + 104 + 105 + 106 + 107 + 108 + 109 + 110 + 111 + 112 + 113 + 114 + 115 + 116 + 117 + 118 + 119 + 120 + 121 + 122 + 123 + 124 + 125 + 126 + 127 + 128 + 129 + 130 + 131 + 132 + 133 + 134 + 135 + 136 + 137 + 138 + 139 + 140 + 141 + 142 + 143 + 144 + 145 + 50 + 100 + 150 + 70 + 72 + 74 + 76 + 78 + 80 + 82 + 84 + 86 + 88 + 90 + 92 + 94 + 96 + 98 + 100 + 102 + 104 + 106 + 108 + 110 + 112 + 114 + 116 + 118 + 120 + 122 + 124 + 126 + 128 + 130 + 132 + 134 + 136 + 138 + 140 + 142 + 144 + 146 + + + Value + + + + + + + + + + +
+ +
+ +
+
+ +
+
+
+
+
+
+

Computing the term structure

Now that we've got the basic motion set up, let's start making things a bit more sophisticated for the model. We're going to assume that the drift of the stock is the difference between the implied forward rate and the quarterly dividend rate.

+

We're given the yearly term structure, and need to calculate the quarterly forward rate to match this structure. The term structure is assumed to follow:

+

$d(0, t) = d(0,t-1)\cdot f_{i-1, i}$

+

Where $f_{i-1, i}$ is the quarterly forward rate.

+ +
+
+
+
+
+
In [5]:
+
+
+
forward_term = function(yearly_term)
+    # It is assumed that we have a yearly term structure passed in, and starts at year 0
+    # This implies a nominal rate above 0 for the first year!
+    years = length(term)-1 # because we start at 0
+    structure = [(term[i+1] / term[i]) for i=1:years]
+end;
+
+ +
+
+
+ +
+
+
+
+
+
+

Illustrating the term structure

Now that we've got our term structure, let's validate that we're getting the correct results! If we've done this correctly, then:

+ +
term[2] == term[1] * structure[1]
+ +
+
+
+
+
+
In [6]:
+
+
+
# Example term structure taken from:
+# http://www.treasury.gov/resource-center/data-chart-center/interest-rates/Pages/TextView.aspx?data=yield
+# Linear interpolation used years in-between periods, assuming real-dollar
+# interest rates
+forward_yield = forward_term(term)
+calculated_term2 = term[1] * forward_yield[1]
+
+println("Actual term[2]: $(term[2]); Calculated term[2]: $(calculated_term2)")
+
+ +
+
+
+ +
+
+ + +
+
+
Actual term[2]: 1.0049; Calculated term[2]: 1.0049
+
+
+
+ +
+
+ +
+
+
+
+
+
+

The full underlying simulation

Now that we have the term structure set up, we can actually start doing some real simulation! Let's construct some paths through the full 5-year time frame. In order to do this, we will simulate 1 year at a time, and use the forward rates at those times to compute the drift. Thus, there will be 5 total simulations batched together.

+ +
+
+
+
+
+
In [7]:
+
+
+
full_motion = ones(5) * S0
+full_term = vcat(term[1], forward_yield)
+for i=1:T
+    Ī¼ = (full_term[i] - 1 - q)
+    year_motion = simulate_gbm(full_motion[:,end], Ī¼, Ļƒ, 1, n)
+    full_motion = hcat(full_motion, year_motion)
+end
+
+display_motion(full_motion, T)
+
+ +
+
+
+ +
+
+ + +
Out[7]:
+ +
+ + + + + Years + + + + 1 + 5 + 2 + 3 + 4 + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + Color + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + -250 + -200 + -150 + -100 + -50 + 0 + 50 + 100 + 150 + 200 + 250 + 300 + 350 + 400 + 450 + -200 + -190 + -180 + -170 + -160 + -150 + -140 + -130 + -120 + -110 + -100 + -90 + -80 + -70 + -60 + -50 + -40 + -30 + -20 + -10 + 0 + 10 + 20 + 30 + 40 + 50 + 60 + 70 + 80 + 90 + 100 + 110 + 120 + 130 + 140 + 150 + 160 + 170 + 180 + 190 + 200 + 210 + 220 + 230 + 240 + 250 + 260 + 270 + 280 + 290 + 300 + 310 + 320 + 330 + 340 + 350 + 360 + 370 + 380 + 390 + 400 + -200 + 0 + 200 + 400 + -200 + -180 + -160 + -140 + -120 + -100 + -80 + -60 + -40 + -20 + 0 + 20 + 40 + 60 + 80 + 100 + 120 + 140 + 160 + 180 + 200 + 220 + 240 + 260 + 280 + 300 + 320 + 340 + 360 + 380 + 400 + + + Value + + + + + + + + + + +
+ +
+ +
+
+ +
+
+
+
+
+
+

Final simulation

We're now going to actually build out the full motion that we'll use for computing the pricing of our autocallable products. It will be largely the same, but we will use far more sample paths for the simulation.

+ +
+
+
+
+
+
In [8]:
+
+
+
full_simulation = function(S0, T, n, m, term)
+    forward = vcat(term[1], forward_term(term))
+
+    # And an S0 to kick things off.
+    final_motion = ones(m) * S0
+    for i=1:T
+        Ī¼ = (forward[i] - 1 - q)
+        year_motion = simulate_gbm(final_motion[:,end], Ī¼, Ļƒ, 1, n)
+        final_motion = hcat(final_motion, year_motion)
+    end
+    return final_motion
+end
+
+tic()
+full_simulation(S0, T, n, m, term)
+time = toq()
+@printf("Time to run simulation: %.2fs", time)
+
+ +
+
+
+ +
+
+ + +
+
+
Time to run simulation: 5.34s
+
+
+ +
+
+ +
+
+
+
+
+
+

Athena Simulation

Now that we've defined our underlying simulation, let's actually try and price an Athena note. Athena has the following characteristics:

+
    +
  • Automatically called if the underlying is above the call barrier at observation
  • +
  • Accelerated coupon paid if the underlying is above the call barrier at observation
      +
    • The coupon paid is $c \cdot i$ with $i$ as the current year, and $c$ the coupon rate
    • +
    +
  • +
  • Principle protection up until a protection barrier at observation; All principle at risk if this barrier not met
  • +
  • Observed yearly
  • +
+ +
+
+
+
+
+
In [9]:
+
+
+
call_barrier = S0
+strike = S0
+protection_barrier = S0 * .6
+coupon = nominal * .07
+
+price_athena = function(initial_price, year_prices, call_barrier,
+        protection_barrier, coupon, forward_structure)
+
+    total_coupons = 0
+    
+    t = length(year_prices)
+
+    for i=1:t
+        price = year_prices[i]
+        if price ā‰„ call_barrier
+            return (nominal + coupon*i) * exp((prod(forward_structure[i:end])-1)*(t-i))
+        end
+    end
+
+    # We've reached maturity, time to check capital protection
+    if year_prices[end] > protection_barrier
+        return nominal
+    else
+        put = (strike - year_prices[end]) / strike
+        return nominal*(1-put)
+    end
+end
+
+forward_structure = forward_term(term)
+price_function = (year_prices) -> price_athena(S0, year_prices,
+    call_barrier, protection_barrier, coupon, forward_structure)
+
+athena = function()
+    year_indexes = [n*i for i=1:T]
+    motion = full_simulation(S0, T, n, m, term)
+    payoffs = [price_function(motion[i, year_indexes]) for i=1:m]
+    return mean(payoffs)
+end
+
+mean_payoffs = zeros(num_simulations)
+for i=1:num_simulations
+    tic()
+    mean_payoffs[i] = athena()
+    time = toq()
+    @printf("Mean of simulation %i: \$%.4f; Simulation time: %.2fs\n", i, mean_payoffs[i], time)
+end
+
+final_mean = mean(mean_payoffs)
+println("Mean over $num_simulations simulations: $(mean(mean_payoffs))")
+pv = final_mean * (exp(-(prod(forward_structure)-1)*T))
+@printf("Present value of Athena note: \$%.2f, notional: \$%.2f", pv, nominal)
+
+ +
+
+
+ +
+
+ + +
+
+
Mean of simulation 1: $103.2805; Simulation time: 5.59s
+Mean of simulation 2: $103.3796; Simulation time: 5.05s
+Mean of simulation 3: $103.4752; Simulation time: 5.18s
+Mean of simulation 4: $103.4099; Simulation time: 5.37s
+Mean of simulation 5: $103.3260; Simulation time: 5.32s
+Mean over 5 simulations: 103.37421610015554
+Present value of Athena note: $95.00, notional: $100.00
+
+
+ +
+
+ +
+
+
+
+
+
+

Phoenix without Memory Simulation

Let's move into pricing a Phoenix without memory. It's very similar to the Athena production, with the exception that we introduce a coupon barrier so coupons are paid even when the underlying is below the initial price.

+

The Phoenix product has the following characteristics (example here):

+
    +
  • Automatically called if the underlying is above the call barrier at observation
  • +
  • Coupon paid if the underlying is above a coupon barrier at observation
  • +
  • Principle protection up until a protection barrier at observation; All principle at risk if this barrier not met
  • +
  • Observed yearly
  • +
+

Some example paths (all assume that a call barrier of the current price, and coupon barrier some level below that):

+
    +
  • At the end of year 1, the stock is above the call barrier; the note is called and you receive the value of the stock plus the coupon being paid.
  • +
  • At the end of year 1, the stock is above the coupon barrier, but not the call barrier; you receive the coupon. At the end of year 2, the stock is below the coupon barrier; you receive nothing. At the end of year 3, the stock is above the call barrier; the note is called and you receive the value of the stock plus a coupon for year 3.
  • +
+

We're going to re-use the same simulation, with the following parameters:

+
    +
  • Call barrier: 100%
  • +
  • Coupon barrier: 70%
  • +
  • Coupon: 6%
  • +
  • Capital protection until 70% (at maturity)
  • +
+ +
+
+
+
+
+
In [10]:
+
+
+
call_barrier = S0
+coupon_barrier = S0 * .8
+protection_barrier = S0 * .6
+coupon = nominal * .06
+
+price_phoenix_no_memory = function(initial_price, year_prices, call_barrier, coupon_barrier,
+        protection_barrier, coupon, forward_structure)
+
+    total_coupons = 0
+    t = length(year_prices)
+
+    for i=1:t
+        price = year_prices[i]
+        if price ā‰„ call_barrier
+            return (nominal + coupon + total_coupons)*exp((prod(forward_structure[i:end])-1)*(t-i))
+        elseif price ā‰„ coupon_barrier
+            total_coupons = total_coupons * exp(forward_structure[i]-1) + coupon
+        else
+            total_coupons *= exp(forward_structure[i]-1)
+        end
+    end
+
+    # We've reached maturity, time to check capital protection
+    if year_prices[end] > protection_barrier
+        return nominal + total_coupons
+    else
+        put = (strike - year_prices[end]) / strike
+        return nominal*(1-put)
+    end
+end
+
+forward_structure = forward_term(term)
+price_function = (year_prices) -> price_phoenix_no_memory(S0, year_prices,
+    call_barrier, coupon_barrier, protection_barrier, coupon, forward_structure)
+
+phoenix_no_memory = function()
+    year_indexes = [n*i for i=1:T]
+    motion = full_simulation(S0, T, n, m, term)
+    payoffs = [price_function(motion[i, year_indexes]) for i=1:m]
+    return mean(payoffs)
+end
+
+mean_payoffs = zeros(num_simulations)
+for i=1:num_simulations
+    tic()
+    mean_payoffs[i] = phoenix_no_memory()
+    time = toq()
+    @printf("Mean of simulation %i: \$%.4f; Simulation time: %.2fs\n", i, mean_payoffs[i], time)
+end
+
+final_mean = mean(mean_payoffs)
+println("Mean over $num_simulations simulations: $(mean(mean_payoffs))")
+pv = final_mean * exp(-(prod(forward_structure)-1)*(T))
+@printf("Present value of Phoenix without memory note: \$%.2f", pv)
+
+ +
+
+
+ +
+
+ + +
+
+
Mean of simulation 1: $106.0562; Simulation time: 5.72s
+Mean of simulation 2: $106.0071; Simulation time: 5.85s
+Mean of simulation 3: $105.9959; Simulation time: 5.87s
+Mean of simulation 4: $106.0665; Simulation time: 5.93s
+Mean of simulation 5: $106.0168; Simulation time: 5.81s
+Mean over 5 simulations: 106.02850857209883
+Present value of Phoenix without memory note: $97.44
+
+
+ +
+
+ +
+
+
+
+
+
+

Phoenix with Memory Simulation

The Phoenix with Memory structure is very similar to the Phoenix, but as the name implies, has a special "memory" property: It remembers any coupons that haven't been paid at prior observation times, and pays them all if the underlying crosses the coupon barrier. For example:

+
    +
  • Note issued with 100% call barrier, 70% coupon barrier. At year 1, the underlying is at 50%, so no coupons are paid. At year 2, the underlying is at 80%, so coupons for both year 1 and 2 are paid, resulting in a double coupon.
  • +
+

You can also find an example here.

+

Let's go ahead and set up the simulation! The parameters will be the same, but we can expect that the value will go up because of the memory attribute

+ +
+
+
+
+
+
In [11]:
+
+
+
call_barrier = S0
+coupon_barrier = S0 * .8
+protection_barrier = S0 * .6
+coupon = nominal * .07
+
+price_phoenix_with_memory = function(initial_price, year_prices, call_barrier,
+    coupon_barrier, protection_barrier, coupon, forward_structure)
+
+    last_coupon = 0
+    total_coupons = 0
+    
+    t = length(year_prices)
+
+    for i=1:t
+        price = year_prices[i]
+        if price > call_barrier
+            return (nominal + coupon + total_coupons)*exp((prod(forward_structure[i:end])-1)*(t-i))
+        elseif price > coupon_barrier
+            ####################################################################
+            # The only difference between with/without memory is the below lines
+            memory_coupons = (i - last_coupon) * coupon
+            last_coupon = i
+            total_coupons = total_coupons * exp(forward_structure[i]-1) + memory_coupons
+            ####################################################################
+        else
+            total_coupons *= exp(forward_structure[i]-1)
+        end
+    end
+
+    # We've reached maturity, time to check capital protection
+    if year_prices[end] > protection_barrier
+        return nominal + total_coupons
+    else
+        put = (strike - year_prices[end]) / strike
+        return nominal*(1-put)
+    end
+end
+
+forward_structure = forward_term(term)
+price_function = (year_prices) -> price_phoenix_with_memory(S0, year_prices,
+    call_barrier, coupon_barrier, protection_barrier, coupon, forward_structure)
+
+phoenix_with_memory = function()
+    year_indexes = [n*i for i=1:T]
+    motion = full_simulation(S0, T, n, m, term)
+    payoffs = [price_function(motion[i, year_indexes]) for i=1:m]
+    return mean(payoffs)
+end
+
+mean_payoffs = zeros(num_simulations)
+for i=1:num_simulations
+    tic()
+    mean_payoffs[i] = phoenix_with_memory()
+    time = toq()
+    @printf("Mean of simulation %i: \$%.4f; Simulation time: %.2fs\n",
+        i, mean_payoffs[i], time)
+end
+
+final_mean = mean(mean_payoffs)
+println("Mean over $num_simulations simulations: $(mean(mean_payoffs))")
+pv = final_mean * exp(-(prod(forward_structure)-1)*(T))
+@printf("Present value of Phoenix with memory note: \$%.2f", pv)
+
+ +
+
+
+ +
+
+ + +
+
+
Mean of simulation 1: $108.8612; Simulation time: 5.89s
+Mean of simulation 2: $109.0226; Simulation time: 5.90s
+Mean of simulation 3: $108.9175; Simulation time: 5.92s
+Mean of simulation 4: $108.9426; Simulation time: 5.94s
+Mean of simulation 5: $108.8087; Simulation time: 6.06s
+Mean over 5 simulations: 108.91052564051816
+Present value of Phoenix with memory note: $100.09
+
+
+ +
+
+ +

@@ -84,6 +3868,20 @@ MathJax.Hub.Config({tex2jax: {inlineMath: [['$','$'], ['\(','\)']]}}); +
+
+ + +
@@ -95,6 +3893,7 @@ MathJax.Hub.Config({tex2jax: {inlineMath: [['$','$'], ['\(','\)']]}});
    +
diff --git a/cloudy-in-seattle.html b/cloudy-in-seattle.html index 2bc7459..6f61324 100644 --- a/cloudy-in-seattle.html +++ b/cloudy-in-seattle.html @@ -4,20 +4,22 @@ - + - + Cloudy in Seattle - Bradlee Speice - - - - + + + + + + @@ -26,6 +28,17 @@ + + + @@ -39,7 +52,7 @@
@@ -54,12 +67,12 @@

Cloudy in Seattle

-

Bradlee Speice, Sat 23 January 2016, Blog

+

Bradlee Speice, Sat 23 January 2016, Blog

-data science, weather

+data science, weather

@@ -73,7 +86,738 @@
-

{% notebook 2016-1-23-cloudy-in-seattle.ipynb %}

+

+

+
+
In [1]:
+
+
+
import pickle
+import pandas as pd
+import numpy as np
+from bokeh.plotting import output_notebook, figure, show
+from bokeh.palettes import RdBu4 as Palette
+from datetime import datetime
+import warnings
+
+output_notebook()
+
+ +
+
+
+ +
+
+ + +
+ +
+ + + + + +
+ + BokehJS successfully loaded. +
+
+ +
+ +
+
+ +
+
+
+
+
+
+

After taking some time to explore how the weather in North Carolina stacked up over the past years, I was interested in doing the same analysis for other cities. Growing up with family from Binghamton, NY I was always told it was very cloudy there. And Seattle has a nasty reputation for being very depressing and cloudy. All said, the cities I want to examine are:

+
    +
  • Binghamton, NY
  • +
  • Cary, NC
  • +
  • Seattle, WA
  • +
  • New York City, NY
  • +
+

I'd be interested to try this analysis worldwide at some point - comparing London and Seattle might be an interesting analysis. For now though, we'll stick with trying out the US data.

+

There will be plenty of charts. I want to know: How has average cloud cover and precipitation chance changed over the years for each city mentioned? This will hopefully tell us whether Seattle has actually earned its reputation for being a depressing city.

+ +
+
+
+
+
+
In [2]:
+
+
+
city_forecasts = pickle.load(open('city_forecasts.p', 'rb'))
+forecasts_df = pd.DataFrame.from_dict(city_forecasts)
+
+ +
+
+
+ +
+
+
+
In [3]:
+
+
+
cities = ['binghamton', 'cary', 'nyc', 'seattle']
+city_colors = {cities[i]: Palette[i] for i in range(0, 4)}
+
+def safe_cover(frame):
+    if frame and 'cloudCover' in frame:
+        return frame['cloudCover']
+    else:
+        return np.NaN
+
+def monthly_avg_cloudcover(city, year, month):
+    dates = pd.DatetimeIndex(start=datetime(year, month, 1, 12),
+                             end=datetime(year, month + 1, 1, 12),
+                             freq='D', closed='left')
+    cloud_cover_vals = list(map(lambda x: safe_cover(forecasts_df[city][x]['currently']), dates))
+    cloud_cover_samples = len(list(filter(lambda x: x is not np.NaN, cloud_cover_vals)))
+    # Ignore an issue with nanmean having all NaN values. We'll discuss the data issues below.
+    with warnings.catch_warnings():
+        warnings.simplefilter('ignore')
+        return np.nanmean(cloud_cover_vals), cloud_cover_samples
+
+ +
+
+
+ +
+
+
+
In [4]:
+
+
+
years = range(1990, 2016)
+def city_avg_cc(city, month):
+    return [monthly_avg_cloudcover(city, y, month) for y in years]
+
+months = [
+    ('July', 7),
+    ('August', 8),
+    ('September', 9),
+    ('October', 10),
+    ('November', 11)
+]
+
+for month, month_id in months:
+    month_averages = {city: city_avg_cc(city, month_id) for city in cities}
+    f = figure(title="{} Average Cloud Cover".format(month),
+               x_axis_label='Year',
+               y_axis_label='Cloud Cover Percentage')
+    for city in cities:
+        f.line(years, [x[0] for x in month_averages[city]],
+               legend=city, color=city_colors[city])
+    show(f)
+
+ +
+
+
+ +
+
+ + +
+ +
+ +
+ + +
+ +
+ +
+ +
+ +
+ + +
+ +
+ +
+ +
+ +
+ + +
+ +
+ +
+ +
+ +
+ + +
+ +
+ +
+ +
+ +
+ + +
+ +
+ +
+
+ +
+
+
+
+
+
+

Well, as it so happens it looks like there are some data issues. July's data is a bit sporadic, and 2013 seems to be missing from most months as well. I think really only two things can really be confirmed here:

+
    +
  • Seattle, specifically for the months of October and November, is in fact significantly more cloudy on average than are other cities
  • +
  • All cities surveyed have seen average cloud cover decline over the months studied. There are data issues, but the trend seems clear.
  • +
+

Let's now move from cloud cover data to looking at average rainfall chance.

+ +
+
+
+
+
+
In [5]:
+
+
+
def safe_precip(frame):
+    if frame and 'precipProbability' in frame:
+        return frame['precipProbability']
+    else:
+        return np.NaN
+
+def monthly_avg_precip(city, year, month):
+    dates = pd.DatetimeIndex(start=datetime(year, month, 1, 12),
+                             end=datetime(year, month + 1, 1, 12),
+                             freq='D', closed='left')
+    precip_vals = list(map(lambda x: safe_precip(forecasts_df[city][x]['currently']), dates))
+    precip_samples = len(list(filter(lambda x: x is not np.NaN, precip_vals)))
+    # Ignore an issue with nanmean having all NaN values. We'll discuss the data issues below.
+    with warnings.catch_warnings():
+        warnings.simplefilter('ignore')
+        return np.nanmean(precip_vals), precip_samples
+
+def city_avg_precip(city, month):
+    return [monthly_avg_precip(city, y, month) for y in years]
+
+for month, month_id in months:
+    month_averages = {city: city_avg_cc(city, month_id) for city in cities}
+    f = figure(title="{} Average Precipitation Chance".format(month),
+               x_axis_label='Year',
+               y_axis_label='Precipitation Chance Percentage')
+    for city in cities:
+        f.line(years, [x[0] for x in month_averages[city]],
+               legend=city, color=city_colors[city])
+    show(f)
+
+ +
+
+
+ +
+
+ + +
+ +
+ +
+ + +
+ +
+ +
+ +
+ +
+ + +
+ +
+ +
+ +
+ +
+ + +
+ +
+ +
+ +
+ +
+ + +
+ +
+ +
+ +
+ +
+ + +
+ +
+ +
+
+ +
+
+
+
+
+
+

The same data issue caveats apply here: 2013 seems to be missing some data, and July has some issues as well. However, this seems to confirm the trends we saw with cloud cover:

+
    +
  • Seattle, specifically for the months of August, October, and November has had a consistently higher chance of rain than other cities surveyed.
  • +
  • Average precipitation chance, just like average cloud cover, has been trending down over time.
  • +
+

Conclusion

I have to admit I was a bit surprised after doing this analysis. Seattle showed a higher average cloud cover and average precipitation chance than did the other cities surveyed. Maybe Seattle is actually an objectively more depressing city to live in.

+

Well that's all for weather data at the moment. It's been a great experiment, but I think this is about as far as I'll be able to get with weather data without some domain knowledge. Talk again soon!

+ +
+
+

@@ -81,6 +825,20 @@ MathJax.Hub.Config({tex2jax: {inlineMath: [['$','$'], ['\(','\)']]}}); +
+
+ + +
@@ -92,6 +850,7 @@ MathJax.Hub.Config({tex2jax: {inlineMath: [['$','$'], ['\(','\)']]}});
    +
diff --git a/complaining-about-the-weather.html b/complaining-about-the-weather.html index 76a8ef6..5b2c289 100644 --- a/complaining-about-the-weather.html +++ b/complaining-about-the-weather.html @@ -4,20 +4,22 @@ - + - + Complaining about the Weather - Bradlee Speice - - - - + + + + + + @@ -26,6 +28,17 @@ + + + @@ -39,7 +52,7 @@
@@ -54,12 +67,12 @@

Complaining about the Weather

-

Bradlee Speice, Fri 01 January 2016, Blog

+

Bradlee Speice, Fri 01 January 2016, Blog

-weather

+weather

@@ -73,7 +86,758 @@
-

{% notebook 2016-1-1-complaining-about-weather.ipynb %}

+

+

+
+
In [1]:
+
+
+
from bokeh.plotting import figure, output_notebook, show
+from bokeh.palettes import PuBuGn9 as Palette
+import pandas as pd
+import numpy as np
+from datetime import datetime
+import pickle
+
+output_notebook()
+
+ +
+
+
+ +
+
+ + +
+ +
+ + + + + +
+ + BokehJS successfully loaded. +
+
+ +
+ +
+
+ +
+
+
+
+
+
+

I'm originally from North Carolina, and I've been hearing a lot of people talking about how often it's been raining recently. They're excited for any day that has sun.

+

So I got a bit curious: Has North Carolina over the past few months actually had more cloudy and rainy days recently than in previous years? This shouldn't be a particularly challenging task, but I'm interested to know if people's perceptions actually reflect reality.

+

The data we'll use comes from forecast.io, since they can give us a cloud cover percentage. I've gone ahead and retrieved the data to a pickle file, and included the code that was used to generate it. First up: What was the average cloud cover in North Carolina during August - November, and how many days were cloudy? We're going to assume that a "cloudy" day is defined as any day in which the cloud cover is above 50%.

+ +
+
+
+
+
+
In [2]:
+
+
+
city_forecasts = pickle.load(open('city_forecasts.p', 'rb'))
+forecast_df = pd.DataFrame.from_dict(city_forecasts)
+
+ +
+
+
+ +
+
+
+
In [3]:
+
+
+
cary_forecast = forecast_df['cary']
+years = range(1990, 2016)
+months = range(7, 12)
+months_str = ['July', 'August', 'September', 'October', 'November']
+
+def safe_cover(frame):
+    if frame and 'cloudCover' in frame:
+        return frame['cloudCover']
+    else:
+        return np.NaN
+
+def monthly_avg_cloudcover(year, month):
+    dates = pd.DatetimeIndex(start=datetime(year, month, 1, 12),
+                             end=datetime(year, month + 1, 1, 12),
+                             freq='D', closed='left')
+    cloud_cover_vals = list(map(lambda x: safe_cover(cary_forecast[x]['currently']), dates))
+    cloud_cover_samples = len(list(filter(lambda x: x is not np.NaN, cloud_cover_vals)))
+    return np.nanmean(cloud_cover_vals), cloud_cover_samples
+
+
+monthly_cover_vals = [[monthly_avg_cloudcover(y, m)[0] for y in years] for m in months]
+
+f = figure(title='Monthly Average Cloud Cover',
+           x_range=(1990, 2015),
+          x_axis_label='Year')
+for x in range(0, len(months)):
+    f.line(years, monthly_cover_vals[x], legend=months_str[x], color=Palette[x])
+show(f)
+
+ +
+
+
+ +
+
+ + +
+ +
+ +
+ + +
+ +
+ +
+
+ +
+
+
+
+
+
+

As we can see from the chart above, on the whole the monthly average cloud cover has been generally trending down over time. The average cloud cover is also lower than it was last year - it seems people are mostly just complaining. There are some data issues that start in 2012 that we need to be aware of - the cloud cover percentage doesn't exist for all days. Even so, the data that we have seems to reflect the wider trend, so we'll assume for now that the missing data doesn't skew our results.

+

There's one more metric we want to check though - how many cloudy days were there? This is probably a better gauge of sentiment than the average monthly cover.

+ +
+
+
+
+
+
In [4]:
+
+
+
def monthly_cloudy_days(year, month):
+    dates = pd.DatetimeIndex(start=datetime(year, month, 1, 12),
+                             end=datetime(year, month + 1, 1, 12),
+                             freq='D', closed='left')
+    cloud_cover_vals = list(map(lambda x: safe_cover(cary_forecast[x]['currently']), dates))
+    cloud_cover_samples = len(list(filter(lambda x: x is not np.NaN, cloud_cover_vals)))
+    cloudy_days = [cover > .5 for cover in cloud_cover_vals]
+    return np.count_nonzero(cloudy_days), cloud_cover_samples
+
+monthly_days_vals = [[monthly_cloudy_days(y, m)[0] for y in years] for m in months]
+monthly_cover_samples = [[monthly_cloudy_days(y, m)[1] for y in years] for m in months]
+
+f = figure(title='Monthly Cloudy Days',
+           x_range=(1990, 2015),
+          x_axis_label='Year')
+for x in range(0, len(months)):
+    f.line(years, monthly_days_vals[x], legend=months_str[x], color=Palette[x])
+show(f)
+
+f = figure(title='Monthly Cloud Cover Samples',
+          x_range=(1990, 2015),
+          x_axis_label='Year',
+          height=300)
+for x in range(0, len(months)):
+    f.line(years, monthly_cover_samples[x], legend=months_str[x], color=Palette[x])
+show(f)
+
+ +
+
+
+ +
+
+ + +
+ +
+ +
+ + +
+ +
+ +
+ +
+ +
+ + +
+ +
+ +
+
+ +
+
+
+
+
+
+

On the whole, the number of cloudy days seems to reflect the trend with average cloud cover - it's actually becoming more sunny as time progresses. That said, we need to be careful in how we view this number - because there weren't as many samples in 2015 as previous years, the number of days can get thrown off. In context though, even if most days not recorded were in fact cloudy, the overall count for 2015 would still be lower than previous years.

+

Tracking Precipitation Chances

In addition to checking cloud cover, I wanted to check precipitation data as well - what is the average precipitation chance over a month, and how many days during a month is rain likely? The thinking is that days with a high-precipitation chance will also be days in which it is cloudy or depressing.

+ +
+
+
+
+
+
In [5]:
+
+
+
def safe_precip(frame):
+    if frame and 'precipProbability' in frame:
+        return frame['precipProbability']
+    else:
+        return np.NaN
+
+def monthly_avg_precip(year, month):
+    dates = pd.DatetimeIndex(start=datetime(year, month, 1, 12),
+                             end=datetime(year, month + 1, 1, 12),
+                             freq='D', closed='left')
+    precip_vals = list(map(lambda x: safe_precip(cary_forecast[x]['currently']), dates))
+    precip_samples = len(list(filter(lambda x: x is not np.NaN, precip_vals)))
+    return np.nanmean(precip_vals), precip_samples
+
+monthly_avg_precip_vals = [[monthly_avg_precip(y, m)[0] for y in years] for m in months]
+
+f = figure(title='Monthly Average Precipitation Chance',
+           x_range=(1990, 2015),
+          x_axis_label='Year')
+for x in range(0, len(months)):
+    f.line(years, monthly_avg_precip_vals[x], legend=months_str[x], color=Palette[x])
+show(f)
+
+ +
+
+
+ +
+
+ + +
+ +
+ +
+ + +
+ +
+ +
+
+ +
+
+
+
+
+
+

As we can see from the chart, the average chance of precipitation over a month more or less stays within a band of 0 - .1 for all months over all years. This is further evidence that the past few months are no more cloudy or rainy than previous years. Like the cloud cover though, we still want to get a count of all the rainy days, in addition to the average chance. We'll define a "rainy day" as any day in which the chance of rain is greater than 25%.

+ +
+
+
+
+
+
In [6]:
+
+
+
def monthly_rainy_days(year, month):
+    dates = pd.DatetimeIndex(start=datetime(year, month, 1, 12),
+                             end=datetime(year, month + 1, 1, 12),
+                             freq='D', closed='left')
+    precip_prob_vals = list(map(lambda x: safe_precip(cary_forecast[x]['currently']), dates))
+    precip_prob_samples = len(list(filter(lambda x: x is not np.NaN, precip_prob_vals)))
+    precip_days = [prob > .25 for prob in precip_prob_vals]
+    return np.count_nonzero(precip_days), precip_prob_samples
+
+monthly_precip_days_vals = [[monthly_rainy_days(y, m)[0] for y in years] for m in months]
+monthly_precip_samples = [[monthly_rainy_days(y, m)[1] for y in years] for m in months]
+
+f = figure(title='Monthly Rainy Days',
+           x_range=(1990, 2015),
+          x_axis_label='Year')
+for x in range(0, len(months)):
+    f.line(years, monthly_precip_days_vals[x], legend=months_str[x], color=Palette[x])
+show(f)
+
+f = figure(title='Monthly Rainy Days Samples',
+          x_range=(1990, 2015),
+          x_axis_label='Year',
+          height=300)
+for x in range(0, len(months)):
+    f.line(years, monthly_precip_samples[x], legend=months_str[x], color=Palette[x])
+show(f)
+
+ +
+
+
+ +
+
+ + +
+ +
+ +
+ + +
+ +
+ +
+ +
+ +
+ + +
+ +
+ +
+
+ +
+
+
+
+
+
+

After trying to find the number of days that are rainy, we can see that November hit its max value for rainy days in 2015. However, that value is 6, as compared to a previous maximum of 5. While it is a new record, the value isn't actually all that different. And for other months, the values are mostly in-line with the averages.

+

Summary and Conclusions

After having looked at forecast data for Cary, it appears that the months of July - November this year in terms of weather were at worst on par with prior years, if not slightly more sunny. This seems to be a case of confirmation bias: someone complains about a string of cloudy or rainy days, and suddenly you start noticing them more.

+

While this analysis doesn't take into account other areas of North Carolina, my initial guess would be to assume that other areas also will show similar results: nothing interesting is happening. Maybe that will be for another blog post later!

+

Coming soon: I'll compare rain/cloud conditions in North Carolina to some other places in the U.S.!

+ +
+
+
+
+
+
+
+
+

Generating the Forecast file

The following code was generates the file that was used throughout the blog post. Please note that I'm retrieving data for other cities to use in a future blog post, only Cary data was used for this post.

+
import pandas as pd
+from functools import reduce
+import requests
+from datetime import datetime
+
+# Coordinate data from http://itouchmap.com/latlong.html
+cary_loc = (35.79154,-78.781117)
+nyc_loc = (40.78306,-73.971249)
+seattle_loc = (47.60621,-122.332071)
+binghamton_loc = (42.098687,-75.917974)
+cities = {
+    'cary': cary_loc,
+    'nyc': nyc_loc,
+    'seattle': seattle_loc,
+    'binghamton': binghamton_loc
+}
+
+apikey = '' # My super-secret API Key
+
+def get_forecast(lat, long, date=None):
+    forecast_base = "https://api.forecast.io/forecast/"
+    if date is None:
+        url = forecast_base + apikey + '/{},{}'.format(lat, long)
+    else:
+        epoch = int(date.timestamp())
+        url = forecast_base + apikey + '/{},{},{}'.format(lat, long, epoch)
+
+    return requests.get(url).json()
+
+years = range(1990,2016)
+# For datetimes, the 12 is for getting the weather at noon.
+# We're doing this over midnight because we're more concerned
+# with what people see, and people don't typically see the weather
+# at midnight.
+dt_indices = [pd.date_range(start=datetime(year, 7, 1, 12),
+                          end=datetime(year, 11, 30, 12))
+              for year in years]
+dt_merge = reduce(lambda x, y: x.union(y), dt_indices)
+
+# Because we have to pay a little bit to use the API, we use for loops here
+# instead of a comprehension - if something breaks, we want to preserve the
+# data already retrieved
+city_forecasts = {}
+for city, loc in cities.items():
+    print("Retrieving data for {} starting at {}".format(city,
+                                                         datetime.now().strftime("%I:%M:%S %p")))
+    for dt in dt_merge:
+        try:
+            city_forecasts[(city, dt)] = get_forecast(*loc, dt)
+        except Exception as e:
+            print(e)
+            city_forecasts[(city, dt)] = None
+print("End forecast retrieval: {}".format(datetime.now().strftime("%I:%M:%S %p")))
+
+import pickle
+pickle.dump(city_forecasts, open('city_forecasts.p', 'wb'))
+
+### Output:
+# Retrieving data for binghamton starting at 05:13:42 PM
+# Retrieving data for seattle starting at 05:30:51 PM
+# Retrieving data for nyc starting at 05:48:30 PM
+# Retrieving data for cary starting at 06:08:32 PM
+# End forecast retrieval: 06:25:21 PM
+
+ +
+
+

@@ -81,6 +845,20 @@ MathJax.Hub.Config({tex2jax: {inlineMath: [['$','$'], ['\(','\)']]}}); +
+
+ + +
@@ -92,6 +870,7 @@ MathJax.Hub.Config({tex2jax: {inlineMath: [['$','$'], ['\(','\)']]}});
    +
diff --git a/event-studies-and-earnings-releases.html b/event-studies-and-earnings-releases.html index 243a1ea..bf9b6fc 100644 --- a/event-studies-and-earnings-releases.html +++ b/event-studies-and-earnings-releases.html @@ -4,20 +4,22 @@ - + - + Event Studies and Earnings Releases - Bradlee Speice - - - - + + + + + + @@ -26,6 +28,17 @@ + + + @@ -39,7 +52,7 @@
@@ -54,12 +67,12 @@

Event Studies and Earnings Releases

-

Bradlee Speice, Wed 08 June 2016, Blog

+

Bradlee Speice, Wed 08 June 2016, Blog

-earnings, event study

+earnings, event study

@@ -75,7 +88,5235 @@
-

{% notebook 2016-6-8-event-studies-and-earnings-releases.ipynb %}

+

+

+
+
+
+
+

Or, being suspicious of market insiders.

+
+

Use the button below to show the code I've used to generate this article. Because there is a significant amount more code involved than most other posts I've written, it's hidden by default to allow people to concentrate on the important bits.

+ +
+
+
+
+
+
In [1]:
+
+
+
from IPython.display import HTML
+
+HTML('''<script>
+code_show=true; 
+function code_toggle() {
+ if (code_show){
+ $('div.input').hide();
+ } else {
+ $('div.input').show();
+ }
+ code_show = !code_show
+} 
+$( document ).ready(code_toggle);
+</script>
+<form action="javascript:code_toggle()"><input type="submit" value="Click here to toggle on/off the raw code."></form>''')
+
+ +
+
+
+ +
+
+ + +
Out[1]:
+ +
+ +
+
+ +
+ +
+
+ +
+
+
+
+
+
+

The Market Just Knew

I recently saw two examples of stock charts that have kept me thinking for a while. And now that the semester is complete, I finally have enough time to really look at them and give them the treatment they deserve. The first is good old Apple:

+ +
+
+
+
+
+
In [2]:
+
+
+
from secrets import QUANDL_KEY
+import matplotlib.pyplot as plt
+from matplotlib.dates import date2num
+from matplotlib.finance import candlestick_ohlc
+from matplotlib.dates import DateFormatter, WeekdayLocator,\
+    DayLocator, MONDAY
+import quandl
+from datetime import datetime
+import pandas as pd
+%matplotlib inline
+
+def fetch_ticker(ticker, start, end):
+    # Quandl is currently giving me issues with returning
+    # the entire dataset and not slicing server-side.
+    # So instead, we'll do it client-side!
+    q_format = '%Y-%m-%d'
+    ticker_data = quandl.get('YAHOO/' + ticker,
+                             start_date=start.strftime(q_format),
+                             end_date=end.strftime(q_format),
+                             authtoken=QUANDL_KEY)
+    return ticker_data
+
+def ohlc_dataframe(data, ax=None):
+    # Much of this code re-used from:
+    # http://matplotlib.org/examples/pylab_examples/finance_demo.html
+    if ax is None:
+        f, ax = plt.subplots()
+    
+    vals = [(date2num(date), *(data.loc[date]))
+            for date in data.index]
+    candlestick_ohlc(ax, vals)
+    
+    mondays = WeekdayLocator(MONDAY)
+    alldays = DayLocator()
+    weekFormatter = DateFormatter('%b %d')
+    ax.xaxis.set_major_locator(mondays)
+    ax.xaxis.set_minor_locator(alldays)
+    ax.xaxis.set_major_formatter(weekFormatter)
+    return ax
+
+AAPL = fetch_ticker('AAPL', datetime(2016, 3, 1), datetime(2016, 5, 1))
+ax = ohlc_dataframe(AAPL)
+plt.vlines(date2num(datetime(2016, 4, 26, 12)),
+           ax.get_ylim()[0], ax.get_ylim()[1],
+           color='b',
+          label='Earnings Release')
+plt.legend(loc=3)
+plt.title("Apple Price 3/1/2016 - 5/1/2016");
+
+ +
+
+
+ +
+
+ + +
+ + +
+ +
+ +
+ +
+
+ +
+
+
+
+
+
+

The second chart is from Facebook:

+ +
+
+
+
+
+
In [3]:
+
+
+
FB = fetch_ticker('FB', datetime(2016, 3, 1), datetime(2016, 5, 5))
+ax = ohlc_dataframe(FB)
+plt.vlines(date2num(datetime(2016, 4, 27, 12)),
+           ax.get_ylim()[0], ax.get_ylim()[1],
+           color='b', label='Earnings Release')
+plt.title('Facebook Price 3/5/2016 - 5/5/2016')
+plt.legend(loc=2);
+
+ +
+
+
+ +
+
+ + +
+ + +
+ +
+ +
+ +
+
+ +
+
+
+
+
+
+

These two charts demonstrate two very specific phonomena: how the market prepares for earnings releases. Let's look at those charts again, but with some extra information. As we're about the see, the market "knew" in advance that Apple was going to perform poorly. The market expected that Facebook was going to perform poorly, and instead shot the lights out. Let's see that trend in action:

+ +
+
+
+
+
+
In [4]:
+
+
+
def plot_hilo(ax, start, end, data):
+    ax.plot([date2num(start), date2num(end)],
+            [data.loc[start]['High'], data.loc[end]['High']],
+            color='b')
+    ax.plot([date2num(start), date2num(end)],
+            [data.loc[start]['Low'], data.loc[end]['Low']],
+            color='b')
+
+f, axarr = plt.subplots(1, 2)
+
+ax_aapl = axarr[0]
+ax_fb = axarr[1]
+
+# Plot the AAPL trend up and down
+ohlc_dataframe(AAPL, ax=ax_aapl)
+plot_hilo(ax_aapl, datetime(2016, 3, 1), datetime(2016, 4, 15), AAPL)
+plot_hilo(ax_aapl, datetime(2016, 4, 18), datetime(2016, 4, 26), AAPL)
+ax_aapl.vlines(date2num(datetime(2016, 4, 26, 12)),
+               ax_aapl.get_ylim()[0], ax_aapl.get_ylim()[1],
+               color='g', label='Earnings Release')
+ax_aapl.legend(loc=2)
+ax_aapl.set_title('AAPL Price History')
+
+# Plot the FB trend down and up
+ohlc_dataframe(FB, ax=ax_fb)
+plot_hilo(ax_fb, datetime(2016, 3, 30), datetime(2016, 4, 27), FB)
+plot_hilo(ax_fb, datetime(2016, 4, 28), datetime(2016, 5, 5), FB)
+ax_fb.vlines(date2num(datetime(2016, 4, 27, 12)),
+             ax_fb.get_ylim()[0], ax_fb.get_ylim()[1],
+             color='g', label='Earnings Release')
+ax_fb.legend(loc=2)
+ax_fb.set_title('FB Price History')
+
+f.set_size_inches(18, 6)
+
+ +
+
+
+ +
+
+ + +
+ + +
+ +
+ +
+ +
+
+ +
+
+
+
+
+
+

As we can see above, the market broke a prevailing trend on Apple in order to go down, and ultimately predict the earnings release. For Facebook, the opposite happened. While the trend was down, the earnings were fantastic and the market corrected itself much higher.

+ +
+
+
+
+
+
+
+
+

Formulating the Question

While these are two specific examples, there are plenty of other examples you could cite one way or another. Even if the preponderance of evidence shows that the market correctly predicts earnings releases, we need not accuse people of collusion; for a company like Apple with many suppliers we can generally forecast how Apple has done based on those same suppliers.

+

The question then, is this: how well does the market predict the earnings releases? It's an incredibly broad question that I want to disect in a couple of different ways:

+
    +
  1. Given a stock that has been trending down over the past N days before an earnings release, how likely does it continue downward after the release?
  2. +
  3. Given a stock trending up, how likely does it continue up?
  4. +
  5. Is there a difference in accuracy between large- and small-cap stocks?
  6. +
  7. How often, and for how long, do markets trend before an earnings release?
  8. +
+

I want to especially thank Alejandro Saltiel for helping me retrieve the data. He's great. And now for all of the interesting bits.

+ +
+
+
+
+
+
+
+
+

Event Studies

Before we go too much further, I want to introduce the actual event study. Each chart intends to capture a lot of information and present an easy-to-understand pattern:

+ +
+
+
+
+
+
In [5]:
+
+
+
import numpy as np
+import pandas as pd
+from pandas.tseries.holiday import USFederalHolidayCalendar
+from pandas.tseries.offsets import CustomBusinessDay
+from datetime import datetime, timedelta
+
+# If you remove rules, it removes them from *all* calendars
+# To ensure we don't pop rules we don't want to, first make
+# sure to fully copy the object
+trade_calendar = USFederalHolidayCalendar()
+trade_calendar.rules.pop(6) # Remove Columbus day
+trade_calendar.rules.pop(7) # Remove Veteran's day
+TradeDay = lambda days: CustomBusinessDay(days, calendar=trade_calendar)
+
+def plot_study(array):
+    # Given a 2-d array, we assume the event happens at index `lookback`,
+    # and create all of our summary statistics from there.
+    lookback = int((array.shape[1] - 1) / 2)
+    norm_factor = np.repeat(array[:,lookback].reshape(-1, 1), array.shape[1], axis=1)
+    centered_data = array / norm_factor - 1
+    lookforward = centered_data.shape[1] - lookback
+    means = centered_data.mean(axis=0)
+    lookforward_data = centered_data[:,lookforward:]
+    std_dev = np.hstack([0, lookforward_data.std(axis=0)])
+    maxes = lookforward_data.max(axis=0)
+    mins = lookforward_data.min(axis=0)
+    
+    f, axarr = plt.subplots(1, 2)
+    range_begin = -lookback
+    range_end = lookforward
+    axarr[0].plot(range(range_begin, range_end), means)
+    axarr[1].plot(range(range_begin, range_end), means)
+    axarr[0].fill_between(range(0, range_end),
+                     means[-lookforward:] + std_dev,
+                     means[-lookforward:] - std_dev,
+                    alpha=.5, label="$\pm$ 1 s.d.")
+    axarr[1].fill_between(range(0, range_end),
+                     means[-lookforward:] + std_dev,
+                     means[-lookforward:] - std_dev,
+                    alpha=.5, label="$\pm$ 1 s.d.")
+    
+    max_err = maxes - means[-lookforward+1:]
+    min_err = means[-lookforward+1:] - mins
+    axarr[0].errorbar(range(1, range_end),
+                  means[-lookforward+1:],
+                  yerr=[min_err, max_err], label='Max & Min')
+    axarr[0].legend(loc=2)
+    axarr[1].legend(loc=2)
+    
+    axarr[0].set_xlim((-lookback-1, lookback+1))
+    axarr[1].set_xlim((-lookback-1, lookback+1))
+    
+def plot_study_small(array):
+    # Given a 2-d array, we assume the event happens at index `lookback`,
+    # and create all of our summary statistics from there.
+    lookback = int((array.shape[1] - 1) / 2)
+    norm_factor = np.repeat(array[:,lookback].reshape(-1, 1), array.shape[1], axis=1)
+    centered_data = array / norm_factor - 1
+    lookforward = centered_data.shape[1] - lookback
+    means = centered_data.mean(axis=0)
+    lookforward_data = centered_data[:,lookforward:]
+    std_dev = np.hstack([0, lookforward_data.std(axis=0)])
+    maxes = lookforward_data.max(axis=0)
+    mins = lookforward_data.min(axis=0)
+    
+    range_begin = -lookback
+    range_end = lookforward
+    plt.plot(range(range_begin, range_end), means)
+    plt.fill_between(range(0, range_end),
+                     means[-lookforward:] + std_dev,
+                     means[-lookforward:] - std_dev,
+                    alpha=.5, label="$\pm$ 1 s.d.")
+    
+    max_err = maxes - means[-lookforward+1:]
+    min_err = means[-lookforward+1:] - mins
+    plt.errorbar(range(1, range_end),
+                  means[-lookforward+1:],
+                  yerr=[min_err, max_err], label='Max & Min')
+    plt.legend(loc=2)
+    plt.xlim((-lookback-1, lookback+1))
+    
+def fetch_event_data(ticker, events, horizon=5):
+    # Use horizon+1 to account for including the day of the event,
+    # and half-open interval - that is, for a horizon of 5,
+    # we should be including 11 events. Additionally, using the
+    # CustomBusinessDay means we automatically handle issues if
+    # for example a company reports Friday afternoon - the date
+    # calculator will turn this into a "Saturday" release, but
+    # we effectively shift that to Monday with the logic below.
+    td_back = TradeDay(horizon+1)
+    td_forward = TradeDay(horizon+1)
+    
+    start_date = min(events) - td_back
+    end_date = max(events) + td_forward
+    total_data = fetch_ticker(ticker, start_date, end_date)
+    event_data = [total_data.ix[event-td_back:event+td_forward]\
+                      [0:horizon*2+1]\
+                      ['Adjusted Close']
+                  for event in events]
+    return np.array(event_data)
+
+# Generate a couple of random events
+
+event_dates = [datetime(2016, 5, 27) - timedelta(days=1) - TradeDay(x*20) for x in range(1, 40)]
+data = fetch_event_data('CELG', event_dates)
+plot_study_small(data)
+plt.legend(loc=3)
+plt.gcf().set_size_inches(12, 6);
+
+
+plt.annotate('Mean price for days leading up to each event',
+             (-5, -.01), (-4.5, .025),
+             arrowprops=dict(facecolor='black', shrink=0.05))
+plt.annotate('', (-.1, .005), (-.5, .02),
+             arrowprops={'facecolor': 'black', 'shrink': .05})
+plt.annotate('$\pm$ 1 std. dev. each day', (5, .055), (2.5, .085),
+            arrowprops={'facecolor': 'black', 'shrink': .05})
+plt.annotate('Min/Max each day', (.9, -.07), (-1, -.1),
+            arrowprops={'facecolor': 'black', 'shrink': .05});
+
+ +
+
+
+ +
+
+ + +
+ + +
+ +
+ +
+ +
+
+ +
+
+
+
+
+
+

And as a quick textual explanation as well:

+
    +
  • The blue line represents the mean price for each day, represented as a percentage of the price on the '0-day'. For example, if we defined an 'event' as whenever the stock price dropped for three days, we would see a decreasing blue line to the left of the 0-day.

    +
  • +
  • The blue shaded area represents one standard deviation above and below the mean price for each day following an event. This is intended to give us an idea of what the stock price does in general following an event.

    +
  • +
  • The green bars are the minimum and maximum price for each day following an event. This instructs us as to how much it's possible for the stock to move.

    +
  • +
+ +
+
+
+
+
+
+
+
+

Event Type 1: Trending down over the past N days

The first type of event I want to study is how stocks perform when they've been trending down over the past couple of days prior to a release. However, we need to clarify what exactly is meant by "trending down." To do so, we'll use the following metric: the midpoint between each day's opening and closing price goes down over a period of N days.

+

It's probably helpful to have an example:

+ +
+
+
+
+
+
In [6]:
+
+
+
f, axarr = plt.subplots(1, 2)
+f.set_size_inches(18, 6)
+
+FB_plot = axarr[0]
+ohlc_dataframe(FB[datetime(2016, 4, 18):], FB_plot)
+
+FB_truncated = FB[datetime(2016, 4, 18):datetime(2016, 4, 27)]
+midpoint = FB_truncated['Open']/2 + FB_truncated['Close']/2
+FB_plot.plot(FB_truncated.index, midpoint, label='Midpoint')
+FB_plot.vlines(date2num(datetime(2016, 4, 27, 12)),
+               ax_fb.get_ylim()[0], ax_fb.get_ylim()[1],
+               color='g', label='Earnings Release')
+FB_plot.legend(loc=2)
+FB_plot.set_title('FB Midpoint Plot')
+
+AAPL_plot = axarr[1]
+ohlc_dataframe(AAPL[datetime(2016, 4, 10):], AAPL_plot)
+AAPL_truncated = AAPL[datetime(2016, 4, 10):datetime(2016, 4, 26)]
+midpoint = AAPL_truncated['Open']/2 + AAPL_truncated['Close']/2
+AAPL_plot.plot(AAPL_truncated.index, midpoint, label='Midpoint')
+AAPL_plot.vlines(date2num(datetime(2016, 4, 26, 12)),
+                 ax_aapl.get_ylim()[0], ax_aapl.get_ylim()[1],
+                 color='g', label='Earnings Release')
+AAPL_plot.legend(loc=3)
+AAPL_plot.set_title('AAPL Midpoint Plot');
+
+ +
+
+
+ +
+
+ + +
+ + +
+ +
+ +
+ +
+
+ +
+
+
+
+
+
+

Given these charts, we can see that FB was trending down for the four days preceding the earnings release, and AAPL was trending down for a whopping 8 days (we don't count the peak day). This will define the methodology that we will use for the study.

+

So what are the results? For a given horizon, how well does the market actually perform?

+ +
+
+
+
+
+
In [7]:
+
+
+
# Read in the events for each stock;
+# The file was created using the first code block in the Appendix
+import yaml
+from dateutil.parser import parse
+from progressbar import ProgressBar
+
+data_str = open('earnings_dates.yaml', 'r').read()
+# Need to remove invalid lines
+filtered = filter(lambda x: '{' not in x, data_str.split('\n'))
+earnings_data = yaml.load('\n'.join(filtered))
+
+# Convert our earnings data into a list of (ticker, date) pairs
+# to make it easy to work with.
+# This is horribly inefficient, but should get us what we need
+ticker_dates = []
+for ticker, date_list in earnings_data.items():
+    for iso_str in date_list:
+        ticker_dates.append((ticker, parse(iso_str)))
+
+def does_trend_down(ticker, event, horizon):
+    # Figure out if the `event` has a downtrend for
+    # the `horizon` days preceding it
+    # As an interpretation note: it is assumed that
+    # the closing price of day `event` is the reference
+    # point, and we want `horizon` days before that.
+    # The price_data.hdf was created in the second appendix code block
+    try:
+        ticker_data = pd.read_hdf('price_data.hdf', ticker)
+        data = ticker_data[event-TradeDay(horizon):event]
+        midpoints = data['Open']/2 + data['Close']/2
+
+        # Shift dates one forward into the future and subtract
+        # Effectively: do we trend down over all days?
+        elems = midpoints - midpoints.shift(1)
+        return len(elems)-1 == len(elems.dropna()[elems <= 0])
+    except KeyError:
+        # If the stock doesn't exist, it doesn't qualify as trending down
+        # Mostly this is here to make sure the entire analysis doesn't 
+        # blow up if there were issues in data retrieval
+        return False
+
+def study_trend(horizon, trend_function):
+    five_day_events = np.zeros((1, horizon*2 + 1))
+    invalid_events = []
+    for ticker, event in ProgressBar()(ticker_dates):
+        if trend_function(ticker, event, horizon):
+            ticker_data = pd.read_hdf('price_data.hdf', ticker)
+            event_data = ticker_data[event-TradeDay(horizon):event+TradeDay(horizon)]['Close']
+
+            try:
+                five_day_events = np.vstack([five_day_events, event_data])
+            except ValueError:
+                # Sometimes we don't get exactly the right number of values due to calendar
+                # issues. I've fixed most everything I can, and the few issues that are left
+                # I assume don't systemically bias the results (i.e. data could be missing
+                # because it doesn't exist, etc.). After running through, ~1% of events get
+                # discarded this way
+                invalid_events.append((ticker, event))
+            
+
+    # Remove our initial zero row
+    five_day_events = five_day_events[1:,:]
+    plot_study(five_day_events)
+    plt.gcf().suptitle('Action over {} days: {} events'
+                       .format(horizon,five_day_events.shape[0]))
+    plt.gcf().set_size_inches(18, 6)
+    
+# Start with a 5 day study
+study_trend(5, does_trend_down)
+
+ +
+
+
+ +
+
+ + +
+
+
100% (47578 of 47578) |###########################################################| Elapsed Time: 0:21:38 Time: 0:21:38
+
+
+
+ +
+ + +
+ +
+ +
+ +
+
+ +
+
+
+
+
+
+

When a stock has been trending down for 5 days, once the earnings are announced it really doesn't move on average. However, the variability is incredible. This implies two important things:

+
    +
  1. The market is just as often wrong about an earnings announcement before it happens as it is correct
  2. +
  3. The incredible width of the min/max bars and standard deviation area tell us that the market reacts violently after the earnings are released.
  4. +
+

Let's repeat the same study, but over a time horizon of 8 days and 3 days. Presumably if a stock has been going down for 8 days at a time before the earnings, the market should be more accurate.

+ +
+
+
+
+
+
In [8]:
+
+
+
# 8 day study next
+study_trend(8, does_trend_down)
+
+ +
+
+
+ +
+
+ + +
+
+
100% (47578 of 47578) |###########################################################| Elapsed Time: 0:20:29 Time: 0:20:29
+
+
+
+ +
+ + +
+ +
+ +
+ +
+
+ +
+
+
+
+
+
+

However, looking only at stocks that trended down for 8 days prior to a release, the same pattern emerges: on average, the stock doesn't move, but the market reaction is often incredibly violent.

+ +
+
+
+
+
+
In [9]:
+
+
+
# 3 day study after that
+study_trend(3, does_trend_down)
+
+ +
+
+
+ +
+
+ + +
+
+
100% (47578 of 47578) |###########################################################| Elapsed Time: 0:26:26 Time: 0:26:26
+
+
+
+ +
+ + +
+ +
+ +
+ +
+
+ +
+
+
+
+
+
+

Finally, when we look at a 3-day horizon, we start getting some incredible outliers. Stocks have a potential to move over ~300% up, and the standard deviation width is again, incredible. The results for a 3-day horizon follow the same pattern we've seen in the 5- and 8-day horizons.

+ +
+
+
+
+
+
+
+
+

Event Type 2: Trending up for N days

We're now going to repeat the analysis, but do it for uptrends instead. That is, instead of looking at stocks that have been trending down over the past number of days, we focus only on stocks that have been trending up.

+ +
+
+
+
+
+
In [10]:
+
+
+
def does_trend_up(ticker, event, horizon):
+    # Figure out if the `event` has an uptrend for
+    # the `horizon` days preceding it
+    # As an interpretation note: it is assumed that
+    # the closing price of day `event` is the reference
+    # point, and we want `horizon` days before that.
+    # The price_data.hdf was created in the second appendix code block
+    try:
+        ticker_data = pd.read_hdf('price_data.hdf', ticker)
+        data = ticker_data[event-TradeDay(horizon):event]
+        midpoints = data['Open']/2 + data['Close']/2
+
+        # Shift dates one forward into the future and subtract
+        # Effectively: do we trend down over all days?
+        elems = midpoints - midpoints.shift(1)
+        return len(elems)-1 == len(elems.dropna()[elems >= 0])
+    except KeyError:
+        # If the stock doesn't exist, it doesn't qualify as trending down
+        # Mostly this is here to make sure the entire analysis doesn't 
+        # blow up if there were issues in data retrieval
+        return False
+
+study_trend(5, does_trend_up)
+
+ +
+
+
+ +
+
+ + +
+
+
100% (47578 of 47578) |###########################################################| Elapsed Time: 0:22:51 Time: 0:22:51
+
+
+
+ +
+ + +
+ +
+ +
+ +
+
+ +
+
+
+
+
+
+

The patterns here are very similar. With the exception of noting that stocks can go to nearly 400% after an earnings announcement (most likely this included a takeover announcement, etc.), we still see large min/max bars and wide standard deviation of returns.

+

We'll repeat the pattern for stocks going up for both 8 and 3 days straight, but at this point, the results should be very predictable:

+ +
+
+
+
+
+
In [11]:
+
+
+
study_trend(8, does_trend_up)
+
+ +
+
+
+ +
+
+ + +
+
+
100% (47578 of 47578) |###########################################################| Elapsed Time: 0:20:51 Time: 0:20:51
+
+
+
+ +
+ + +
+ +
+ +
+ +
+
+ +
+
+
+
In [12]:
+
+
+
study_trend(3, does_trend_up)
+
+ +
+
+
+ +
+
+ + +
+
+
100% (47578 of 47578) |###########################################################| Elapsed Time: 0:26:56 Time: 0:26:56
+
+
+
+ +
+ + +
+ +
+ +
+ +
+
+ +
+
+
+
+
+
+

Conclusion and Summary

I guess the most important thing to summarize with is this: looking at the entire market, stock performance prior to an earnings release has no bearing on the stock's performance. Honestly: given the huge variability of returns after an earnings release, even when the stock has been trending for a long time, you're best off divesting before an earnings release and letting the market sort itself out.

+

However, there is a big caveat. These results are taken when we look at the entire market. So while we can say that the market as a whole knows nothing and just reacts violently, I want to take a closer look into this data. Does the market typically perform poorly on large-cap/high liquidity stocks? Do smaller companies have investors that know them better and can thus predict performance better? Are specific market sectors better at prediction? Presumably technology stocks are more volatile than the industrials.

+

So there are some more interesting questions I still want to ask with this data. Knowing that the hard work of data processing is largely already done, it should be fairly simple to continue this analysis and get much more refined with it. Until next time.

+ +
+
+
+
+
+
+
+
+

Appendix

Export event data for Russell 3000 companies:

+
import pandas as pd
+from html.parser import HTMLParser
+from datetime import datetime, timedelta
+import requests
+import re
+from dateutil import parser
+import progressbar
+from concurrent import futures
+import yaml
+
+class EarningsParser(HTMLParser):
+    store_dates = False
+    earnings_offset = None
+    dates = []
+
+    def __init__(self, *args, **kwargs):
+        super().__init__(*args, **kwargs)
+        self.dates = []
+
+    def handle_starttag(self, tag, attrs):
+        if tag == 'table':
+            self.store_dates = True
+
+    def handle_data(self, data):
+        if self.store_dates:
+            match = re.match(r'\d+/\d+/\d+', data)
+            if match:
+                self.dates.append(match.group(0))
+
+        # If a company reports before the bell, record the earnings date
+        # being at midnight the day before. Ex: WMT reports 5/19/2016,
+        # but we want the reference point to be the closing price on 5/18/2016
+        if 'After Close' in data:
+            self.earnings_offset = timedelta(days=0)
+        elif 'Before Open' in data:
+            self.earnings_offset = timedelta(days=-1)
+
+    def handle_endtag(self, tag):
+        if tag == 'table':
+            self.store_dates = False
+
+def earnings_releases(ticker):
+    #print("Looking up ticker {}".format(ticker))
+    user_agent = 'Mozilla/5.0 (Windows NT 10.0; WOW64; rv:46.0) '\
+        'Gecko/20100101 Firefox/46.0'
+    headers = {'user-agent': user_agent}
+    base_url = 'http://www.streetinsider.com/ec_earnings.php?q={}'\
+        .format(ticker)
+    e = EarningsParser()
+    s = requests.Session()
+    a = requests.adapters.HTTPAdapter(max_retries=0)
+    s.mount('http://', a)
+    e.feed(str(s.get(base_url, headers=headers).content))
+
+    if e.earnings_offset is not None:
+        dates = map(lambda x: parser.parse(x) + e.earnings_offset, e.dates)
+        past = filter(lambda x: x < datetime.now(), dates)
+        return list(map(lambda d: d.isoformat(), past))
+
+# Use a Russell-3000 ETF tracker (ticker IWV) to get a list of holdings
+r3000 = pd.read_csv('https://www.ishares.com/us/products/239714/'
+                    'ishares-russell-3000-etf/1449138789749.ajax?'
+                    'fileType=csv&fileName=IWV_holdings&dataType=fund',
+                    header=10)
+r3000_equities = r3000[(r3000['Exchange'] == 'NASDAQ') |
+                       (r3000['Exchange'] == 'New York Stock Exchange Inc.')]
+
+dates_file = open('earnings_dates.yaml', 'w')
+
+with futures.ThreadPoolExecutor(max_workers=8) as pool:
+    fs = {pool.submit(earnings_releases, r3000_equities.ix[t]['Ticker']): t
+          for t in r3000_equities.index}
+    pbar = progressbar.ProgressBar(term_width=80,
+                                   max_value=r3000_equities.index.max())
+
+    for future in futures.as_completed(fs):
+        i = fs[future]
+        pbar.update(i)
+        dates_file.write(yaml.dump({r3000_equities.ix[i]['Ticker']:
+            future.result()}))
+
+

Downloading stock price data needed for the event studies:

+
from secrets import QUANDL_KEY
+import pandas as pd
+import yaml
+from dateutil.parser import parse
+from datetime import timedelta
+import quandl
+from progressbar import ProgressBar
+
+def fetch_ticker(ticker, start, end):
+    # Quandl is currently giving me issues with returning
+    # the entire dataset and not slicing server-side.
+    # So instead, we'll do it client-side!
+    q_format = '%Y-%m-%d'
+    ticker_data = quandl.get('YAHOO/' + ticker,
+                             start_date=start.strftime(q_format),
+                             end_date=end.strftime(q_format),
+                             authtoken=QUANDL_KEY)
+    return ticker_data
+
+data_str = open('earnings_dates.yaml', 'r').read()
+# Need to remove invalid lines
+filtered = filter(lambda x: '{' not in x, data_str.split('\n'))
+earnings_data = yaml.load('\n'.join(filtered))
+
+# Get the first 1500 keys - split up into two statements
+# because of Quandl rate limits
+tickers = list(earnings_data.keys())
+
+price_dict = {}
+invalid_tickers = []
+for ticker in ProgressBar()(tickers[0:1500]):
+    try:
+        # Replace '.' with '-' in name for some tickers
+        fixed = ticker.replace('.', '-')
+        event_strs = earnings_data[ticker]
+        events = [parse(event) for event in event_strs]
+        td = timedelta(days=20)
+        price_dict[ticker] = fetch_ticker(fixed,
+            min(events)-td, max(events)+td)
+    except quandl.NotFoundError:
+        invalid_tickers.append(ticker)
+
+# Execute this after 10 minutes have passed
+for ticker in ProgressBar()(tickers[1500:]):
+    try:
+        # Replace '.' with '-' in name for some tickers
+        fixed = ticker.replace('.', '-')
+        event_strs = earnings_data[ticker]
+        events = [parse(event) for event in event_strs]
+        td = timedelta(days=20)
+        price_dict[ticker] = fetch_ticker(fixed,
+            min(events)-td, max(events)+td)
+    except quandl.NotFoundError:
+        invalid_tickers.append(ticker)
+
+prices_store = pd.HDFStore('price_data.hdf')
+for ticker, prices in price_dict.items():
+    prices_store[ticker] = prices
+
+ +
+
+

+
+
+ + +
@@ -95,6 +5350,7 @@ MathJax.Hub.Config({tex2jax: {inlineMath: [['\$','\$']]}});
    +
diff --git a/guaranteed-money-maker.html b/guaranteed-money-maker.html index f6fa3c1..c28b38c 100644 --- a/guaranteed-money-maker.html +++ b/guaranteed-money-maker.html @@ -4,20 +4,22 @@ - + - + Guaranteed Money Maker - Bradlee Speice - - - - + + + + + + @@ -26,6 +28,17 @@ + + + @@ -39,7 +52,7 @@
@@ -54,12 +67,12 @@

Guaranteed Money Maker

-

Bradlee Speice, Wed 03 February 2016, Blog

+

Bradlee Speice, Wed 03 February 2016, Blog

-martingale, strategy

+martingale, strategy

@@ -73,7 +86,265 @@
-

{% notebook 2016-2-3-guaranteed-money-maker.ipynb %}

+

+

+
+
+
+
+

If you can see into the future, that is.

My previous class in Stochastic Calculus covered a lot of interesting topics, and the important one for today +is the Gambler's Ruin problem. If you're interested in some of the theory behind it, also make sure to check out +random walks. The important bit is that we studied the Martingale Betting Strategy, which describes for us +a guaranteed way to eventually make money.

+

The strategy goes like this: You are going to toss a fair coin with a friend. If you guess heads or tails correctly, you get back double the money you bet. If you guess incorrectly, you lose money. How should you bet?

+

The correct answer is that you should double your bet each time you lose. Then when you finally win, you'll be guaranteed to make back everything you lost and then $1 extra! Consider the scenario:

+
    +
  1. You bet $1, and guess incorrectly. You're 1 dollar in the hole.
  2. +
  3. You bet $2, and guess incorrectly. You're 3 dollars in the hole now.
  4. +
  5. You bet $4, and guess incorrectly. You're 7 dollars in the hole.
  6. +
  7. You bet $8, and guess correctly! You now get back those 8 dollars you bet, plus 8 extra for winning, for a total profit of one dollar!
  8. +
+

Mathematically, we can prove that as long as you have unlimited money to bet, you are guaranteed to make money.

+

Applying the Martingale Strategy

But we're all realistic people, and once you start talking about "unlimited money" eyebrows should be raised. Even still, this is an interesting strategy to investigate, and I want to apply it to the stock market. As long as we can guarantee there's a single day in which the stock goes up, we should be able to make money right? The question is just how much we have to invest to guarantee this.

+

Now it's time for the math. We'll use the following definitions:

+
    +
  • $o_i$ = the share price at the opening of day $i$
  • +
  • $c_i$ = the share price at the close of day $i$
  • +
  • $d_i$ = the amount of money we want to invest at the beginning of day $i$
  • +
+

With those definitions in place, I'd like to present the formula that is guaranteed to make you money. I call it Bradlee's Investment Formula:

+

$c_n \sum_{i=1}^n \frac{d_i}{o_i} > \sum_{i=1}^{n} d_i$

+

It might not look like much, but if you can manage to make it so that this formula holds true, you will be guaranteed to make money. The intuition behind the formula is this: The closing share price times the number of shares you have purchased ends up greater than the amount of money you invested.

+

That is, on day $n$, if you know what the closing price will be you can set up the amount of money you invest that day to guarantee you make money. I'll even teach you to figure out how much money that is! Take a look:

+

$ +\begin{align} +c_n \sum_{i=1}^{n-1} \frac{d_i}{o_i} + \frac{c_nd_n}{o_n} &> \sum_{i=1}^{n-1}d_i + d_n\\ +\frac{c_nd_n}{o_n} - d_n &> \sum_{i=1}^{n-1}(d_i - \frac{c_nd_i}{o_i})\\ +d_n (\frac{c_n - o_n}{o_n}) &> \sum_{i=1}^{n-1} d_i(1 - \frac{c_n}{o_i})\\ +d_n &> \frac{o_n}{c_n - o_n} \sum_{i=1}^{n-1} d_i(1 - \frac{1}{o_i}) +\end{align}$

+

If you invest exactly $d_n$ that day, you'll break even. But if you can make sure the money you invest is greater than that quantity on the right (which requires that you have a crystal ball tell you the stock's closing price) you are guaranteed to make money!

+

Interesting Implications

On a more serious note though, the formula above tells us a couple of interesting things:

+
    +
  1. It's impossible to make money without the closing price at some point being greater than the opening price (or vice-versa if you are short selling) - there is no amount of money you can invest that will turn things in your favor.
  2. +
  3. Close prices of the past aren't important if you're concerned about the bottom line. While chart technicians use price history to make judgment calls, in the end, the closing price on anything other than the last day is irrelevant.
  4. +
  5. It's possible to make money as long as there is a single day where the closing price is greater than the opening price! You might have to invest a lot to do so, but it's possible.
  6. +
  7. You must make a prediction about where the stock will close at if you want to know how much to invest. That is, we can set up our investment for the day to make money if the stock goes up 1%, but if it only goes up .5% we'll still lose money.
  8. +
  9. It's possible the winning move is to scale back your position. Consider the scenario:
      +
    • You invest money and the stock closes down the day .5%
    • +
    • You invest tomorrow expecting the stock to go up 1%
    • +
    • The winning investment to break even (assuming a 1% increase) is to scale back the position, since the shares you purchased at the beginning would then be profitable
    • +
    +
  10. +
+

Running the simulation

So now that we've defined our investment formula,we need to tweak a couple things in order to make an investment strategy we can actually work with. There are two issues we need to address:

+
    +
  1. The formula only tells us how much to invest if we want to break even ($d_n$). If we actually want to turn a profit, we need to invest more than that, which we will refer to as the bias.
  2. +
  3. The formula assumes we know what the closing price will be on any given day. If we don't know this, we can still invest assuming the stock price will close at a level we choose. If the price doesn't meet this objective, we try again tomorrow! This predetermined closing price will be referred to as the expectation.
  4. +
+

Now that we've defined our bias and expectation, we can actually build a strategy we can simulate. Much like the martingale strategy told you to bet twice your previous bet in order to make money, we've designed a system that tells us how much to bet in order to make money as well.

+

Now, let's get to the code!

+ +
+
+
+
+
+
In [1]:
+
+
+
using Quandl
+api_key = ""
+daily_investment = function(current_open, current_close, purchase_history, open_history)
+    # We're not going to safeguard against divide by 0 - that's the user's responsibility
+    t1 = current_close / current_open - 1
+    t2 = sum(purchase_history - purchase_history*current_close ./ open_history)
+    return t2 / t1
+end;
+
+ +
+
+
+ +
+
+
+
+
+
+

And let's code a way to run simulations quickly:

+ +
+
+
+
+
+
In [2]:
+
+
+
is_profitable = function(current_price, purchase_history, open_history)
+    shares = sum(purchase_history ./ open_history)
+    return current_price*shares > sum(purchase_history)
+end
+
+simulate = function(name, start, init, expected, bias)
+    ticker_info = quandlget(name, from=start, api_key=api_key)
+    open_vals = ticker_info["Open"].values
+    close_vals = ticker_info["Close"].values
+    invested = [init]
+    
+    # The simulation stops once we've made a profit
+    day = 1
+    profitable = is_profitable(close_vals[day], invested, open_vals[1:length(invested)]) ||
+        is_profitable(open_vals[day+1], invested, open_vals[1:length(invested)])
+    while !profitable
+        expected_close = open_vals[day+1] * expected
+        todays_purchase = daily_investment(open_vals[day+1], expected_close, invested, open_vals[1:day])
+        invested = [invested; todays_purchase + bias]
+        # expected_profit = expected_close * sum(invested ./ open_vals[1:length(invested)]) - sum(invested)
+        day += 1
+        profitable = is_profitable(close_vals[day], invested, open_vals[1:length(invested)]) ||
+            is_profitable(open_vals[day+1], invested, open_vals[1:length(invested)])
+    end
+    
+    shares = sum(invested ./ open_vals[1:length(invested)])
+    max_profit = max(close_vals[day], open_vals[day+1])
+    profit = shares * max_profit - sum(invested)
+    return (invested, profit)
+end
+
+sim_summary = function(investments, profit)
+    leverages = [sum(investments[1:i]) for i=1:length(investments)]
+    max_leverage = maximum(leverages) / investments[1]
+    println("Max leverage: $(max_leverage)")
+    println("Days invested: $(length(investments))")
+    println("Profit: $profit")
+end;
+
+ +
+
+
+ +
+
+
+
+
+
+

Now, let's get some data and run a simulation! Our first test:

+
    +
  • We'll invest 100 dollars in LMT, and expect that the stock will close up 1% every day. We'll invest $d_n$ + 10 dollars every day that we haven't turned a profit, and end the simulation once we've made a profit.
  • +
+ +
+
+
+
+
+
In [3]:
+
+
+
investments, profit = simulate("YAHOO/LMT", Date(2015, 11, 29), 100, 1.01, 10)
+sim_summary(investments, profit)
+
+ +
+
+
+ +
+
+ + +
+
+
Max leverage: 5.590373200042106
+Days invested: 5
+Profit: 0.6894803101560001
+
+
+
+ +
+
+ +
+
+
+
+
+
+

The result: We need to invest 5.6x our initial position over a period of 5 days to make approximately .69¢

+
    +
  • Now let's try the same thing, but we'll assume the stock closes up 2% instead.
  • +
+ +
+
+
+
+
+
In [4]:
+
+
+
investments, profit = simulate("YAHOO/LMT", Date(2015, 11, 29), 100, 1.02, 10)
+sim_summary(investments, profit)
+
+ +
+
+
+ +
+
+ + +
+
+
Max leverage: 1.854949900247809
+Days invested: 25
+Profit: 0.08304813163696423
+
+
+
+ +
+
+ +
+
+
+
+
+
+

In this example, we only get up to a 1.85x leveraged position, but it takes 25 days to turn a profit of 8¢

+ +
+
+
+
+
+
+
+
+

Summary

We've defined an investment strategy that can tell us how much to invest when we know what the closing position of a stock will be. We can tweak the strategy to actually make money, but plenty of work needs to be done so that we can optimize the money invested.

+

In the next post I'm going to post more information about some backtests and strategy tests on this strategy (unless of course this experiment actually produces a significant profit potential, and then I'm keeping it for myself).

+ +
+
+
+
+
+
+
+
+

Side note and disclaimer

The claims made in this presentation about being able to guarantee making money are intended as a joke and do not constitute investment advice of any sort.

+ +
+
+

@@ -81,6 +352,20 @@ MathJax.Hub.Config({tex2jax: {inlineMath: [['$','$'], ['\(','\)']]}}); +
+
+ + +
@@ -92,6 +377,7 @@ MathJax.Hub.Config({tex2jax: {inlineMath: [['$','$'], ['\(','\)']]}});
    +
diff --git a/predicting-santander-customer-happiness.html b/predicting-santander-customer-happiness.html index 383a6c5..0f8b0c7 100644 --- a/predicting-santander-customer-happiness.html +++ b/predicting-santander-customer-happiness.html @@ -4,20 +4,22 @@ - + - + Predicting Santander Customer Happiness - Bradlee Speice - - - - + + + + + + @@ -26,6 +28,17 @@ + + + @@ -39,7 +52,7 @@
@@ -54,12 +67,12 @@

Predicting Santander Customer Happiness

-

Bradlee Speice, Sat 05 March 2016, Blog

+

Bradlee Speice, Sat 05 March 2016, Blog

-data science, kaggle, machine learning

+data science, kaggle, machine learning

@@ -73,7 +86,923 @@
-

{% notebook 2016-3-5-predicting-santander-customer-happiness.ipynb %}

+

+

+
+
+
+
+

My first Kaggle competition

It's time! After embarking on a Machine Learning class this semester, and with a Saturday in which I don't have much planned, I wanted to put this class and training to work. It's my first competition submission. I want to walk you guys through how I'm approaching this problem, because I thought it would be really neat. The competition is Banco Santander's Santander Customer Satisfaction competition. It seemed like an easy enough problem I could actually make decent progress on it.

+

Data Exploration

First up: we need to load our data and do some exploratory work. Because we're going to be using this data for model selection prior to testing, we need to make a further split. I've already gone ahead and done this work, please see the code in the appendix below.

+ +
+
+
+
+
+
In [1]:
+
+
+
import pandas as pd
+import numpy as np
+import matplotlib.pyplot as plt
+%matplotlib inline
+
+# Record how long it takes to run the notebook - I'm curious.
+from datetime import datetime
+start = datetime.now()
+
+dataset = pd.read_csv('split_train.csv')
+dataset.index = dataset.ID
+X = dataset.drop(['TARGET', 'ID', 'ID.1'], 1)
+y = dataset.TARGET
+
+ +
+
+
+ +
+
+
+
In [2]:
+
+
+
y.unique()
+
+ +
+
+
+ +
+
+ + +
Out[2]:
+ + +
+
array([0, 1], dtype=int64)
+
+ +
+ +
+
+ +
+
+
+
In [3]:
+
+
+
len(X.columns)
+
+ +
+
+
+ +
+
+ + +
Out[3]:
+ + +
+
369
+
+ +
+ +
+
+ +
+
+
+
+
+
+

Okay, so there are only two classes we're predicting: 1 for unsatisfied customers, 0 for satisfied customers. I would have preferred this to be something more like a regression, or predicting multiple classes: maybe the customer isn't the most happy, but is nowhere near closing their accounts. For now though, that's just the data we're working with.

+

Now, I'd like to make a scatter matrix of everything going on. Unfortunately as noted above, we have 369 different features. There's no way I can graphically make sense of that much data to start with.

+

We're also not told what the data actually represents: Are these survey results? Average time between contact with a customer care person? Frequency of contacting a customer care person? The idea is that I need to reduce the number of dimensions we're predicting across.

+

Dimensionality Reduction pt. 1 - Binary Classifiers

My first attempt to reduce the data dimensionality is to find all the binary classifiers in the dataset (i.e. 0 or 1 values) and see if any of those are good (or anti-good) predictors of the final data.

+ +
+
+
+
+
+
In [4]:
+
+
+
cols = X.columns
+b_class = []
+for c in cols:
+    if len(X[c].unique()) == 2:
+        b_class.append(c)
+        
+len(b_class)
+
+ +
+
+
+ +
+
+ + +
Out[4]:
+ + +
+
111
+
+ +
+ +
+
+ +
+
+
+
+
+
+

So there are 111 features in the dataset that are a binary label. Let's see if any of them are good at predicting the users satisfaction!

+ +
+
+
+
+
+
In [5]:
+
+
+
# First we need to `binarize` the data to 0-1; some of the labels are {0, 1},
+# some are {0, 3}, etc.
+from sklearn.preprocessing import binarize
+X_bin = binarize(X[b_class])
+
+accuracy = [np.mean(X_bin[:,i] == y) for i in range(0, len(b_class))]
+acc_df = pd.DataFrame({"Accuracy": accuracy}, index=b_class)
+acc_df.describe()
+
+ +
+
+
+ +
+
+ + +
Out[5]:
+ +
+
+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
Accuracy
count111.000000
mean0.905159
std0.180602
min0.043598
25%0.937329
50%0.959372
75%0.960837
max0.960837
+
+
+ +
+ +
+
+ +
+
+
+
+
+
+

Wow! Looks like we've got some incredibly predictive features! So much so that we should be a bit concerned. My initial guess for what's happening is that we have a sparsity issue: so many of the values are 0, and these likely happen to line up with satisfied customers.

+

So the question we must now answer, which I likely should have asked long before now: What exactly is the distribution of un/satisfied customers?

+ +
+
+
+
+
+
In [6]:
+
+
+
unsat = y[y == 1].count()
+print("Satisfied customers: {}; Unsatisfied customers: {}".format(len(y) - unsat, unsat))
+naive_guess = np.mean(y == np.zeros(len(y)))
+print("Naive guess accuracy: {}".format(naive_guess))
+
+ +
+
+
+ +
+
+ + +
+
+
Satisfied customers: 51131; Unsatisfied customers: 2083
+Naive guess accuracy: 0.9608561656706882
+
+
+
+ +
+
+ +
+
+
+
+
+
+

This is a bit discouraging. A naive guess of "always satisfied" performs as well as our best individual binary classifier. What this tells me then, is that these data columns aren't incredibly helpful in prediction. I'd be interested in a polynomial expansion of this data-set, but for now, that's more computation than I want to take on.

+

Dimensionality Reduction pt. 2 - LDA

Knowing that our naive guess performs so well is a blessing and a curse:

+
    +
  • Curse: The threshold for performance is incredibly high: We can only "improve" over the naive guess by 4%
  • +
  • Blessing: All the binary classification features we just discovered are worthless on their own. We can throw them out and reduce the data dimensionality from 369 to 111.
  • +
+

Now, in removing these features from the dataset, I'm not saying that there is no "information" contained within them. There might be. But the only way we'd know is through a polynomial expansion, and I'm not going to take that on within this post.

+

My initial thought for a "next guess" is to use the LDA model for dimensionality reduction. However, it can only reduce dimensions to $1 - p$, with $p$ being the number of classes. Since this is a binary classification, every LDA model that I try will have dimensionality one; when I actually try this, the predictor ends up being slightly less accurate than the naive guess.

+

Instead, let's take a different approach to dimensionality reduction: principle components analysis. This allows us to perform the dimensionality reduction without worrying about the number of classes. Then, we'll use a Gaussian Naive Bayes model to actually do the prediction. This model is chosen simply because it doesn't take a long time to fit and compute; because PCA will take so long, I just want a prediction at the end of this. We can worry about using a more sophisticated LDA/QDA/SVM model later.

+

Now into the actual process: We're going to test out PCA dimensionality reduction from 1 - 20 dimensions, and then predict using a Gaussian Naive Bayes model. The 20 dimensions upper limit was selected because the accuracy never improves after you get beyond that (I found out by running it myself). Hopefully, we'll find that we can create a model better than the naive guess.

+ +
+
+
+
+
+
In [7]:
+
+
+
from sklearn.naive_bayes import GaussianNB
+from sklearn.decomposition import PCA
+
+X_no_bin = X.drop(b_class, 1)
+
+def evaluate_gnb(dims):
+    pca = PCA(n_components=dims)
+    X_xform = pca.fit_transform(X_no_bin)
+    
+    gnb = GaussianNB()
+    gnb.fit(X_xform, y)
+    return gnb.score(X_xform, y)
+
+dim_range = np.arange(1, 21)
+plt.plot(dim_range, [evaluate_gnb(dim) for dim in dim_range], label="Gaussian NB Accuracy")
+plt.axhline(naive_guess, label="Naive Guess", c='k')
+plt.axhline(1 - naive_guess, label="Inverse Naive Guess", c='k')
+plt.gcf().set_size_inches(12, 6)
+plt.legend();
+
+ +
+
+
+ +
+
+ + +
+ + +
+ +
+ +
+ +
+
+ +
+
+
+
+
+
+

**sigh...** After all the effort and computational power, we're still at square one: we have yet to beat out the naive guess threshold. With PCA in play we end up performing terribly, but not terribly enough that we can guess against ourselves.

+

Let's try one last-ditch attempt using the entire data set:

+ +
+
+
+
+
+
In [8]:
+
+
+
def evaluate_gnb_full(dims):
+    pca = PCA(n_components=dims)
+    X_xform = pca.fit_transform(X)
+    
+    gnb = GaussianNB()
+    gnb.fit(X_xform, y)
+    return gnb.score(X_xform, y)
+
+dim_range = np.arange(1, 21)
+plt.plot(dim_range, [evaluate_gnb(dim) for dim in dim_range], label="Gaussian NB Accuracy")
+plt.axhline(naive_guess, label="Naive Guess", c='k')
+plt.axhline(1 - naive_guess, label="Inverse Naive Guess", c='k')
+plt.gcf().set_size_inches(12, 6)
+plt.legend();
+
+ +
+
+
+ +
+
+ + +
+ + +
+ +
+ +
+ +
+
+ +
+
+
+
+
+
+

Nothing. It is interesting to note that the graphs are almost exactly the same: This would imply again that the variables we removed earlier (all the binary classifiers) indeed have almost no predictive power. It seems this problem is high-dimensional, but with almost no data that can actually inform our decisions.

+

Summary for Day 1

After spending a couple hours with this dataset, there seems to be a fundamental issue in play: We have very high-dimensional data, and it has no bearing on our ability to actually predict customer satisfaction. This can be a huge issue: it implies that no matter what model we use, we fundamentally can't perform well. I'm sure most of this is because I'm not an experienced data scientist. Even so, we have yet to develop a strategy that can actually beat out the village idiot; so far, the bank is best off just assuming all its customers are satisfied. Hopefully more to come soon.

+ +
+
+
+
+
+
In [9]:
+
+
+
end = datetime.now()
+print("Running time: {}".format(end - start))
+
+ +
+
+
+ +
+
+ + +
+
+
Running time: 0:00:58.715714
+
+
+
+ +
+
+ +
+
+
+
+
+
+

Appendix

Code used to split the initial training data:

+
from sklearn.cross_validation import train_test_split
+data = pd.read_csv('train.csv')
+data.index = data.ID
+
+data_train, data_validate = train_test_split(
+    data, train_size=.7)
+
+data_train.to_csv('split_train.csv')
+data_validate.to_csv('split_validate.csv')
+
+ +
+
+

+
+
+ + +
@@ -93,6 +1036,7 @@ MathJax.Hub.Config({tex2jax: {inlineMath: [['$','$']]}});
    +
diff --git a/profitability-using-the-investment-formula.html b/profitability-using-the-investment-formula.html index d81450b..66d695a 100644 --- a/profitability-using-the-investment-formula.html +++ b/profitability-using-the-investment-formula.html @@ -4,20 +4,22 @@ - + - + Profitability using the Investment Formula - Bradlee Speice - - - - + + + + + + @@ -26,6 +28,17 @@ + + + @@ -39,7 +52,7 @@
@@ -54,12 +67,12 @@

Profitability using the Investment Formula

-

Bradlee Speice, Fri 26 February 2016, Blog

+

Bradlee Speice, Fri 26 February 2016, Blog

-algorithmic-trading, python

+algorithmic-trading, python

@@ -73,7 +86,6875 @@
-

{% notebook 2016-2-26-profitability-using-the-investment-formula.ipynb %}

+

+

+
+
+
+
+

Profitability using the Investment Formula

I've previously talked about crafting an Investment Formula that would guarantee making money if you could predict which direction the stock market was going to go. This is going to be the first in a series of posts trying to flesh out what an actual investment strategy based on this formula would look like.

+

But first, the formula doesn't take into account two very important things: leverage, and the number of days invested. That's why I want to set up what I'm going to call the Profitability Score.

+

The definition is going to be very simple:

+
    +
  • $p$: Profit made once you exit the investment
  • +
  • $i$: Initial investment into the asset
  • +
  • $m$: Maximum investment in the asset
  • +
  • $l = m / i$: The maximum leverage of an investment, as the ratio of maximum invested to initial investment
  • +
  • $d$: The number of days it takes to turn a profit
  • +
+

$s = \frac{1000 p}{i(l + d)} = \frac{1000 p}{m + i\cdot d}$

+

Crazy, right? The score is simply the (normalized) profit you made divided by the leverage plus days invested. The $\cdot 1000$ is just to turn the number into something more reasonable - people don't like hearing something with a profitability score of .001 for example.

+

Theoretical Justification

The formula itself is designed to be simple in principle: I like making a profit, and I want to penalize the leverage you incur and days you have to invest. Ideally, we want to have a stock that goes up all the time. However, the investment formula takes advantage of a different case: trying to profit from highly volatile assets. If we can make money when the investment only has one day up, let's do it!

+

Even so, there are two potential issues: First, stocks that trend upward will have a higher profitability score - both leverage and days invested will be 1. To protect against only investing in this trend, I can do things like taking $\log(d)$. I don't want to start biasing the scoring function until I have a practical reason to do so, so right now I'll leave it standing.

+

The second issue is how to penalize leverage and days invested relative to each other. As it currently stands, a leverage of 6x with only 1 day invested is the same as leveraging 2x with 3 days invested. In the future, I'd again want to look at making the impact of days invested smaller - I can get over an extra 3 days in the market if it means that I don't have to incur a highly leveraged position.

+

So there could be things about the scoring function we change in the future, but I want to run some actual tests before we start worrying about things like that!

+

Running a simulation

This won't be an incredibly rigorous backtest, I just want to see some results from the work so far. Let's set up the simulation code again, and start looking into some random stocks. If you've read the last blog post, you can skip over the code. The only difference is that it's been ported to python to make the data-wrangling easier. Julia doesn't yet support some of the multi-index things I'm trying to do.

+ +
+
+
+
+
+
In [19]:
+
+
+
import numpy as np
+import pandas as pd
+import matplotlib.pyplot as plt
+from Quandl import get as qget
+%matplotlib inline
+api_key = ''
+
+profitability = lambda p, i, m, d: 1000*p / (m + i*d)
+
+def is_profitable(current_price, purchase_history, open_history):
+    shares = (purchase_history / open_history).sum()
+    return current_price * shares > sum(purchase_history)
+
+def daily_investment(current_open, current_close, purchase_history, open_history):
+    t1 = current_close / current_open - 1
+    t2 = (purchase_history - purchase_history * current_close / open_history).sum()
+    return t2 / t1
+
+def simulate_day(open_vals, close_vals, init, expected, bias):
+    invested = np.array([init])
+    
+    day = 1
+    profitable = is_profitable(close_vals[day-1], invested, open_vals[0:len(invested)]) \
+        or is_profitable(open_vals[day], invested, open_vals[0:len(invested)])
+
+    while not profitable:
+        expected_close = open_vals[day] * expected
+        todays_purchase = daily_investment(open_vals[day], expected_close, invested, open_vals[0:day])
+        invested = np.append(invested, todays_purchase + bias)
+        # expected_profit = expected_close * (invested / open_vals[0:len(invested)]).sum() - invested.sum()
+        day += 1
+        profitable = is_profitable(close_vals[day-1], invested, open_vals[0:len(invested)]) \
+            or is_profitable(open_vals[day], invested, open_vals[0:len(invested)])
+            
+    shares = (invested / open_vals[0:len(invested)]).sum()
+    
+    # Make sure we can't see into the future - we know either today's close or tomorrow's open
+    # will be profitable, but we need to check which one.
+    if is_profitable(close_vals[day-1], invested, open_vals[0:len(invested)]):
+        ending_price = close_vals[day-1]
+    else:
+        ending_price = open_vals[day]
+        
+    profit = shares * ending_price - sum(invested)
+    return invested, profit
+
+def simulate_ts(name, start, end, initial, expected, bias):
+    ticker_info = qget(name, trim_start=start, api_key=api_key)
+    evaluation_times = ticker_info[:end].index
+    
+    # Handle Google vs. YFinance data
+    if "Adjusted Close" in ticker_info.columns:
+        close_column = "Adjusted Close"
+    else:
+        close_column = "Close"
+        
+    sim = {d: simulate_day(ticker_info[d:]["Open"], ticker_info[d:][close_column],
+                       100, 1.02, 10) for d in evaluation_times}
+    sim_series = pd.Series(sim)
+    result = pd.DataFrame()
+    result["profit"] = sim_series.apply(lambda x: x[1])
+    result["max"] = sim_series.apply(lambda x: max(x[0]))
+    result["days"] = sim_series.apply(lambda x: len(x[0]))
+    result["score"] = sim_series.apply(lambda x: profitability(x[1], x[0][0], max(x[0]), len(x[0])))
+    result["investments"] = sim_series.apply(lambda x: x[0])
+    
+    return result
+
+def simulate_tickers(tickers):
+    from datetime import datetime
+    results = {}
+    for ticker in tickers:
+        start = datetime(2015, 1, 1)
+        results_df = simulate_ts(ticker, start, datetime(2016, 1, 1), 100, 1.01, 10)
+        results[ticker] = results_df
+        
+    return pd.concat(list(results.values()), keys=list(results.keys()), axis=1)
+
+ +
+
+
+ +
+
+
+
+
+
+

And now the interesting part

Let's start looking into the data! FANG stocks have been big over the past year, let's see how they look:

+ +
+
+
+
+
+
In [7]:
+
+
+
fang_df = simulate_tickers(["YAHOO/FB", "YAHOO/AAPL", "YAHOO/NFLX", "YAHOO/GOOG"])
+
+ +
+
+
+ +
+
+
+
In [8]:
+
+
+
fang_df.xs('days', axis=1, level=1).hist()
+plt.gcf().set_size_inches(18, 8);
+plt.gcf().suptitle("Distribution of Days Until Profitability", fontsize=18);
+
+ +
+
+
+ +
+
+ + +
+ + +
+ +
+ +
+ +
+
+ +
+
+
+
In [10]:
+
+
+
fang_df.xs('score', axis=1, level=1).plot()
+plt.gcf().set_size_inches(18, 6)
+plt.gcf().suptitle("Profitability score over time", fontsize=18);
+
+ +
+
+
+ +
+
+ + +
+ + +
+ +
+ +
+ +
+
+ +
+
+
+
+
+
+

Let's think about these graphs. First, the histogram. What we like seeing is a lot of 1's - that means there were a lot of days that the stock went up and we didn't have to worry about actually implementing the strategy - we were able to close the trade at a profit.

+

Looking at the profitability score over time though is a bit more interesting. First off, stocks that are more volatile will tend to have a higher profitability score, no two ways about that. However, Netflix consistently outperformed on this metric. We know that 2015 was a good year for Netflix, so that's a (small) sign the strategy is performing as expected.

+

The final interesting note happens around the end of August 2015. Around this period, the markets were selling off in a big way due to issues in China (not unlike what's happening now). Even so, all of the FANG stocks saw an uptick in profitability around this time. This is another sign that the strategy being developed performs better during periods of volatility, rather than from riding markets up or down.

+

What about FANG vs. some cyclicals?

+ +
+
+
+
+
+
In [13]:
+
+
+
cyclic_df = simulate_tickers(["YAHOO/X", "YAHOO/CAT", "YAHOO/NFLX", "YAHOO/GOOG"])
+
+ +
+
+
+ +
+
+
+
In [14]:
+
+
+
cyclic_df.xs('days', axis=1, level=1).hist()
+plt.gcf().set_size_inches(18, 8);
+plt.gcf().suptitle("Distribution of Days Until Profitability", fontsize=18);
+
+ +
+
+
+ +
+
+ + +
+ + +
+ +
+ +
+ +
+
+ +
+
+
+
In [15]:
+
+
+
cyclic_df.xs('score', axis=1, level=1).plot()
+plt.gcf().set_size_inches(18, 6)
+plt.gcf().suptitle("Profitability score over time", fontsize=18);
+
+ +
+
+
+ +
+
+ + +
+ + +
+ +
+ +
+ +
+
+ +
+
+
+
+
+
+

Some more interesting results come from this as well. First off, US Steel (X) has a much smoother distribution of days until profitability - it doesn't have a huge number of values at 1 and then drop off. Intuitively, we're not terribly large fans of this, we want a stock to go up! However, on the profitability score it is the only serious contender to Netflix.

+

Second, we see the same trend around August - the algorithm performs well in volatile markets.

+

For a final test, let's try some biotech and ETFs!

+ +
+
+
+
+
+
In [21]:
+
+
+
biotech_df = simulate_tickers(['YAHOO/REGN', 'YAHOO/CELG', 'GOOG/NASDAQ_BIB', 'GOOG/NASDAQ_IBB'])
+
+ +
+
+
+ +
+
+
+
In [22]:
+
+
+
biotech_df.xs('days', axis=1, level=1).hist()
+plt.gcf().set_size_inches(18, 8);
+plt.gcf().suptitle("Distribution of Days Until Profitability", fontsize=18);
+
+ +
+
+
+ +
+
+ + +
+ + +
+ +
+ +
+ +
+
+ +
+
+
+
In [23]:
+
+
+
biotech_df.xs('score', axis=1, level=1).plot()
+plt.gcf().set_size_inches(18, 6)
+plt.gcf().suptitle("Profitability score over time", fontsize=18);
+
+ +
+
+
+ +
+
+ + +
+ + +
+ +
+ +
+ +
+
+ +
+
+
+
+
+
+

In this example, we don't see a whole lot of interesting things: the scores are all fairly close together with notable exceptions in late August, and mid-October.

+

What is interesting is that during the volatile period, the ETF's performed significantly better than the stocks did in terms of profitability. The leveraged ETF (BIB) performed far above anyone else, and it appears that indeed, it is most profitable during volatile periods. Even so, it was far more likely to take multiple days to give a return. Its count of 1-day investments trails the other ETF and both stocks by a decent margin.

+

And consider me an OCD freak, but I just really like Celgene's distribution - it looks nice and smooth.

+ +
+
+
+
+
+
+
+
+

Summary and plans for the next post

So far I'm really enjoying playing with this strategy - there's a lot of depth here to understand, though the preliminary results seem to indicate that it profits mostly from taking the other side of a volatile trade. I'd be interested to run results later on data from January - It's been a particularly volatile start to the year so it would be neat to see whether this strategy would work then.

+

For the next post, I want to start playing with some of the parameters: How do the bias and expected close influence the process? The values have been fairly conservative so far, it will be interesting to see how the simulations respond afterward.

+ +
+
+

@@ -81,6 +6962,20 @@ MathJax.Hub.Config({tex2jax: {inlineMath: [['$','$'], ['\\(','\\)']]}}); +
+
+ + +
@@ -92,6 +6987,7 @@ MathJax.Hub.Config({tex2jax: {inlineMath: [['$','$'], ['\\(','\\)']]}});
    +
diff --git a/testing-cramer.html b/testing-cramer.html index 586f7cb..f1e75db 100644 --- a/testing-cramer.html +++ b/testing-cramer.html @@ -4,20 +4,22 @@ - + - + Testing Cramer - Bradlee Speice - - - - + + + + + + @@ -26,6 +28,17 @@ + + + @@ -39,7 +52,7 @@
@@ -54,12 +67,12 @@

Testing Cramer

-

Bradlee Speice, Sat 26 December 2015, Blog

+

Bradlee Speice, Sat 26 December 2015, Blog

-data science, futures

+data science, futures

@@ -73,7 +86,425 @@
-

{% notebook 2015-12-26-testing_cramer.ipynb %}

+

+

+
+
In [1]:
+
+
+
import requests
+import pandas as pd
+import numpy as np
+from dateutil import parser as dtparser
+from dateutil.relativedelta import relativedelta
+from datetime import datetime
+from html.parser import HTMLParser
+from copy import copy
+import Quandl
+
+ +
+
+
+ +
+
+
+
+
+
+

Testing Cramer

Pursuant to attending a graduate school studying Financial Engineering, I've been a fan of the Mad Money TV show featuring the bombastic Jim Cramer. One of the things that he's said is that you shouldn't use the futures to predict where the stock market is going to go. But he says it often enough, I've begun to wonder - who is he trying to convince?

+

It makes sense that because futures on things like the S&P 500 are traded continuously, they would price in market information before the stock market opens. So is Cramer right to be convinced that strategies based on the futures are a poor idea? I wanted to test it out.

+

The first question is where to get the future's data. I've been part of Seeking Alpha for a bit, and they publish the Wall Street Breakfast newsletter which contains daily future's returns as of 6:20 AM EST. I'd be interested in using that data to see if we can actually make some money.

+

First though, let's get the data:

+

Downloading Futures data from Seeking Alpha

We're going to define two HTML parsing classes - one to get the article URL's from a page, and one to get the actual data from each article.

+ +
+
+
+
+
+
In [2]:
+
+
+
class ArticleListParser(HTMLParser):
+    """Given a web page with articles on it, parse out the article links"""
+    
+    articles = []
+    
+    def handle_starttag(self, tag, attrs):
+        #if tag == 'div' and ("id", "author_articles_wrapper") in attrs:
+        #    self.fetch_links = True
+        if tag == 'a' and ('class', 'dashboard_article_link') in attrs:
+            href = list(filter(lambda x: x[0] == 'href', attrs))[0][1]
+            self.articles.append(href)
+            
+base_url = "http://seekingalpha.com/author/wall-street-breakfast/articles"
+article_page_urls = [base_url] + [base_url + '/{}'.format(i) for i in range(2, 20)]
+
+global_articles = []
+for page in article_page_urls:
+    # We need to switch the user agent, as SA blocks the standard requests agent
+    articles_html = requests.get(page,
+                                headers={"User-Agent": "Wget/1.13.4"})
+    parser = ArticleListParser()
+    parser.feed(articles_html.text)
+    global_articles += (parser.articles)
+
+ +
+
+
+ +
+
+
+
In [3]:
+
+
+
class ArticleReturnParser(HTMLParser):
+    "Given an article, parse out the futures returns in it"
+    
+    record_font_tags = False
+    in_font_tag = False
+    counter = 0
+    # data = {} # See __init__
+    
+    def __init__(self, *args, **kwargs):
+        super().__init__(*args, **kwargs)
+        self.data = {}
+    
+    def handle_starttag(self, tag, attrs):
+        if tag == 'span' and ('itemprop', 'datePublished') in attrs:
+            date_string = list(filter(lambda x: x[0] == 'content', attrs))[0][1]
+            date = dtparser.parse(date_string)
+            self.data['date'] = date
+            
+        self.in_font_tag = tag == 'font'
+        
+    def safe_float(self, string):
+        try:
+            return float(string[:-1]) / 100
+        except ValueError:
+            return np.NaN
+            
+    def handle_data(self, content):
+        if not self.record_font_tags and "Futures at 6" in content:
+            self.record_font_tags = True
+            
+        if self.record_font_tags and self.in_font_tag:
+            if self.counter == 0:
+                self.data['DOW'] = self.safe_float(content)
+            elif self.counter == 1:
+                self.data['S&P'] = self.safe_float(content)
+            elif self.counter == 2:
+                self.data['NASDAQ'] = self.safe_float(content)
+            elif self.counter == 3:
+                self.data['Crude'] = self.safe_float(content)
+            elif self.counter == 4:
+                self.data['Gold'] = self.safe_float(content)
+            
+            self.counter += 1
+            
+    def handle_endtag(self, tag):
+        self.in_font_tag = False
+
+def retrieve_data(url):
+    sa = "http://seekingalpha.com"
+    article_html = requests.get(sa + url,
+                               headers={"User-Agent": "Wget/1.13.4"})
+    parser = ArticleReturnParser()
+    parser.feed(article_html.text)
+    parser.data.update({"url": url})
+    parser.data.update({"text": article_html.text})
+    return parser.data
+
+# This copy **MUST** be in place. I'm not sure why,
+# as you'd think that the data being returned would already
+# represent a different memory location. Even so, it blows up
+# if you don't do this.
+article_list = list(set(global_articles))
+article_data = [copy(retrieve_data(url)) for url in article_list]
+# If there's an issue downloading the article, drop it.
+article_df = pd.DataFrame.from_dict(article_data).dropna()
+
+ +
+
+
+ +
+
+
+
+
+
+

Fetching the Returns data

Now that we have the futures data, we're going to compare across 4 different indices - the S&P 500 index, Dow Jones Industrial, Russell 2000, and NASDAQ 100. Let's get the data off of Quandl to make things easier!

+ +
+
+
+
+
+
In [4]:
+
+
+
# article_df is sorted by date, so we get the first row.
+start_date = article_df.sort_values(by='date').iloc[0]['date'] - relativedelta(days=1)
+SPY = Quandl.get("GOOG/NYSE_SPY", trim_start=start_date)
+DJIA = Quandl.get("GOOG/AMS_DIA", trim_start=start_date)
+RUSS = Quandl.get("GOOG/AMEX_IWM", trim_start=start_date)
+NASDAQ = Quandl.get("GOOG/EPA_QQQ", trim_start=start_date)
+
+ +
+
+
+ +
+
+
+
+
+
+

Running the Comparison

There are two types of tests I want to determine: How accurate each futures category is at predicting the index's opening change over the close before, and predicting the index's daily return.

+

Let's first calculate how good each future is at predicting the opening return over the previous day. I expect that the futures will be more than 50% accurate, since the information is recorded 3 hours before the markets open.

+ +
+
+
+
+
+
In [5]:
+
+
+
def calculate_opening_ret(frame):
+    # I'm not a huge fan of the appending for loop,
+    # but it's a bit verbose for a comprehension
+    data = {}
+    for i in range(1, len(frame)):
+        date = frame.iloc[i].name
+        prior_close = frame.iloc[i-1]['Close']
+        open_val = frame.iloc[i]['Open']
+        data[date] = (open_val - prior_close) / prior_close
+        
+    return data
+
+SPY_open_ret = calculate_opening_ret(SPY)
+DJIA_open_ret = calculate_opening_ret(DJIA)
+RUSS_open_ret = calculate_opening_ret(RUSS)
+NASDAQ_open_ret = calculate_opening_ret(NASDAQ)
+
+def signs_match(list_1, list_2):
+    # This is a surprisingly difficult task - we have to match
+    # up the dates in order to check if opening returns actually match
+    index_dict_dt = {key.to_datetime(): list_2[key] for key in list_2.keys()}
+    
+    matches = []
+    for row in list_1.iterrows():
+        row_dt = row[1][1]
+        row_value = row[1][0]
+        index_dt = datetime(row_dt.year, row_dt.month, row_dt.day)
+        if index_dt in list_2:
+            index_value = list_2[index_dt]
+            if (row_value > 0 and index_value > 0) or \
+                (row_value < 0 and index_value < 0) or \
+                (row_value == 0 and index_value == 0):
+                    matches += [1]
+            else:
+                matches += [0]
+            #print("{}".format(list_2[index_dt]))
+    return matches
+    
+    
+prediction_dict = {}
+matches_dict = {}
+count_dict = {}
+index_dict = {"SPY": SPY_open_ret, "DJIA": DJIA_open_ret, "RUSS": RUSS_open_ret, "NASDAQ": NASDAQ_open_ret}
+indices = ["SPY", "DJIA", "RUSS", "NASDAQ"]
+futures = ["Crude", "Gold", "DOW", "NASDAQ", "S&P"]
+for index in indices:
+    matches_dict[index] = {future: signs_match(article_df[[future, 'date']],
+                                               index_dict[index]) for future in futures}
+    count_dict[index] = {future: len(matches_dict[index][future]) for future in futures}
+    prediction_dict[index] = {future: np.mean(matches_dict[index][future])
+                              for future in futures}
+print("Articles Checked: ")
+print(pd.DataFrame.from_dict(count_dict))
+print()
+print("Prediction Accuracy:")
+print(pd.DataFrame.from_dict(prediction_dict))
+
+ +
+
+
+ +
+
+ + +
+
+
Articles Checked: 
+        DJIA  NASDAQ  RUSS  SPY
+Crude    268     268   271  271
+DOW      268     268   271  271
+Gold     268     268   271  271
+NASDAQ   268     268   271  271
+S&P      268     268   271  271
+
+Prediction Accuracy:
+            DJIA    NASDAQ      RUSS       SPY
+Crude   0.544776  0.522388  0.601476  0.590406
+DOW     0.611940  0.604478  0.804428  0.841328
+Gold    0.462687  0.455224  0.464945  0.476015
+NASDAQ  0.615672  0.608209  0.797048  0.830258
+S&P     0.604478  0.597015  0.811808  0.848708
+
+
+
+ +
+
+ +
+
+
+
+
+
+

This data is very interesting. Some insights:

+
    +
  • Both DOW and NASDAQ futures are pretty bad at predicting their actual market openings
  • +
  • NASDAQ and Dow are fairly unpredictable; Russell 2000 and S&P are very predictable
  • +
  • Gold is a poor predictor in general - intuitively Gold should move inverse to the market, but it appears to be about as accurate as a coin flip.
  • +
+

All said though it appears that futures data is important for determining market direction for both the S&P 500 and Russell 2000. Cramer is half-right: futures data isn't very helpful for the Dow and NASDAQ indices, but is great for the S&P and Russell indices.

+

The next step - Predicting the close

Given the code we currently have, I'd like to predict the close of the market as well. We can re-use most of the code, so let's see what happens:

+ +
+
+
+
+
+
In [6]:
+
+
+
def calculate_closing_ret(frame):
+    # I'm not a huge fan of the appending for loop,
+    # but it's a bit verbose for a comprehension
+    data = {}
+    for i in range(0, len(frame)):
+        date = frame.iloc[i].name
+        open_val = frame.iloc[i]['Open']
+        close_val = frame.iloc[i]['Close']
+        data[date] = (close_val - open_val) / open_val
+        
+    return data
+
+SPY_close_ret = calculate_closing_ret(SPY)
+DJIA_close_ret = calculate_closing_ret(DJIA)
+RUSS_close_ret = calculate_closing_ret(RUSS)
+NASDAQ_close_ret = calculate_closing_ret(NASDAQ)
+
+def signs_match(list_1, list_2):
+    # This is a surprisingly difficult task - we have to match
+    # up the dates in order to check if opening returns actually match
+    index_dict_dt = {key.to_datetime(): list_2[key] for key in list_2.keys()}
+    
+    matches = []
+    for row in list_1.iterrows():
+        row_dt = row[1][1]
+        row_value = row[1][0]
+        index_dt = datetime(row_dt.year, row_dt.month, row_dt.day)
+        if index_dt in list_2:
+            index_value = list_2[index_dt]
+            if (row_value > 0 and index_value > 0) or \
+                (row_value < 0 and index_value < 0) or \
+                (row_value == 0 and index_value == 0):
+                    matches += [1]
+            else:
+                matches += [0]
+            #print("{}".format(list_2[index_dt]))
+    return matches
+    
+    
+matches_dict = {}
+count_dict = {}
+prediction_dict = {}
+index_dict = {"SPY": SPY_close_ret, "DJIA": DJIA_close_ret,
+              "RUSS": RUSS_close_ret, "NASDAQ": NASDAQ_close_ret}
+indices = ["SPY", "DJIA", "RUSS", "NASDAQ"]
+futures = ["Crude", "Gold", "DOW", "NASDAQ", "S&P"]
+for index in indices:
+    matches_dict[index] = {future: signs_match(article_df[[future, 'date']],
+                                               index_dict[index]) for future in futures}
+    count_dict[index] = {future: len(matches_dict[index][future]) for future in futures}
+    prediction_dict[index] = {future: np.mean(matches_dict[index][future])
+                              for future in futures}
+    
+print("Articles Checked:")
+print(pd.DataFrame.from_dict(count_dict))
+print()
+print("Prediction Accuracy:")
+print(pd.DataFrame.from_dict(prediction_dict))
+
+ +
+
+
+ +
+
+ + +
+
+
Articles Checked:
+        DJIA  NASDAQ  RUSS  SPY
+Crude    268     268   271  271
+DOW      268     268   271  271
+Gold     268     268   271  271
+NASDAQ   268     268   271  271
+S&P      268     268   271  271
+
+Prediction Accuracy:
+            DJIA    NASDAQ      RUSS       SPY
+Crude   0.533582  0.529851  0.501845  0.542435
+DOW     0.589552  0.608209  0.535055  0.535055
+Gold    0.455224  0.451493  0.483395  0.512915
+NASDAQ  0.582090  0.626866  0.531365  0.538745
+S&P     0.585821  0.608209  0.535055  0.535055
+
+
+
+ +
+
+ +
+
+
+
+
+
+

Well, it appears that the futures data is terrible at predicting market close. NASDAQ predicting NASDAQ is the most interesting data point, but 63% accuracy isn't accurate enough to make money consistently.

+

Final sentiments

The data bears out very close to what I expected would happen:

+
    +
  • Futures data is more accurate than a coin flip for predicting openings, which makes sense since it is recorded only 3 hours before the actual opening
  • +
  • Futures data is about as acccurate as a coin flip for predicting closings, which means there is no money to be made in trying to predict the market direction for the day given the futures data.
  • +
+

In summary:

+
    +
  • Cramer is half right: Futures data is not good for predicting the market open of the Dow and NASDAQ indices. Contrary to Cramer though, it is very good for predicting the S&P and Russell indices - we can achieve an accuracy slightly over 80% for each.
  • +
  • Making money in the market is hard. We can't just go to the futures and treat them as an oracle for where the market will close.
  • +
+

I hope you've enjoyed this, I quite enjoyed taking a deep dive in the analytics this way. I'll be posting more soon!

+ +
+
+

@@ -81,6 +512,20 @@ MathJax.Hub.Config({tex2jax: {inlineMath: [['$','$'], ['\(','\)']]}}); +
+
+ + +
@@ -92,6 +537,7 @@ MathJax.Hub.Config({tex2jax: {inlineMath: [['$','$'], ['\(','\)']]}});
    +
diff --git a/the-unfair-casino.html b/the-unfair-casino.html index 6b53198..bfbc7f7 100644 --- a/the-unfair-casino.html +++ b/the-unfair-casino.html @@ -4,20 +4,22 @@ - + - + The Unfair Casino - Bradlee Speice - - - - + + + + + + @@ -26,6 +28,17 @@ + + + @@ -39,7 +52,7 @@
@@ -73,7 +86,2729 @@
-

{% notebook 2016-5-15-the-unfair-casino.ipynb %}

+

+

+
+
+
+
+

Or, how to get thrown out of a casino because you're a mathematician.

+
+

In the ongoing eternal cycle of mathematicians asking generally useless questions about probability, I dreamt up another one. The scenario is as follows:

+

You're playing a game with two die, and you do not get to see what the outcome of the die are on each roll. All you get to see is their sum. Given an arbitrarily long list of the sum of two rolls, can you determine if one or both die are loaded, and what those loadings are?

+

Proving we can detect cheating

My first question is simply, is this possible? There's a lot of trivial cases that make it obvious that there's cheating going on. But there are some edge cases that might give us more difficulty. First though, let's get a picture of what the fair distribution looks like. In principle, we can only detect cheating if the distribution of the fair die differs from the distribution of the loaded die.

+ +
+
+
+
+
+
In [1]:
+
+
+
import numpy as np
+import pandas as pd
+import matplotlib.pyplot as plt
+%matplotlib inline
+
+fair_1 = np.random.randint(1, 7, 10000)
+fair_2 = np.random.randint(1, 7, 10000)
+
+pd.Series(fair_1 + fair_2).plot(kind='hist', bins=11);
+plt.title('Fair Distribution');
+
+ +
+
+
+ +
+
+ + +
+ + +
+ +
+ +
+ +
+
+ +
+
+
+
+
+
+

This distribution makes sense: there are many ways to make a 7 (the most frequent observed value) and very few ways to make a 12 or 2; an important symmetry. As a special note, you can notice that the sum of two fair dice is a discrete case of the Triangle Distribution, which is itself a special case of the Irwin-Hall Distribution.

+

The Edge Cases

Given that we understand how the results of two fair dice are distributed, let's see some of the interesting edge cases that come up. This will give us assurance that when a casino is cheating, it is detectable (given sufficient data). To make this as hard as possible, we will think of scenarios where the expected value of the sum of loaded dice is the same as the expected value of the sum of fair dice.

+

Edge Case 1

What happens when one die is biased low, and one die is biased high? That is, where:

+\begin{align} +\begin{array}{cc} +D_1 = \left\{ +\begin{array}{lr} +1 & w.p. 1/3\\ +2 & w.p. 1/3\\ +3 & w.p. 1/12\\ +4 & w.p. 1/12\\ +5 & w.p. 1/12\\ +6 & w.p. 1/12 +\end{array} +\right. & +D_2 = \left\{ +\begin{array}{lr} +1 & w.p. 1/12\\ +2 & w.p. 1/12\\ +3 & w.p. 1/12\\ +4 & w.p. 1/12\\ +5 & w.p. 1/3\\ +6 & w.p. 1/3 +\end{array} +\right. \\ +\mathbb{E}[D_1] = 2.5 & \mathbb{E}[D_2] = 4.5 +\end{array}\\ +\mathbb{E}[D_1 + D_2] = 7 = \mathbb{E}[D_{fair} + D_{fair}] +\end{align} +
+
+
+
+
+
In [2]:
+
+
+
def unfair_die(p_vals, n):
+    x = np.random.multinomial(1, p_vals, n)
+    return x.nonzero()[1] + 1
+
+d1 = [1/3, 1/3, 1/12, 1/12, 1/12, 1/12]
+d2 = [1/12, 1/12, 1/12, 1/12, 1/3, 1/3]
+
+x1 = unfair_die(d1, 10000)
+x2 = unfair_die(d2, 10000)
+
+pd.Series(x1 + x2).plot(kind='hist', bins=11);
+plt.title('$D_1$ biased low, $D_2$ biased high');
+
+ +
+
+
+ +
+
+ + +
+ + +
+ +
+ +
+ +
+
+ +
+
+
+
+
+
+

We can see that while the 7 value remains the most likely (as expected), the distribution is not so nicely shaped any more.

+

Edge Case 2: When one die is loaded low, and one is loaded high, we've seen how we can detect them. How about when two die are loaded both low and high? That is, we have the following distribution:

+\begin{align} +\begin{array}{cc} +D_1 = \left\{ +\begin{array}{lr} +1 & w.p. 1/3\\ +2 & w.p. 1/12\\ +3 & w.p. 1/12\\ +4 & w.p. 1/12\\ +5 & w.p. 1/12\\ +6 & w.p. 1/3 +\end{array} +\right. & +D_2 = \left\{ +\begin{array}{lr} +1 & w.p. 1/3\\ +2 & w.p. 1/12\\ +3 & w.p. 1/12\\ +4 & w.p. 1/12\\ +5 & w.p. 1/12\\ +6 & w.p. 1/3 +\end{array} +\right. \\ +\mathbb{E}[D_1] = 3.5 & \mathbb{E}[D_2] = 3.5 +\end{array}\\ +\mathbb{E}[D_1 + D_2] = 7 = \mathbb{E}[D_{fair} + D_{fair}] +\end{align}

We can see even that the expected value of each individual die is the same as the fair die! However, the distribution (if we are doing this correctly) should still be skewed:

+ +
+
+
+
+
+
In [3]:
+
+
+
d1 = [1/3, 1/12, 1/12, 1/12, 1/12, 1/3]
+d2 = d1
+
+x1 = unfair_die(d1, 10000)
+x2 = unfair_die(d2, 10000)
+
+pd.Series(x1 + x2).plot(kind='hist', bins=11)
+plt.title("$D_1$ and $D_2$ biased to 1 and 6");
+
+ +
+
+
+ +
+
+ + +
+ + +
+ +
+ +
+ +
+
+ +
+
+
+
+
+
+

In a very un-subtle way, we have of course made the values 2 and 12 far more likely.

+

Detection Conclusion

There are some trivial examples of cheating that are easy to detect: whenever the expected value of the sum of two fair dice deviates from the expected value for the sum of two fair dice, we can immediately conclude that there is cheating at stake.

+

The interesting edge cases occur when the expected value of the sum of loaded dice matches the expected value of the sum of fair dice. Considering the above examples (and a couple more I ran through in developing this), we have seen that in every circumstance having two unfair dice leads to a distribution of results different from the fair results.

+

We can thus finally state: just by looking at the distribution of results from this game, we can immediately conclude whether there is cheating.

+ +
+
+
+
+
+
+
+
+

Simulated Annealing

What we really would like to do though, is see if there is any way to determine how exactly the dice are loaded. This is significantly more complicated, but we can borrow some algorithms from Machine Learning to figure out exactly how to perform this process. I'm using the Simulated Annealing algorithm, and I discuss why this works and why I chose it over some of the alternatives in the justification. If you don't care about how I set up the model and just want to see the code, check out the actual code.

+

Simulated Annealing is a variation of the Metropolis-Hastings Algorithm, but the important thing for us is: Simulated Annealing allows us to quickly optimize high-dimensional problems. But what exactly are we trying to optimize? Ideally, we want a function that can tell us whether one distribution for the dice better explains the results than another distribution. This is known as the likelihood function.

+

Deriving the Likelihood function

To derive our likelihood function, we want to know: what is the probability of seeing a specific result given those hidden parameters? This is actually a surprisingly difficult problem. While we can do a lot of calculations by hand, we need a more general solution since we will be working with very some interesting die distributions.

+

We first note that the sum of two dice can take on 11 different values - 2 through 12. This implies that each individual sum follows a Categorical distribution. That is:

+\begin{align} +\mathcal{L(x)} = \left\{ +\begin{array}{lr} +p_2 & x = 2\\ +p_3 & x = 3\\ +\ldots & \\ +p_{11} & x = 11\\ +p_{12} & x = 12 +\end{array} +\right. +\end{align}

Where each $p_i$ is the probability of seeing that specific result. However, we need to calculate what each probability is! I'll save you the details, but this author explains how to do it.

+

Now, we would like to know the likelihood of our entire data-set. This is trivial:

+\begin{align} +\mathcal{L(\mathbf{X})} &= \prod_{i=1}^n L(x) +\end{align}

However, it's typically much easier to work with the $\log(\mathcal{L})$ function instead. This is critically important from a computational perspective: when you multiply so many small numbers together (i.e. the product of $L(x)$ terms) the computer suffers from rounding error; if we don't control for this, we will find that no matter the distributions we choose for each die, the "likelihood" will be close to zero because the computer is not precise enough.

+\begin{align} +\log(\mathcal{L}) &= \sum_{i=1}^n \log(L) +\end{align}

The process of Simulated Annealing

The means by which we optimize our likelihood function is the simulated annealing algorithm. The way it works is as follows:

+
    +
  1. Start with a random guess for the parameters we are trying to optimize. In our case we are trying to guess the distribution of two dice, and so we "optimize" until we have a distribution that matches the data.

    +
  2. +
  3. For each iteration of the algorithm:

    +
      +
    1. Generate a new "proposed" set of parameters based on the current parameters - +i.e. slightly modify the current parameters to get a new set of parameters.
    2. +
    3. Calculate the value of $\log(\mathcal{L})$ for each set of parameters. If the function value for the +proposed parameter set is higher than for the current, automatically switch to the new parameter set +and continue the next iteration.
    4. +
    5. Given the new parameter set performs worse, determine a probability of switching to the new parameter set anyways: $\mathcal{P}(p_{current}, p_{proposed})$
    6. +
    7. Switch to the new parameter set with probability $\mathcal{P}$. If you fail to switch, begin the next iteration.
    8. +
    +
  4. +
  5. The algorithm is complete after we fail to make a transition $n$ times in a row.

    +
  6. +
+

If everything goes according to plan, we will have a value that is close to the true distribution of each die.

+

The actual code

We start by defining the score function. This will tell us how well the proposed die densities actually explain the results.

+ +
+
+
+
+
+
In [5]:
+
+
+
import numpy as np
+from numpy import polynomial
+
+def density_coef(d1_density, d2_density):
+    # Calculating the probabilities of each outcome was taken
+    # from this author: http://math.stackexchange.com/a/1710392/320784
+    d1_p = polynomial.Polynomial(d1_density)
+    d2_p = polynomial.Polynomial(d2_density)
+    coefs = (d1_p * d2_p).coef
+    return coefs
+
+def score(x, d1_density, d2_density):
+    # We've now got the probabilities of each event, but we need
+    # to shift the array a bit so we can use the x values to actually
+    # index into it. This will allow us to do all the calculations
+    # incredibly quickly
+    coefs = density_coef(d1_density, d2_density)
+    coefs = np.hstack((0, 0, coefs))
+    return np.log(coefs[x]).sum()
+
+ +
+
+
+ +
+
+
+
+
+
+

Afterward, we need to write something to permute the proposal densities. We make random modifications, and eventually the best one survives.

+ +
+
+
+
+
+
In [6]:
+
+
+
def permute(d1_density, d2_density):
+    # To ensure we have legitimate densities, we will randomly
+    # increase one die face probability by `change`,
+    # and decrease one by `change`.
+    # This means there are something less than (1/`change`)^12 possibilities
+    # we are trying to search over.
+    change = .01
+    
+    d1_index1, d1_index2 = np.random.randint(0, 6, 2)
+    d2_index1, d2_index2 = np.random.randint(0, 6, 2)
+    
+    # Also make sure to copy. I've had some weird aliasing issues
+    # in the past that made everything blow up.
+    new_d1 = np.float64(np.copy(d1_density))
+    new_d2 = np.float64(np.copy(d2_density))
+    
+    # While this doesn't account for the possibility that some
+    # values go negative, in practice this never happens
+    new_d1[d1_index1] += change
+    new_d1[d1_index2] -= change
+    new_d2[d2_index1] += change
+    new_d2[d2_index2] -= change
+    
+    return new_d1, new_d2
+
+ +
+
+
+ +
+
+
+
+
+
+

Now we've got the main algorithm code to do. This is what brings all the pieces together.

+ +
+
+
+
+
+
In [7]:
+
+
+
def optimize(data, conv_count=10, max_iter=1e4):
+    switch_failures = 0
+    iter_count = 0
+    
+    # Start with guessing fair dice
+    cur_d1 = np.repeat(1/6, 6)
+    cur_d2 = np.repeat(1/6, 6)
+    cur_score = score(data, cur_d1, cur_d2)
+    
+    # Keep track of our best guesses - may not be
+    # what we end with
+    max_score = cur_score
+    max_d1 = cur_d1
+    max_d2 = cur_d2
+    
+    # Optimization stops when we have failed to switch `conv_count`
+    # times (presumably because we have a great guess), or we reach
+    # the maximum number of iterations.
+    while switch_failures < conv_count and iter_count < max_iter:
+        iter_count += 1
+        if iter_count % (max_iter / 10) == 0:
+            print('Iteration: {}; Current score (higher is better): {}'.format(
+                    iter_count, cur_score))
+                
+        new_d1, new_d2 = permute(cur_d1, cur_d2)
+        new_score = score(data, new_d1, new_d2)
+        
+        if new_score > max_score:
+            max_score = new_score
+            max_d1 = new_d1
+            max_d2 = new_d2
+            
+        if new_score > cur_score:
+            # If the new permutation beats the old one,
+            # automatically select it.
+            cur_score = new_score
+            cur_d1 = new_d1
+            cur_d2 = new_d2
+            switch_failures = 0
+        else:
+            # We didn't beat the current score, but allow
+            # for possibly switching anyways.
+            accept_prob = np.exp(new_score - cur_score)
+            coin_toss = np.random.rand()
+            if coin_toss < accept_prob:
+                # We randomly switch to the new distribution
+                cur_score = new_score
+                cur_d1 = new_d1
+                cur_d2 = new_d2
+                switch_failures = 0
+            else:
+                switch_failures += 1
+            
+    # Return both our best guess, and the ending guess
+    return max_d1, max_d2, cur_d1, cur_d2
+
+ +
+
+
+ +
+
+
+
+
+
+

And now we have finished the hard work!

+

Catching the Casino

Let's go through a couple of scenarios and see if we can catch the casino cheating with some loaded dice. In every scenario we start with an assumption of fair dice, and then try our hand to figure out what the actual distribution was.

+

Attempt 1

The casino is using two dice that are both biased low. How well can we recover the distribution?

+ +
+
+
+
+
+
In [8]:
+
+
+
import time
+def simulate_casino(d1_dist, d2_dist, n=10000):
+    d1_vals = unfair_die(d1_dist, n)
+    d2_vals = unfair_die(d2_dist, n)
+    
+    start = time.perf_counter()
+    max_d1, max_d2, final_d1, final_d2 = optimize(d1_vals + d2_vals)
+    end = time.perf_counter()
+    print("Simulated Annealing time: {:.02f}s".format(end - start))
+    
+    coef_range = np.arange(2, 13) - .5
+    plt.subplot(221)
+    plt.bar(coef_range, density_coef(d1_dist, d2_dist), width=1)
+    plt.title('True Distribution')
+    
+    plt.subplot(222)
+    plt.hist(d1_vals + d2_vals, bins=11)
+    plt.title('Empirical Distribution')
+    
+    plt.subplot(223)
+    plt.bar(coef_range, density_coef(max_d1, max_d2), width=1)
+    plt.title('Recovered Distribution')
+    
+    plt.gcf().set_size_inches(10, 10)
+
+
+simulate_casino([2/9, 2/9, 2/9, 1/9, 1/9, 1/9],
+                [2/9, 2/9, 2/9, 1/9, 1/9, 1/9])
+
+ +
+
+
+ +
+
+ + +
+
+
Iteration: 1000; Current score (higher is better): -22147.004400281654
+Simulated Annealing time: 0.30s
+
+
+
+ +
+ + +
+ +
+ +
+ +
+
+ +
+
+
+
+
+
+

Attempt 2

The casino now uses dice that are both biased towards 1 and 6.

+ +
+
+
+
+
+
In [9]:
+
+
+
simulate_casino([1/3, 1/12, 1/12, 1/12, 1/12, 1/3],
+                [1/3, 1/12, 1/12, 1/12, 1/12, 1/3])
+
+ +
+
+
+ +
+
+ + +
+
+
Simulated Annealing time: 0.08s
+
+
+
+ +
+ + +
+ +
+ +
+ +
+
+ +
+
+
+
+
+
+

Attempt 3

The casino will now use one die biased towards 1 and 6, and one die towards 3 and 4.

+ +
+
+
+
+
+
In [10]:
+
+
+
simulate_casino([1/3, 1/12, 1/12, 1/12, 1/12, 1/3],
+                [1/12, 1/12, 1/3, 1/3, 1/12, 1/12])
+
+ +
+
+
+ +
+
+ + +
+
+
Simulated Annealing time: 0.09s
+
+
+
+ +
+ + +
+ +
+ +
+ +
+
+ +
+
+
+
+
+
+

Attempt 4

We'll now finally go to a fair casino to make sure that we can still recognize a positive result.

+ +
+
+
+
+
+
In [11]:
+
+
+
simulate_casino(np.repeat(1/6, 6), np.repeat(1/6, 6))
+
+ +
+
+
+ +
+
+ + +
+
+
Simulated Annealing time: 0.02s
+
+
+
+ +
+ + +
+ +
+ +
+ +
+
+ +
+
+
+
+
+
+

Attempt 5

We've so far been working with a large amount of data - 10,000 data points. Can we now scale things back to only 250 throws? We'll start with two dice biased high.

+ +
+
+
+
+
+
In [12]:
+
+
+
simulate_casino([1/9, 1/9, 1/9, 2/9, 2/9, 2/9],
+                [1/9, 1/9, 1/9, 2/9, 2/9, 2/9],
+                n=250)
+
+ +
+
+
+ +
+
+ + +
+
+
Iteration: 1000; Current score (higher is better): -551.6995384525453
+Iteration: 2000; Current score (higher is better): -547.7803673440676
+Iteration: 3000; Current score (higher is better): -547.9805613193807
+Iteration: 4000; Current score (higher is better): -546.7574874775273
+Iteration: 5000; Current score (higher is better): -549.5798007672656
+Iteration: 6000; Current score (higher is better): -545.0354060154496
+Iteration: 7000; Current score (higher is better): -550.1134504086606
+Iteration: 8000; Current score (higher is better): -549.9306537114975
+Iteration: 9000; Current score (higher is better): -550.7075182119111
+Iteration: 10000; Current score (higher is better): -549.400679551826
+Simulated Annealing time: 1.94s
+
+
+
+ +
+ + +
+ +
+ +
+ +
+
+ +
+
+
+
+
+
+

The results are surprisingly good. While the actual optimization process took much longer to finish than in the other examples, we still have a very good guess. As a caveat though: the recovered distribution tends to overfit the data. That is, if the data doesn't fit the underlying distribution well, the model will also fail.

+

Conclusion

Given the results above, we can see that we have indeed come up with a very good algorithm to determine the distribution of two dice given their results. As a benefit, we have even seen that results come back very quickly; it's not uncommon for the optimization to converge within a tenth of a second.

+

Additionally, we have seen that the algorithm can intuit the distribution even when there is not much data. While the final example shows that we can 'overfit' on the dataset, we can still get valuable information from a relatively small dataset.

+

We can declare at long last: the mathematicians have again triumphed over the casino.

+
+ +
+
+
+
+
+
+
+
+

Justification of Simulated Annealing

Why Simulated Annealing?

So why even use an algorithm with a fancy title like Simulated Annealing? First of all, because the title is sexy. Second of all, because this is a reasonably complicated problem to try and solve. We have a parameter space where each value $p_{ij} \in (0, 1); i, j \in \{1, \ldots, 6\}$, for a total of 12 different variables we are trying to optimize over. Additionally, given a 12-dimensional function we are trying to optimize, simulated annealing makes sure that we don't fall into a local minimum.

+

Why not something else?

This is a fair question. There are two classes of algorithms that can also be used to solve this problem: Non-linear optimization methods, and the EM algorithm.

+
    +
  1. I chose not to use non-linear optimization simply because I'm a bit concerned that it will trap me in a local maximum. Instead of running multiple different optimizations from different starting points, I can just use simulated annealing to take that into account. In addition, throughout the course of testing the simulated annealing code converged incredibly quickly - far more quickly than any non-linear solver would be able to accomplish.

    +
  2. +
  3. The EM Algorithm was originally what I intended to write this blog post with. Indeed, the post was inspired by the crooked casino example which uses the EM algorithm to solve it. However, after modeling the likelihood function I realized that the algebra would very quickly get out of hand. Trying to compute all the polynomial terms would not be fun, which would be needed to actually optimize for each parameter. So while the EM algorithm would likely be much faster in raw speed terms, the amount of time needed to program and verify it meant that I was far better off using a different method for optimization.

    +
  4. +
+ +
+
+

+
+
+ + +
@@ -93,6 +2842,7 @@ MathJax.Hub.Config({tex2jax: {inlineMath: [['\$','\$']]}});
    +
diff --git a/tick-tock.html b/tick-tock.html index 4325cee..c3cc173 100644 --- a/tick-tock.html +++ b/tick-tock.html @@ -4,20 +4,22 @@ - + - + Tick Tock... - Bradlee Speice - - - - + + + + + + @@ -26,6 +28,17 @@ + + + @@ -39,7 +52,7 @@
@@ -54,12 +67,12 @@

Tick Tock...

-

Bradlee Speice, Wed 06 April 2016, Blog

+

Bradlee Speice, Wed 06 April 2016, Blog

-fitbit, heartrate

+fitbit, heartrate

@@ -73,7 +86,521 @@
-

{% notebook 2016-4-6-tick-tock....ipynb %}

+

+

+
+
+
+
+

If all we have is a finite number of heartbeats left, what about me?

+
+

Warning: this one is a bit creepier. But that's what you get when you come up with data science ideas as you're drifting off to sleep.

+

2.5 Billion

If PBS is right, that's the total number of heartbeats we get. Approximately once every second that number goes down, and down, and down again...

+ +
+
+
+
+
+
In [1]:
+
+
+
total_heartbeats = 2500000000
+
+ +
+
+
+ +
+
+
+
+
+
+

I got a Fitbit this past Christmas season, mostly because I was interested in the data and trying to work on some data science projects with it. This is going to be the first project, but there will likely be more (and not nearly as morbid). My idea was: If this is the final number that I'm running up against, how far have I come, and how far am I likely to go? I've currently had about 3 months' time to estimate what my data will look like, so let's go ahead and see: given a lifetime 2.5 billion heart beats, how much time do I have left?

+

Statistical Considerations

Since I'm starting to work with health data, there are a few considerations I think are important before I start digging through my data.

+
    +
  1. The concept of 2.5 billion as an agreed-upon number is tenuous at best. I've seen anywhere from 2.21 billion to 3.4 billion so even if I knew exactly how many times my heart had beaten so far, the ending result is suspect at best. I'm using 2.5 billion because that seems to be about the midpoint of the estimates I've seen so far.
  2. +
  3. Most of the numbers I've seen so far are based on extrapolating number of heart beats from life expectancy. As life expectancy goes up, the number of expected heart beats goes up too.
  4. +
  5. My estimation of the number of heartbeats in my life so far is based on 3 months worth of data, and I'm extrapolating an entire lifetime based on this.
  6. +
+

So while the ending number is not useful in any medical context, it is still an interesting project to work with the data I have on hand.

+

Getting the data

Fitbit has an API available for people to pull their personal data off the system. It requires registering an application, authentication with OAuth, and some other complicated things. If you're not interested in how I fetch the data, skip here.

+

Registering an application

I've already registered a personal application with Fitbit, so I can go ahead and retrieve things like the client secret from a file.

+ +
+
+
+
+
+
In [2]:
+
+
+
# Import all the OAuth secret information from a local file
+from secrets import CLIENT_SECRET, CLIENT_ID, CALLBACK_URL
+
+ +
+
+
+ +
+
+
+
+
+
+

Handling OAuth 2

So, all the people that know what OAuth 2 is know what's coming next. For those who don't: OAuth is how people allow applications to access other data without having to know your password. Essentially the dialog goes like this:

+ +
Application: I've got a user here who wants to use my application, but I need their data.
+Fitbit: OK, what data do you need access to, and for how long?
+Application: I need all of these scopes, and for this amount of time.
+Fitbit: OK, let me check with the user to make sure they really want to do this.
+
+Fitbit: User, do you really want to let this application have your data?
+User: I do! And to prove it, here's my password.
+Fitbit: OK, everything checks out. I'll let the application access your data.
+
+Fitbit: Application, you can access the user's data. Use this special value whenever you need to request data from me.
+Application: Thank you, now give me all the data.
+

Effectively, this allows an application to gain access to a user's data without ever needing to know the user's password. That way, even if the other application is hacked, the user's original data remains safe. Plus, the user can let the data service know to stop providing the application access any time they want. All in all, very secure.

+

It does make handling small requests a bit challenging, but I'll go through the steps here. We'll be using the Implicit Grant workflow, as it requires fewer steps in processing.

+

First, we need to set up the URL the user would visit to authenticate:

+ +
+
+
+
+
+
In [3]:
+
+
+
import urllib
+
+FITBIT_URI = 'https://www.fitbit.com/oauth2/authorize'
+params = {
+    # If we need more than one scope, must be a CSV string
+    'scope': 'heartrate',
+    'response_type': 'token',
+    'expires_in': 86400, # 1 day
+    'redirect_uri': CALLBACK_URL,
+    'client_id': CLIENT_ID
+}
+
+request_url = FITBIT_URI + '?' + urllib.parse.urlencode(params)
+
+ +
+
+
+ +
+
+
+
+
+
+

Now, here you would print out the request URL, go visit it, and get the full URL that it sends you back to. Because that is very sensitive information (specifically containing my CLIENT_ID that I'd really rather not share on the internet), I've skipped that step in the code here, but it happens in the background.

+ +
+
+
+
+
+
In [6]:
+
+
+
# The `response_url` variable contains the full URL that
+# FitBit sent back to us, but most importantly,
+# contains the token we need for authorization.
+access_token = dict(urllib.parse.parse_qsl(response_url))['access_token']
+
+ +
+
+
+ +
+
+
+
+
+
+

Requesting the data

Now that we've actually set up our access via the access_token, it's time to get the actual heart rate data. I'll be using data from January 1, 2016 through March 31, 2016, and extrapolating wildly from that.

+

Fitbit only lets us fetch intraday data one day at a time, so I'll create a date range using pandas and iterate through that to pull down all the data.

+ +
+
+
+
+
+
In [7]:
+
+
+
from requests_oauthlib import OAuth2Session
+import pandas as pd
+from datetime import datetime
+
+session = OAuth2Session(token={
+        'access_token': access_token,
+        'token_type': 'Bearer'
+    })
+
+format_str = '%Y-%m-%d'
+start_date = datetime(2016, 1, 1)
+end_date = datetime(2016, 3, 31)
+dr = pd.date_range(start_date, end_date)
+
+url = 'https://api.fitbit.com/1/user/-/activities/heart/date/{0}/1d/1min.json'
+hr_responses = [session.get(url.format(d.strftime(format_str))) for d in dr]
+
+def record_to_df(record):
+    if 'activities-heart' not in record:
+        return None
+    date_str = record['activities-heart'][0]['dateTime']
+    df = pd.DataFrame(record['activities-heart-intraday']['dataset'])
+        
+    df.index = df['time'].apply(
+        lambda x: datetime.strptime(date_str + ' ' + x, '%Y-%m-%d %H:%M:%S'))
+    return df
+
+hr_dataframes = [record_to_df(record.json()) for record in hr_responses]
+hr_df_concat = pd.concat(hr_dataframes)
+
+
+# There are some minutes with missing data, so we need to correct that
+full_daterange = pd.date_range(hr_df_concat.index[0],
+                              hr_df_concat.index[-1],
+                              freq='min')
+hr_df_full = hr_df_concat.reindex(full_daterange, method='nearest')
+
+print("Heartbeats from {} to {}: {}".format(hr_df_full.index[0],
+                                            hr_df_full.index[-1],
+                                            hr_df_full['value'].sum()))
+
+ +
+
+
+ +
+
+ + +
+
+
Heartbeats from 2016-01-01 00:00:00 to 2016-03-31 23:59:00: 8139060
+
+
+
+ +
+
+ +
+
+
+
+
+
+

And now we've retrieved all the available heart rate data for January 1st through March 31st! Let's get to the actual analysis.

+

Wild Extrapolations from Small Data

A fundamental issue of this data is that it's pretty small. I'm using 3 months of data to make predictions about my entire life. But, purely as an exercise, I'll move forward.

+

How many heartbeats so far?

The first step is figuring out how many of the 2.5 billion heartbeats I've used so far. We're going to try and work backward from the present day to when I was born to get that number. The easy part comes first: going back to January 1st, 1992. That's because I can generalize how many 3-month increments there were between now and then, account for leap years, and call that section done.

+

Between January 1992 and January 2016 there were 96 quarters, and 6 leap days. The number we're looking for is:

+\begin{equation} +hr_q \cdot n - hr_d \cdot (n-m) +\end{equation}
    +
  • $hr_q$: Number of heartbeats per quarter
  • +
  • $hr_d$: Number of heartbeats on leap day
  • +
  • $n$: Number of quarters, in this case 96
  • +
  • $m$: Number of leap days, in this case 6
  • +
+ +
+
+
+
+
+
In [8]:
+
+
+
quarterly_count = hr_df_full['value'].sum()
+leap_day_count = hr_df_full[(hr_df_full.index.month == 2) &
+                            (hr_df_full.index.day == 29)]['value'].sum()
+num_quarters = 96
+leap_days = 6
+
+jan_92_jan_16 = quarterly_count * num_quarters - leap_day_count * (num_quarters - leap_days)
+jan_92_jan_16
+
+ +
+
+
+ +
+
+ + +
Out[8]:
+ + +
+
773609400
+
+ +
+ +
+
+ +
+
+
+
+
+
+

So between January 1992 and January 2016 I've used $\approx$ 774 million heartbeats. Now, I need to go back to my exact birthday. I'm going to first find on average how many heartbeats I use in a minute, and multiply that by the number of minutes between my birthday and January 1992.

+

For privacy purposes I'll put the code here that I'm using, but without any identifying information:

+ +
+
+
+
+
+
In [9]:
+
+
+
minute_mean = hr_df_full['value'].mean()
+# Don't you wish you knew?
+# birthday_minutes = ???
+
+birthday_heartbeats = birthday_minutes * minute_mean
+
+heartbeats_until_2016 = int(birthday_heartbeats + jan_92_jan_16)
+remaining_2016 = total_heartbeats - heartbeats_until_2016
+
+print("Heartbeats so far: {}".format(heartbeats_until_2016))
+print("Remaining heartbeats: {}".format(remaining_2016))
+
+ +
+
+
+ +
+
+ + +
+
+
Heartbeats so far: 775804660
+Remaining heartbeats: 1724195340
+
+
+
+ +
+
+ +
+
+
+
+
+
+

It would appear that my heart has beaten 775,804,660 times between my moment of birth and January 1st 2016, and that I have 1.72 billion left.

+

How many heartbeats longer?

Now comes the tricky bit. I know how many heart beats I've used so far, and how many I have remaining, so I'd like to come up with a (relatively) accurate estimate of when exactly my heart should give out. We'll do this in a few steps, increasing in granularity.

+

First step, how many heartbeats do I use in a 4-year period? I have data for a single quarter including leap day, so I want to know:

+\begin{equation} +hr_q \cdot n - hr_d \cdot (n - m) +\end{equation}
    +
  • $hr_q$: Heartbeats per quarter
  • +
  • $hr_d$: Heartbeats per leap day
  • +
  • $n$: Number of quarters = 16
  • +
  • $m$: Number of leap days = 1
  • +
+ +
+
+
+
+
+
In [10]:
+
+
+
heartbeats_4year = quarterly_count * 16 - leap_day_count * (16 - 1)
+heartbeats_4year
+
+ +
+
+
+ +
+
+ + +
Out[10]:
+ + +
+
128934900
+
+ +
+ +
+
+ +
+
+
+
+
+
+

Now, I can fast forward from 2016 the number of periods of 4 years I have left.

+ +
+
+
+
+
+
In [11]:
+
+
+
four_year_periods = remaining_2016 // heartbeats_4year
+remaining_4y = remaining_2016 - four_year_periods * heartbeats_4year
+
+print("Four year periods remaining: {}".format(four_year_periods))
+print("Remaining heartbeats after 4 year periods: {}".format(remaining_4y))
+
+ +
+
+
+ +
+
+ + +
+
+
Four year periods remaining: 13
+Remaining heartbeats after 4 year periods: 48041640
+
+
+
+ +
+
+ +
+
+
+
+
+
+

Given that there are 13 four-year periods left, I can move from 2016 all the way to 2068, and find that I will have 48 million heart beats left. Let's drop down to figuring out how many quarters that is. I know that 2068 will have a leap day (unless someone finally decides to get rid of them), so I'll subtract that out first. Then, I'm left to figure out how many quarters exactly are left.

+ +
+
+
+
+
+
In [12]:
+
+
+
remaining_leap = remaining_4y - leap_day_count
+# Ignore leap day in the data set
+heartbeats_quarter = hr_df_full[(hr_df_full.index.month != 2) &
+                                (hr_df_full.index.day != 29)]['value'].sum()
+quarters_left = remaining_leap // heartbeats_quarter
+remaining_year = remaining_leap - quarters_left * heartbeats_quarter
+
+print("Quarters left starting 2068: {}".format(quarters_left))
+print("Remaining heartbeats after that: {}".format(remaining_year))
+
+ +
+
+
+ +
+
+ + +
+
+
Quarters left starting 2068: 8
+Remaining heartbeats after that: 4760716
+
+
+
+ +
+
+ +
+
+
+
+
+
+

So, that analysis gets me through until January 1st 2070. Final step, using that minute estimate to figure out how many minutes past that I'm predicted to have:

+ +
+
+
+
+
+
In [13]:
+
+
+
from datetime import timedelta
+
+base = datetime(2070, 1, 1)
+minutes_left = remaining_year // minute_mean
+
+kaput = timedelta(minutes=minutes_left)
+base + kaput
+
+ +
+
+
+ +
+
+ + +
Out[13]:
+ + +
+
datetime.datetime(2070, 2, 23, 5, 28)
+
+ +
+ +
+
+ +
+
+
+
+
+
+

According to this, I've got until February 23rd, 2070 at 5:28 PM in the evening before my heart gives out.

+

Summary

Well, that's kind of a creepy date to know. As I said at the top though, this number is totally useless in any medical context. It ignores the rate at which we continue to get better at making people live longer, and is extrapolating from 3 months' worth of data the rest of my life. Additionally, throughout my time developing this post I made many minor mistakes. I think they're all fixed now, but it's easy to mix a number up here or there and the analysis gets thrown off by a couple years.

+

Even still, I think philosophically humans have a desire to know how much time we have left in the world. Man is but a breath, and it's scary to think just how quickly that date may be coming up. This analysis asks an important question though: what are you going to do with the time you have left?

+

Thanks for sticking with me on this one, I promise it will be much less depressing next time!

+ +
+
+

+
+
+ + +
@@ -93,6 +634,7 @@ MathJax.Hub.Config({tex2jax: {inlineMath: [['\$','\$']]}});
    +
diff --git a/tweet-like-me.html b/tweet-like-me.html index e5cd06e..f9fda20 100644 --- a/tweet-like-me.html +++ b/tweet-like-me.html @@ -4,20 +4,22 @@ - + - + Tweet Like Me - Bradlee Speice - - - - + + + + + + @@ -26,6 +28,17 @@ + + + @@ -39,7 +52,7 @@
@@ -54,12 +67,12 @@

Tweet Like Me

-

Bradlee Speice, Mon 28 March 2016, Blog

+

Bradlee Speice, Mon 28 March 2016, Blog

-MCMC, twitter

+MCMC, twitter

@@ -73,7 +86,577 @@
-

{% notebook 2016-3-28-tweet-like-me.ipynb %}

+

+

+
+
+
+
+

An experiment in creating a robot that will imitate me on Twitter.

+
+

So, I'm taking a Machine Learning course this semester in school, and one of the topics we keep coming back to is natural language processing and the 'bag of words' data structure. That is, given a sentence:

+

How much wood would a woodchuck chuck if a woodchuck could chuck wood?

+

We can represent that sentence as the following list:

+

{ + How: 1 + much: 1 + wood: 2 + would: 2 + a: 2 + woodchuck: 2 + chuck: 2 + if: 1 +}

+

Ignoring where the words happened, we're just interested in how often the words occurred. That got me thinking: I wonder what would happen if I built a robot that just imitated how often I said things? It's dangerous territory when computer scientists ask "what if," but I got curious enough I wanted to follow through.

+

The Objective

Given an input list of Tweets, build up the following things:

+
    +
  1. The distribution of starting words; since there are no "prior" words to go from, we need to treat this as a special case.
  2. +
  3. The distribution of words given a previous word; for example, every time I use the word woodchuck in the example sentence, there is a 50% chance it is followed by chuck and a 50% chance it is followed by could. I need this distribution for all words.
  4. +
  5. The distribution of quantity of hashtags; Do I most often use just one? Two? Do they follow something like a Poisson distribution?
  6. +
  7. Distribution of hashtags; Given a number of hashtags, what is the actual content? I'll treat hashtags as separate from the content of a tweet.
  8. +
+

The Data

I'm using as input my tweet history. I don't really use Twitter anymore, but it seems like a fun use of the dataset. I'd like to eventually build this to a point where I can imitate anyone on Twitter using their last 100 tweets or so, but I'll start with this as example code.

+

The Algorithm

I'll be using the NLTK library for doing a lot of the heavy lifting. First, let's import the data:

+ +
+
+
+
+
+
In [1]:
+
+
+
import pandas as pd
+
+tweets = pd.read_csv('tweets.csv')
+text = tweets.text
+
+# Don't include tweets in reply to or mentioning people
+replies = text.str.contains('@')
+text_norep = text.loc[~replies]
+
+ +
+
+
+ +
+
+
+
+
+
+

And now that we've got data, let's start crunching. First, tokenize and build out the distribution of first word:

+ +
+
+
+
+
+
In [2]:
+
+
+
from nltk.tokenize import TweetTokenizer
+tknzr = TweetTokenizer()
+tokens = text_norep.map(tknzr.tokenize)
+
+first_words = tokens.map(lambda x: x[0])
+first_words_alpha = first_words[first_words.str.isalpha()]
+first_word_dist = first_words_alpha.value_counts() / len(first_words_alpha)
+
+ +
+
+
+ +
+
+
+
+
+
+

Next, we need to build out the conditional distributions. That is, what is the probability of the next word given the current word is $X$? This one is a bit more involved. First, find all unique words, and then find what words proceed them. This can probably be done in a more efficient manner than I'm currently doing here, but we'll ignore that for the moment.

+ +
+
+
+
+
+
In [3]:
+
+
+
from functools import reduce
+
+# Get all possible words
+all_words = reduce(lambda x, y: x+y, tokens, [])
+unique_words = set(all_words)
+actual_words = set([x if x[0] != '.' else None for x in unique_words])
+
+word_dist = {}
+for word in iter(actual_words):
+    indices = [i for i, j in enumerate(all_words) if j == word]
+    proceeding = [all_words[i+1] for i in indices]
+    word_dist[word] = proceeding
+
+ +
+
+
+ +
+
+
+
+
+
+

Now that we've got the tweet analysis done, it's time for the fun part: hashtags! Let's count how many hashtags are in each tweet, I want to get a sense of the distribution.

+ +
+
+
+
+
+
In [4]:
+
+
+
import matplotlib.pyplot as plt
+%matplotlib inline
+
+hashtags = text_norep.str.count('#')
+bins = hashtags.unique().max()
+hashtags.plot(kind='hist', bins=bins)
+
+ +
+
+
+ +
+
+ + +
Out[4]:
+ + +
+
<matplotlib.axes._subplots.AxesSubplot at 0x18e59dc28d0>
+
+ +
+ +
+ + +
+ +
+ +
+ +
+
+ +
+
+
+
+
+
+

That looks like a Poisson distribution, kind of as I expected. I'm guessing my number of hashtags per tweet is $\sim Poi(1)$, but let's actually find the most likely estimator which in this case is just $\bar{\lambda}$:

+ +
+
+
+
+
+
In [5]:
+
+
+
mle = hashtags.mean()
+mle
+
+ +
+
+
+ +
+
+ + +
Out[5]:
+ + +
+
0.870236869207003
+
+ +
+ +
+
+ +
+
+
+
+
+
+

Pretty close! So we can now simulate how many hashtags are in a tweet. Let's also find what hashtags are actually used:

+ +
+
+
+
+
+
In [6]:
+
+
+
hashtags = [x for x in all_words if x[0] == '#']
+n_hashtags = len(hashtags)
+
+unique_hashtags = list(set([x for x in unique_words if x[0] == '#']))
+hashtag_dist = pd.DataFrame({'hashtags': unique_hashtags,
+                             'prob': [all_words.count(h) / n_hashtags
+                                      for h in unique_hashtags]})
+len(hashtag_dist)
+
+ +
+
+
+ +
+
+ + +
Out[6]:
+ + +
+
603
+
+ +
+ +
+
+ +
+
+
+
+
+
+

Turns out I have used 603 different hashtags during my time on Twitter. That means I was using a unique hashtag for about every third tweet.

+

In better news though, we now have all the data we need to go about actually constructing tweets! The process will happen in a few steps:

+
    +
  1. Randomly select what the first word will be.
  2. +
  3. Randomly select the number of hashtags for this tweet, and then select the actual hashtags.
  4. +
  5. Fill in the remaining space of 140 characters with random words taken from my tweets.
  6. +
+

And hopefully, we won't have anything too crazy come out the other end. The way we do the selection follows a Multinomial Distribution: given a lot of different values with specific probability, pick one. Let's give a quick example:

+ +
x: .33
+y: .5
+z: .17
+

That is, I pick x with probability 33%, y with probability 50%, and so on. In context of our sentence construction, I've built out the probabilities of specific words already - now I just need to simulate that distribution. Time for the engine to actually be developed!

+ +
+
+
+
+
+
In [7]:
+
+
+
import numpy as np
+
+def multinom_sim(n, vals, probs):
+    occurrences = np.random.multinomial(n, probs)
+    results = occurrences * vals
+    return ' '.join(results[results != ''])
+
+def sim_n_hashtags(hashtag_freq):
+    return np.random.poisson(hashtag_freq)
+
+def sim_hashtags(n, hashtag_dist):
+    return multinom_sim(n, hashtag_dist.hashtags, hashtag_dist.prob)
+
+def sim_first_word(first_word_dist):
+    probs = np.float64(first_word_dist.values)
+    return multinom_sim(1, first_word_dist.reset_index()['index'], probs)
+
+def sim_next_word(current, word_dist):
+    dist = pd.Series(word_dist[current])
+    probs = np.ones(len(dist)) / len(dist)
+    return multinom_sim(1, dist, probs)
+
+ +
+
+
+ +
+
+
+
+
+
+

Pulling it all together

I've now built out all the code I need to actually simulate a sentence written by me. Let's try doing an example with five words and a single hashtag:

+ +
+
+
+
+
+
In [8]:
+
+
+
first = sim_first_word(first_word_dist)
+second = sim_next_word(first, word_dist)
+third = sim_next_word(second, word_dist)
+fourth = sim_next_word(third, word_dist)
+fifth = sim_next_word(fourth, word_dist)
+hashtag = sim_hashtags(1, hashtag_dist)
+
+' '.join((first, second, third, fourth, fifth, hashtag))
+
+ +
+
+
+ +
+
+ + +
Out[8]:
+ + +
+
'My first all-nighter of friends #oldschool'
+
+ +
+ +
+
+ +
+
+
+
+
+
+

Let's go ahead and put everything together! We're going to simulate a first word, simulate the hashtags, and then simulate to fill the gap until we've either taken up all the space or reached a period.

+ +
+
+
+
+
+
In [9]:
+
+
+
def simulate_tweet():
+    chars_remaining = 140
+    first = sim_first_word(first_word_dist)
+    n_hash = sim_n_hashtags(mle)
+    hashtags = sim_hashtags(n_hash, hashtag_dist)
+    
+    chars_remaining -= len(first) + len(hashtags)
+    
+    tweet = first
+    current = first
+    while chars_remaining > len(tweet) + len(hashtags) and current[0] != '.' and current[0] != '!':
+        current = sim_next_word(current, word_dist)
+        tweet += ' ' + current
+        
+    tweet = tweet[:-2] + tweet[-1]
+        
+    return ' '.join((tweet, hashtags)).strip()
+
+ +
+
+
+ +
+
+
+
+
+
+

The results

And now for something completely different: twenty random tweets dreamed up by my computer and my Twitter data. Here you go:

+ +
+
+
+
+
+
In [12]:
+
+
+
for i in range(0, 20):
+    print(simulate_tweet())
+    print()
+
+ +
+
+
+ +
+
+ + +
+
+
Also , I'm at 8 this morning. #thursdaysgohard #ornot
+
+Turns out of us breathe the code will want to my undergraduate career is becoming more night trying ? Religion is now as a chane #HYPE
+
+You know what recursion is to review the UNCC. #ornot
+
+There are really sore 3 bonfires in my first writing the library ground floor if awesome. #realtalk #impressed
+
+So we can make it out there's nothing but I'm not let us so hot I could think I may be good. #SwingDance
+
+Happy Christmas , at Harris Teeter to be be godly or Roman Catholic ). #4b392b#4b392b #Isaiah26
+
+For context , I in the most decisive factor of the same for homework. #accomplishment
+
+Freaking done. #loveyouall
+
+New blog post : Don't jump in a quiz in with a knife fight. #haskell #earlybirthday
+
+God shows me legitimately want to get some food and one day.
+
+Stormed the queen city. #mindblown
+
+The day of a cold at least outside right before the semester ..
+
+Finished with the way back. #winners
+
+Waking up , OJ , I feel like Nick Jonas today.
+
+First draft of so hard drive. #humansvszombies
+
+Eric Whitacre is the wise creation.
+
+Ethics paper first , music in close to everyone who just be posting up with my sin , and Jerry Springr #TheLittleThings
+
+Love that you know enough time I've eaten at 8 PM. #deepthoughts #stillblownaway
+
+Lead. #ThinkingTooMuch #Christmas
+
+Aamazing conference when you married #DepartmentOfRedundancyDepartment Yep , but there's a legitimate challenge.
+
+
+
+
+ +
+
+ +
+
+
+
+
+
+

...Which all ended up being a whole lot more nonsensical than I had hoped for. There are some good ones, so I'll call that an accomplishment! I was banking on grammar not being an issue: since my tweets use impeccable grammar, the program modeled off them should have pretty good grammar as well. There are going to be some hilarious edge cases (I'm looking at you, Ethics paper first, music in close to everyone) that make no sense, and some hilarious edge cases (Waking up, OJ, I feel like Nick Jonas today) that make me feel like I should have a Twitter rap career. On the whole though, the structure came out alright.

+

Moving on from here

During class we also talked about an interesting idea: trying to analyze corporate documents and corporate speech. I'd be interested to know what this analysis applied to something like a couple of bank press releases could do. By any means, the code needs some work to clean it up before I get that far.

+

For further reading

I'm pretty confident I re-invented a couple wheels along the way - what I'm doing feels a lot like what Markov Chain Monte Carlo is intended to do. But I've never worked explicitly with that before, so more research is needed.

+ +
+
+

@@ -81,6 +664,20 @@ MathJax.Hub.Config({tex2jax: {inlineMath: [['$','$'], ['\(','\)']]}}); +
+
+ + +
@@ -92,6 +689,7 @@ MathJax.Hub.Config({tex2jax: {inlineMath: [['$','$'], ['\(','\)']]}});
    +
diff --git a/welcome-and-an-algorithm.html b/welcome-and-an-algorithm.html index 3d5c51a..5e25490 100644 --- a/welcome-and-an-algorithm.html +++ b/welcome-and-an-algorithm.html @@ -4,20 +4,22 @@ - + - + Welcome, and an algorithm - Bradlee Speice - - - - + + + + + + @@ -26,6 +28,17 @@ + + + @@ -39,7 +52,7 @@
@@ -54,12 +67,12 @@

Welcome, and an algorithm

-

Bradlee Speice, Thu 19 November 2015, Sat 05 December 2015, Blog

+

Bradlee Speice, Thu 19 November 2015, Sat 05 December 2015, Blog

-introduction, trading

+introduction, trading

@@ -92,7 +105,7 @@ days. Then, optimize for return subject to the drawdown being below a specific level. We didn't include the Sharpe ratio as a constraint, mostly because we were a bit late entering the competition.

I'll be updating this post with the results of our algorithm as they come along!

-
+

UPDATE 12/5/2015: Now that the competition has ended, I wanted to update how the algorithm performed. Unfortunately, it didn't do very well. I'm planning to make some tweaks over the coming weeks, and do another forward test in January.

@@ -128,8 +141,315 @@ to make some tweaks over the coming weeks, and do another forward test in Januar -
-

{% notebook 2015-11-14-welcome.ipynb %}

+
+

+

+
+
+
+
+

Trading Competition Optimization

Goal: Max return given maximum Sharpe and Drawdown

+
+
+
+
+
+
In [1]:
+
+
+
from IPython.display import display
+import Quandl
+from datetime import datetime, timedelta
+
+tickers = ['XOM', 'CVX', 'CLB', 'OXY', 'SLB']
+market_ticker = 'GOOG/NYSE_VOO'
+lookback = 30
+d_col = 'Close'
+
+data = {tick: Quandl.get('YAHOO/{}'.format(tick))[-lookback:] for tick in tickers}
+market = Quandl.get(market_ticker)
+
+ +
+
+
+ +
+
+
+
+
+
+

Calculating the Return

We first want to know how much each ticker returned over the prior period.

+ +
+
+
+
+
+
In [2]:
+
+
+
returns = {tick: data[tick][d_col].pct_change() for tick in tickers}
+
+display({tick: returns[tick].mean() for tick in tickers})
+
+ +
+
+
+ +
+
+ + +
+ + +
+
{'CLB': -0.0016320202164526894,
+ 'CVX': 0.0010319531629488911,
+ 'OXY': 0.00093418904454400551,
+ 'SLB': 0.00098431254720448159,
+ 'XOM': 0.00044165797556096868}
+
+ +
+ +
+
+ +
+
+
+
+
+
+

Calculating the Sharpe ratio

Sharpe: ${R - R_M \over \sigma}$

+

We use the average return over the lookback period, minus the market average return, over the ticker standard deviation to calculate the Sharpe. Shorting a stock turns a negative Sharpe positive.

+ +
+
+
+
+
+
In [3]:
+
+
+
market_returns = market.pct_change()
+
+sharpe = lambda ret: (ret.mean() - market_returns[d_col].mean()) / ret.std()
+sharpes = {tick: sharpe(returns[tick]) for tick in tickers}
+
+display(sharpes)
+
+ +
+
+
+ +
+
+ + +
+ + +
+
{'CLB': -0.10578734457846127,
+ 'CVX': 0.027303529817677398,
+ 'OXY': 0.022622210057414487,
+ 'SLB': 0.026950946344858676,
+ 'XOM': -0.0053519259698605499}
+
+ +
+ +
+
+ +
+
+
+
+
+
+

Calculating the drawdown

This one is easy - what is the maximum daily change over the lookback period? That is, because we will allow short positions, we are not concerned strictly with maximum downturn, but in general, what is the largest 1-day change?

+ +
+
+
+
+
+
In [4]:
+
+
+
drawdown = lambda ret: ret.abs().max()
+drawdowns = {tick: drawdown(returns[tick]) for tick in tickers}
+
+display(drawdowns)
+
+ +
+
+
+ +
+
+ + +
+ + +
+
{'CLB': 0.043551495607375035,
+ 'CVX': 0.044894389686214398,
+ 'OXY': 0.051424517867144637,
+ 'SLB': 0.034774627850375328,
+ 'XOM': 0.035851524605672758}
+
+ +
+ +
+
+ +
+
+
+
+
+
+

Performing the optimization

$\begin{align} +max\ \ & \mu \cdot \omega\\ +s.t.\ \ & \vec{1} \omega = 1\\ +& \vec{S} \omega \ge s\\ +& \vec{D} \cdot | \omega | \le d\\ +& \left|\omega\right| \le l\\ +\end{align}$

+

We want to maximize average return subject to having a full portfolio, Sharpe above a specific level, drawdown below a level, and leverage not too high - that is, don't have huge long/short positions.

+ +
+
+
+
+
+
In [5]:
+
+
+
import numpy as np
+from scipy.optimize import minimize
+
+#sharpe_limit = .1
+drawdown_limit = .05
+leverage = 250
+
+# Use the map so we can guarantee we maintain the correct order
+# sharpe_a = np.array(list(map(lambda tick: sharpes[tick], tickers))) * -1 # So we can write as upper-bound
+dd_a = np.array(list(map(lambda tick: drawdowns[tick], tickers)))
+returns_a = np.array(list(map(lambda tick: returns[tick].mean(), tickers))) # Because minimizing
+
+meets_sharpe = lambda x: sum(abs(x) * sharpe_a) - sharpe_limit
+def meets_dd(x):
+    portfolio = sum(abs(x))
+    if portfolio < .1:
+        # If there are no stocks in the portfolio,
+        # we can accidentally induce division by 0,
+        # or division by something small enough to cause infinity
+        return 0
+    
+    return drawdown_limit - sum(abs(x) * dd_a) / sum(abs(x))
+
+is_portfolio = lambda x: sum(x) - 1
+
+def within_leverage(x):
+    return leverage - sum(abs(x))
+
+objective = lambda x: sum(x * returns_a) * -1 # Because we're minimizing
+bounds = ((None, None),) * len(tickers)
+x = np.zeros(len(tickers))
+
+constraints = [
+    {
+        'type': 'eq',
+        'fun': is_portfolio
+    }, {
+        'type': 'ineq',
+        'fun': within_leverage
+    #}, {
+    #    'type': 'ineq',
+    #    'fun': meets_sharpe
+    }, {
+        'type': 'ineq',
+        'fun': meets_dd
+    }
+]
+
+optimal = minimize(objective, x, bounds=bounds, constraints=constraints,
+                  options={'maxiter': 500})
+
+# Optimization time!
+display(optimal.message)
+
+display("Holdings: {}".format(list(zip(tickers, optimal.x))))
+
+expected_return = optimal.fun * -100  # multiply by -100 to scale, and compensate for minimizing
+display("Expected Return: {:.3f}%".format(expected_return))
+
+expected_drawdown = sum(abs(optimal.x) * dd_a) / sum(abs(optimal.x)) * 100
+display("Expected Max Drawdown: {0:.2f}%".format(expected_drawdown))
+
+# TODO: Calculate expected Sharpe
+
+ +
+
+
+ +
+
+ + +
+ + +
+
'Optimization terminated successfully.'
+
+ +
+ +
+ + +
+
"Holdings: [('XOM', 5.8337945679814904), ('CVX', 42.935064321851307), ('CLB', -124.5), ('OXY', 36.790387773552119), ('SLB', 39.940753336615096)]"
+
+ +
+ +
+ + +
+
'Expected Return: 32.375%'
+
+ +
+ +
+ + +
+
'Expected Max Drawdown: 4.34%'
+
+ +
+ +
+
+ +

@@ -137,6 +457,20 @@ MathJax.Hub.Config({tex2jax: {inlineMath: [['$','$'], ['\(','\)']]}}); +
+
+ + +
@@ -148,6 +482,7 @@ MathJax.Hub.Config({tex2jax: {inlineMath: [['$','$'], ['\(','\)']]}});
    +