Finish up bspeice.github.com conversions

This commit is contained in:
Bradlee Speice 2024-11-06 03:32:56 +00:00
parent e7babcd8a1
commit a5b338431d
38 changed files with 6400 additions and 1 deletions

View File

@ -0,0 +1,16 @@
Title: Predicting Santander Customer Happiness
Date: 2016-03-05
Category: Blog
Tags: machine learning, data science, kaggle
Authors: Bradlee Speice
Summary: My first real-world data challenge: predicting whether a bank's customers will be happy.
[//]: <> "Modified: "
{% notebook 2016-3-5-predicting-santander-customer-happiness.ipynb %}
<script type="text/x-mathjax-config">
# MathJax.Hub.Config({tex2jax: {inlineMath: [['$','$'], ['\(','\)']]}});
MathJax.Hub.Config({tex2jax: {inlineMath: [['$','$']]}});
</script>
<script async src='https://cdn.mathjax.org/mathjax/latest/MathJax.js?config=TeX-AMS_CHTML'></script>

File diff suppressed because one or more lines are too long

View File

@ -0,0 +1,269 @@
### My first Kaggle competition
It's time! After embarking on a Machine Learning class this semester, and with a Saturday in which I don't have much planned, I wanted to put this class and training to work. It's my first competition submission. I want to walk you guys through how I'm approaching this problem, because I thought it would be really neat. The competition is Banco Santander's [Santander Customer Satisfaction][1] competition. It seemed like an easy enough problem I could actually make decent progress on it.
# Data Exploration
First up: we need to load our data and do some exploratory work. Because we're going to be using this data for model selection prior to testing, we need to make a further split. I've already gone ahead and done this work, please see the code in the [appendix below](#Appendix).
[1]: https://www.kaggle.com/c/santander-customer-satisfaction
```python
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
# Record how long it takes to run the notebook - I'm curious.
from datetime import datetime
start = datetime.now()
dataset = pd.read_csv('split_train.csv')
dataset.index = dataset.ID
X = dataset.drop(['TARGET', 'ID', 'ID.1'], 1)
y = dataset.TARGET
```
```python
y.unique()
```
array([0, 1], dtype=int64)
```python
len(X.columns)
```
369
Okay, so there are only [two classes we're predicting][2]: 1 for unsatisfied customers, 0 for satisfied customers. I would have preferred this to be something more like a regression, or predicting multiple classes: maybe the customer isn't the most happy, but is nowhere near closing their accounts. For now though, that's just the data we're working with.
Now, I'd like to make a scatter matrix of everything going on. Unfortunately as noted above, we have 369 different features. There's no way I can graphically make sense of that much data to start with.
We're also not told what the data actually represents: Are these survey results? Average time between contact with a customer care person? Frequency of contacting a customer care person? The idea is that I need to reduce the number of dimensions we're predicting across.
## Dimensionality Reduction pt. 1 - Binary Classifiers
My first attempt to reduce the data dimensionality is to find all the binary classifiers in the dataset \(i.e. 0 or 1 values\) and see if any of those are good \(or anti-good\) predictors of the final data.
[2]: https://www.kaggle.com/c/santander-customer-satisfaction/data
```python
cols = X.columns
b_class = []
for c in cols:
if len(X[c].unique()) == 2:
b_class.append(c)
len(b_class)
```
111
So there are 111 features in the dataset that are a binary label. Let's see if any of them are good at predicting the users satisfaction!
```python
# First we need to `binarize` the data to 0-1; some of the labels are {0, 1},
# some are {0, 3}, etc.
from sklearn.preprocessing import binarize
X_bin = binarize(X[b_class])
accuracy = [np.mean(X_bin[:,i] == y) for i in range(0, len(b_class))]
acc_df = pd.DataFrame({"Accuracy": accuracy}, index=b_class)
acc_df.describe()
```
<div>
<table border="1" class="dataframe">
<thead>
<tr style="text-align: right;">
<th></th>
<th>Accuracy</th>
</tr>
</thead>
<tbody>
<tr>
<th>count</th>
<td>111.000000</td>
</tr>
<tr>
<th>mean</th>
<td>0.905159</td>
</tr>
<tr>
<th>std</th>
<td>0.180602</td>
</tr>
<tr>
<th>min</th>
<td>0.043598</td>
</tr>
<tr>
<th>25%</th>
<td>0.937329</td>
</tr>
<tr>
<th>50%</th>
<td>0.959372</td>
</tr>
<tr>
<th>75%</th>
<td>0.960837</td>
</tr>
<tr>
<th>max</th>
<td>0.960837</td>
</tr>
</tbody>
</table>
</div>
Wow! Looks like we've got some incredibly predictive features! So much so that we should be a bit concerned. My initial guess for what's happening is that we have a sparsity issue: so many of the values are 0, and these likely happen to line up with satisfied customers.
So the question we must now answer, which I likely should have asked long before now: What exactly is the distribution of un/satisfied customers?
```python
unsat = y[y == 1].count()
print("Satisfied customers: {}; Unsatisfied customers: {}".format(len(y) - unsat, unsat))
naive_guess = np.mean(y == np.zeros(len(y)))
print("Naive guess accuracy: {}".format(naive_guess))
```
Satisfied customers: 51131; Unsatisfied customers: 2083
Naive guess accuracy: 0.9608561656706882
This is a bit discouraging. A naive guess of "always satisfied" performs as well as our best individual binary classifier. What this tells me then, is that these data columns aren't incredibly helpful in prediction. I'd be interested in a polynomial expansion of this data-set, but for now, that's more computation than I want to take on.
# Dimensionality Reduction pt. 2 - LDA
Knowing that our naive guess performs so well is a blessing and a curse:
- Curse: The threshold for performance is incredibly high: We can only "improve" over the naive guess by 4%
- Blessing: All the binary classification features we just discovered are worthless on their own. We can throw them out and reduce the data dimensionality from 369 to 111.
Now, in removing these features from the dataset, I'm not saying that there is no "information" contained within them. There might be. But the only way we'd know is through a polynomial expansion, and I'm not going to take that on within this post.
My initial thought for a "next guess" is to use the [LDA][3] model for dimensionality reduction. However, it can only reduce dimensions to $1 - p$, with $p$ being the number of classes. Since this is a binary classification, every LDA model that I try will have dimensionality one; when I actually try this, the predictor ends up being slightly less accurate than the naive guess.
Instead, let's take a different approach to dimensionality reduction: [principle components analysis][4]. This allows us to perform the dimensionality reduction without worrying about the number of classes. Then, we'll use a [Gaussian Naive Bayes][5] model to actually do the prediction. This model is chosen simply because it doesn't take a long time to fit and compute; because PCA will take so long, I just want a prediction at the end of this. We can worry about using a more sophisticated LDA/QDA/SVM model later.
Now into the actual process: We're going to test out PCA dimensionality reduction from 1 - 20 dimensions, and then predict using a Gaussian Naive Bayes model. The 20 dimensions upper limit was selected because the accuracy never improves after you get beyond that \(I found out by running it myself\). Hopefully, we'll find that we can create a model better than the naive guess.
[3]:http://scikit-learn.org/stable/modules/lda_qda.html
[4]:http://scikit-learn.org/stable/modules/generated/sklearn.decomposition.PCA.html
[5]:http://scikit-learn.org/stable/modules/naive_bayes.html#gaussian-naive-bayes
```python
from sklearn.naive_bayes import GaussianNB
from sklearn.decomposition import PCA
X_no_bin = X.drop(b_class, 1)
def evaluate_gnb(dims):
pca = PCA(n_components=dims)
X_xform = pca.fit_transform(X_no_bin)
gnb = GaussianNB()
gnb.fit(X_xform, y)
return gnb.score(X_xform, y)
dim_range = np.arange(1, 21)
plt.plot(dim_range, [evaluate_gnb(dim) for dim in dim_range], label="Gaussian NB Accuracy")
plt.axhline(naive_guess, label="Naive Guess", c='k')
plt.axhline(1 - naive_guess, label="Inverse Naive Guess", c='k')
plt.gcf().set_size_inches(12, 6)
plt.legend();
```
![png](_notebook_files/_notebook_11_0.png)
\*\*sigh...\*\* After all the effort and computational power, we're still at square one: we have yet to beat out the naive guess threshold. With PCA in play we end up performing terribly, but not terribly enough that we can guess against ourselves.
Let's try one last-ditch attempt using the entire data set:
```python
def evaluate_gnb_full(dims):
pca = PCA(n_components=dims)
X_xform = pca.fit_transform(X)
gnb = GaussianNB()
gnb.fit(X_xform, y)
return gnb.score(X_xform, y)
dim_range = np.arange(1, 21)
plt.plot(dim_range, [evaluate_gnb(dim) for dim in dim_range], label="Gaussian NB Accuracy")
plt.axhline(naive_guess, label="Naive Guess", c='k')
plt.axhline(1 - naive_guess, label="Inverse Naive Guess", c='k')
plt.gcf().set_size_inches(12, 6)
plt.legend();
```
![png](_notebook_files/_notebook_13_0.png)
Nothing. It is interesting to note that the graphs are almost exactly the same: This would imply again that the variables we removed earlier (all the binary classifiers) indeed have almost no predictive power. It seems this problem is high-dimensional, but with almost no data that can actually inform our decisions.
# Summary for Day 1
After spending a couple hours with this dataset, there seems to be a fundamental issue in play: We have very high-dimensional data, and it has no bearing on our ability to actually predict customer satisfaction. This can be a huge issue: it implies that **no matter what model we use, we fundamentally can't perform well.** I'm sure most of this is because I'm not an experienced data scientist. Even so, we have yet to develop a strategy that can actually beat out the village idiot; **so far, the bank is best off just assuming all its customers are satisfied.** Hopefully more to come soon.
```python
end = datetime.now()
print("Running time: {}".format(end - start))
```
Running time: 0:00:58.715714
# Appendix
Code used to split the initial training data:
```python
from sklearn.cross_validation import train_test_split
data = pd.read_csv('train.csv')
data.index = data.ID
data_train, data_validate = train_test_split(
data, train_size=.7)
data_train.to_csv('split_train.csv')
data_validate.to_csv('split_validate.csv')
```

Binary file not shown.

After

Width:  |  Height:  |  Size: 13 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 13 KiB

View File

@ -0,0 +1,256 @@
---
slug: 2016/03/predicting-santander-customer-happiness
title: Predicting Santander customer happiness
date: 2016-03-05 12:00:00
authors: [bspeice]
tags: []
---
My first Kaggle competition.
<!-- truncate -->
It's time! After embarking on a Machine Learning class this semester, and with a Saturday in which I don't have much planned, I wanted to put this class and training to work. It's my first competition submission. I want to walk you guys through how I'm approaching this problem, because I thought it would be really neat. The competition is Banco Santander's [Santander Customer Satisfaction][1] competition. It seemed like an easy enough problem I could actually make decent progress on it.
## Data Exploration
First up: we need to load our data and do some exploratory work. Because we're going to be using this data for model selection prior to testing, we need to make a further split. I've already gone ahead and done this work, please see the code in the [appendix below](#appendix).
[1]: https://www.kaggle.com/c/santander-customer-satisfaction
```python
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
# Record how long it takes to run the notebook - I'm curious.
from datetime import datetime
start = datetime.now()
dataset = pd.read_csv('split_train.csv')
dataset.index = dataset.ID
X = dataset.drop(['TARGET', 'ID', 'ID.1'], 1)
y = dataset.TARGET
```
```python
y.unique()
```
```
array([0, 1], dtype=int64)
```
```python
len(X.columns)
```
```
369
```
Okay, so there are only [two classes we're predicting][2]: 1 for unsatisfied customers, 0 for satisfied customers. I would have preferred this to be something more like a regression, or predicting multiple classes: maybe the customer isn't the most happy, but is nowhere near closing their accounts. For now though, that's just the data we're working with.
Now, I'd like to make a scatter matrix of everything going on. Unfortunately as noted above, we have 369 different features. There's no way I can graphically make sense of that much data to start with.
We're also not told what the data actually represents: Are these survey results? Average time between contact with a customer care person? Frequency of contacting a customer care person? The idea is that I need to reduce the number of dimensions we're predicting across.
### Dimensionality Reduction pt. 1 - Binary Classifiers
My first attempt to reduce the data dimensionality is to find all the binary classifiers in the dataset \(i.e. 0 or 1 values\) and see if any of those are good \(or anti-good\) predictors of the final data.
[2]: https://www.kaggle.com/c/santander-customer-satisfaction/data
```python
cols = X.columns
b_class = []
for c in cols:
if len(X[c].unique()) == 2:
b_class.append(c)
len(b_class)
```
```
111
```
So there are 111 features in the dataset that are a binary label. Let's see if any of them are good at predicting the users satisfaction!
```python
# First we need to `binarize` the data to 0-1; some of the labels are {0, 1},
# some are {0, 3}, etc.
from sklearn.preprocessing import binarize
X_bin = binarize(X[b_class])
accuracy = [np.mean(X_bin[:,i] == y) for i in range(0, len(b_class))]
acc_df = pd.DataFrame({"Accuracy": accuracy}, index=b_class)
acc_df.describe()
```
<div>
<table>
<thead>
<tr>
<th></th>
<th>Accuracy</th>
</tr>
</thead>
<tbody>
<tr>
<th>count</th>
<td>111.000000</td>
</tr>
<tr>
<th>mean</th>
<td>0.905159</td>
</tr>
<tr>
<th>std</th>
<td>0.180602</td>
</tr>
<tr>
<th>min</th>
<td>0.043598</td>
</tr>
<tr>
<th>25%</th>
<td>0.937329</td>
</tr>
<tr>
<th>50%</th>
<td>0.959372</td>
</tr>
<tr>
<th>75%</th>
<td>0.960837</td>
</tr>
<tr>
<th>max</th>
<td>0.960837</td>
</tr>
</tbody>
</table>
</div>
Wow! Looks like we've got some incredibly predictive features! So much so that we should be a bit concerned. My initial guess for what's happening is that we have a sparsity issue: so many of the values are 0, and these likely happen to line up with satisfied customers.
So the question we must now answer, which I likely should have asked long before now: What exactly is the distribution of un/satisfied customers?
```python
unsat = y[y == 1].count()
print("Satisfied customers: {}; Unsatisfied customers: {}".format(len(y) - unsat, unsat))
naive_guess = np.mean(y == np.zeros(len(y)))
print("Naive guess accuracy: {}".format(naive_guess))
```
```
Satisfied customers: 51131; Unsatisfied customers: 2083
Naive guess accuracy: 0.9608561656706882
```
This is a bit discouraging. A naive guess of "always satisfied" performs as well as our best individual binary classifier. What this tells me then, is that these data columns aren't incredibly helpful in prediction. I'd be interested in a polynomial expansion of this data-set, but for now, that's more computation than I want to take on.
### Dimensionality Reduction pt. 2 - LDA
Knowing that our naive guess performs so well is a blessing and a curse:
- Curse: The threshold for performance is incredibly high: We can only "improve" over the naive guess by 4%
- Blessing: All the binary classification features we just discovered are worthless on their own. We can throw them out and reduce the data dimensionality from 369 to 111.
Now, in removing these features from the dataset, I'm not saying that there is no "information" contained within them. There might be. But the only way we'd know is through a polynomial expansion, and I'm not going to take that on within this post.
My initial thought for a "next guess" is to use the [LDA][3] model for dimensionality reduction. However, it can only reduce dimensions to $1 - p$, with $p$ being the number of classes. Since this is a binary classification, every LDA model that I try will have dimensionality one; when I actually try this, the predictor ends up being slightly less accurate than the naive guess.
Instead, let's take a different approach to dimensionality reduction: [principle components analysis][4]. This allows us to perform the dimensionality reduction without worrying about the number of classes. Then, we'll use a [Gaussian Naive Bayes][5] model to actually do the prediction. This model is chosen simply because it doesn't take a long time to fit and compute; because PCA will take so long, I just want a prediction at the end of this. We can worry about using a more sophisticated LDA/QDA/SVM model later.
Now into the actual process: We're going to test out PCA dimensionality reduction from 1 - 20 dimensions, and then predict using a Gaussian Naive Bayes model. The 20 dimensions upper limit was selected because the accuracy never improves after you get beyond that \(I found out by running it myself\). Hopefully, we'll find that we can create a model better than the naive guess.
[3]:http://scikit-learn.org/stable/modules/lda_qda.html
[4]:http://scikit-learn.org/stable/modules/generated/sklearn.decomposition.PCA.html
[5]:http://scikit-learn.org/stable/modules/naive_bayes.html#gaussian-naive-bayes
```python
from sklearn.naive_bayes import GaussianNB
from sklearn.decomposition import PCA
X_no_bin = X.drop(b_class, 1)
def evaluate_gnb(dims):
pca = PCA(n_components=dims)
X_xform = pca.fit_transform(X_no_bin)
gnb = GaussianNB()
gnb.fit(X_xform, y)
return gnb.score(X_xform, y)
dim_range = np.arange(1, 21)
plt.plot(dim_range, [evaluate_gnb(dim) for dim in dim_range], label="Gaussian NB Accuracy")
plt.axhline(naive_guess, label="Naive Guess", c='k')
plt.axhline(1 - naive_guess, label="Inverse Naive Guess", c='k')
plt.gcf().set_size_inches(12, 6)
plt.legend();
```
![png](_notebook_files/_notebook_11_0.png)
**sigh...** After all the effort and computational power, we're still at square one: we have yet to beat out the naive guess threshold. With PCA in play we end up performing terribly, but not terribly enough that we can guess against ourselves.
Let's try one last-ditch attempt using the entire data set:
```python
def evaluate_gnb_full(dims):
pca = PCA(n_components=dims)
X_xform = pca.fit_transform(X)
gnb = GaussianNB()
gnb.fit(X_xform, y)
return gnb.score(X_xform, y)
dim_range = np.arange(1, 21)
plt.plot(dim_range, [evaluate_gnb(dim) for dim in dim_range], label="Gaussian NB Accuracy")
plt.axhline(naive_guess, label="Naive Guess", c='k')
plt.axhline(1 - naive_guess, label="Inverse Naive Guess", c='k')
plt.gcf().set_size_inches(12, 6)
plt.legend();
```
![png](_notebook_files/_notebook_13_0.png)
Nothing. It is interesting to note that the graphs are almost exactly the same: This would imply again that the variables we removed earlier (all the binary classifiers) indeed have almost no predictive power. It seems this problem is high-dimensional, but with almost no data that can actually inform our decisions.
## Summary for Day 1
After spending a couple hours with this dataset, there seems to be a fundamental issue in play: We have very high-dimensional data, and it has no bearing on our ability to actually predict customer satisfaction. This can be a huge issue: it implies that **no matter what model we use, we fundamentally can't perform well.** I'm sure most of this is because I'm not an experienced data scientist. Even so, we have yet to develop a strategy that can actually beat out the village idiot; **so far, the bank is best off just assuming all its customers are satisfied.** Hopefully more to come soon.
```python
end = datetime.now()
print("Running time: {}".format(end - start))
```
```
Running time: 0:00:58.715714
```
## Appendix
Code used to split the initial training data:
```python
from sklearn.cross_validation import train_test_split
data = pd.read_csv('train.csv')
data.index = data.ID
data_train, data_validate = train_test_split(
data, train_size=.7)
data_train.to_csv('split_train.csv')
data_validate.to_csv('split_validate.csv')
```

View File

@ -8,7 +8,7 @@ tags: []
If all we have is a finite number of heartbeats left, what about me? If all we have is a finite number of heartbeats left, what about me?
--- <!-- truncate -->
Warning: this one is a bit creepier. But that's what you get when you come up with data science ideas as you're drifting off to sleep. Warning: this one is a bit creepier. But that's what you get when you come up with data science ideas as you're drifting off to sleep.

View File

@ -0,0 +1,17 @@
Title: Event Studies and Earnings Releases
Date: 2016-06-08
Category: Blog
Tags: event study, earnings
Authors: Bradlee Speice
Summary: Looking at earnings releases to see how good people are at actually predicting earnings.
[//]: <> "Modified: "
<script type="text/javascript" src="https://cdn.jsdelivr.net/jquery/3.0.0/jquery.min.js"></script>
{% notebook 2016-6-8-event-studies-and-earnings-releases.ipynb %}
<script type="text/x-mathjax-config">
//MathJax.Hub.Config({tex2jax: {inlineMath: [['$','$'], ['\(','\)']]}});
MathJax.Hub.Config({tex2jax: {inlineMath: [['\$','\$']]}});
</script>
<script async src='https://cdn.mathjax.org/mathjax/latest/MathJax.js?config=TeX-AMS_CHTML'></script>

File diff suppressed because one or more lines are too long

View File

@ -0,0 +1,724 @@
Or, being suspicious of market insiders.
---
Use the button below to show the code I've used to generate this article. Because there is a significant amount more code involved than most other posts I've written, it's hidden by default to allow people to concentrate on the important bits.
```python
from IPython.display import HTML
HTML('''<script>
code_show=true;
function code_toggle() {
if (code_show){
$('div.input').hide();
} else {
$('div.input').show();
}
code_show = !code_show
}
$( document ).ready(code_toggle);
</script>
<form action="javascript:code_toggle()"><input type="submit" value="Click here to toggle on/off the raw code."></form>''')
```
<script>
code_show=true;
function code_toggle() {
if (code_show){
$('div.input').hide();
} else {
$('div.input').show();
}
code_show = !code_show
}
$( document ).ready(code_toggle);
</script>
<form action="javascript:code_toggle()"><input type="submit" value="Click here to toggle on/off the raw code."></form>
# The Market Just Knew
I recently saw two examples of stock charts that have kept me thinking for a while. And now that the semester is complete, I finally have enough time to really look at them and give them the treatment they deserve. The first is good old Apple:
```python
from secrets import QUANDL_KEY
import matplotlib.pyplot as plt
from matplotlib.dates import date2num
from matplotlib.finance import candlestick_ohlc
from matplotlib.dates import DateFormatter, WeekdayLocator,\
DayLocator, MONDAY
import quandl
from datetime import datetime
import pandas as pd
%matplotlib inline
def fetch_ticker(ticker, start, end):
# Quandl is currently giving me issues with returning
# the entire dataset and not slicing server-side.
# So instead, we'll do it client-side!
q_format = '%Y-%m-%d'
ticker_data = quandl.get('YAHOO/' + ticker,
start_date=start.strftime(q_format),
end_date=end.strftime(q_format),
authtoken=QUANDL_KEY)
return ticker_data
def ohlc_dataframe(data, ax=None):
# Much of this code re-used from:
# http://matplotlib.org/examples/pylab_examples/finance_demo.html
if ax is None:
f, ax = plt.subplots()
vals = [(date2num(date), *(data.loc[date]))
for date in data.index]
candlestick_ohlc(ax, vals)
mondays = WeekdayLocator(MONDAY)
alldays = DayLocator()
weekFormatter = DateFormatter('%b %d')
ax.xaxis.set_major_locator(mondays)
ax.xaxis.set_minor_locator(alldays)
ax.xaxis.set_major_formatter(weekFormatter)
return ax
AAPL = fetch_ticker('AAPL', datetime(2016, 3, 1), datetime(2016, 5, 1))
ax = ohlc_dataframe(AAPL)
plt.vlines(date2num(datetime(2016, 4, 26, 12)),
ax.get_ylim()[0], ax.get_ylim()[1],
color='b',
label='Earnings Release')
plt.legend(loc=3)
plt.title("Apple Price 3/1/2016 - 5/1/2016");
```
![png](_notebook_files/_notebook_3_0.png)
The second chart is from Facebook:
```python
FB = fetch_ticker('FB', datetime(2016, 3, 1), datetime(2016, 5, 5))
ax = ohlc_dataframe(FB)
plt.vlines(date2num(datetime(2016, 4, 27, 12)),
ax.get_ylim()[0], ax.get_ylim()[1],
color='b', label='Earnings Release')
plt.title('Facebook Price 3/5/2016 - 5/5/2016')
plt.legend(loc=2);
```
![png](_notebook_files/_notebook_5_0.png)
These two charts demonstrate two very specific phonomena: how the market prepares for earnings releases. Let's look at those charts again, but with some extra information. As we're about the see, the market "knew" in advance that Apple was going to perform poorly. The market expected that Facebook was going to perform poorly, and instead shot the lights out. Let's see that trend in action:
```python
def plot_hilo(ax, start, end, data):
ax.plot([date2num(start), date2num(end)],
[data.loc[start]['High'], data.loc[end]['High']],
color='b')
ax.plot([date2num(start), date2num(end)],
[data.loc[start]['Low'], data.loc[end]['Low']],
color='b')
f, axarr = plt.subplots(1, 2)
ax_aapl = axarr[0]
ax_fb = axarr[1]
# Plot the AAPL trend up and down
ohlc_dataframe(AAPL, ax=ax_aapl)
plot_hilo(ax_aapl, datetime(2016, 3, 1), datetime(2016, 4, 15), AAPL)
plot_hilo(ax_aapl, datetime(2016, 4, 18), datetime(2016, 4, 26), AAPL)
ax_aapl.vlines(date2num(datetime(2016, 4, 26, 12)),
ax_aapl.get_ylim()[0], ax_aapl.get_ylim()[1],
color='g', label='Earnings Release')
ax_aapl.legend(loc=2)
ax_aapl.set_title('AAPL Price History')
# Plot the FB trend down and up
ohlc_dataframe(FB, ax=ax_fb)
plot_hilo(ax_fb, datetime(2016, 3, 30), datetime(2016, 4, 27), FB)
plot_hilo(ax_fb, datetime(2016, 4, 28), datetime(2016, 5, 5), FB)
ax_fb.vlines(date2num(datetime(2016, 4, 27, 12)),
ax_fb.get_ylim()[0], ax_fb.get_ylim()[1],
color='g', label='Earnings Release')
ax_fb.legend(loc=2)
ax_fb.set_title('FB Price History')
f.set_size_inches(18, 6)
```
![png](_notebook_files/_notebook_7_0.png)
As we can see above, the market broke a prevailing trend on Apple in order to go down, and ultimately predict the earnings release. For Facebook, the opposite happened. While the trend was down, the earnings were fantastic and the market corrected itself much higher.
# Formulating the Question
While these are two specific examples, there are plenty of other examples you could cite one way or another. Even if the preponderance of evidence shows that the market correctly predicts earnings releases, we need not accuse people of collusion; for a company like Apple with many suppliers we can generally forecast how Apple has done based on those same suppliers.
The question then, is this: **how well does the market predict the earnings releases?** It's an incredibly broad question that I want to disect in a couple of different ways:
1. Given a stock that has been trending down over the past N days before an earnings release, how likely does it continue downward after the release?
2. Given a stock trending up, how likely does it continue up?
3. Is there a difference in accuracy between large- and small-cap stocks?
4. How often, and for how long, do markets trend before an earnings release?
**I want to especially thank Alejandro Saltiel for helping me retrieve the data.** He's great. And now for all of the interesting bits.
# Event Studies
Before we go too much further, I want to introduce the actual event study. Each chart intends to capture a lot of information and present an easy-to-understand pattern:
```python
import numpy as np
import pandas as pd
from pandas.tseries.holiday import USFederalHolidayCalendar
from pandas.tseries.offsets import CustomBusinessDay
from datetime import datetime, timedelta
# If you remove rules, it removes them from *all* calendars
# To ensure we don't pop rules we don't want to, first make
# sure to fully copy the object
trade_calendar = USFederalHolidayCalendar()
trade_calendar.rules.pop(6) # Remove Columbus day
trade_calendar.rules.pop(7) # Remove Veteran's day
TradeDay = lambda days: CustomBusinessDay(days, calendar=trade_calendar)
def plot_study(array):
# Given a 2-d array, we assume the event happens at index `lookback`,
# and create all of our summary statistics from there.
lookback = int((array.shape[1] - 1) / 2)
norm_factor = np.repeat(array[:,lookback].reshape(-1, 1), array.shape[1], axis=1)
centered_data = array / norm_factor - 1
lookforward = centered_data.shape[1] - lookback
means = centered_data.mean(axis=0)
lookforward_data = centered_data[:,lookforward:]
std_dev = np.hstack([0, lookforward_data.std(axis=0)])
maxes = lookforward_data.max(axis=0)
mins = lookforward_data.min(axis=0)
f, axarr = plt.subplots(1, 2)
range_begin = -lookback
range_end = lookforward
axarr[0].plot(range(range_begin, range_end), means)
axarr[1].plot(range(range_begin, range_end), means)
axarr[0].fill_between(range(0, range_end),
means[-lookforward:] + std_dev,
means[-lookforward:] - std_dev,
alpha=.5, label="$\pm$ 1 s.d.")
axarr[1].fill_between(range(0, range_end),
means[-lookforward:] + std_dev,
means[-lookforward:] - std_dev,
alpha=.5, label="$\pm$ 1 s.d.")
max_err = maxes - means[-lookforward+1:]
min_err = means[-lookforward+1:] - mins
axarr[0].errorbar(range(1, range_end),
means[-lookforward+1:],
yerr=[min_err, max_err], label='Max & Min')
axarr[0].legend(loc=2)
axarr[1].legend(loc=2)
axarr[0].set_xlim((-lookback-1, lookback+1))
axarr[1].set_xlim((-lookback-1, lookback+1))
def plot_study_small(array):
# Given a 2-d array, we assume the event happens at index `lookback`,
# and create all of our summary statistics from there.
lookback = int((array.shape[1] - 1) / 2)
norm_factor = np.repeat(array[:,lookback].reshape(-1, 1), array.shape[1], axis=1)
centered_data = array / norm_factor - 1
lookforward = centered_data.shape[1] - lookback
means = centered_data.mean(axis=0)
lookforward_data = centered_data[:,lookforward:]
std_dev = np.hstack([0, lookforward_data.std(axis=0)])
maxes = lookforward_data.max(axis=0)
mins = lookforward_data.min(axis=0)
range_begin = -lookback
range_end = lookforward
plt.plot(range(range_begin, range_end), means)
plt.fill_between(range(0, range_end),
means[-lookforward:] + std_dev,
means[-lookforward:] - std_dev,
alpha=.5, label="$\pm$ 1 s.d.")
max_err = maxes - means[-lookforward+1:]
min_err = means[-lookforward+1:] - mins
plt.errorbar(range(1, range_end),
means[-lookforward+1:],
yerr=[min_err, max_err], label='Max & Min')
plt.legend(loc=2)
plt.xlim((-lookback-1, lookback+1))
def fetch_event_data(ticker, events, horizon=5):
# Use horizon+1 to account for including the day of the event,
# and half-open interval - that is, for a horizon of 5,
# we should be including 11 events. Additionally, using the
# CustomBusinessDay means we automatically handle issues if
# for example a company reports Friday afternoon - the date
# calculator will turn this into a "Saturday" release, but
# we effectively shift that to Monday with the logic below.
td_back = TradeDay(horizon+1)
td_forward = TradeDay(horizon+1)
start_date = min(events) - td_back
end_date = max(events) + td_forward
total_data = fetch_ticker(ticker, start_date, end_date)
event_data = [total_data.ix[event-td_back:event+td_forward]\
[0:horizon*2+1]\
['Adjusted Close']
for event in events]
return np.array(event_data)
# Generate a couple of random events
event_dates = [datetime(2016, 5, 27) - timedelta(days=1) - TradeDay(x*20) for x in range(1, 40)]
data = fetch_event_data('CELG', event_dates)
plot_study_small(data)
plt.legend(loc=3)
plt.gcf().set_size_inches(12, 6);
plt.annotate('Mean price for days leading up to each event',
(-5, -.01), (-4.5, .025),
arrowprops=dict(facecolor='black', shrink=0.05))
plt.annotate('', (-.1, .005), (-.5, .02),
arrowprops={'facecolor': 'black', 'shrink': .05})
plt.annotate('$\pm$ 1 std. dev. each day', (5, .055), (2.5, .085),
arrowprops={'facecolor': 'black', 'shrink': .05})
plt.annotate('Min/Max each day', (.9, -.07), (-1, -.1),
arrowprops={'facecolor': 'black', 'shrink': .05});
```
![png](_notebook_files/_notebook_11_0.png)
And as a quick textual explanation as well:
- The blue line represents the mean price for each day, represented as a percentage of the price on the '0-day'. For example, if we defined an 'event' as whenever the stock price dropped for three days, we would see a decreasing blue line to the left of the 0-day.
- The blue shaded area represents one standard deviation above and below the mean price for each day following an event. This is intended to give us an idea of what the stock price does in general following an event.
- The green bars are the minimum and maximum price for each day following an event. This instructs us as to how much it's possible for the stock to move.
# Event Type 1: Trending down over the past N days
The first type of event I want to study is how stocks perform when they've been trending down over the past couple of days prior to a release. However, we need to clarify what exactly is meant by "trending down." To do so, we'll use the following metric: **the midpoint between each day's opening and closing price goes down over a period of N days**.
It's probably helpful to have an example:
```python
f, axarr = plt.subplots(1, 2)
f.set_size_inches(18, 6)
FB_plot = axarr[0]
ohlc_dataframe(FB[datetime(2016, 4, 18):], FB_plot)
FB_truncated = FB[datetime(2016, 4, 18):datetime(2016, 4, 27)]
midpoint = FB_truncated['Open']/2 + FB_truncated['Close']/2
FB_plot.plot(FB_truncated.index, midpoint, label='Midpoint')
FB_plot.vlines(date2num(datetime(2016, 4, 27, 12)),
ax_fb.get_ylim()[0], ax_fb.get_ylim()[1],
color='g', label='Earnings Release')
FB_plot.legend(loc=2)
FB_plot.set_title('FB Midpoint Plot')
AAPL_plot = axarr[1]
ohlc_dataframe(AAPL[datetime(2016, 4, 10):], AAPL_plot)
AAPL_truncated = AAPL[datetime(2016, 4, 10):datetime(2016, 4, 26)]
midpoint = AAPL_truncated['Open']/2 + AAPL_truncated['Close']/2
AAPL_plot.plot(AAPL_truncated.index, midpoint, label='Midpoint')
AAPL_plot.vlines(date2num(datetime(2016, 4, 26, 12)),
ax_aapl.get_ylim()[0], ax_aapl.get_ylim()[1],
color='g', label='Earnings Release')
AAPL_plot.legend(loc=3)
AAPL_plot.set_title('AAPL Midpoint Plot');
```
![png](_notebook_files/_notebook_14_0.png)
Given these charts, we can see that FB was trending down for the four days preceding the earnings release, and AAPL was trending down for a whopping 8 days (we don't count the peak day). This will define the methodology that we will use for the study.
So what are the results? For a given horizon, how well does the market actually perform?
```python
# Read in the events for each stock;
# The file was created using the first code block in the Appendix
import yaml
from dateutil.parser import parse
from progressbar import ProgressBar
data_str = open('earnings_dates.yaml', 'r').read()
# Need to remove invalid lines
filtered = filter(lambda x: '{' not in x, data_str.split('\n'))
earnings_data = yaml.load('\n'.join(filtered))
# Convert our earnings data into a list of (ticker, date) pairs
# to make it easy to work with.
# This is horribly inefficient, but should get us what we need
ticker_dates = []
for ticker, date_list in earnings_data.items():
for iso_str in date_list:
ticker_dates.append((ticker, parse(iso_str)))
def does_trend_down(ticker, event, horizon):
# Figure out if the `event` has a downtrend for
# the `horizon` days preceding it
# As an interpretation note: it is assumed that
# the closing price of day `event` is the reference
# point, and we want `horizon` days before that.
# The price_data.hdf was created in the second appendix code block
try:
ticker_data = pd.read_hdf('price_data.hdf', ticker)
data = ticker_data[event-TradeDay(horizon):event]
midpoints = data['Open']/2 + data['Close']/2
# Shift dates one forward into the future and subtract
# Effectively: do we trend down over all days?
elems = midpoints - midpoints.shift(1)
return len(elems)-1 == len(elems.dropna()[elems <= 0])
except KeyError:
# If the stock doesn't exist, it doesn't qualify as trending down
# Mostly this is here to make sure the entire analysis doesn't
# blow up if there were issues in data retrieval
return False
def study_trend(horizon, trend_function):
five_day_events = np.zeros((1, horizon*2 + 1))
invalid_events = []
for ticker, event in ProgressBar()(ticker_dates):
if trend_function(ticker, event, horizon):
ticker_data = pd.read_hdf('price_data.hdf', ticker)
event_data = ticker_data[event-TradeDay(horizon):event+TradeDay(horizon)]['Close']
try:
five_day_events = np.vstack([five_day_events, event_data])
except ValueError:
# Sometimes we don't get exactly the right number of values due to calendar
# issues. I've fixed most everything I can, and the few issues that are left
# I assume don't systemically bias the results (i.e. data could be missing
# because it doesn't exist, etc.). After running through, ~1% of events get
# discarded this way
invalid_events.append((ticker, event))
# Remove our initial zero row
five_day_events = five_day_events[1:,:]
plot_study(five_day_events)
plt.gcf().suptitle('Action over {} days: {} events'
.format(horizon,five_day_events.shape[0]))
plt.gcf().set_size_inches(18, 6)
# Start with a 5 day study
study_trend(5, does_trend_down)
```
100% (47578 of 47578) |###########################################################| Elapsed Time: 0:21:38 Time: 0:21:38
![png](_notebook_files/_notebook_16_1.png)
When a stock has been trending down for 5 days, once the earnings are announced it really doesn't move on average. However, the variability is *incredible*. This implies two important things:
1. The market is just as often wrong about an earnings announcement before it happens as it is correct
2. The incredible width of the min/max bars and standard deviation area tell us that the market reacts *violently* after the earnings are released.
Let's repeat the same study, but over a time horizon of 8 days and 3 days. Presumably if a stock has been going down for 8 days at a time before the earnings, the market should be more accurate.
```python
# 8 day study next
study_trend(8, does_trend_down)
```
100% (47578 of 47578) |###########################################################| Elapsed Time: 0:20:29 Time: 0:20:29
![png](_notebook_files/_notebook_18_1.png)
However, looking only at stocks that trended down for 8 days prior to a release, the same pattern emerges: on average, the stock doesn't move, but the market reaction is often incredibly violent.
```python
# 3 day study after that
study_trend(3, does_trend_down)
```
100% (47578 of 47578) |###########################################################| Elapsed Time: 0:26:26 Time: 0:26:26
![png](_notebook_files/_notebook_20_1.png)
Finally, when we look at a 3-day horizon, we start getting some incredible outliers. Stocks have a potential to move over ~300% up, and the standard deviation width is again, incredible. The results for a 3-day horizon follow the same pattern we've seen in the 5- and 8-day horizons.
# Event Type 2: Trending up for N days
We're now going to repeat the analysis, but do it for uptrends instead. That is, instead of looking at stocks that have been trending down over the past number of days, we focus only on stocks that have been trending up.
```python
def does_trend_up(ticker, event, horizon):
# Figure out if the `event` has an uptrend for
# the `horizon` days preceding it
# As an interpretation note: it is assumed that
# the closing price of day `event` is the reference
# point, and we want `horizon` days before that.
# The price_data.hdf was created in the second appendix code block
try:
ticker_data = pd.read_hdf('price_data.hdf', ticker)
data = ticker_data[event-TradeDay(horizon):event]
midpoints = data['Open']/2 + data['Close']/2
# Shift dates one forward into the future and subtract
# Effectively: do we trend down over all days?
elems = midpoints - midpoints.shift(1)
return len(elems)-1 == len(elems.dropna()[elems >= 0])
except KeyError:
# If the stock doesn't exist, it doesn't qualify as trending down
# Mostly this is here to make sure the entire analysis doesn't
# blow up if there were issues in data retrieval
return False
study_trend(5, does_trend_up)
```
100% (47578 of 47578) |###########################################################| Elapsed Time: 0:22:51 Time: 0:22:51
![png](_notebook_files/_notebook_23_1.png)
The patterns here are very similar. With the exception of noting that stocks can go to nearly 400% after an earnings announcement (most likely this included a takeover announcement, etc.), we still see large min/max bars and wide standard deviation of returns.
We'll repeat the pattern for stocks going up for both 8 and 3 days straight, but at this point, the results should be very predictable:
```python
study_trend(8, does_trend_up)
```
100% (47578 of 47578) |###########################################################| Elapsed Time: 0:20:51 Time: 0:20:51
![png](_notebook_files/_notebook_25_1.png)
```python
study_trend(3, does_trend_up)
```
100% (47578 of 47578) |###########################################################| Elapsed Time: 0:26:56 Time: 0:26:56
![png](_notebook_files/_notebook_26_1.png)
# Conclusion and Summary
I guess the most important thing to summarize with is this: **looking at the entire market, stock performance prior to an earnings release has no bearing on the stock's performance.** Honestly: given the huge variability of returns after an earnings release, even when the stock has been trending for a long time, you're best off divesting before an earnings release and letting the market sort itself out.
*However*, there is a big caveat. These results are taken when we look at the entire market. So while we can say that the market as a whole knows nothing and just reacts violently, I want to take a closer look into this data. Does the market typically perform poorly on large-cap/high liquidity stocks? Do smaller companies have investors that know them better and can thus predict performance better? Are specific market sectors better at prediction? Presumably technology stocks are more volatile than the industrials.
So there are some more interesting questions I still want to ask with this data. Knowing that the hard work of data processing is largely already done, it should be fairly simple to continue this analysis and get much more refined with it. Until next time.
# Appendix
Export event data for Russell 3000 companies:
```python
import pandas as pd
from html.parser import HTMLParser
from datetime import datetime, timedelta
import requests
import re
from dateutil import parser
import progressbar
from concurrent import futures
import yaml
class EarningsParser(HTMLParser):
store_dates = False
earnings_offset = None
dates = []
def __init__(self, *args, **kwargs):
super().__init__(*args, **kwargs)
self.dates = []
def handle_starttag(self, tag, attrs):
if tag == 'table':
self.store_dates = True
def handle_data(self, data):
if self.store_dates:
match = re.match(r'\d+/\d+/\d+', data)
if match:
self.dates.append(match.group(0))
# If a company reports before the bell, record the earnings date
# being at midnight the day before. Ex: WMT reports 5/19/2016,
# but we want the reference point to be the closing price on 5/18/2016
if 'After Close' in data:
self.earnings_offset = timedelta(days=0)
elif 'Before Open' in data:
self.earnings_offset = timedelta(days=-1)
def handle_endtag(self, tag):
if tag == 'table':
self.store_dates = False
def earnings_releases(ticker):
#print("Looking up ticker {}".format(ticker))
user_agent = 'Mozilla/5.0 (Windows NT 10.0; WOW64; rv:46.0) '\
'Gecko/20100101 Firefox/46.0'
headers = {'user-agent': user_agent}
base_url = 'http://www.streetinsider.com/ec_earnings.php?q={}'\
.format(ticker)
e = EarningsParser()
s = requests.Session()
a = requests.adapters.HTTPAdapter(max_retries=0)
s.mount('http://', a)
e.feed(str(s.get(base_url, headers=headers).content))
if e.earnings_offset is not None:
dates = map(lambda x: parser.parse(x) + e.earnings_offset, e.dates)
past = filter(lambda x: x < datetime.now(), dates)
return list(map(lambda d: d.isoformat(), past))
# Use a Russell-3000 ETF tracker (ticker IWV) to get a list of holdings
r3000 = pd.read_csv('https://www.ishares.com/us/products/239714/'
'ishares-russell-3000-etf/1449138789749.ajax?'
'fileType=csv&fileName=IWV_holdings&dataType=fund',
header=10)
r3000_equities = r3000[(r3000['Exchange'] == 'NASDAQ') |
(r3000['Exchange'] == 'New York Stock Exchange Inc.')]
dates_file = open('earnings_dates.yaml', 'w')
with futures.ThreadPoolExecutor(max_workers=8) as pool:
fs = {pool.submit(earnings_releases, r3000_equities.ix[t]['Ticker']): t
for t in r3000_equities.index}
pbar = progressbar.ProgressBar(term_width=80,
max_value=r3000_equities.index.max())
for future in futures.as_completed(fs):
i = fs[future]
pbar.update(i)
dates_file.write(yaml.dump({r3000_equities.ix[i]['Ticker']:
future.result()}))
```
Downloading stock price data needed for the event studies:
```python
from secrets import QUANDL_KEY
import pandas as pd
import yaml
from dateutil.parser import parse
from datetime import timedelta
import quandl
from progressbar import ProgressBar
def fetch_ticker(ticker, start, end):
# Quandl is currently giving me issues with returning
# the entire dataset and not slicing server-side.
# So instead, we'll do it client-side!
q_format = '%Y-%m-%d'
ticker_data = quandl.get('YAHOO/' + ticker,
start_date=start.strftime(q_format),
end_date=end.strftime(q_format),
authtoken=QUANDL_KEY)
return ticker_data
data_str = open('earnings_dates.yaml', 'r').read()
# Need to remove invalid lines
filtered = filter(lambda x: '{' not in x, data_str.split('\n'))
earnings_data = yaml.load('\n'.join(filtered))
# Get the first 1500 keys - split up into two statements
# because of Quandl rate limits
tickers = list(earnings_data.keys())
price_dict = {}
invalid_tickers = []
for ticker in ProgressBar()(tickers[0:1500]):
try:
# Replace '.' with '-' in name for some tickers
fixed = ticker.replace('.', '-')
event_strs = earnings_data[ticker]
events = [parse(event) for event in event_strs]
td = timedelta(days=20)
price_dict[ticker] = fetch_ticker(fixed,
min(events)-td, max(events)+td)
except quandl.NotFoundError:
invalid_tickers.append(ticker)
# Execute this after 10 minutes have passed
for ticker in ProgressBar()(tickers[1500:]):
try:
# Replace '.' with '-' in name for some tickers
fixed = ticker.replace('.', '-')
event_strs = earnings_data[ticker]
events = [parse(event) for event in event_strs]
td = timedelta(days=20)
price_dict[ticker] = fetch_ticker(fixed,
min(events)-td, max(events)+td)
except quandl.NotFoundError:
invalid_tickers.append(ticker)
prices_store = pd.HDFStore('price_data.hdf')
for ticker, prices in price_dict.items():
prices_store[ticker] = prices
```

Binary file not shown.

After

Width:  |  Height:  |  Size: 22 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 24 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 24 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 24 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 20 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 20 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 23 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 19 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 10 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 12 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 32 KiB

View File

@ -0,0 +1,692 @@
---
slug: 2016/06/event-studies-and-earnings-releases
title: Event studies and earnings releases
date: 2016-06-08 12:00:00
authors: [bspeice]
tags: []
---
Or, being suspicious of market insiders.
<!-- truncate -->
## The Market Just Knew
I recently saw two examples of stock charts that have kept me thinking for a while. And now that the semester is complete, I finally have enough time to really look at them and give them the treatment they deserve. The first is good old Apple:
<details>
<summary>Code</summary>
```python
from secrets import QUANDL_KEY
import matplotlib.pyplot as plt
from matplotlib.dates import date2num
from matplotlib.finance import candlestick_ohlc
from matplotlib.dates import DateFormatter, WeekdayLocator,\
DayLocator, MONDAY
import quandl
from datetime import datetime
import pandas as pd
%matplotlib inline
def fetch_ticker(ticker, start, end):
# Quandl is currently giving me issues with returning
# the entire dataset and not slicing server-side.
# So instead, we'll do it client-side!
q_format = '%Y-%m-%d'
ticker_data = quandl.get('YAHOO/' + ticker,
start_date=start.strftime(q_format),
end_date=end.strftime(q_format),
authtoken=QUANDL_KEY)
return ticker_data
def ohlc_dataframe(data, ax=None):
# Much of this code re-used from:
# http://matplotlib.org/examples/pylab_examples/finance_demo.html
if ax is None:
f, ax = plt.subplots()
vals = [(date2num(date), *(data.loc[date]))
for date in data.index]
candlestick_ohlc(ax, vals)
mondays = WeekdayLocator(MONDAY)
alldays = DayLocator()
weekFormatter = DateFormatter('%b %d')
ax.xaxis.set_major_locator(mondays)
ax.xaxis.set_minor_locator(alldays)
ax.xaxis.set_major_formatter(weekFormatter)
return ax
```
</details>
```python
AAPL = fetch_ticker('AAPL', datetime(2016, 3, 1), datetime(2016, 5, 1))
ax = ohlc_dataframe(AAPL)
plt.vlines(date2num(datetime(2016, 4, 26, 12)),
ax.get_ylim()[0], ax.get_ylim()[1],
color='b',
label='Earnings Release')
plt.legend(loc=3)
plt.title("Apple Price 3/1/2016 - 5/1/2016");
```
![png](_notebook_files/_notebook_3_0.png)
The second chart is from Facebook:
```python
FB = fetch_ticker('FB', datetime(2016, 3, 1), datetime(2016, 5, 5))
ax = ohlc_dataframe(FB)
plt.vlines(date2num(datetime(2016, 4, 27, 12)),
ax.get_ylim()[0], ax.get_ylim()[1],
color='b', label='Earnings Release')
plt.title('Facebook Price 3/5/2016 - 5/5/2016')
plt.legend(loc=2);
```
![png](_notebook_files/_notebook_5_0.png)
These two charts demonstrate two very specific phonomena: how the market prepares for earnings releases. Let's look at those charts again, but with some extra information. As we're about the see, the market "knew" in advance that Apple was going to perform poorly. The market expected that Facebook was going to perform poorly, and instead shot the lights out. Let's see that trend in action:
<details>
<summary>Code</summary>
```python
def plot_hilo(ax, start, end, data):
ax.plot([date2num(start), date2num(end)],
[data.loc[start]['High'], data.loc[end]['High']],
color='b')
ax.plot([date2num(start), date2num(end)],
[data.loc[start]['Low'], data.loc[end]['Low']],
color='b')
f, axarr = plt.subplots(1, 2)
ax_aapl = axarr[0]
ax_fb = axarr[1]
# Plot the AAPL trend up and down
ohlc_dataframe(AAPL, ax=ax_aapl)
plot_hilo(ax_aapl, datetime(2016, 3, 1), datetime(2016, 4, 15), AAPL)
plot_hilo(ax_aapl, datetime(2016, 4, 18), datetime(2016, 4, 26), AAPL)
ax_aapl.vlines(date2num(datetime(2016, 4, 26, 12)),
ax_aapl.get_ylim()[0], ax_aapl.get_ylim()[1],
color='g', label='Earnings Release')
ax_aapl.legend(loc=2)
ax_aapl.set_title('AAPL Price History')
# Plot the FB trend down and up
ohlc_dataframe(FB, ax=ax_fb)
plot_hilo(ax_fb, datetime(2016, 3, 30), datetime(2016, 4, 27), FB)
plot_hilo(ax_fb, datetime(2016, 4, 28), datetime(2016, 5, 5), FB)
ax_fb.vlines(date2num(datetime(2016, 4, 27, 12)),
ax_fb.get_ylim()[0], ax_fb.get_ylim()[1],
color='g', label='Earnings Release')
ax_fb.legend(loc=2)
ax_fb.set_title('FB Price History')
f.set_size_inches(18, 6)
```
</details>
![png](_notebook_files/_notebook_7_0.png)
As we can see above, the market broke a prevailing trend on Apple in order to go down, and ultimately predict the earnings release. For Facebook, the opposite happened. While the trend was down, the earnings were fantastic and the market corrected itself much higher.
## Formulating the Question
While these are two specific examples, there are plenty of other examples you could cite one way or another. Even if the preponderance of evidence shows that the market correctly predicts earnings releases, we need not accuse people of collusion; for a company like Apple with many suppliers we can generally forecast how Apple has done based on those same suppliers.
The question then, is this: **how well does the market predict the earnings releases?** It's an incredibly broad question that I want to disect in a couple of different ways:
1. Given a stock that has been trending down over the past N days before an earnings release, how likely does it continue downward after the release?
2. Given a stock trending up, how likely does it continue up?
3. Is there a difference in accuracy between large- and small-cap stocks?
4. How often, and for how long, do markets trend before an earnings release?
**I want to especially thank Alejandro Saltiel for helping me retrieve the data.** He's great. And now for all of the interesting bits.
## Event Studies
Before we go too much further, I want to introduce the actual event study. Each chart intends to capture a lot of information and present an easy-to-understand pattern:
<details>
<summary>Code</summary>
```python
import numpy as np
import pandas as pd
from pandas.tseries.holiday import USFederalHolidayCalendar
from pandas.tseries.offsets import CustomBusinessDay
from datetime import datetime, timedelta
# If you remove rules, it removes them from *all* calendars
# To ensure we don't pop rules we don't want to, first make
# sure to fully copy the object
trade_calendar = USFederalHolidayCalendar()
trade_calendar.rules.pop(6) # Remove Columbus day
trade_calendar.rules.pop(7) # Remove Veteran's day
TradeDay = lambda days: CustomBusinessDay(days, calendar=trade_calendar)
def plot_study(array):
# Given a 2-d array, we assume the event happens at index `lookback`,
# and create all of our summary statistics from there.
lookback = int((array.shape[1] - 1) / 2)
norm_factor = np.repeat(array[:,lookback].reshape(-1, 1), array.shape[1], axis=1)
centered_data = array / norm_factor - 1
lookforward = centered_data.shape[1] - lookback
means = centered_data.mean(axis=0)
lookforward_data = centered_data[:,lookforward:]
std_dev = np.hstack([0, lookforward_data.std(axis=0)])
maxes = lookforward_data.max(axis=0)
mins = lookforward_data.min(axis=0)
f, axarr = plt.subplots(1, 2)
range_begin = -lookback
range_end = lookforward
axarr[0].plot(range(range_begin, range_end), means)
axarr[1].plot(range(range_begin, range_end), means)
axarr[0].fill_between(range(0, range_end),
means[-lookforward:] + std_dev,
means[-lookforward:] - std_dev,
alpha=.5, label="$\pm$ 1 s.d.")
axarr[1].fill_between(range(0, range_end),
means[-lookforward:] + std_dev,
means[-lookforward:] - std_dev,
alpha=.5, label="$\pm$ 1 s.d.")
max_err = maxes - means[-lookforward+1:]
min_err = means[-lookforward+1:] - mins
axarr[0].errorbar(range(1, range_end),
means[-lookforward+1:],
yerr=[min_err, max_err], label='Max & Min')
axarr[0].legend(loc=2)
axarr[1].legend(loc=2)
axarr[0].set_xlim((-lookback-1, lookback+1))
axarr[1].set_xlim((-lookback-1, lookback+1))
def plot_study_small(array):
# Given a 2-d array, we assume the event happens at index `lookback`,
# and create all of our summary statistics from there.
lookback = int((array.shape[1] - 1) / 2)
norm_factor = np.repeat(array[:,lookback].reshape(-1, 1), array.shape[1], axis=1)
centered_data = array / norm_factor - 1
lookforward = centered_data.shape[1] - lookback
means = centered_data.mean(axis=0)
lookforward_data = centered_data[:,lookforward:]
std_dev = np.hstack([0, lookforward_data.std(axis=0)])
maxes = lookforward_data.max(axis=0)
mins = lookforward_data.min(axis=0)
range_begin = -lookback
range_end = lookforward
plt.plot(range(range_begin, range_end), means)
plt.fill_between(range(0, range_end),
means[-lookforward:] + std_dev,
means[-lookforward:] - std_dev,
alpha=.5, label="$\pm$ 1 s.d.")
max_err = maxes - means[-lookforward+1:]
min_err = means[-lookforward+1:] - mins
plt.errorbar(range(1, range_end),
means[-lookforward+1:],
yerr=[min_err, max_err], label='Max & Min')
plt.legend(loc=2)
plt.xlim((-lookback-1, lookback+1))
def fetch_event_data(ticker, events, horizon=5):
# Use horizon+1 to account for including the day of the event,
# and half-open interval - that is, for a horizon of 5,
# we should be including 11 events. Additionally, using the
# CustomBusinessDay means we automatically handle issues if
# for example a company reports Friday afternoon - the date
# calculator will turn this into a "Saturday" release, but
# we effectively shift that to Monday with the logic below.
td_back = TradeDay(horizon+1)
td_forward = TradeDay(horizon+1)
start_date = min(events) - td_back
end_date = max(events) + td_forward
total_data = fetch_ticker(ticker, start_date, end_date)
event_data = [total_data.ix[event-td_back:event+td_forward]\
[0:horizon*2+1]\
['Adjusted Close']
for event in events]
return np.array(event_data)
```
</details>
```python
# Generate a couple of random events
event_dates = [datetime(2016, 5, 27) - timedelta(days=1) - TradeDay(x*20) for x in range(1, 40)]
data = fetch_event_data('CELG', event_dates)
plot_study_small(data)
plt.legend(loc=3)
plt.gcf().set_size_inches(12, 6);
plt.annotate('Mean price for days leading up to each event',
(-5, -.01), (-4.5, .025),
arrowprops=dict(facecolor='black', shrink=0.05))
plt.annotate('', (-.1, .005), (-.5, .02),
arrowprops={'facecolor': 'black', 'shrink': .05})
plt.annotate('$\pm$ 1 std. dev. each day', (5, .055), (2.5, .085),
arrowprops={'facecolor': 'black', 'shrink': .05})
plt.annotate('Min/Max each day', (.9, -.07), (-1, -.1),
arrowprops={'facecolor': 'black', 'shrink': .05});
```
![png](_notebook_files/_notebook_11_0.png)
And as a quick textual explanation as well:
- The blue line represents the mean price for each day, represented as a percentage of the price on the '0-day'. For example, if we defined an 'event' as whenever the stock price dropped for three days, we would see a decreasing blue line to the left of the 0-day.
- The blue shaded area represents one standard deviation above and below the mean price for each day following an event. This is intended to give us an idea of what the stock price does in general following an event.
- The green bars are the minimum and maximum price for each day following an event. This instructs us as to how much it's possible for the stock to move.
## Event Type 1: Trending down over the past N days
The first type of event I want to study is how stocks perform when they've been trending down over the past couple of days prior to a release. However, we need to clarify what exactly is meant by "trending down." To do so, we'll use the following metric: **the midpoint between each day's opening and closing price goes down over a period of N days**.
It's probably helpful to have an example:
<details>
<summary>Code</summary>
```python
f, axarr = plt.subplots(1, 2)
f.set_size_inches(18, 6)
FB_plot = axarr[0]
ohlc_dataframe(FB[datetime(2016, 4, 18):], FB_plot)
FB_truncated = FB[datetime(2016, 4, 18):datetime(2016, 4, 27)]
midpoint = FB_truncated['Open']/2 + FB_truncated['Close']/2
FB_plot.plot(FB_truncated.index, midpoint, label='Midpoint')
FB_plot.vlines(date2num(datetime(2016, 4, 27, 12)),
ax_fb.get_ylim()[0], ax_fb.get_ylim()[1],
color='g', label='Earnings Release')
FB_plot.legend(loc=2)
FB_plot.set_title('FB Midpoint Plot')
AAPL_plot = axarr[1]
ohlc_dataframe(AAPL[datetime(2016, 4, 10):], AAPL_plot)
AAPL_truncated = AAPL[datetime(2016, 4, 10):datetime(2016, 4, 26)]
midpoint = AAPL_truncated['Open']/2 + AAPL_truncated['Close']/2
AAPL_plot.plot(AAPL_truncated.index, midpoint, label='Midpoint')
AAPL_plot.vlines(date2num(datetime(2016, 4, 26, 12)),
ax_aapl.get_ylim()[0], ax_aapl.get_ylim()[1],
color='g', label='Earnings Release')
AAPL_plot.legend(loc=3)
AAPL_plot.set_title('AAPL Midpoint Plot');
```
</details>
![png](_notebook_files/_notebook_14_0.png)
Given these charts, we can see that FB was trending down for the four days preceding the earnings release, and AAPL was trending down for a whopping 8 days (we don't count the peak day). This will define the methodology that we will use for the study.
So what are the results? For a given horizon, how well does the market actually perform?
<details>
<summary>Code</summary>
```python
# Read in the events for each stock;
# The file was created using the first code block in the Appendix
import yaml
from dateutil.parser import parse
from progressbar import ProgressBar
data_str = open('earnings_dates.yaml', 'r').read()
# Need to remove invalid lines
filtered = filter(lambda x: '{' not in x, data_str.split('\n'))
earnings_data = yaml.load('\n'.join(filtered))
# Convert our earnings data into a list of (ticker, date) pairs
# to make it easy to work with.
# This is horribly inefficient, but should get us what we need
ticker_dates = []
for ticker, date_list in earnings_data.items():
for iso_str in date_list:
ticker_dates.append((ticker, parse(iso_str)))
def does_trend_down(ticker, event, horizon):
# Figure out if the `event` has a downtrend for
# the `horizon` days preceding it
# As an interpretation note: it is assumed that
# the closing price of day `event` is the reference
# point, and we want `horizon` days before that.
# The price_data.hdf was created in the second appendix code block
try:
ticker_data = pd.read_hdf('price_data.hdf', ticker)
data = ticker_data[event-TradeDay(horizon):event]
midpoints = data['Open']/2 + data['Close']/2
# Shift dates one forward into the future and subtract
# Effectively: do we trend down over all days?
elems = midpoints - midpoints.shift(1)
return len(elems)-1 == len(elems.dropna()[elems <= 0])
except KeyError:
# If the stock doesn't exist, it doesn't qualify as trending down
# Mostly this is here to make sure the entire analysis doesn't
# blow up if there were issues in data retrieval
return False
def study_trend(horizon, trend_function):
five_day_events = np.zeros((1, horizon*2 + 1))
invalid_events = []
for ticker, event in ProgressBar()(ticker_dates):
if trend_function(ticker, event, horizon):
ticker_data = pd.read_hdf('price_data.hdf', ticker)
event_data = ticker_data[event-TradeDay(horizon):event+TradeDay(horizon)]['Close']
try:
five_day_events = np.vstack([five_day_events, event_data])
except ValueError:
# Sometimes we don't get exactly the right number of values due to calendar
# issues. I've fixed most everything I can, and the few issues that are left
# I assume don't systemically bias the results (i.e. data could be missing
# because it doesn't exist, etc.). After running through, ~1% of events get
# discarded this way
invalid_events.append((ticker, event))
# Remove our initial zero row
five_day_events = five_day_events[1:,:]
plot_study(five_day_events)
plt.gcf().suptitle('Action over {} days: {} events'
.format(horizon,five_day_events.shape[0]))
plt.gcf().set_size_inches(18, 6)
# Start with a 5 day study
study_trend(5, does_trend_down)
```
```
100% (47578 of 47578) |###########################################################| Elapsed Time: 0:21:38 Time: 0:21:38
```
</details>
![png](_notebook_files/_notebook_16_1.png)
When a stock has been trending down for 5 days, once the earnings are announced it really doesn't move on average. However, the variability is *incredible*. This implies two important things:
1. The market is just as often wrong about an earnings announcement before it happens as it is correct
2. The incredible width of the min/max bars and standard deviation area tell us that the market reacts *violently* after the earnings are released.
Let's repeat the same study, but over a time horizon of 8 days and 3 days. Presumably if a stock has been going down for 8 days at a time before the earnings, the market should be more accurate.
<details>
<summary>Code</summary>
```python
# 8 day study next
study_trend(8, does_trend_down)
```
```
100% (47578 of 47578) |###########################################################| Elapsed Time: 0:20:29 Time: 0:20:29
```
</details>
![png](_notebook_files/_notebook_18_1.png)
However, looking only at stocks that trended down for 8 days prior to a release, the same pattern emerges: on average, the stock doesn't move, but the market reaction is often incredibly violent.
<details>
<summary>Code</summary>
```python
# 3 day study after that
study_trend(3, does_trend_down)
```
```
100% (47578 of 47578) |###########################################################| Elapsed Time: 0:26:26 Time: 0:26:26
```
</details>
![png](_notebook_files/_notebook_20_1.png)
Finally, when we look at a 3-day horizon, we start getting some incredible outliers. Stocks have a potential to move over ~300% up, and the standard deviation width is again, incredible. The results for a 3-day horizon follow the same pattern we've seen in the 5- and 8-day horizons.
## Event Type 2: Trending up for N days
We're now going to repeat the analysis, but do it for uptrends instead. That is, instead of looking at stocks that have been trending down over the past number of days, we focus only on stocks that have been trending up.
<details>
<summary>Code</summary>
```python
def does_trend_up(ticker, event, horizon):
# Figure out if the `event` has an uptrend for
# the `horizon` days preceding it
# As an interpretation note: it is assumed that
# the closing price of day `event` is the reference
# point, and we want `horizon` days before that.
# The price_data.hdf was created in the second appendix code block
try:
ticker_data = pd.read_hdf('price_data.hdf', ticker)
data = ticker_data[event-TradeDay(horizon):event]
midpoints = data['Open']/2 + data['Close']/2
# Shift dates one forward into the future and subtract
# Effectively: do we trend down over all days?
elems = midpoints - midpoints.shift(1)
return len(elems)-1 == len(elems.dropna()[elems >= 0])
except KeyError:
# If the stock doesn't exist, it doesn't qualify as trending down
# Mostly this is here to make sure the entire analysis doesn't
# blow up if there were issues in data retrieval
return False
study_trend(5, does_trend_up)
```
```
100% (47578 of 47578) |###########################################################| Elapsed Time: 0:22:51 Time: 0:22:51
```
</details>
![png](_notebook_files/_notebook_23_1.png)
The patterns here are very similar. With the exception of noting that stocks can go to nearly 400% after an earnings announcement (most likely this included a takeover announcement, etc.), we still see large min/max bars and wide standard deviation of returns.
We'll repeat the pattern for stocks going up for both 8 and 3 days straight, but at this point, the results should be very predictable:
<details>
<summary>Code</summary>
```python
study_trend(8, does_trend_up)
```
```
100% (47578 of 47578) |###########################################################| Elapsed Time: 0:20:51 Time: 0:20:51
```
</details>
![png](_notebook_files/_notebook_25_1.png)
<details>
<summary>Code</summary>
```python
study_trend(3, does_trend_up)
```
```
100% (47578 of 47578) |###########################################################| Elapsed Time: 0:26:56 Time: 0:26:56
```
</details>
![png](_notebook_files/_notebook_26_1.png)
## Conclusion and Summary
I guess the most important thing to summarize with is this: **looking at the entire market, stock performance prior to an earnings release has no bearing on the stock's performance.** Honestly: given the huge variability of returns after an earnings release, even when the stock has been trending for a long time, you're best off divesting before an earnings release and letting the market sort itself out.
*However*, there is a big caveat. These results are taken when we look at the entire market. So while we can say that the market as a whole knows nothing and just reacts violently, I want to take a closer look into this data. Does the market typically perform poorly on large-cap/high liquidity stocks? Do smaller companies have investors that know them better and can thus predict performance better? Are specific market sectors better at prediction? Presumably technology stocks are more volatile than the industrials.
So there are some more interesting questions I still want to ask with this data. Knowing that the hard work of data processing is largely already done, it should be fairly simple to continue this analysis and get much more refined with it. Until next time.
# Appendix
Export event data for Russell 3000 companies:
<details>
<summary>Code</summary>
```python
import pandas as pd
from html.parser import HTMLParser
from datetime import datetime, timedelta
import requests
import re
from dateutil import parser
import progressbar
from concurrent import futures
import yaml
class EarningsParser(HTMLParser):
store_dates = False
earnings_offset = None
dates = []
def __init__(self, *args, **kwargs):
super().__init__(*args, **kwargs)
self.dates = []
def handle_starttag(self, tag, attrs):
if tag == 'table':
self.store_dates = True
def handle_data(self, data):
if self.store_dates:
match = re.match(r'\d+/\d+/\d+', data)
if match:
self.dates.append(match.group(0))
# If a company reports before the bell, record the earnings date
# being at midnight the day before. Ex: WMT reports 5/19/2016,
# but we want the reference point to be the closing price on 5/18/2016
if 'After Close' in data:
self.earnings_offset = timedelta(days=0)
elif 'Before Open' in data:
self.earnings_offset = timedelta(days=-1)
def handle_endtag(self, tag):
if tag == 'table':
self.store_dates = False
def earnings_releases(ticker):
#print("Looking up ticker {}".format(ticker))
user_agent = 'Mozilla/5.0 (Windows NT 10.0; WOW64; rv:46.0) '\
'Gecko/20100101 Firefox/46.0'
headers = {'user-agent': user_agent}
base_url = 'http://www.streetinsider.com/ec_earnings.php?q={}'\
.format(ticker)
e = EarningsParser()
s = requests.Session()
a = requests.adapters.HTTPAdapter(max_retries=0)
s.mount('http://', a)
e.feed(str(s.get(base_url, headers=headers).content))
if e.earnings_offset is not None:
dates = map(lambda x: parser.parse(x) + e.earnings_offset, e.dates)
past = filter(lambda x: x < datetime.now(), dates)
return list(map(lambda d: d.isoformat(), past))
# Use a Russell-3000 ETF tracker (ticker IWV) to get a list of holdings
r3000 = pd.read_csv('https://www.ishares.com/us/products/239714/'
'ishares-russell-3000-etf/1449138789749.ajax?'
'fileType=csv&fileName=IWV_holdings&dataType=fund',
header=10)
r3000_equities = r3000[(r3000['Exchange'] == 'NASDAQ') |
(r3000['Exchange'] == 'New York Stock Exchange Inc.')]
dates_file = open('earnings_dates.yaml', 'w')
with futures.ThreadPoolExecutor(max_workers=8) as pool:
fs = {pool.submit(earnings_releases, r3000_equities.ix[t]['Ticker']): t
for t in r3000_equities.index}
pbar = progressbar.ProgressBar(term_width=80,
max_value=r3000_equities.index.max())
for future in futures.as_completed(fs):
i = fs[future]
pbar.update(i)
dates_file.write(yaml.dump({r3000_equities.ix[i]['Ticker']:
future.result()}))
```
</details>
Downloading stock price data needed for the event studies:
<details>
<summary>Code</summary>
```python
from secrets import QUANDL_KEY
import pandas as pd
import yaml
from dateutil.parser import parse
from datetime import timedelta
import quandl
from progressbar import ProgressBar
def fetch_ticker(ticker, start, end):
# Quandl is currently giving me issues with returning
# the entire dataset and not slicing server-side.
# So instead, we'll do it client-side!
q_format = '%Y-%m-%d'
ticker_data = quandl.get('YAHOO/' + ticker,
start_date=start.strftime(q_format),
end_date=end.strftime(q_format),
authtoken=QUANDL_KEY)
return ticker_data
data_str = open('earnings_dates.yaml', 'r').read()
# Need to remove invalid lines
filtered = filter(lambda x: '{' not in x, data_str.split('\n'))
earnings_data = yaml.load('\n'.join(filtered))
# Get the first 1500 keys - split up into two statements
# because of Quandl rate limits
tickers = list(earnings_data.keys())
price_dict = {}
invalid_tickers = []
for ticker in ProgressBar()(tickers[0:1500]):
try:
# Replace '.' with '-' in name for some tickers
fixed = ticker.replace('.', '-')
event_strs = earnings_data[ticker]
events = [parse(event) for event in event_strs]
td = timedelta(days=20)
price_dict[ticker] = fetch_ticker(fixed,
min(events)-td, max(events)+td)
except quandl.NotFoundError:
invalid_tickers.append(ticker)
# Execute this after 10 minutes have passed
for ticker in ProgressBar()(tickers[1500:]):
try:
# Replace '.' with '-' in name for some tickers
fixed = ticker.replace('.', '-')
event_strs = earnings_data[ticker]
events = [parse(event) for event in event_strs]
td = timedelta(days=20)
price_dict[ticker] = fetch_ticker(fixed,
min(events)-td, max(events)+td)
except quandl.NotFoundError:
invalid_tickers.append(ticker)
prices_store = pd.HDFStore('price_data.hdf')
for ticker, prices in price_dict.items():
prices_store[ticker] = prices
```
</details>

View File

@ -0,0 +1,309 @@
Title: A Rustic Re-Podcasting Server (Part 1)
Date: 2016-10-22
Category: Blog
Tags: Rust, nutone
Authors: Bradlee Speice
Summary: Learning Rust by fire (it sounds better than learning by corrosion)
[//]: <> "Modified: "
I listen to a lot of Drum and Bass music, because it's beautiful music. And
there's a particular site, [Bassdrive.com](http://bassdrive.com/) that hosts
a lot of great content. Specifically, the
[archives](http://archives.bassdrivearchive.com/) section of the site has a
list of the past shows that you can download and listen to. The issue is, it's
just a [giant list of links to download](http://archives.bassdrivearchive.com/6%20-%20Saturday/Electronic%20Warfare%20-%20The%20Overfiend/). I'd really like
this in a podcast format to take with me on the road, etc.
So I wrote the [elektricity](https://github.com/bspeice/elektricity) web
application to actually accomplish all that. Whenever you request a feed, it
goes out to Bassdrive, processes all the links on a page, and serves up some
fresh, tasty RSS to satisfy your ears. I hosted it on Heroku using the free
tier because it's really not resource-intensive at all.
**The issue so far** is that I keep running out of free tier hours during a
month because my podcasting application likes to have a server scan for new
episodes constantly. Not sure why it's doing that, but I don't have a whole
lot of control over it. It's a phenomenal application otherwise.
**My (over-engineered) solution**: Re-write the application using the
[Rust](https://www.rust-lang.org/en-US/) programming language. I'd like to run
this on a small hacker board I own, and doing this in Rust would allow me to
easily cross-compile it. Plus, I've been very interested in the Rust language
for a while and this would be a great opportunity to really learn it well.
The code is available [here](https://github.com/bspeice/nutone) as development
progresses.
# The Setup
We'll be using the [iron](http://ironframework.io/) library to handle the
server, and [hyper](http://hyper.rs/) to fetch the data we need from elsewhere
on the interwebs. [HTML5Ever](http://doc.servo.org/html5ever/index.html) allows
us to ingest the content that will be coming from Bassdrive, and finally,
output is done with [handlebars-rust](http://sunng87.github.io/handlebars-rust/handlebars/index.html).
It will ultimately be interesting to see how much more work must be done to
actually get this working over another language like Python. Coming from a
dynamic state of mind it's super easy to just chain stuff together, ship it out,
and call it a day. I think I'm going to end up getting much dirtier trying to
write all of this out.
# Issue 1: Strings
Strings in Rust are hard. I acknowledge Python can get away with some things
that make strings super easy (and Python 3 has gotten better at cracking down
on some bad cases, `str <-> bytes` specifically), but Rust is hard.
Let's take for example the `404` error handler I'm trying to write. The result
should be incredibly simple: All I want is to echo back
`Didn't find URL: <url>`. Shouldn't be that hard right? In Python I'd just do
something like:
```python
def echo_handler(request):
return "You're visiting: {}".format(request.uri)
```
And we'd call it a day. Rust isn't so simple. Let's start with the trivial
examples people post online:
```rust
fn hello_world(req: &mut Request) -> IronResult<Response> {
Ok(Response::with((status::Ok, "You found the server!")))
}
```
Doesn't look too bad right? In fact, it's essentially the same as the Python
version! All we need to do is just send back a string of some form. So, we
look up the documentation for [`Request`](http://ironframework.io/doc/iron/request/struct.Request.html) and see a `url` field that will contain
what we want. Let's try the first iteration:
```rust
fn hello_world(req: &mut Request) -> IronResult<Response> {
Ok(Response::with((status::Ok, "You found the URL: " + req.url)))
}
```
Which yields the error:
error[E0369]: binary operation `+` cannot be applied to type `&'static str`
OK, what's going on here? Time to start Googling for ["concatenate strings in Rust"](https://www.google.com/#q=concatenate+strings+in+rust). That's what we
want to do right? Concatenate a static string and the URL.
After Googling, we come across a helpful [`concat!`](https://doc.rust-lang.org/std/macro.concat!.html) macro that looks really nice! Let's try that one:
```rust
fn hello_world(req: &mut Request) -> IronResult<Response> {
Ok(Response::with((status::Ok, concat!("You found the URL: ", req.url))))
}
```
And the error:
`error: expected a literal`
Turns out Rust actually blows up because the `concat!` macro expects us to know
at compile time what `req.url` is. Which, in my outsider opinion, is a bit
strange. `println!` and `format!`, etc., all handle values they don't know at
compile time. Why can't `concat!`? By any means, we need a new plan of attack.
How about we try formatting strings?
```rust
fn hello_world(req: &mut Request) -> IronResult<Response> {
Ok(Response::with((status::Ok, format!("You found the URL: {}", req.url))))
}
```
And at long last, it works. Onwards!
# Issue 2: Fighting with the borrow checker
Rust's single coolest feature is how the compiler can guarantee safety in your
program. As long as you don't use `unsafe` pointers in Rust, you're guaranteed
safety. And not having truly manual memory management is really cool; I'm
totally OK with never having to write `malloc()` again.
That said, even [the Rust documentation](https://doc.rust-lang.org/book/ownership.html) makes a specific note:
> Many new users to Rust experience something we like to call
> fighting with the borrow checker, where the Rust compiler refuses to
> compile a program that the author thinks is valid.
If you have to put it in the documentation, it's not a helpful note:
it's hazing.
So now that we have a handler which works with information from the request, we
want to start making something that looks like an actual web application.
The router provided by `iron` isn't terribly difficult so I won't cover it.
Instead, the thing that had me stumped for a couple hours was trying to
dynamically create routes.
The unfortunate thing with Rust (in my limited experience at the moment) is that
there is a severe lack of non-trivial examples. Using the router is easy when
you want to give an example of a static function. But how do you you start
working on things that are a bit more complex?
We're going to cover that here. Our first try: creating a function which returns
other functions. This is a principle called [currying](http://stackoverflow.com/a/36321/1454178). We set up a function that allows us to keep some data in scope
for another function to come later.
```rust
fn build_handler(message: String) -> Fn(&mut Request) -> IronResult<Response> {
move |_: &mut Request| {
Ok(Response::with((status::Ok, message)))
}
}
```
We've simply set up a function that returns another anonymous function with the
`message` parameter scoped in. If you compile this, you get not 1, not 2, but 5
new errors. 4 of them are the same though:
error[E0277]: the trait bound `for<'r, 'r, 'r> std::ops::Fn(&'r mut iron::Request<'r, 'r>) -> std::result::Result<iron::Response, iron::IronError> + 'static: std::marker::Sized` is not satisfied
...oookay. I for one, am not going to spend time trying to figure out what's
going on there.
And it is here that I will save the audience many hours of frustrated effort.
At this point, I decided to switch from `iron` to pure `hyper` since using
`hyper` would give me a much simpler API. All I would have to do is build a
function that took two parameters as input, and we're done. That said, it
ultimately posed many more issues because I started getting into a weird fight
with the `'static` [lifetime](https://doc.rust-lang.org/book/lifetimes.html)
and being a Rust newbie I just gave up on trying to understand it.
Instead, we will abandon (mostly) the curried function attempt, and instead
take advantage of something Rust actually intends us to use: `struct` and
`trait`.
Remember when I talked about a lack of non-trivial examples on the Internet?
This is what I was talking about. I could only find *one* example of this
available online, and it was incredibly complex and contained code we honestly
don't need or care about. There was no documentation of how to build routes that
didn't use static functions, etc. But, I'm assuming you don't really care about
my whining, so let's get to it.
The `iron` documentation mentions the [`Handler`](http://ironframework.io/doc/iron/middleware/trait.Handler.html) trait as being something we can implement.
Does the function signature for that `handle()` method look familiar? It's what
we've been working with so far.
The principle is that we need to define a new `struct` to hold our data, then
implement that `handle()` method to return the result. Something that looks
like this might do:
```rust
struct EchoHandler {
message: String
}
impl Handler for EchoHandler {
fn handle(&self, _: &mut Request) -> IronResult<Response> {
Ok(Response::with((status::Ok, self.message)))
}
}
// Later in the code when we set up the router...
let echo = EchoHandler {
message: "Is it working yet?"
}
router.get("/", echo.handle, "index");
```
We attempt to build a struct, and give its `handle` method off to the router
so the router knows what to do.
You guessed it, more errors:
error: attempted to take value of method `handle` on type `EchoHandler`
Now, the Rust compiler is actually a really nice fellow, and offers us help:
help: maybe a `()` to call it is missing? If not, try an anonymous function
We definitely don't want to call that function, so maybe try an anonymous
function as it recommends?
```rust
router.get("/", |req: &mut Request| echo.handle(req), "index");
```
Another error:
error[E0373]: closure may outlive the current function, but it borrows `echo`, which is owned by the current function
Another helpful message:
help: to force the closure to take ownership of `echo` (and any other referenced variables), use the `move` keyword
We're getting closer though! Let's implement this change:
```rust
router.get("/", move |req: &mut Request| echo.handle(req), "index");
```
And here's where things get strange:
error[E0507]: cannot move out of borrowed content
--> src/main.rs:18:40
|
18 | Ok(Response::with((status::Ok, self.message)))
| ^^^^ cannot move out of borrowed content
Now, this took me another couple hours to figure out. I'm going to explain it,
but **keep this in mind: Rust only allows one reference at a time** (exceptions
apply of course).
When we attempt to use `self.message` as it has been created in the earlier
`struct`, we essentially are trying to give it away to another piece of code.
Rust's semantics then state that *we may no longer access it* unless it is
returned to us (which `iron`'s code does not do). There are two ways to fix
this:
1. Only give away references (i.e. `&self.message` instead of `self.message`)
instead of transferring ownership
2. Make a copy of the underlying value which will be safe to give away
I didn't know these were the two options originally, so I hope this helps the
audience out. Because `iron` won't accept a reference, we are forced into the
second option: making a copy. To do so, we just need to change the function
to look like this:
```rust
Ok(Response::with((status::Ok, self.message.clone())))
```
Not so bad, huh? My only complaint is that it took so long to figure out exactly
what was going on.
And now we have a small server that we can configure dynamically. At long last.
> Final sidenote: You can actually do this without anonymous functions. Just
> change the router line to:
> `router.get("/", echo, "index");`
>
> Rust's type system seems to figure out that we want to use the `handle()` method.
# Conclusion
After a good long days' work, we now have the routing functionality set up on
our application. We should be able to scale this pretty well in the future:
the RSS content we need to deliver in the future can be treated as a string, so
the building blocks are in place.
There are two important things I learned starting with Rust today:
1. Rust is a new language, and while the code is high-quality, the mindshare is coming.
2. I'm a terrible programmer.
Number 1 is pretty obvious and not surprising to anyone. Number two caught me
off guard. I've gotten used to having either a garbage collector (Java, Python,
etc.) or playing a little fast and loose with scoping rules (C, C++). You don't
have to worry about object lifetime there. With Rust, it's forcing me to fully
understand and use well the memory in my applications. In the final mistake I
fixed (using `.clone()`) I would have been fine in C++ to just give away that
reference and never use it again. I wouldn't have run into a "use-after-free"
error, but I would have potentially been leaking memory. Rust forced me to be
incredibly precise about how I use it.
All said I'm excited for using Rust more. I think it's super cool, it's just
going to take me a lot longer to do this than I originally thought.

View File

@ -0,0 +1,329 @@
---
slug: 2016/10/rustic-repodcasting
title: A Rustic re-podcasting server
date: 2016-10-22 12:00:00
authors: [bspeice]
tags: []
---
Learning Rust by fire (it sounds better than learning by corrosion)
<!-- truncate -->
I listen to a lot of Drum and Bass music, because it's beautiful music. And
there's a particular site, [Bassdrive.com](http://bassdrive.com/) that hosts
a lot of great content. Specifically, the
[archives](http://archives.bassdrivearchive.com/) section of the site has a
list of the past shows that you can download and listen to. The issue is, it's
just a [giant list of links to download](http://archives.bassdrivearchive.com/6%20-%20Saturday/Electronic%20Warfare%20-%20The%20Overfiend/). I'd really like
this in a podcast format to take with me on the road, etc.
So I wrote the [elektricity](https://github.com/bspeice/elektricity) web
application to actually accomplish all that. Whenever you request a feed, it
goes out to Bassdrive, processes all the links on a page, and serves up some
fresh, tasty RSS to satisfy your ears. I hosted it on Heroku using the free
tier because it's really not resource-intensive at all.
**The issue so far** is that I keep running out of free tier hours during a
month because my podcasting application likes to have a server scan for new
episodes constantly. Not sure why it's doing that, but I don't have a whole
lot of control over it. It's a phenomenal application otherwise.
**My (over-engineered) solution**: Re-write the application using the
[Rust](https://www.rust-lang.org/en-US/) programming language. I'd like to run
this on a small hacker board I own, and doing this in Rust would allow me to
easily cross-compile it. Plus, I've been very interested in the Rust language
for a while and this would be a great opportunity to really learn it well.
The code is available [here](https://github.com/bspeice/nutone) as development
progresses.
## The Setup
We'll be using the [iron](http://ironframework.io/) library to handle the
server, and [hyper](http://hyper.rs/) to fetch the data we need from elsewhere
on the interwebs. [HTML5Ever](http://doc.servo.org/html5ever/index.html) allows
us to ingest the content that will be coming from Bassdrive, and finally,
output is done with [handlebars-rust](http://sunng87.github.io/handlebars-rust/handlebars/index.html).
It will ultimately be interesting to see how much more work must be done to
actually get this working over another language like Python. Coming from a
dynamic state of mind it's super easy to just chain stuff together, ship it out,
and call it a day. I think I'm going to end up getting much dirtier trying to
write all of this out.
## Issue 1: Strings
Strings in Rust are hard. I acknowledge Python can get away with some things
that make strings super easy (and Python 3 has gotten better at cracking down
on some bad cases, `str <-> bytes` specifically), but Rust is hard.
Let's take for example the `404` error handler I'm trying to write. The result
should be incredibly simple: All I want is to echo back
`Didn't find URL: <url>`. Shouldn't be that hard right? In Python I'd just do
something like:
```python
def echo_handler(request):
return "You're visiting: {}".format(request.uri)
```
And we'd call it a day. Rust isn't so simple. Let's start with the trivial
examples people post online:
```rust
fn hello_world(req: &mut Request) -> IronResult<Response> {
Ok(Response::with((status::Ok, "You found the server!")))
}
```
Doesn't look too bad right? In fact, it's essentially the same as the Python
version! All we need to do is just send back a string of some form. So, we
look up the documentation for [`Request`](http://ironframework.io/doc/iron/request/struct.Request.html) and see a `url` field that will contain
what we want. Let's try the first iteration:
```rust
fn hello_world(req: &mut Request) -> IronResult<Response> {
Ok(Response::with((status::Ok, "You found the URL: " + req.url)))
}
```
Which yields the error:
```
error[E0369]: binary operation `+` cannot be applied to type `&'static str`
```
OK, what's going on here? Time to start Googling for ["concatenate strings in Rust"](https://www.google.com/#q=concatenate+strings+in+rust). That's what we
want to do right? Concatenate a static string and the URL.
After Googling, we come across a helpful [`concat!`](https://doc.rust-lang.org/std/macro.concat!.html) macro that looks really nice! Let's try that one:
```rust
fn hello_world(req: &mut Request) -> IronResult<Response> {
Ok(Response::with((status::Ok, concat!("You found the URL: ", req.url))))
}
```
And the error:
```
error: expected a literal
```
Turns out Rust actually blows up because the `concat!` macro expects us to know
at compile time what `req.url` is. Which, in my outsider opinion, is a bit
strange. `println!` and `format!`, etc., all handle values they don't know at
compile time. Why can't `concat!`? By any means, we need a new plan of attack.
How about we try formatting strings?
```rust
fn hello_world(req: &mut Request) -> IronResult<Response> {
Ok(Response::with((status::Ok, format!("You found the URL: {}", req.url))))
}
```
And at long last, it works. Onwards!
## Issue 2: Fighting with the borrow checker
Rust's single coolest feature is how the compiler can guarantee safety in your
program. As long as you don't use `unsafe` pointers in Rust, you're guaranteed
safety. And not having truly manual memory management is really cool; I'm
totally OK with never having to write `malloc()` again.
That said, even [the Rust documentation](https://doc.rust-lang.org/book/ownership.html) makes a specific note:
> Many new users to Rust experience something we like to call
> fighting with the borrow checker, where the Rust compiler refuses to
> compile a program that the author thinks is valid.
If you have to put it in the documentation, it's not a helpful note:
it's hazing.
So now that we have a handler which works with information from the request, we
want to start making something that looks like an actual web application.
The router provided by `iron` isn't terribly difficult so I won't cover it.
Instead, the thing that had me stumped for a couple hours was trying to
dynamically create routes.
The unfortunate thing with Rust (in my limited experience at the moment) is that
there is a severe lack of non-trivial examples. Using the router is easy when
you want to give an example of a static function. But how do you you start
working on things that are a bit more complex?
We're going to cover that here. Our first try: creating a function which returns
other functions. This is a principle called [currying](http://stackoverflow.com/a/36321/1454178). We set up a function that allows us to keep some data in scope
for another function to come later.
```rust
fn build_handler(message: String) -> Fn(&mut Request) -> IronResult<Response> {
move |_: &mut Request| {
Ok(Response::with((status::Ok, message)))
}
}
```
We've simply set up a function that returns another anonymous function with the
`message` parameter scoped in. If you compile this, you get not 1, not 2, but 5
new errors. 4 of them are the same though:
```
error[E0277]: the trait bound `for<'r, 'r, 'r> std::ops::Fn(&'r mut iron::Request<'r, 'r>) -> std::result::Result<iron::Response, iron::IronError> + 'static: std::marker::Sized` is not satisfied
```
...oookay. I for one, am not going to spend time trying to figure out what's
going on there.
And it is here that I will save the audience many hours of frustrated effort.
At this point, I decided to switch from `iron` to pure `hyper` since using
`hyper` would give me a much simpler API. All I would have to do is build a
function that took two parameters as input, and we're done. That said, it
ultimately posed many more issues because I started getting into a weird fight
with the `'static` [lifetime](https://doc.rust-lang.org/book/lifetimes.html)
and being a Rust newbie I just gave up on trying to understand it.
Instead, we will abandon (mostly) the curried function attempt, and instead
take advantage of something Rust actually intends us to use: `struct` and
`trait`.
Remember when I talked about a lack of non-trivial examples on the Internet?
This is what I was talking about. I could only find *one* example of this
available online, and it was incredibly complex and contained code we honestly
don't need or care about. There was no documentation of how to build routes that
didn't use static functions, etc. But, I'm assuming you don't really care about
my whining, so let's get to it.
The `iron` documentation mentions the [`Handler`](http://ironframework.io/doc/iron/middleware/trait.Handler.html) trait as being something we can implement.
Does the function signature for that `handle()` method look familiar? It's what
we've been working with so far.
The principle is that we need to define a new `struct` to hold our data, then
implement that `handle()` method to return the result. Something that looks
like this might do:
```rust
struct EchoHandler {
message: String
}
impl Handler for EchoHandler {
fn handle(&self, _: &mut Request) -> IronResult<Response> {
Ok(Response::with((status::Ok, self.message)))
}
}
// Later in the code when we set up the router...
let echo = EchoHandler {
message: "Is it working yet?"
}
router.get("/", echo.handle, "index");
```
We attempt to build a struct, and give its `handle` method off to the router
so the router knows what to do.
You guessed it, more errors:
```
error: attempted to take value of method `handle` on type `EchoHandler`
```
Now, the Rust compiler is actually a really nice fellow, and offers us help:
```
help: maybe a `()` to call it is missing? If not, try an anonymous function
```
We definitely don't want to call that function, so maybe try an anonymous
function as it recommends?
```rust
router.get("/", |req: &mut Request| echo.handle(req), "index");
```
Another error:
```
error[E0373]: closure may outlive the current function, but it borrows `echo`, which is owned by the current function
```
Another helpful message:
```
help: to force the closure to take ownership of `echo` (and any other referenced variables), use the `move` keyword
```
We're getting closer though! Let's implement this change:
```rust
router.get("/", move |req: &mut Request| echo.handle(req), "index");
```
And here's where things get strange:
```
error[E0507]: cannot move out of borrowed content
--> src/main.rs:18:40
|
18 | Ok(Response::with((status::Ok, self.message)))
| ^^^^ cannot move out of borrowed content
```
Now, this took me another couple hours to figure out. I'm going to explain it,
but **keep this in mind: Rust only allows one reference at a time** (exceptions
apply of course).
When we attempt to use `self.message` as it has been created in the earlier
`struct`, we essentially are trying to give it away to another piece of code.
Rust's semantics then state that *we may no longer access it* unless it is
returned to us (which `iron`'s code does not do). There are two ways to fix
this:
1. Only give away references (i.e. `&self.message` instead of `self.message`)
instead of transferring ownership
2. Make a copy of the underlying value which will be safe to give away
I didn't know these were the two options originally, so I hope this helps the
audience out. Because `iron` won't accept a reference, we are forced into the
second option: making a copy. To do so, we just need to change the function
to look like this:
```rust
Ok(Response::with((status::Ok, self.message.clone())))
```
Not so bad, huh? My only complaint is that it took so long to figure out exactly
what was going on.
And now we have a small server that we can configure dynamically. At long last.
> Final sidenote: You can actually do this without anonymous functions. Just
> change the router line to:
> `router.get("/", echo, "index");`
>
> Rust's type system seems to figure out that we want to use the `handle()` method.
## Conclusion
After a good long days' work, we now have the routing functionality set up on
our application. We should be able to scale this pretty well in the future:
the RSS content we need to deliver in the future can be treated as a string, so
the building blocks are in place.
There are two important things I learned starting with Rust today:
1. Rust is a new language, and while the code is high-quality, the mindshare is coming.
2. I'm a terrible programmer.
Number 1 is pretty obvious and not surprising to anyone. Number two caught me
off guard. I've gotten used to having either a garbage collector (Java, Python,
etc.) or playing a little fast and loose with scoping rules (C, C++). You don't
have to worry about object lifetime there. With Rust, it's forcing me to fully
understand and use well the memory in my applications. In the final mistake I
fixed (using `.clone()`) I would have been fine in C++ to just give away that
reference and never use it again. I wouldn't have run into a "use-after-free"
error, but I would have potentially been leaking memory. Rust forced me to be
incredibly precise about how I use it.
All said I'm excited for using Rust more. I think it's super cool, it's just
going to take me a lot longer to do this than I originally thought.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

View File

@ -0,0 +1,9 @@
Title: Audio Compression using PCA
Date: 2016-11-01
Category: Blog
Tags: PCA, Machine Learning, Digital Signal Processing
Authors: Bradlee Speice
Summary: In which I apply Machine Learning techniques to Digital Signal Processing to astounding failure.
[//]: <> "Modified: "
{% notebook 2016-11-01-PCA-audio-compression.ipynb %}

File diff suppressed because one or more lines are too long

File diff suppressed because one or more lines are too long

View File

@ -0,0 +1,362 @@
---
slug: 2016/11/pca-audio-compression
title: PCA audio compression
date: 2016-11-01 12:00:00
authors: [bspeice]
tags: []
---
In which I apply Machine Learning techniques to Digital Signal Processing to astounding failure.
<!-- truncate -->
Towards a new (and pretty poor) compression scheme
--------------------------------------------------
I'm going to be working with some audio data for a while as I get prepared for a term project this semester. I'll be working (with a partner) to design a system for separating voices from music. Given my total lack of experience with [Digital Signal Processing][1] I figured that now was as good a time as ever to work on a couple of fun projects that would get me back up to speed.
The first project I want to work on: Designing a new compression scheme for audio data.
A Brief Introduction to Audio Compression
-----------------------------------------
Audio files when uncompressed (files ending with `.wav`) are huge. Like, 10.5 Megabytes per minute huge. Storage is cheap these days, but that's still an incredible amount of data that we don't really need. Instead, we'd like to compress that data so that it's not taking up so much space. There are broadly two ways to accomplish this:
1. Lossless compression - Formats like [FLAC][2], [ALAC][3], and [Monkey's Audio (.ape)][4] all go down this route. The idea is that when you compress and uncompress a file, you get exactly the same as what you started with.
2. Lossy compression - Formats like [MP3][5], [Ogg][6], and [AAC (`.m4a`)][7] are far more popular, but make a crucial tradeoff: We can reduce the file size even more during compression, but the decompressed file won't be the same.
There is a fundamental tradeoff at stake: Using lossy compression sacrifices some of the integrity of the resulting file to save on storage space. Most people (I personally believe it's everybody) can't hear the difference, so this is an acceptable tradeoff. You have files that take up a 10<sup>th</sup> of the space, and nobody can tell there's a difference in audio quality.
A PCA-based Compression Scheme
------------------------------
What I want to try out is a [PCA][8] approach to encoding audio. The PCA technique comes from Machine Learning, where it is used for a process called [Dimensionality Reduction][9]. Put simply, the idea is the same as lossy compression: if we can find a way that represents the data well enough, we can save on space. There are a lot of theoretical concerns that lead me to believe this compression style will not end well, but I'm interested to try it nonetheless.
PCA works as follows: Given a dataset with a number of features, I find a way to approximate those original features using some "new features" that are statistically as close as possible to the original ones. This is comparable to a scheme like MP3: Given an original signal, I want to find a way of representing it that gets approximately close to what the original was. The difference is that PCA is designed for statistical data, and not signal data. But we won't let that stop us.
The idea is as follows: Given a signal, reshape it into 1024 columns by however many rows are needed (zero-padded if necessary). Run the PCA algorithm, and do dimensionality reduction with a couple different settings. The number of components I choose determines the quality: If I use 1024 components, I will essentially be using the original signal. If I use a smaller number of components, I start losing some of the data that was in the original file. This will give me an idea of whether it's possible to actually build an encoding scheme off of this, or whether I'm wasting my time.
Running the Algorithm
---------------------
The audio I will be using comes from the song [Tabulasa][10], by [Broke for Free][11]. I'll be loading in the audio signal to Python and using [Scikit-Learn][12] to actually run the PCA algorithm.
We first need to convert the FLAC file I have to a WAV:
[1]: https://en.wikipedia.org/wiki/Digital_signal_processing
[2]: https://en.wikipedia.org/wiki/FLAC
[3]: https://en.wikipedia.org/wiki/Apple_Lossless
[4]: https://en.wikipedia.org/wiki/Monkey%27s_Audio
[5]: https://en.wikipedia.org/wiki/MP3
[6]: https://en.wikipedia.org/wiki/Vorbis
[7]: https://en.wikipedia.org/wiki/Advanced_Audio_Coding
[8]: https://en.wikipedia.org/wiki/Principal_component_analysis
[9]: https://en.wikipedia.org/wiki/Dimensionality_reduction
[10]: https://brokeforfree.bandcamp.com/track/tabulasa
[11]: https://brokeforfree.bandcamp.com/album/xxvii
[12]: http://scikit-learn.org/stable/modules/generated/sklearn.decomposition.PCA.html#sklearn.decomposition.PCA
```python
!ffmpeg -hide_banner -loglevel panic -i "Broke For Free/XXVII/01 Tabulasa.flac" "Tabulasa.wav" -c wav
```
Then, let's go ahead and load a small sample so you can hear what is going on.
```python
from IPython.display import Audio
from scipy.io import wavfile
samplerate, tabulasa = wavfile.read('Tabulasa.wav')
start = samplerate * 14 # 10 seconds in
end = start + samplerate * 10 # 5 second duration
Audio(data=tabulasa[start:end, 0], rate=samplerate)
```
import wav1 from "./1.wav";
<audio controls src={wav1}/>
Next, we'll define the code we will be using to do PCA. It's very short, as the PCA algorithm is very simple.
```python
from sklearn.decomposition import PCA
import numpy as np
def pca_reduce(signal, n_components, block_size=1024):
# First, zero-pad the signal so that it is divisible by the block_size
samples = len(signal)
hanging = block_size - np.mod(samples, block_size)
padded = np.lib.pad(signal, (0, hanging), 'constant', constant_values=0)
# Reshape the signal to have 1024 dimensions
reshaped = padded.reshape((len(padded) // block_size, block_size))
# Second, do the actual PCA process
pca = PCA(n_components=n_components)
pca.fit(reshaped)
transformed = pca.transform(reshaped)
reconstructed = pca.inverse_transform(transformed).reshape((len(padded)))
return pca, transformed, reconstructed
```
Now that we've got our functions set up, let's try actually running something. First, we'll use `n_components == block_size`, which implies that we should end up with the same signal we started with.
```python
tabulasa_left = tabulasa[:,0]
_, _, reconstructed = pca_reduce(tabulasa_left, 1024, 1024)
Audio(data=reconstructed[start:end], rate=samplerate)
```
import wav2 from "./2.wav";
<audio controls src={wav2}/>
OK, that does indeed sound like what we originally had. Let's drastically cut down the number of components we're doing this with as a sanity check: the audio quality should become incredibly poor.
```python
_, _, reconstructed = pca_reduce(tabulasa_left, 32, 1024)
Audio(data=reconstructed[start:end], rate=samplerate)
```
import wav3 from "./3.wav";
<audio controls src={wav3}/>
As expected, our reconstructed audio does sound incredibly poor! But there's something else very interesting going on here under the hood. Did you notice that the bassline comes across very well, but that there's no midrange or treble? The drums are almost entirely gone.
[Drop the (Treble)][13]
-----------------------
It will help to understand PCA more fully when trying to read this part, but I'll do my best to break it down. PCA tries to find a way to best represent the dataset using "components." Think of each "component" as containing some of the information you need in order to reconstruct the full audio. For example, you might have a "low frequency" component that contains all the information you need in order to hear the bassline. There might be other components that explain the high frequency things like singers, or melodies, that you also need.
What makes PCA interesting is that it attempts to find the "most important" components in explaining the signal. In a signal processing world, this means that PCA is trying to find the signal amongst the noise in your data. In our case, this means that PCA, when forced to work with small numbers of components, will chuck out the noisy components first. It's doing it's best job to reconstruct the signal, but it has to make sacrifices somewhere.
So I've mentioned that PCA identifies the "noisy" components in our dataset. This is equivalent to saying that PCA removes the "high frequency" components in this case: it's very easy to represent a low-frequency signal like a bassline. It's far more difficult to represent a high-frequency signal because it's changing all the time. When you force PCA to make a tradeoff by using a small number of components, the best it can hope to do is replicate the low-frequency sections and skip the high-frequency things.
This is a very interesting insight, and it also has echos (pardon the pun) of how humans understand music in general. Other encoding schemes (like MP3, etc.) typically chop off a lot of the high-frequency range as well. There is typically a lot of high-frequency noise in audio that is nearly impossible to hear, so it's easy to remove it without anyone noticing. PCA ends up doing something similar, and while that certainly wasn't the intention, it is an interesting effect.
## A More Realistic Example
So we've seen the edge cases so far: Using a large number of components results in audio very close to the original, and using a small number of components acts as a low-pass filter. How about we develop something that sounds "good enough" in practice, that we can use as a benchmark for size? We'll use ourselves as judges of audio quality, and build another function to help us estimate how much space we need to store everything in.
[13]: https://youtu.be/Ua0KpfJsxKo?t=1m17s
```python
from bz2 import compress
import pandas as pd
def raw_estimate(transformed, pca):
# We assume that we'll be storing things as 16-bit WAV,
# meaning two bytes per sample
signal_bytes = transformed.tobytes()
# PCA stores the components as floating point, we'll assume
# that means 32-bit floats, so 4 bytes per element
component_bytes = transformed.tobytes()
# Return a result in megabytes
return (len(signal_bytes) + len(component_bytes)) / (2**20)
# Do an estimate for lossless compression applied on top of our
# PCA reduction
def bz2_estimate(transformed, pca):
bytestring = transformed.tobytes() + b';' + pca.components_.tobytes()
compressed = compress(bytestring)
return len(compressed) / (2**20)
compression_attempts = [
(1, 1),
(1, 2),
(1, 4),
(4, 32),
(16, 256),
(32, 256),
(64, 256),
(128, 1024),
(256, 1024),
(512, 1024),
(128, 2048),
(256, 2048),
(512, 2048),
(1024, 2048)
]
def build_estimates(signal, n_components, block_size):
pca, transformed, recon = pca_reduce(tabulasa_left, n_components, block_size)
raw_pca_estimate = raw_estimate(transformed, pca)
bz2_pca_estimate = bz2_estimate(transformed, pca)
raw_size = len(recon.tobytes()) / (2**20)
return raw_size, raw_pca_estimate, bz2_pca_estimate
pca_compression_results = pd.DataFrame([
build_estimates(tabulasa_left, n, bs)
for n, bs in compression_attempts
])
pca_compression_results.columns = ["Raw", "PCA", "PCA w/ BZ2"]
pca_compression_results.index = compression_attempts
pca_compression_results
```
<div>
<table>
<thead>
<tr>
<th></th>
<th>Raw</th>
<th>PCA</th>
<th>PCA w/ BZ2</th>
</tr>
</thead>
<tbody>
<tr>
<th>(1, 1)</th>
<td>69.054298</td>
<td>138.108597</td>
<td>16.431797</td>
</tr>
<tr>
<th>(1, 2)</th>
<td>69.054306</td>
<td>69.054306</td>
<td>32.981380</td>
</tr>
<tr>
<th>(1, 4)</th>
<td>69.054321</td>
<td>34.527161</td>
<td>16.715032</td>
</tr>
<tr>
<th>(4, 32)</th>
<td>69.054443</td>
<td>17.263611</td>
<td>8.481735</td>
</tr>
<tr>
<th>(16, 256)</th>
<td>69.054688</td>
<td>8.631836</td>
<td>4.274846</td>
</tr>
<tr>
<th>(32, 256)</th>
<td>69.054688</td>
<td>17.263672</td>
<td>8.542909</td>
</tr>
<tr>
<th>(64, 256)</th>
<td>69.054688</td>
<td>34.527344</td>
<td>17.097543</td>
</tr>
<tr>
<th>(128, 1024)</th>
<td>69.054688</td>
<td>17.263672</td>
<td>9.430644</td>
</tr>
<tr>
<th>(256, 1024)</th>
<td>69.054688</td>
<td>34.527344</td>
<td>18.870387</td>
</tr>
<tr>
<th>(512, 1024)</th>
<td>69.054688</td>
<td>69.054688</td>
<td>37.800940</td>
</tr>
<tr>
<th>(128, 2048)</th>
<td>69.062500</td>
<td>8.632812</td>
<td>6.185015</td>
</tr>
<tr>
<th>(256, 2048)</th>
<td>69.062500</td>
<td>17.265625</td>
<td>12.366942</td>
</tr>
<tr>
<th>(512, 2048)</th>
<td>69.062500</td>
<td>34.531250</td>
<td>24.736506</td>
</tr>
<tr>
<th>(1024, 2048)</th>
<td>69.062500</td>
<td>69.062500</td>
<td>49.517493</td>
</tr>
</tbody>
</table>
</div>
As we can see, there are a couple of instances where we do nearly 20 times better on storage space than the uncompressed file. Let's here what that sounds like:
```python
_, _, reconstructed = pca_reduce(tabulasa_left, 16, 256)
Audio(data=reconstructed[start:end], rate=samplerate)
```
import wav4 from "./4.wav";
<audio controls src={wav4}/>
It sounds incredibly poor though. Let's try something that's a bit more realistic:
```python
_, _, reconstructed = pca_reduce(tabulasa_left, 1, 4)
Audio(data=reconstructed[start:end], rate=samplerate)
```
import wav5 from "./5.wav";
<audio controls src={wav5}/>
And just out of curiosity, we can try something that has the same ratio of components to block size. This should be close to an apples-to-apples comparison.
```python
_, _, reconstructed = pca_reduce(tabulasa_left, 64, 256)
Audio(data=reconstructed[start:end], rate=samplerate)
```
import wav6 from "./6.wav"
<audio controls src={wav6}/>
The smaller block size definitely has better high-end response, but I personally think the larger block size sounds better overall.
## Conclusions
So, what do I think about audio compression using PCA?
Strangely enough, it actually works pretty well relative to what I expected. That said, it's a terrible idea in general.
First off, you don't really save any space. The component matrix needed to actually run the PCA algorithm takes up a lot of space on its own, so it's very difficult to save space without sacrificing a huge amount of audio quality. And even then, codecs like AAC sound very nice even at bitrates that this PCA method could only dream of.
Second, there's the issue of audio streaming. PCA relies on two components: the datastream, and a matrix used to reconstruct the original signal. While it is easy to stream the data, you can't stream that matrix. And even if you divided the stream up into small blocks to give you a small matrix, you must guarantee that the matrix arrives; if you don't have that matrix, the data stream will make no sense whatsoever.
All said, this was an interesting experiment. It's really cool seeing PCA used for signal analysis where I haven't seen it applied before, but I don't think it will lead to any practical results. Look forward to more signal processing stuff in the future!

View File

@ -0,0 +1,244 @@
Title: Captain's Cookbook - Part 1
Date: 2018-01-16
Category: Blog
Tags: capnproto rust
Authors: Bradlee Speice
Summary: A basic introduction to getting started with Cap'N Proto
[//]: <> "Modified: "
# Captain's Cookbook - Part 1
I've been working a lot with [Cap'N Proto](https://capnproto.org/) recently with Rust, but there's a real dearth of information
on how to set up and get going quickly. In the interest of trying to get more people using this (because I think it's
fantastic), I'm going to work through a couple of examples detailing what exactly should be done to get going.
So, what is Cap'N Proto? It's a data serialization library. It has contemporaries with [Protobuf](https://developers.google.com/protocol-buffers/)
and [FlatBuffers](https://google.github.io/flatbuffers/), but is better compared with FlatBuffers. The whole point behind it
is to define a schema language and serialization format such that:
1. Applications that do not share the same base programming language can communicate
2. The data and schema you use can naturally evolve over time as your needs change
Accompanying this are typically code generators that take the schemas you define for your application and give you back
code for different languages to get data to and from that schema.
Now, what makes Cap'N Proto different from, say, Protobuf, is that there is no serialization/deserialization step the same way
as is implemented with Protobuf. Instead, the idea is that the message itself can be loaded in memory and used directly there.
We're going to take a look at a series of progressively more complex projects that use Cap'N Proto in an effort to provide some
examples of what idiomatic usage looks like, and shorten the startup time needed to make use of this library in Rust projects.
If you want to follow along, feel free. If not, I've posted [the final result](https://github.com/bspeice/capnp_cookbook_1)
for reference.
# Step 1: Installing `capnp`
The `capnp` binary itself is needed for taking the schema files you write and turning them into a format that can be used by the
code generation libraries. Don't ask me what that actually means, I just know that you need to make sure this is installed.
I'll refer you to [Cap'N Proto's installation instructions](https://capnproto.org/install.html) here. As a quick TLDR though:
- Linux users will likely have a binary shipped by their package manager - On Ubuntu, `apt install capnproto` is enough
- OS X users can use [Homebrew](https://brew.sh/) as an easy install path. Just `brew install capnp`
- Windows users are a bit more complicated. If you're using [Chocolatey](https://chocolatey.org/), there's [a package](https://chocolatey.org/packages/capnproto/) available. If that doesn't work however, you need to download [a release zip](https://capnproto.org/capnproto-c++-win32-0.6.1.zip) and make sure that the `capnp.exe` binary is in your `%PATH%` environment variable
The way you know you're done with this step is if the following command works in your shell:
```bash
capnp id
```
# Step 2: Starting a Cap'N Proto Rust project
After the `capnp` binary is set up, it's time to actually create our Rust project. Nothing terribly complex here, just a simple
```bash
mkdir capnp_cookbook_1
cd capnp_cookbook_1
cargo init --bin
```
We'll put the following content into `Cargo.toml`:
```
[package]
name = "capnp_cookbook_1"
version = "0.1.0"
authors = ["Bradlee Speice <bspeice@kcg.com>"]
[build-dependencies]
capnpc = "0.8" # 1
[dependencies]
capnp = "0.8" # 2
```
This sets up:
1. The Rust code generator (CAPNProto Compiler)
2. The Cap'N Proto runtime library (CAPNProto runtime)
We've now got everything prepared that we need for writing a Cap'N Proto project.
# Step 3: Writing a basic schema
We're going to start with writing a pretty trivial data schema that we can extend later. This is just intended to make sure
you get familiar with how to start from a basic project.
First, we're going to create a top-level directory for storing the schema files in:
```bash
# Assuming we're starting from the `capnp_cookbook_1` directory created earlier
mkdir schema
cd schema
```
Now, we're going to put the following content in `point.capnp`:
```
@0xab555145c708dad2;
struct Point {
x @0 :Int32;
y @1 :Int32;
}
```
Pretty easy, we've now got structure for an object we'll be able to quickly encode in a binary format.
# Step 4: Setting up the build process
Now it's time to actually set up the build process to make sure that Cap'N Proto generates the Rust code we'll eventually be using.
This is typically done through a `build.rs` file to invoke the schema compiler.
In the same folder as your `Cargo.toml` file, please put the following content in `build.rs`:
```rust
extern crate capnpc;
fn main() {
::capnpc::CompilerCommand::new()
.src_prefix("schema") // 1
.file("schema/point.capnp") // 2
.run().expect("compiling schema");
}
```
This sets up the protocol compiler (`capnpc` from earlier) to compile the schema we've built so far.
1. Because Cap'N Proto schema files can re-use types specified in other files, the `src_prefix()` tells the compiler
where to look for those extra files at.
2. We specify the schema file we're including by hand. In a much larger project, you could presumably build the `CompilerCommand`
dynamically, but we won't worry too much about that one for now.
# Step 5: Running the build
If you've done everything correctly so far, you should be able to actually build the project and see the auto-generated code.
Run a `cargo build` command, and if you don't see `cargo` complaining, you're doing just fine!
So where exactly does the generated code go to? I think it's critically important for people to be able to see what the generated
code looks like, because you need to understand what you're actually programming against. The short answer is: the generated code lives
somewhere in the `target/` directory.
The long answer is that you're best off running a `find` command to get the actual file path:
```bash
# Assuming we're running from the capnp_cookbook_1 project folder
find . -name point_capnp.rs
```
Alternately, if the `find` command isn't available, the path will look something like:
```
./target/debug/build/capnp_cookbook_1-c6e2990393c32fe6/out/point_capnp.rs
```
See if there are any paths in your target directory that look similar.
Now, the file content looks pretty nasty. I've included an example [here](https://github.com/bspeice/capnp_cookbook_1/blob/master/target/debug/build/capnp_cookbook_1-c6e2990393c32fe6/out/point_capnp.rs)
if you aren't following along at home. There are a couple things I'll try and point out though so you can get an idea of how
the schema we wrote for the "Point" message is tied to the generated code.
First, the Cap'N Proto library splits things up into `Builder` and `Reader` structs. These are best thought of the same way
Rust separates `mut` from non-`mut` code. `Builder`s are `mut` versions of your message, and `Reader`s are immutable versions.
For example, the [`Builder` impl](https://github.com/bspeice/capnp_cookbook_1/blob/master/target/debug/build/capnp_cookbook_1-c6e2990393c32fe6/out/point_capnp.rs#L90) for `point` defines [`get_x()`](https://github.com/bspeice/capnp_cookbook_1/blob/master/target/debug/build/capnp_cookbook_1-c6e2990393c32fe6/out/point_capnp.rs#L105), [`set_x()`](https://github.com/bspeice/capnp_cookbook_1/blob/master/target/debug/build/capnp_cookbook_1-c6e2990393c32fe6/out/point_capnp.rs#L109), [`get_y()`](https://github.com/bspeice/capnp_cookbook_1/blob/master/target/debug/build/capnp_cookbook_1-c6e2990393c32fe6/out/point_capnp.rs#L113), and [`set_y()`](https://github.com/bspeice/capnp_cookbook_1/blob/master/target/debug/build/capnp_cookbook_1-c6e2990393c32fe6/out/point_capnp.rs#L117) methods.
In comparison, the [`Reader` impl](https://github.com/bspeice/capnp_cookbook_1/blob/master/target/debug/build/capnp_cookbook_1-c6e2990393c32fe6/out/point_capnp.rs#L38) only defines [`get_x()`](https://github.com/bspeice/capnp_cookbook_1/blob/master/target/debug/build/capnp_cookbook_1-c6e2990393c32fe6/out/point_capnp.rs#L47) and [`get_y()`](https://github.com/bspeice/capnp_cookbook_1/blob/master/target/debug/build/capnp_cookbook_1-c6e2990393c32fe6/out/point_capnp.rs#L51) methods.
So now we know that there are some `get` and `set` methods available for our `x` and `y` coordinates;
but what do we actually do with those?
# Step 6: Making a point
So we've install Cap'N Proto, gotten a project set up, and can generate schema code now. It's time to actually start building
Cap'N Proto messages! I'm going to put the code you need here because it's small, and put some extra long comments inline. This code
should go in [`src/main.rs`](https://github.com/bspeice/capnp_cookbook_1/blob/master/src/main.rs):
```rust
// Note that we use `capnp` here, NOT `capnpc`
extern crate capnp;
// We create a module here to define how we are to access the code
// being included.
pub mod point_capnp {
// The environment variable OUT_DIR is set by Cargo, and
// is the location of all the code that was built as part
// of the codegen step.
// point_capnp.rs is the actual file to include
include!(concat!(env!("OUT_DIR"), "/point_capnp.rs"));
}
fn main() {
// The process of building a Cap'N Proto message is a bit tedious.
// We start by creating a generic Builder; it acts as the message
// container that we'll later be filling with content of our `Point`
let mut builder = capnp::message::Builder::new_default();
// Because we need a mutable reference to the `builder` later,
// we fence off this part of the code to allow sequential mutable
// borrows. As I understand it, non-lexical lifetimes:
// https://github.com/rust-lang/rust-roadmap/issues/16
// will make this no longer necessary
{
// And now we can set up the actual message we're trying to create
let mut point_msg = builder.init_root::<point_capnp::point::Builder>();
// Stuff our message with some content
point_msg.set_x(12);
point_msg.set_y(14);
}
// It's now time to serialize our message to binary. Let's set up a buffer for that:
let mut buffer = Vec::new();
// And actually fill that buffer with our data
capnp::serialize::write_message(&mut buffer, &builder).unwrap();
// Finally, let's deserialize the data
let deserialized = capnp::serialize::read_message(
&mut buffer.as_slice(),
capnp::message::ReaderOptions::new()
).unwrap();
// `deserialized` is currently a generic reader; it understands
// the content of the message we gave it (i.e. that there are two
// int32 values) but doesn't really know what they represent (the Point).
// This is where we map the generic data back into our schema.
let point_reader = deserialized.get_root::<point_capnp::point::Reader>().unwrap();
// We can now get our x and y values back, and make sure they match
assert_eq!(point_reader.get_x(), 12);
assert_eq!(point_reader.get_y(), 14);
}
```
And with that, we've now got a functioning project. Here's the content I'm planning to go over next as we build up
some practical examples of Cap'N Proto in action:
## Next steps:
Part 2: Using [TypedReader](https://github.com/capnproto/capnproto-rust/blob/master/src/message.rs#L181) to send messages across thread boundaries
Part 3: Serialization and Deserialization of multiple Cap'N Proto messages

View File

@ -0,0 +1,246 @@
---
slug: 2018/01/captains-cookbook-part-1
title: Captain's cookbook - part 1
date: 2018-01-16 12:00:00
authors: [bspeice]
tags: []
---
A basic introduction to getting started with Cap'N Proto.
<!-- truncate -->
I've been working a lot with [Cap'N Proto](https://capnproto.org/) recently with Rust, but there's a real dearth of information
on how to set up and get going quickly. In the interest of trying to get more people using this (because I think it's
fantastic), I'm going to work through a couple of examples detailing what exactly should be done to get going.
So, what is Cap'N Proto? It's a data serialization library. It has contemporaries with [Protobuf](https://developers.google.com/protocol-buffers/)
and [FlatBuffers](https://google.github.io/flatbuffers/), but is better compared with FlatBuffers. The whole point behind it
is to define a schema language and serialization format such that:
1. Applications that do not share the same base programming language can communicate
2. The data and schema you use can naturally evolve over time as your needs change
Accompanying this are typically code generators that take the schemas you define for your application and give you back
code for different languages to get data to and from that schema.
Now, what makes Cap'N Proto different from, say, Protobuf, is that there is no serialization/deserialization step the same way
as is implemented with Protobuf. Instead, the idea is that the message itself can be loaded in memory and used directly there.
We're going to take a look at a series of progressively more complex projects that use Cap'N Proto in an effort to provide some
examples of what idiomatic usage looks like, and shorten the startup time needed to make use of this library in Rust projects.
If you want to follow along, feel free. If not, I've posted [the final result](https://github.com/bspeice/capnp_cookbook_1)
for reference.
## Step 1: Installing `capnp`
The `capnp` binary itself is needed for taking the schema files you write and turning them into a format that can be used by the
code generation libraries. Don't ask me what that actually means, I just know that you need to make sure this is installed.
I'll refer you to [Cap'N Proto's installation instructions](https://capnproto.org/install.html) here. As a quick TLDR though:
- Linux users will likely have a binary shipped by their package manager - On Ubuntu, `apt install capnproto` is enough
- OS X users can use [Homebrew](https://brew.sh/) as an easy install path. Just `brew install capnp`
- Windows users are a bit more complicated. If you're using [Chocolatey](https://chocolatey.org/), there's [a package](https://chocolatey.org/packages/capnproto/) available. If that doesn't work however, you need to download [a release zip](https://capnproto.org/capnproto-c++-win32-0.6.1.zip) and make sure that the `capnp.exe` binary is in your `%PATH%` environment variable
The way you know you're done with this step is if the following command works in your shell:
```bash
capnp id
```
## Step 2: Starting a Cap'N Proto Rust project
After the `capnp` binary is set up, it's time to actually create our Rust project. Nothing terribly complex here, just a simple
```bash
mkdir capnp_cookbook_1
cd capnp_cookbook_1
cargo init --bin
```
We'll put the following content into `Cargo.toml`:
```
[package]
name = "capnp_cookbook_1"
version = "0.1.0"
authors = ["Bradlee Speice <bspeice@kcg.com>"]
[build-dependencies]
capnpc = "0.8" # 1
[dependencies]
capnp = "0.8" # 2
```
This sets up:
1. The Rust code generator (CAPNProto Compiler)
2. The Cap'N Proto runtime library (CAPNProto runtime)
We've now got everything prepared that we need for writing a Cap'N Proto project.
## Step 3: Writing a basic schema
We're going to start with writing a pretty trivial data schema that we can extend later. This is just intended to make sure
you get familiar with how to start from a basic project.
First, we're going to create a top-level directory for storing the schema files in:
```bash
# Assuming we're starting from the `capnp_cookbook_1` directory created earlier
mkdir schema
cd schema
```
Now, we're going to put the following content in `point.capnp`:
```
@0xab555145c708dad2;
struct Point {
x @0 :Int32;
y @1 :Int32;
}
```
Pretty easy, we've now got structure for an object we'll be able to quickly encode in a binary format.
## Step 4: Setting up the build process
Now it's time to actually set up the build process to make sure that Cap'N Proto generates the Rust code we'll eventually be using.
This is typically done through a `build.rs` file to invoke the schema compiler.
In the same folder as your `Cargo.toml` file, please put the following content in `build.rs`:
```rust
extern crate capnpc;
fn main() {
::capnpc::CompilerCommand::new()
.src_prefix("schema") // 1
.file("schema/point.capnp") // 2
.run().expect("compiling schema");
}
```
This sets up the protocol compiler (`capnpc` from earlier) to compile the schema we've built so far.
1. Because Cap'N Proto schema files can re-use types specified in other files, the `src_prefix()` tells the compiler
where to look for those extra files at.
2. We specify the schema file we're including by hand. In a much larger project, you could presumably build the `CompilerCommand`
dynamically, but we won't worry too much about that one for now.
## Step 5: Running the build
If you've done everything correctly so far, you should be able to actually build the project and see the auto-generated code.
Run a `cargo build` command, and if you don't see `cargo` complaining, you're doing just fine!
So where exactly does the generated code go to? I think it's critically important for people to be able to see what the generated
code looks like, because you need to understand what you're actually programming against. The short answer is: the generated code lives
somewhere in the `target/` directory.
The long answer is that you're best off running a `find` command to get the actual file path:
```bash
# Assuming we're running from the capnp_cookbook_1 project folder
find . -name point_capnp.rs
```
Alternately, if the `find` command isn't available, the path will look something like:
```
./target/debug/build/capnp_cookbook_1-c6e2990393c32fe6/out/point_capnp.rs
```
See if there are any paths in your target directory that look similar.
Now, the file content looks pretty nasty. I've included an example [here](https://github.com/bspeice/capnp_cookbook_1/blob/master/target/debug/build/capnp_cookbook_1-c6e2990393c32fe6/out/point_capnp.rs)
if you aren't following along at home. There are a couple things I'll try and point out though so you can get an idea of how
the schema we wrote for the "Point" message is tied to the generated code.
First, the Cap'N Proto library splits things up into `Builder` and `Reader` structs. These are best thought of the same way
Rust separates `mut` from non-`mut` code. `Builder`s are `mut` versions of your message, and `Reader`s are immutable versions.
For example, the [`Builder` impl](https://github.com/bspeice/capnp_cookbook_1/blob/master/target/debug/build/capnp_cookbook_1-c6e2990393c32fe6/out/point_capnp.rs#L90) for `point` defines [`get_x()`](https://github.com/bspeice/capnp_cookbook_1/blob/master/target/debug/build/capnp_cookbook_1-c6e2990393c32fe6/out/point_capnp.rs#L105), [`set_x()`](https://github.com/bspeice/capnp_cookbook_1/blob/master/target/debug/build/capnp_cookbook_1-c6e2990393c32fe6/out/point_capnp.rs#L109), [`get_y()`](https://github.com/bspeice/capnp_cookbook_1/blob/master/target/debug/build/capnp_cookbook_1-c6e2990393c32fe6/out/point_capnp.rs#L113), and [`set_y()`](https://github.com/bspeice/capnp_cookbook_1/blob/master/target/debug/build/capnp_cookbook_1-c6e2990393c32fe6/out/point_capnp.rs#L117) methods.
In comparison, the [`Reader` impl](https://github.com/bspeice/capnp_cookbook_1/blob/master/target/debug/build/capnp_cookbook_1-c6e2990393c32fe6/out/point_capnp.rs#L38) only defines [`get_x()`](https://github.com/bspeice/capnp_cookbook_1/blob/master/target/debug/build/capnp_cookbook_1-c6e2990393c32fe6/out/point_capnp.rs#L47) and [`get_y()`](https://github.com/bspeice/capnp_cookbook_1/blob/master/target/debug/build/capnp_cookbook_1-c6e2990393c32fe6/out/point_capnp.rs#L51) methods.
So now we know that there are some `get` and `set` methods available for our `x` and `y` coordinates;
but what do we actually do with those?
## Step 6: Making a point
So we've install Cap'N Proto, gotten a project set up, and can generate schema code now. It's time to actually start building
Cap'N Proto messages! I'm going to put the code you need here because it's small, and put some extra long comments inline. This code
should go in [`src/main.rs`](https://github.com/bspeice/capnp_cookbook_1/blob/master/src/main.rs):
```rust
// Note that we use `capnp` here, NOT `capnpc`
extern crate capnp;
// We create a module here to define how we are to access the code
// being included.
pub mod point_capnp {
// The environment variable OUT_DIR is set by Cargo, and
// is the location of all the code that was built as part
// of the codegen step.
// point_capnp.rs is the actual file to include
include!(concat!(env!("OUT_DIR"), "/point_capnp.rs"));
}
fn main() {
// The process of building a Cap'N Proto message is a bit tedious.
// We start by creating a generic Builder; it acts as the message
// container that we'll later be filling with content of our `Point`
let mut builder = capnp::message::Builder::new_default();
// Because we need a mutable reference to the `builder` later,
// we fence off this part of the code to allow sequential mutable
// borrows. As I understand it, non-lexical lifetimes:
// https://github.com/rust-lang/rust-roadmap/issues/16
// will make this no longer necessary
{
// And now we can set up the actual message we're trying to create
let mut point_msg = builder.init_root::<point_capnp::point::Builder>();
// Stuff our message with some content
point_msg.set_x(12);
point_msg.set_y(14);
}
// It's now time to serialize our message to binary. Let's set up a buffer for that:
let mut buffer = Vec::new();
// And actually fill that buffer with our data
capnp::serialize::write_message(&mut buffer, &builder).unwrap();
// Finally, let's deserialize the data
let deserialized = capnp::serialize::read_message(
&mut buffer.as_slice(),
capnp::message::ReaderOptions::new()
).unwrap();
// `deserialized` is currently a generic reader; it understands
// the content of the message we gave it (i.e. that there are two
// int32 values) but doesn't really know what they represent (the Point).
// This is where we map the generic data back into our schema.
let point_reader = deserialized.get_root::<point_capnp::point::Reader>().unwrap();
// We can now get our x and y values back, and make sure they match
assert_eq!(point_reader.get_x(), 12);
assert_eq!(point_reader.get_y(), 14);
}
```
And with that, we've now got a functioning project. Here's the content I'm planning to go over next as we build up
some practical examples of Cap'N Proto in action:
## Next steps
Part 2: Using [TypedReader](https://github.com/capnproto/capnproto-rust/blob/master/src/message.rs#L181) to send messages across thread boundaries
Part 3: Serialization and Deserialization of multiple Cap'N Proto messages

View File

@ -0,0 +1,249 @@
Title: Captain's Cookbook - Part 2
Date: 2018-01-18
Category: Blog
Tags: capnproto rust
Authors: Bradlee Speice
Summary: A look at more practical usages of Cap'N Proto
[//]: <> "Modified: "
# Captain's Cookbook - Part 2 - Using the TypedReader
[Part 1](http://bspeice.github.io/captains-cookbook-part-1.html) of this series took a look at a basic starting project
with Cap'N Proto. In this section, we're going to take the (admittedly basic) schema and look at how we can add a pretty
basic feature - sending Cap'N Proto messages between threads. It's nothing complex, but I want to make sure that there's
some documentation surrounding practical usage of the library.
As a quick refresher, we build a Cap'N Proto message and go through the serialization/deserialization steps
[here](https://github.com/bspeice/capnp_cookbook_1/blob/master/src/main.rs). Our current example is going to build on
the code we wrote there; after the deserialization step, we'll try and send the `point_reader` to a separate thread
for verification.
I'm going to walk through the attempts as I made them and my thinking throughout.
If you want to skip to the final project, check out the code available [here](https://github.com/bspeice/capnp_cookbook_2)
# Attempt 1: Move the reference
As a first attempt, we're going to try and let Rust move the reference. Our code will look something like:
<div class="highlight">
```rust
fn main() {
// ...assume that we own a `buffer: Vec<u8>` containing the binary message content from
// somewhere else
let deserialized = capnp::serialize::read_message(
&mut buffer.as_slice(),
capnp::message::ReaderOptions::new()
).unwrap();
let point_reader = deserialized.get_root::<point_capnp::point::Reader>().unwrap();
// By using `point_reader` inside the new thread, we're hoping that Rust can
// safely move the reference and invalidate the original thread's usage.
// Since the original thread doesn't use `point_reader` again, this should
// be safe, right?
let handle = std::thread:spawn(move || {
assert_eq!(point_reader.get_x(), 12);
assert_eq!(point_reader.get_y(), 14);
});
handle.join().unwrap()
}
```
Well, the Rust compiler doesn't really like this. We get four distinct errors back:
```
error[E0277]: the trait bound `*const u8: std::marker::Send` is not satisfied in `[closure@src/main.rs:31:37: 36:6 point_reader:point_capnp::point::Reader<'_>]`
--> src/main.rs:31:18
|
31 | let handle = std::thread::spawn(move || {
| ^^^^^^^^^^^^^^^^^^ `*const u8` cannot be sent between threads safely
|
error[E0277]: the trait bound `*const capnp::private::layout::WirePointer: std::marker::Send` is not satisfied in `[closure@src/main.rs:31:37: 36:6 point_reader:point_capnp::point::Reader<'_>]`
--> src/main.rs:31:18
|
31 | let handle = std::thread::spawn(move || {
| ^^^^^^^^^^^^^^^^^^ `*const capnp::private::layout::WirePointer` cannot be sent between threads safely
|
error[E0277]: the trait bound `capnp::private::arena::ReaderArena: std::marker::Sync` is not satisfied
--> src/main.rs:31:18
|
31 | let handle = std::thread::spawn(move || {
| ^^^^^^^^^^^^^^^^^^ `capnp::private::arena::ReaderArena` cannot be shared between threads safely
|
error[E0277]: the trait bound `*const std::vec::Vec<std::option::Option<std::boxed::Box<capnp::private::capability::ClientHook + 'static>>>: std::marker::Send` is not satisfied in `[closure@src/main.rs:31:37: 36:6 point_reader:point_capnp::point::Reader<'_>]`
--> src/main.rs:31:18
|
31 | let handle = std::thread::spawn(move || {
| ^^^^^^^^^^^^^^^^^^ `*const std::vec::Vec<std::option::Option<std::boxed::Box<capnp::private::capability::ClientHook + 'static>>>` cannot be sent between threads safely
|
error: aborting due to 4 previous errors
```
Note, I've removed the help text for brevity, but suffice to say that these errors are intimidating.
Pay attention to the text that keeps on getting repeated though: `XYZ cannot be sent between threads safely`.
This is a bit frustrating: we own the `buffer` from which all the content was derived, and we don't have any
unsafe accesses in our code. We guarantee that we wait for the child thread to stop first, so there's no possibility
of the pointer becoming invalid because the original thread exits before the child thread does. So why is Rust
preventing us from doing something that really should be legal?
This is what is known as [fighting the borrow checker](https://doc.rust-lang.org/1.8.0/book/references-and-borrowing.html).
Let our crusade begin.
# Attempt 2: Put the `Reader` in a `Box
The [`Box`](https://doc.rust-lang.org/std/boxed/struct.Box.html) type allows us to convert a pointer we have
(in our case the `point_reader`) into an "owned" value, which should be easier to send across threads.
Our next attempt looks something like this:
```rust
fn main() {
// ...assume that we own a `buffer: Vec<u8>` containing the binary message content
// from somewhere else
let deserialized = capnp::serialize::read_message(
&mut buffer.as_slice(),
capnp::message::ReaderOptions::new()
).unwrap();
let point_reader = deserialized.get_root::<point_capnp::point::Reader>().unwrap();
let boxed_reader = Box::new(point_reader);
// Now that the reader is `Box`ed, we've proven ownership, and Rust can
// move the ownership to the new thread, right?
let handle = std::thread::spawn(move || {
assert_eq!(boxed_reader.get_x(), 12);
assert_eq!(boxed_reader.get_y(), 14);
});
handle.join().unwrap();
}
```
Spoiler alert: still doesn't work. Same errors still show up.
```
error[E0277]: the trait bound `*const u8: std::marker::Send` is not satisfied in `point_capnp::point::Reader<'_>`
--> src/main.rs:33:18
|
33 | let handle = std::thread::spawn(move || {
| ^^^^^^^^^^^^^^^^^^ `*const u8` cannot be sent between threads safely
|
error[E0277]: the trait bound `*const capnp::private::layout::WirePointer: std::marker::Send` is not satisfied in `point_capnp::point::Reader<'_>`
--> src/main.rs:33:18
|
33 | let handle = std::thread::spawn(move || {
| ^^^^^^^^^^^^^^^^^^ `*const capnp::private::layout::WirePointer` cannot be sent between threads safely
|
error[E0277]: the trait bound `capnp::private::arena::ReaderArena: std::marker::Sync` is not satisfied
--> src/main.rs:33:18
|
33 | let handle = std::thread::spawn(move || {
| ^^^^^^^^^^^^^^^^^^ `capnp::private::arena::ReaderArena` cannot be shared between threads safely
|
error[E0277]: the trait bound `*const std::vec::Vec<std::option::Option<std::boxed::Box<capnp::private::capability::ClientHook + 'static>>>: std::marker::Send` is not satisfied in `point_capnp::point::Reader<'_>`
--> src/main.rs:33:18
|
33 | let handle = std::thread::spawn(move || {
| ^^^^^^^^^^^^^^^^^^ `*const std::vec::Vec<std::option::Option<std::boxed::Box<capnp::private::capability::ClientHook + 'static>>>` cannot be sent between threads safely
|
error: aborting due to 4 previous errors
```
Let's be a little bit smarter about the exceptions this time though. What is that
[`std::marker::Send`](https://doc.rust-lang.org/std/marker/trait.Send.html) thing the compiler keeps telling us about?
The documentation is pretty clear; `Send` is used to denote:
> Types that can be transferred across thread boundaries.
In our case, we are seeing the error messages for two reasons:
1. Pointers (`*const u8`) are not safe to send across thread boundaries. While we're nice in our code
making sure that we wait on the child thread to finish before closing down, the Rust compiler can't make
that assumption, and so complains that we're not using this in a safe manner.
2. The `point_capnp::point::Reader` type is itself not safe to send across threads because it doesn't
implement the `Send` trait. Which is to say, the things that make up a `Reader` are themselves not thread-safe,
so the `Reader` is also not thread-safe.
So, how are we to actually transfer a parsed Cap'N Proto message between threads?
# Attempt 3: The `TypedReader`
The `TypedReader` is a new API implemented in the Cap'N Proto [Rust code](https://crates.io/crates/capnp/0.8.14).
We're interested in it here for two reasons:
1. It allows us to define an object where the _object_ owns the underlying data. In previous attempts,
the current context owned the data, but the `Reader` itself had no such control.
2. We can compose the `TypedReader` using objects that are safe to `Send` across threads, guaranteeing
that we can transfer parsed messages across threads.
The actual type info for the [`TypedReader`](https://github.com/capnproto/capnproto-rust/blob/f0efc35d7e9bd8f97ca4fdeb7c57fd7ea348e303/src/message.rs#L181)
is a bit complex. And to be honest, I'm still really not sure what the whole point of the
[`PhantomData`](https://doc.rust-lang.org/std/marker/struct.PhantomData.html) thing is either.
My impression is that it lets us enforce type safety when we know what the underlying Cap'N Proto
message represents. That is, technically the only thing we're storing is the untyped binary message;
`PhantomData` just enforces the principle that the binary represents some specific object that has been parsed.
Either way, we can carefully construct something which is safe to move between threads:
```rust
fn main() {
// ...assume that we own a `buffer: Vec<u8>` containing the binary message content from somewhere else
let deserialized = capnp::serialize::read_message(
&mut buffer.as_slice(),
capnp::message::ReaderOptions::new()
).unwrap();
let point_reader: capnp::message::TypedReader<capnp::serialize::OwnedSegments, point_capnp::point::Owned> =
capnp::message::TypedReader::new(deserialized);
// Because the point_reader is now working with OwnedSegments (which are owned vectors) and an Owned message
// (which is 'static lifetime), this is now safe
let handle = std::thread::spawn(move || {
// The point_reader owns its data, and we use .get() to retrieve the actual point_capnp::point::Reader
// object from it
let point_root = point_reader.get().unwrap();
assert_eq!(point_root.get_x(), 12);
assert_eq!(point_root.get_y(), 14);
});
handle.join().unwrap();
}
```
And while we've left Rust to do the dirty work of actually moving the `point_reader` into the new thread,
we could also use things like [`mpsc` channels](https://doc.rust-lang.org/std/sync/mpsc/index.html) to achieve a similar effect.
So now we're able to define basic Cap'N Proto messages, and send them all around our programs.
## Next steps:
[Part 1: Setting up a basic Cap'N Proto Rust project](http://bspeice.github.io/captains-cookbook-part-1.html)
Part 3: Serialization and Deserialization of multiple Cap'N Proto messages

View File

@ -0,0 +1,247 @@
---
slug: 2018/01/captains-cookbook-part-2
title: Captain's cookbook - part 2
date: 2018-01-16 13:00:00
authors: [bspeice]
tags: []
---
A look at more practical usages of Cap'N Proto
<!-- truncate -->
[Part 1](/2018/01/captains-cookbook-part-1) of this series took a look at a basic starting project
with Cap'N Proto. In this section, we're going to take the (admittedly basic) schema and look at how we can add a pretty
basic feature - sending Cap'N Proto messages between threads. It's nothing complex, but I want to make sure that there's
some documentation surrounding practical usage of the library.
As a quick refresher, we build a Cap'N Proto message and go through the serialization/deserialization steps
[here](https://github.com/bspeice/capnp_cookbook_1/blob/master/src/main.rs). Our current example is going to build on
the code we wrote there; after the deserialization step, we'll try and send the `point_reader` to a separate thread
for verification.
I'm going to walk through the attempts as I made them and my thinking throughout.
If you want to skip to the final project, check out the code available [here](https://github.com/bspeice/capnp_cookbook_2)
## Attempt 1: Move the reference
As a first attempt, we're going to try and let Rust move the reference. Our code will look something like:
```rust
fn main() {
// ...assume that we own a `buffer: Vec<u8>` containing the binary message content from
// somewhere else
let deserialized = capnp::serialize::read_message(
&mut buffer.as_slice(),
capnp::message::ReaderOptions::new()
).unwrap();
let point_reader = deserialized.get_root::<point_capnp::point::Reader>().unwrap();
// By using `point_reader` inside the new thread, we're hoping that Rust can
// safely move the reference and invalidate the original thread's usage.
// Since the original thread doesn't use `point_reader` again, this should
// be safe, right?
let handle = std::thread:spawn(move || {
assert_eq!(point_reader.get_x(), 12);
assert_eq!(point_reader.get_y(), 14);
});
handle.join().unwrap()
}
```
Well, the Rust compiler doesn't really like this. We get four distinct errors back:
```
error[E0277]: the trait bound `*const u8: std::marker::Send` is not satisfied in `[closure@src/main.rs:31:37: 36:6 point_reader:point_capnp::point::Reader<'_>]`
--> src/main.rs:31:18
|
31 | let handle = std::thread::spawn(move || {
| ^^^^^^^^^^^^^^^^^^ `*const u8` cannot be sent between threads safely
|
error[E0277]: the trait bound `*const capnp::private::layout::WirePointer: std::marker::Send` is not satisfied in `[closure@src/main.rs:31:37: 36:6 point_reader:point_capnp::point::Reader<'_>]`
--> src/main.rs:31:18
|
31 | let handle = std::thread::spawn(move || {
| ^^^^^^^^^^^^^^^^^^ `*const capnp::private::layout::WirePointer` cannot be sent between threads safely
|
error[E0277]: the trait bound `capnp::private::arena::ReaderArena: std::marker::Sync` is not satisfied
--> src/main.rs:31:18
|
31 | let handle = std::thread::spawn(move || {
| ^^^^^^^^^^^^^^^^^^ `capnp::private::arena::ReaderArena` cannot be shared between threads safely
|
error[E0277]: the trait bound `*const std::vec::Vec<std::option::Option<std::boxed::Box<capnp::private::capability::ClientHook + 'static>>>: std::marker::Send` is not satisfied in `[closure@src/main.rs:31:37: 36:6 point_reader:point_capnp::point::Reader<'_>]`
--> src/main.rs:31:18
|
31 | let handle = std::thread::spawn(move || {
| ^^^^^^^^^^^^^^^^^^ `*const std::vec::Vec<std::option::Option<std::boxed::Box<capnp::private::capability::ClientHook + 'static>>>` cannot be sent between threads safely
|
error: aborting due to 4 previous errors
```
Note, I've removed the help text for brevity, but suffice to say that these errors are intimidating.
Pay attention to the text that keeps on getting repeated though: `XYZ cannot be sent between threads safely`.
This is a bit frustrating: we own the `buffer` from which all the content was derived, and we don't have any
unsafe accesses in our code. We guarantee that we wait for the child thread to stop first, so there's no possibility
of the pointer becoming invalid because the original thread exits before the child thread does. So why is Rust
preventing us from doing something that really should be legal?
This is what is known as [fighting the borrow checker](https://doc.rust-lang.org/1.8.0/book/references-and-borrowing.html).
Let our crusade begin.
## Attempt 2: Put the `Reader` in a `Box
The [`Box`](https://doc.rust-lang.org/std/boxed/struct.Box.html) type allows us to convert a pointer we have
(in our case the `point_reader`) into an "owned" value, which should be easier to send across threads.
Our next attempt looks something like this:
```rust
fn main() {
// ...assume that we own a `buffer: Vec<u8>` containing the binary message content
// from somewhere else
let deserialized = capnp::serialize::read_message(
&mut buffer.as_slice(),
capnp::message::ReaderOptions::new()
).unwrap();
let point_reader = deserialized.get_root::<point_capnp::point::Reader>().unwrap();
let boxed_reader = Box::new(point_reader);
// Now that the reader is `Box`ed, we've proven ownership, and Rust can
// move the ownership to the new thread, right?
let handle = std::thread::spawn(move || {
assert_eq!(boxed_reader.get_x(), 12);
assert_eq!(boxed_reader.get_y(), 14);
});
handle.join().unwrap();
}
```
Spoiler alert: still doesn't work. Same errors still show up.
```
error[E0277]: the trait bound `*const u8: std::marker::Send` is not satisfied in `point_capnp::point::Reader<'_>`
--> src/main.rs:33:18
|
33 | let handle = std::thread::spawn(move || {
| ^^^^^^^^^^^^^^^^^^ `*const u8` cannot be sent between threads safely
|
error[E0277]: the trait bound `*const capnp::private::layout::WirePointer: std::marker::Send` is not satisfied in `point_capnp::point::Reader<'_>`
--> src/main.rs:33:18
|
33 | let handle = std::thread::spawn(move || {
| ^^^^^^^^^^^^^^^^^^ `*const capnp::private::layout::WirePointer` cannot be sent between threads safely
|
error[E0277]: the trait bound `capnp::private::arena::ReaderArena: std::marker::Sync` is not satisfied
--> src/main.rs:33:18
|
33 | let handle = std::thread::spawn(move || {
| ^^^^^^^^^^^^^^^^^^ `capnp::private::arena::ReaderArena` cannot be shared between threads safely
|
error[E0277]: the trait bound `*const std::vec::Vec<std::option::Option<std::boxed::Box<capnp::private::capability::ClientHook + 'static>>>: std::marker::Send` is not satisfied in `point_capnp::point::Reader<'_>`
--> src/main.rs:33:18
|
33 | let handle = std::thread::spawn(move || {
| ^^^^^^^^^^^^^^^^^^ `*const std::vec::Vec<std::option::Option<std::boxed::Box<capnp::private::capability::ClientHook + 'static>>>` cannot be sent between threads safely
|
error: aborting due to 4 previous errors
```
Let's be a little bit smarter about the exceptions this time though. What is that
[`std::marker::Send`](https://doc.rust-lang.org/std/marker/trait.Send.html) thing the compiler keeps telling us about?
The documentation is pretty clear; `Send` is used to denote:
> Types that can be transferred across thread boundaries.
In our case, we are seeing the error messages for two reasons:
1. Pointers (`*const u8`) are not safe to send across thread boundaries. While we're nice in our code
making sure that we wait on the child thread to finish before closing down, the Rust compiler can't make
that assumption, and so complains that we're not using this in a safe manner.
2. The `point_capnp::point::Reader` type is itself not safe to send across threads because it doesn't
implement the `Send` trait. Which is to say, the things that make up a `Reader` are themselves not thread-safe,
so the `Reader` is also not thread-safe.
So, how are we to actually transfer a parsed Cap'N Proto message between threads?
## Attempt 3: The `TypedReader`
The `TypedReader` is a new API implemented in the Cap'N Proto [Rust code](https://crates.io/crates/capnp/0.8.14).
We're interested in it here for two reasons:
1. It allows us to define an object where the _object_ owns the underlying data. In previous attempts,
the current context owned the data, but the `Reader` itself had no such control.
2. We can compose the `TypedReader` using objects that are safe to `Send` across threads, guaranteeing
that we can transfer parsed messages across threads.
The actual type info for the [`TypedReader`](https://github.com/capnproto/capnproto-rust/blob/f0efc35d7e9bd8f97ca4fdeb7c57fd7ea348e303/src/message.rs#L181)
is a bit complex. And to be honest, I'm still really not sure what the whole point of the
[`PhantomData`](https://doc.rust-lang.org/std/marker/struct.PhantomData.html) thing is either.
My impression is that it lets us enforce type safety when we know what the underlying Cap'N Proto
message represents. That is, technically the only thing we're storing is the untyped binary message;
`PhantomData` just enforces the principle that the binary represents some specific object that has been parsed.
Either way, we can carefully construct something which is safe to move between threads:
```rust
fn main() {
// ...assume that we own a `buffer: Vec<u8>` containing the binary message content from somewhere else
let deserialized = capnp::serialize::read_message(
&mut buffer.as_slice(),
capnp::message::ReaderOptions::new()
).unwrap();
let point_reader: capnp::message::TypedReader<capnp::serialize::OwnedSegments, point_capnp::point::Owned> =
capnp::message::TypedReader::new(deserialized);
// Because the point_reader is now working with OwnedSegments (which are owned vectors) and an Owned message
// (which is 'static lifetime), this is now safe
let handle = std::thread::spawn(move || {
// The point_reader owns its data, and we use .get() to retrieve the actual point_capnp::point::Reader
// object from it
let point_root = point_reader.get().unwrap();
assert_eq!(point_root.get_x(), 12);
assert_eq!(point_root.get_y(), 14);
});
handle.join().unwrap();
}
```
And while we've left Rust to do the dirty work of actually moving the `point_reader` into the new thread,
we could also use things like [`mpsc` channels](https://doc.rust-lang.org/std/sync/mpsc/index.html) to achieve a similar effect.
So now we're able to define basic Cap'N Proto messages, and send them all around our programs.
## Next steps:
[Part 1: Setting up a basic Cap'N Proto Rust project](http://bspeice.github.io/captains-cookbook-part-1.html)