Add new tweet like me post

This commit is contained in:
Bradlee Speice 2016-03-28 15:53:22 -04:00
parent f60cb5f8ef
commit 723eafd6d2
12 changed files with 2130 additions and 2 deletions

View File

@ -82,6 +82,8 @@
<div class="container content archive">
<h2><a href="https://bspeice.github.io/archives.html"></a></h2>
<dl class="dl-horizontal">
<dt>Mon 28 March 2016</dt>
<dd><a href="https://bspeice.github.io/tweet-like-me.html">Tweet Like Me</a></dd>
<dt>Sat 05 March 2016</dt>
<dd><a href="https://bspeice.github.io/predicting-santander-customer-happiness.html">Predicting Santander Customer Happiness</a></dd>
<dt>Fri 26 February 2016</dt>

View File

@ -82,6 +82,8 @@
<div class="container content archive">
<h2><a href="https://bspeice.github.io/author/bradlee-speice.html"></a></h2>
<dl class="dl-horizontal">
<dt>Mon 28 March 2016</dt>
<dd><a href="https://bspeice.github.io/tweet-like-me.html">Tweet Like Me</a></dd>
<dt>Sat 05 March 2016</dt>
<dd><a href="https://bspeice.github.io/predicting-santander-customer-happiness.html">Predicting Santander Customer Happiness</a></dd>
<dt>Fri 26 February 2016</dt>

View File

@ -82,6 +82,8 @@
<div class="container content archive">
<h2><a href="https://bspeice.github.io/author/bradlee-speice.html">Bradlee Speice</a></h2>
<dl class="dl-horizontal">
<dt>Mon 28 March 2016</dt>
<dd><a href="https://bspeice.github.io/tweet-like-me.html">Tweet Like Me</a></dd>
<dt>Sat 05 March 2016</dt>
<dd><a href="https://bspeice.github.io/predicting-santander-customer-happiness.html">Predicting Santander Customer Happiness</a></dd>
<dt>Fri 26 February 2016</dt>

View File

@ -82,6 +82,8 @@
<div class="container content archive">
<h2><a href="https://bspeice.github.io/category/blog.html">Blog</a></h2>
<dl class="dl-horizontal">
<dt>Mon 28 March 2016</dt>
<dd><a href="https://bspeice.github.io/tweet-like-me.html">Tweet Like Me</a></dd>
<dt>Sat 05 March 2016</dt>
<dd><a href="https://bspeice.github.io/predicting-santander-customer-happiness.html">Predicting Santander Customer Happiness</a></dd>
<dt>Fri 26 February 2016</dt>

View File

@ -83,6 +83,8 @@
<div class="container content archive">
<h2><a href="https://bspeice.github.io/category/blog.html">Blog</a></h2>
<dl class="dl-horizontal">
<dt>Mon 28 March 2016</dt>
<dd><a href="https://bspeice.github.io/tweet-like-me.html">Tweet Like Me</a></dd>
<dt>Sat 05 March 2016</dt>
<dd><a href="https://bspeice.github.io/predicting-santander-customer-happiness.html">Predicting Santander Customer Happiness</a></dd>
<dt>Fri 26 February 2016</dt>

View File

@ -1,5 +1,580 @@
<?xml version="1.0" encoding="utf-8"?>
<feed xmlns="http://www.w3.org/2005/Atom"><title>Bradlee Speice</title><link href="https://bspeice.github.io/" rel="alternate"></link><link href="https://bspeice.github.io/feeds/all.atom.xml" rel="self"></link><id>https://bspeice.github.io/</id><updated>2016-03-05T00:00:00-05:00</updated><entry><title>Predicting Santander Customer Happiness</title><link href="https://bspeice.github.io/predicting-santander-customer-happiness.html" rel="alternate"></link><updated>2016-03-05T00:00:00-05:00</updated><author><name>Bradlee Speice</name></author><id>tag:bspeice.github.io,2016-03-05:predicting-santander-customer-happiness.html</id><summary type="html">&lt;p&gt;
<feed xmlns="http://www.w3.org/2005/Atom"><title>Bradlee Speice</title><link href="https://bspeice.github.io/" rel="alternate"></link><link href="https://bspeice.github.io/feeds/all.atom.xml" rel="self"></link><id>https://bspeice.github.io/</id><updated>2016-03-28T00:00:00-04:00</updated><entry><title>Tweet Like Me</title><link href="https://bspeice.github.io/tweet-like-me.html" rel="alternate"></link><updated>2016-03-28T00:00:00-04:00</updated><author><name>Bradlee Speice</name></author><id>tag:bspeice.github.io,2016-03-28:tweet-like-me.html</id><summary type="html">&lt;p&gt;
&lt;div class="cell border-box-sizing text_cell rendered"&gt;
&lt;div class="prompt input_prompt"&gt;
&lt;/div&gt;
&lt;div class="inner_cell"&gt;
&lt;div class="text_cell_render border-box-sizing rendered_html"&gt;
&lt;p&gt;An experiment in creating a robot that will imitate me on Twitter.&lt;/p&gt;
&lt;hr&gt;
&lt;p&gt;So, I'm taking a Machine Learning course this semester in school, and one of the topics we keep coming back to is natural language processing and the 'bag of words' data structure. That is, given a sentence:&lt;/p&gt;
&lt;p&gt;&lt;code&gt;How much wood would a woodchuck chuck if a woodchuck could chuck wood?&lt;/code&gt;&lt;/p&gt;
&lt;p&gt;We can represent that sentence as the following list:&lt;/p&gt;
&lt;p&gt;&lt;code&gt;{
How: 1
much: 1
wood: 2
would: 2
a: 2
woodchuck: 2
chuck: 2
if: 1
}&lt;/code&gt;&lt;/p&gt;
&lt;p&gt;Ignoring &lt;em&gt;where&lt;/em&gt; the words happened, we're just interested in how &lt;em&gt;often&lt;/em&gt; the words occurred. That got me thinking: I wonder what would happen if I built a robot that just imitated how often I said things? It's dangerous territory when computer scientists ask "what if," but I got curious enough I wanted to follow through.&lt;/p&gt;
&lt;h2 id="The-Objective"&gt;The Objective&lt;a class="anchor-link" href="#The-Objective"&gt;&amp;#182;&lt;/a&gt;&lt;/h2&gt;&lt;p&gt;Given an input list of Tweets, build up the following things:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;The distribution of starting words; since there are no "prior" words to go from, we need to treat this as a special case.&lt;/li&gt;
&lt;li&gt;The distribution of words given a previous word; for example, every time I use the word &lt;code&gt;woodchuck&lt;/code&gt; in the example sentence, there is a 50% chance it is followed by &lt;code&gt;chuck&lt;/code&gt; and a 50% chance it is followed by &lt;code&gt;could&lt;/code&gt;. I need this distribution for all words.&lt;/li&gt;
&lt;li&gt;The distribution of quantity of hashtags; Do I most often use just one? Two? Do they follow something like a Poisson distribution?&lt;/li&gt;
&lt;li&gt;Distribution of hashtags; Given a number of hashtags, what is the actual content? I'll treat hashtags as separate from the content of a tweet.&lt;/li&gt;
&lt;/ol&gt;
&lt;h2 id="The-Data"&gt;The Data&lt;a class="anchor-link" href="#The-Data"&gt;&amp;#182;&lt;/a&gt;&lt;/h2&gt;&lt;p&gt;I'm using as input my tweet history. I don't really use Twitter anymore, but it seems like a fun use of the dataset. I'd like to eventually build this to a point where I can imitate anyone on Twitter using their last 100 tweets or so, but I'll start with this as example code.&lt;/p&gt;
&lt;h2 id="The-Algorithm"&gt;The Algorithm&lt;a class="anchor-link" href="#The-Algorithm"&gt;&amp;#182;&lt;/a&gt;&lt;/h2&gt;&lt;p&gt;I'll be using the &lt;a href="http://www.nltk.org/"&gt;NLTK&lt;/a&gt; library for doing a lot of the heavy lifting. First, let's import the data:&lt;/p&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;div class="cell border-box-sizing code_cell rendered"&gt;
&lt;div class="input"&gt;
&lt;div class="prompt input_prompt"&gt;In&amp;nbsp;[1]:&lt;/div&gt;
&lt;div class="inner_cell"&gt;
&lt;div class="input_area"&gt;
&lt;div class=" highlight hl-ipython3"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="nn"&gt;pandas&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="nn"&gt;pd&lt;/span&gt;
&lt;span class="n"&gt;tweets&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;pd&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;read_csv&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;tweets.csv&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;text&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;tweets&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;text&lt;/span&gt;
&lt;span class="c1"&gt;# Don&amp;#39;t include tweets in reply to or mentioning people&lt;/span&gt;
&lt;span class="n"&gt;replies&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;str&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;contains&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;@&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;text_norep&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;loc&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="o"&gt;~&lt;/span&gt;&lt;span class="n"&gt;replies&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;div class="cell border-box-sizing text_cell rendered"&gt;
&lt;div class="prompt input_prompt"&gt;
&lt;/div&gt;
&lt;div class="inner_cell"&gt;
&lt;div class="text_cell_render border-box-sizing rendered_html"&gt;
&lt;p&gt;And now that we've got data, let's start crunching. First, tokenize and build out the distribution of first word:&lt;/p&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;div class="cell border-box-sizing code_cell rendered"&gt;
&lt;div class="input"&gt;
&lt;div class="prompt input_prompt"&gt;In&amp;nbsp;[2]:&lt;/div&gt;
&lt;div class="inner_cell"&gt;
&lt;div class="input_area"&gt;
&lt;div class=" highlight hl-ipython3"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="nn"&gt;nltk.tokenize&lt;/span&gt; &lt;span class="k"&gt;import&lt;/span&gt; &lt;span class="n"&gt;TweetTokenizer&lt;/span&gt;
&lt;span class="n"&gt;tknzr&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;TweetTokenizer&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="n"&gt;tokens&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;text_norep&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;map&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;tknzr&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;tokenize&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;first_words&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;tokens&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;map&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;lambda&lt;/span&gt; &lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
&lt;span class="n"&gt;first_words_alpha&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;first_words&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;first_words&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;str&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;isalpha&lt;/span&gt;&lt;span class="p"&gt;()]&lt;/span&gt;
&lt;span class="n"&gt;first_word_dist&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;first_words_alpha&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;value_counts&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="nb"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;first_words_alpha&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;div class="cell border-box-sizing text_cell rendered"&gt;
&lt;div class="prompt input_prompt"&gt;
&lt;/div&gt;
&lt;div class="inner_cell"&gt;
&lt;div class="text_cell_render border-box-sizing rendered_html"&gt;
&lt;p&gt;Next, we need to build out the conditional distributions. That is, what is the probability of the next word given the current word is $X$? This one is a bit more involved. First, find all unique words, and then find what words proceed them. This can probably be done in a more efficient manner than I'm currently doing here, but we'll ignore that for the moment.&lt;/p&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;div class="cell border-box-sizing code_cell rendered"&gt;
&lt;div class="input"&gt;
&lt;div class="prompt input_prompt"&gt;In&amp;nbsp;[3]:&lt;/div&gt;
&lt;div class="inner_cell"&gt;
&lt;div class="input_area"&gt;
&lt;div class=" highlight hl-ipython3"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="nn"&gt;functools&lt;/span&gt; &lt;span class="k"&gt;import&lt;/span&gt; &lt;span class="n"&gt;reduce&lt;/span&gt;
&lt;span class="c1"&gt;# Get all possible words&lt;/span&gt;
&lt;span class="n"&gt;all_words&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;reduce&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;lambda&lt;/span&gt; &lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;y&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="o"&gt;+&lt;/span&gt;&lt;span class="n"&gt;y&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;tokens&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;[])&lt;/span&gt;
&lt;span class="n"&gt;unique_words&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nb"&gt;set&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;all_words&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;actual_words&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nb"&gt;set&lt;/span&gt;&lt;span class="p"&gt;([&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt; &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;!=&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;.&amp;#39;&lt;/span&gt; &lt;span class="k"&gt;else&lt;/span&gt; &lt;span class="kc"&gt;None&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;x&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;unique_words&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
&lt;span class="n"&gt;word_dist&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{}&lt;/span&gt;
&lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;word&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="nb"&gt;iter&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;actual_words&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
&lt;span class="n"&gt;indices&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;j&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="nb"&gt;enumerate&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;all_words&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;j&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="n"&gt;word&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;span class="n"&gt;proceeding&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;all_words&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="o"&gt;+&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;indices&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;span class="n"&gt;word_dist&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;word&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;proceeding&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;div class="cell border-box-sizing text_cell rendered"&gt;
&lt;div class="prompt input_prompt"&gt;
&lt;/div&gt;
&lt;div class="inner_cell"&gt;
&lt;div class="text_cell_render border-box-sizing rendered_html"&gt;
&lt;p&gt;Now that we've got the tweet analysis done, it's time for the fun part: hashtags! Let's count how many hashtags are in each tweet, I want to get a sense of the distribution.&lt;/p&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;div class="cell border-box-sizing code_cell rendered"&gt;
&lt;div class="input"&gt;
&lt;div class="prompt input_prompt"&gt;In&amp;nbsp;[4]:&lt;/div&gt;
&lt;div class="inner_cell"&gt;
&lt;div class="input_area"&gt;
&lt;div class=" highlight hl-ipython3"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="nn"&gt;matplotlib.pyplot&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="nn"&gt;plt&lt;/span&gt;
&lt;span class="o"&gt;%&lt;/span&gt;&lt;span class="k"&gt;matplotlib&lt;/span&gt; inline
&lt;span class="n"&gt;hashtags&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;text_norep&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;str&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;count&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;#&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;bins&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;hashtags&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;unique&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;max&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="n"&gt;hashtags&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;plot&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;kind&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;hist&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;bins&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;bins&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;div class="output_wrapper"&gt;
&lt;div class="output"&gt;
&lt;div class="output_area"&gt;&lt;div class="prompt output_prompt"&gt;Out[4]:&lt;/div&gt;
&lt;div class="output_text output_subarea output_execute_result"&gt;
&lt;pre&gt;&amp;lt;matplotlib.axes._subplots.AxesSubplot at 0x18e59dc28d0&amp;gt;&lt;/pre&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;div class="output_area"&gt;&lt;div class="prompt"&gt;&lt;/div&gt;
&lt;div class="output_png output_subarea "&gt;
&lt;img src="data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAYkAAAEACAYAAABGYoqtAAAABHNCSVQICAgIfAhkiAAAAAlwSFlz
AAALEgAACxIB0t1+/AAAEe1JREFUeJzt3X+s3XV9x/HnCzqRinadjt6NosA0CGYOUSsJM7tmG4pG
YFuGuGkEMmOCTheThZZsazXZBOOcbguJUWYqw7CCIpi5UQi7Li5KmYKixdpkFrHQC1MHogQB3/vj
fGsP9X7KObf33HNu7/ORnPT7/dzv95x3v/32vO7n8/2VqkKSpLkcNu4CJEmTy5CQJDUZEpKkJkNC
ktRkSEiSmgwJSVLTyEMiya4kX01ye5JtXdvqJFuT7EhyY5JVfctvSLIzyV1Jzhh1fZKktsXoSfwU
mK6ql1TVuq5tPXBzVZ0I3AJsAEhyMnAucBJwJnB5kixCjZKkOSxGSGSOzzkb2NxNbwbO6abPAq6u
qserahewE1iHJGksFiMkCrgpyW1J/qRrW1NVswBVtQc4ums/Brinb93dXZskaQxWLMJnnF5V9yX5
ZWBrkh30gqOf9waRpAk08pCoqvu6Px9I8hl6w0ezSdZU1WySKeD+bvHdwLF9q6/t2p4kiaEiSfNQ
VUMd5x3pcFOSlUmO6qafAZwB3AncAJzfLfYW4Ppu+gbgvCRPS3I88Hxg21zvXVW+qti4cePYa5iU
l9vCbeG2OPBrPkbdk1gDXNf95r8CuKqqtib5b2BLkguBu+md0URVbU+yBdgOPAZcVPP9m0mSDtpI
Q6Kqvg2cMkf794HfaazzPuB9o6xLkjQYr7he4qanp8ddwsRwW+zjttjHbXFwshRHc5I4CiVJQ0pC
TdKBa0nS0mZISJKaDAlJUpMhIUlqMiQkSU2GhCSpyZCQJDUZEpKkJkNCktS0rEJiauo4kgz1mpo6
btxlS9LYLKvbcvQelz3sepn3LXYlaZJ4Ww5J0oIyJCRJTYaEJKnJkJAkNRkSkqQmQ0KS1GRISJKa
DAlJUpMhIUlqMiQkSU2GhCSpyZCQJDUZEpKkJkNCktRkSEiSmgwJSVKTISFJajIkJElNhoQkqcmQ
kCQ1GRKSpCZDQpLUZEhIkpoWJSSSHJbkK0lu6OZXJ9maZEeSG5Os6lt2Q5KdSe5KcsZi1CdJmtti
9STeBWzvm18P3FxVJwK3ABsAkpwMnAucBJwJXJ4ki1SjJGk/Iw+JJGuB1wIf62s+G9jcTW8Gzumm
zwKurqrHq2oXsBNYN+oaJUlzW4yexN8Bfw5UX9uaqpoFqKo9wNFd+zHAPX3L7e7aJEljsGKUb57k
dcBsVd2RZPoAi9YBfjanTZs2/Wx6enqa6ekDvb0kLT8zMzPMzMwc1Hukaujv58HfPPkb4E3A48CR
wDOB64CXAdNVNZtkCviPqjopyXqgquqybv1/BzZW1a37vW/Np+7e4Y1h1wuj3EaStFiSUFVDHecd
6XBTVV1SVc+tqhOA84BbqurNwGeB87vF3gJc303fAJyX5GlJjgeeD2wbZY2SpLaRDjcdwKXAliQX
AnfTO6OJqtqeZAu9M6EeAy6aV5dBkrQgRjrcNCoON0nS8CZuuEmStLQZEpKkJkNCktRkSEiSmgwJ
SVKTISFJajIkJElNhoQkqcmQkCQ1GRKSpCZDQpLUZEhIkpoMCUlSkyEhSWoyJCRJTYaEJKnJkJAk
NY3r8aUH7b3vfe9Qy69cuXJElUjSoWvJPr4U/nKodY444goeffRefHyppOVqPo8vXcIhMVzdq1ad
xoMP3oohIWm58hnXkqQFZUhIkpoMCUlSkyEhSWoyJCRJTYaEJKnJkJAkNRkSkqQmQ0KS1GRISJKa
DAlJUpMhIUlqMiQkSU2GhCSpyZCQJDWNNCSSHJHk1iS3J7kzycaufXWSrUl2JLkxyaq+dTYk2Znk
riRnjLI+SdKBjTQkqupR4FVV9RLgFODMJOuA9cDNVXUicAuwASDJycC5wEnAmcDlSYZ6QIYkaeGM
fLipqn7cTR5B75naBZwNbO7aNwPndNNnAVdX1eNVtQvYCawbdY2SpLkNFBJJfn2+H5DksCS3A3uA
m6rqNmBNVc0CVNUe4Ohu8WOAe/pW3921SZLGYNCexOVJtiW5qP/4wSCq6qfdcNNaYF2SF/HzD5r2
IdKSNIFWDLJQVb0yyQuAC4EvJ9kGfLyqbhr0g6rqoSQzwGuA2SRrqmo2yRRwf7fYbuDYvtXWdm1z
2NQ3Pd29JEl7zczMMDMzc1DvkarBf4lPcji94wd/DzwEBLikqj7dWP45wGNV9WCSI4EbgUuB3wK+
X1WXJbkYWF1V67sD11cBr6A3zHQT8ILar8gkNWznY9Wq03jwwVsZvtMShtlGkjSpklBVQ50MNFBP
IsmLgQuA19H74n59VX0lya8CXwTmDAngV4DNSQ6jN7T1L1X1uSRfArYkuRC4m94ZTVTV9iRbgO3A
Y8BF+weEJGnxDNSTSPJ54GPAtVX1yH4/e3NVXTmi+lr12JOQpCGNrCdBrwfxSFU90X3QYcDTq+rH
ix0QkqTFM+jZTTcDR/bNr+zaJEmHsEFD4ulV9fDemW565WhKkiRNikFD4kdJTt07k+SlwCMHWF6S
dAgY9JjEnwHXJLmX3mmvU8AbRlaVJGkiDHox3W1JXgic2DXtqKrHRleWJGkSDNqTAHg5cFy3zqnd
qVSfGElVkqSJMOjFdFcCvwbcATzRNRdgSEjSIWzQnsTLgJO9+lmSlpdBz276Or2D1ZKkZWTQnsRz
gO3d3V8f3dtYVWeNpCpJ0kQYNCQ2jbIISdJkGvQU2M8neR6923bfnGQlcPhoS5Mkjdugjy99K3At
8JGu6RjgM6MqSpI0GQY9cP124HR6Dxqiqnay77nUkqRD1KAh8WhV/WTvTJIV+FxqSTrkDRoSn09y
CXBkkt8FrgE+O7qyJEmTYNCQWA88ANwJvA34HPAXoypKkjQZBnp86aTx8aWSNLyRPb40ybeZ49u1
qk4Y5sMkSUvLMPdu2uvpwB8Cv7Tw5UiSJslAxySq6nt9r91V9SHgdSOuTZI0ZoMON53aN3sYvZ7F
MM+ikCQtQYN+0f9t3/TjwC7g3AWvRpI0UQa9d9OrRl2IJGnyDDrc9O4D/byqPrgw5UiSJskwZze9
HLihm389sA3YOYqiJEmTYdCQWAucWlU/BEiyCfjXqnrTqAqTJI3foLflWAP8pG/+J12bJOkQNmhP
4hPAtiTXdfPnAJtHU5IkaVIMenbTXyf5N+CVXdMFVXX76MqSJE2CQYebAFYCD1XVh4HvJjl+RDVJ
kibEoI8v3QhcDGzomn4B+OdRFSVJmgyD9iR+DzgL+BFAVd0LPHNURUmSJsOgIfGT6j1UoQCSPGN0
JUmSJsWgIbElyUeAX0zyVuBm4KOjK0uSNAkGvVX4B4BrgU8BJwJ/VVX/8FTrJVmb5JYk30hyZ5J3
du2rk2xNsiPJjUlW9a2zIcnOJHclOWN+fy1J0kJ4yseXJjkcuHk+N/lLMgVMVdUdSY4CvgycDVwA
fK+q3p/kYmB1Va1PcjJwFb1bgKyl12N5Qe1XpI8vlaThzefxpU/Zk6iqJ4Cf9v+2P6iq2lNVd3TT
DwN30fvyP5t9F+NtpndxHvQOjl9dVY9X1S5694ZaN+znSpIWxqBXXD8M3JnkJroznACq6p2DflCS
44BTgC8Ba6pqtnuPPUmO7hY7Bvhi32q7uzZJ0hgMGhKf7l7z0g01XQu8q6oe7g0XPYnjOZI0gQ4Y
EkmeW1Xfqap536cpyQp6AXFlVV3fNc8mWVNVs91xi/u79t3AsX2rr+3a5rCpb3q6e0mS9pqZmWFm
Zuag3uOAB66TfKWqTu2mP1VVfzD0BySfAP63qt7d13YZ8P2quqxx4PoV9IaZbsID15K0IOZz4Pqp
hpv63+yEeRR0OvDH9I5n3E7vG/oS4DJ6115cCNxN97zsqtqeZAuwHXgMuGj/gJAkLZ6nColqTA+k
qv4LOLzx499prPM+4H3DfpYkaeE9VUj8RpKH6PUojuym6earqp410uokSWN1wJCoqlYvQJK0DAzz
PAlJ0jJjSEiSmgwJSVKTISFJajIkJElNhoQkqcmQkCQ1GRKSpCZDQpLUZEhIkpoMCUlSkyEhSWoy
JCRJTYaEJKnJkJAkNRkSkqQmQ0KS1GRISJKaDAlJUpMhIUlqMiQkSU2GhCSpyZCQJDUZEpKkJkNC
ktRkSEiSmgwJSVKTISFJajIkJElNhoQkqcmQkCQ1GRJP6QiSDP2amjpu3IVL0kFbMe4CJt+jQA29
1uxsFr4USVpk9iQkSU0jDYkkVySZTfK1vrbVSbYm2ZHkxiSr+n62IcnOJHclOWOUtUmSntqoexIf
B169X9t64OaqOhG4BdgAkORk4FzgJOBM4PIkjtlI0hiNNCSq6gvAD/ZrPhvY3E1vBs7pps8Crq6q
x6tqF7ATWDfK+iRJBzaOYxJHV9UsQFXtAY7u2o8B7ulbbnfXJkkak0k4u2n4U4cA2NQ3Pd29JEl7
zczMMDMzc1DvMY6QmE2ypqpmk0wB93ftu4Fj+5Zb27U1bBpVfZJ0SJienmZ6evpn8+95z3uGfo/F
GG5K99rrBuD8bvotwPV97ecleVqS44HnA9sWoT5JUsNIexJJPklvHOjZSb4DbAQuBa5JciFwN70z
mqiq7Um2ANuBx4CLqmqeQ1GSpIWQpfg9nKSGPZSxatVpPPjgrQx/CCTzWKe33lLctpIOXUmoqqEu
LfCKa0lSkyEhSWoyJCRJTYaEJKnJkJAkNRkSkqQmQ0KS1GRISJKaDAlJUpMhIUlqMiQkSU2GhCSp
yZCQJDUZEpKkJkNCktRkSEiSmgwJSVKTISFJajIkJElNhoQkqcmQkCQ1GRKSpCZDQpLUZEhIkpoM
CUlSkyExMkeQZKjX1NRx4y5akp5kxbgLOHQ9CtRQa8zOZjSlSNI82ZOQJDUZEpKkJkNCktRkSEiS
mgwJSVKTISFJajIkJElNhoQkqWkiQyLJa5J8M8m3klw87nokabmauJBIchjwj8CrgRcBb0zywvFW
NblmZmbGXcLEcFvs47bYx21xcCYuJIB1wM6quruqHgOuBs4ec00Ty/8A+7gt9nFb7OO2ODiTGBLH
APf0zX+3a9McPvCBDw19I0FvJihpUEv2Bn/Petbrh1r+kUe+OaJKFlLvzrHDG+5GguDNBKVRmJo6
jtnZu4daZ82a57Fnz67RFLQAUjX8F8woJTkN2FRVr+nm1wNVVZf1LTNZRUvSElFVQ/2GOIkhcTiw
A/ht4D5gG/DGqrprrIVJ0jI0ccNNVfVEkncAW+kdM7nCgJCk8Zi4noQkaXJM4tlNB+SFdvsk2ZXk
q0luT7Jt3PUspiRXJJlN8rW+ttVJtibZkeTGJKvGWeNiaWyLjUm+m+Qr3es146xxsSRZm+SWJN9I
cmeSd3bty2rfmGM7/GnXPvR+saR6Et2Fdt+id7ziXuA24LyqWgqnLi24JP8DvLSqfjDuWhZbkt8E
HgY+UVUv7touA75XVe/vfoFYXVXrx1nnYmhsi43AD6vqg2MtbpElmQKmquqOJEcBX6Z3ndUFLKN9
4wDb4Q0MuV8stZ6EF9o9WVh6/4YLoqq+AOwfjmcDm7vpzcA5i1rUmDS2BfT2j2WlqvZU1R3d9MPA
XcBaltm+0dgOe683G2q/WGpfMF5o92QF3JTktiRvHXcxE+DoqpqF3n8S4Ogx1zNu70hyR5KPHerD
K3NJchxwCvAlYM1y3Tf6tsOtXdNQ+8VSCwk92elVdSrwWuDt3bCD9lk6Y6kL73LghKo6BdgDLLdh
p6OAa4F3db9J778vLIt9Y47tMPR+sdRCYjfw3L75tV3bslRV93V/PgBcR284bjmbTbIGfjYme/+Y
6xmbqnqg9h1w/Cjw8nHWs5iSrKD3xXhlVV3fNS+7fWOu7TCf/WKphcRtwPOTPC/J04DzgBvGXNNY
JFnZ/ZZAkmcAZwBfH29Viy48eXz1BuD8bvotwPX7r3AIe9K26L4I9/p9lte+8U/A9qr6cF/bctw3
fm47zGe/WFJnN0HvFFjgw+y70O7SMZc0FkmOp9d7KHoXRV61nLZFkk8C08CzgVlgI/AZ4BrgWOBu
4Nyq+r9x1bhYGtviVfTGoX8K7ALetndM/lCW5HTgP4E76f3fKOASendu2MIy2TcOsB3+iCH3iyUX
EpKkxbPUhpskSYvIkJAkNRkSkqQmQ0KS1GRISJKaDAlJUpMhIUlqMiQkSU3/DzepYDZSwMuQAAAA
AElFTkSuQmCC
"
&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;div class="cell border-box-sizing text_cell rendered"&gt;
&lt;div class="prompt input_prompt"&gt;
&lt;/div&gt;
&lt;div class="inner_cell"&gt;
&lt;div class="text_cell_render border-box-sizing rendered_html"&gt;
&lt;p&gt;That looks like a Poisson distribution, kind of as I expected. I'm guessing my number of hashtags per tweet is $\sim Poi(1)$, but let's actually find the &lt;a href="https://en.wikipedia.org/wiki/Poisson_distribution#Maximum_likelihood"&gt;most likely estimator&lt;/a&gt; which in this case is just $\bar{\lambda}$:&lt;/p&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;div class="cell border-box-sizing code_cell rendered"&gt;
&lt;div class="input"&gt;
&lt;div class="prompt input_prompt"&gt;In&amp;nbsp;[5]:&lt;/div&gt;
&lt;div class="inner_cell"&gt;
&lt;div class="input_area"&gt;
&lt;div class=" highlight hl-ipython3"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;span class="n"&gt;mle&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;hashtags&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;mean&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="n"&gt;mle&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;div class="output_wrapper"&gt;
&lt;div class="output"&gt;
&lt;div class="output_area"&gt;&lt;div class="prompt output_prompt"&gt;Out[5]:&lt;/div&gt;
&lt;div class="output_text output_subarea output_execute_result"&gt;
&lt;pre&gt;0.870236869207003&lt;/pre&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;div class="cell border-box-sizing text_cell rendered"&gt;
&lt;div class="prompt input_prompt"&gt;
&lt;/div&gt;
&lt;div class="inner_cell"&gt;
&lt;div class="text_cell_render border-box-sizing rendered_html"&gt;
&lt;p&gt;Pretty close! So we can now simulate how many hashtags are in a tweet. Let's also find what hashtags are actually used:&lt;/p&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;div class="cell border-box-sizing code_cell rendered"&gt;
&lt;div class="input"&gt;
&lt;div class="prompt input_prompt"&gt;In&amp;nbsp;[6]:&lt;/div&gt;
&lt;div class="inner_cell"&gt;
&lt;div class="input_area"&gt;
&lt;div class=" highlight hl-ipython3"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;span class="n"&gt;hashtags&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;x&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;all_words&lt;/span&gt; &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;#&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;span class="n"&gt;n_hashtags&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nb"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;hashtags&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;unique_hashtags&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nb"&gt;list&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nb"&gt;set&lt;/span&gt;&lt;span class="p"&gt;([&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;x&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;unique_words&lt;/span&gt; &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;#&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;]))&lt;/span&gt;
&lt;span class="n"&gt;hashtag_dist&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;pd&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;DataFrame&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;hashtags&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;unique_hashtags&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="s1"&gt;&amp;#39;prob&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;all_words&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;count&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;h&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="n"&gt;n_hashtags&lt;/span&gt;
&lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;h&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;unique_hashtags&lt;/span&gt;&lt;span class="p"&gt;]})&lt;/span&gt;
&lt;span class="nb"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;hashtag_dist&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;div class="output_wrapper"&gt;
&lt;div class="output"&gt;
&lt;div class="output_area"&gt;&lt;div class="prompt output_prompt"&gt;Out[6]:&lt;/div&gt;
&lt;div class="output_text output_subarea output_execute_result"&gt;
&lt;pre&gt;603&lt;/pre&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;div class="cell border-box-sizing text_cell rendered"&gt;
&lt;div class="prompt input_prompt"&gt;
&lt;/div&gt;
&lt;div class="inner_cell"&gt;
&lt;div class="text_cell_render border-box-sizing rendered_html"&gt;
&lt;p&gt;Turns out I have used 603 different hashtags during my time on Twitter. That means I was using a unique hashtag for about every third tweet.&lt;/p&gt;
&lt;p&gt;In better news though, we now have all the data we need to go about actually constructing tweets! The process will happen in a few steps:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;Randomly select what the first word will be.&lt;/li&gt;
&lt;li&gt;Randomly select the number of hashtags for this tweet, and then select the actual hashtags.&lt;/li&gt;
&lt;li&gt;Fill in the remaining space of 140 characters with random words taken from my tweets.&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;And hopefully, we won't have anything too crazy come out the other end. The way we do the selection follows a &lt;a href="https://en.wikipedia.org/wiki/Multinomial_distribution"&gt;Multinomial Distribution&lt;/a&gt;: given a lot of different values with specific probability, pick one. Let's give a quick example:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;x: .33
y: .5
z: .17&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;That is, I pick &lt;code&gt;x&lt;/code&gt; with probability 33%, &lt;code&gt;y&lt;/code&gt; with probability 50%, and so on. In context of our sentence construction, I've built out the probabilities of specific words already - now I just need to simulate that distribution. Time for the engine to actually be developed!&lt;/p&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;div class="cell border-box-sizing code_cell rendered"&gt;
&lt;div class="input"&gt;
&lt;div class="prompt input_prompt"&gt;In&amp;nbsp;[7]:&lt;/div&gt;
&lt;div class="inner_cell"&gt;
&lt;div class="input_area"&gt;
&lt;div class=" highlight hl-ipython3"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="nn"&gt;numpy&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="nn"&gt;np&lt;/span&gt;
&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;multinom_sim&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;n&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;vals&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;probs&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
&lt;span class="n"&gt;occurrences&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;random&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;multinomial&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;n&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;probs&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;results&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;occurrences&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;vals&lt;/span&gt;
&lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39; &amp;#39;&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;join&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;results&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;results&lt;/span&gt; &lt;span class="o"&gt;!=&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;sim_n_hashtags&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;hashtag_freq&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
&lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;random&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;poisson&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;hashtag_freq&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;sim_hashtags&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;n&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;hashtag_dist&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
&lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;multinom_sim&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;n&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;hashtag_dist&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;hashtags&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;hashtag_dist&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;prob&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;sim_first_word&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;first_word_dist&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
&lt;span class="n"&gt;probs&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;float64&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;first_word_dist&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;values&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;multinom_sim&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;first_word_dist&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;reset_index&lt;/span&gt;&lt;span class="p"&gt;()[&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;index&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="n"&gt;probs&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;sim_next_word&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;current&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;word_dist&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
&lt;span class="n"&gt;dist&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;pd&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Series&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;word_dist&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;current&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
&lt;span class="n"&gt;probs&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;ones&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nb"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;dist&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="nb"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;dist&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;multinom_sim&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;dist&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;probs&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;div class="cell border-box-sizing text_cell rendered"&gt;
&lt;div class="prompt input_prompt"&gt;
&lt;/div&gt;
&lt;div class="inner_cell"&gt;
&lt;div class="text_cell_render border-box-sizing rendered_html"&gt;
&lt;h2 id="Pulling-it-all-together"&gt;Pulling it all together&lt;a class="anchor-link" href="#Pulling-it-all-together"&gt;&amp;#182;&lt;/a&gt;&lt;/h2&gt;&lt;p&gt;I've now built out all the code I need to actually simulate a sentence written by me. Let's try doing an example with five words and a single hashtag:&lt;/p&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;div class="cell border-box-sizing code_cell rendered"&gt;
&lt;div class="input"&gt;
&lt;div class="prompt input_prompt"&gt;In&amp;nbsp;[8]:&lt;/div&gt;
&lt;div class="inner_cell"&gt;
&lt;div class="input_area"&gt;
&lt;div class=" highlight hl-ipython3"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;span class="n"&gt;first&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;sim_first_word&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;first_word_dist&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;second&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;sim_next_word&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;first&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;word_dist&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;third&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;sim_next_word&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;second&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;word_dist&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;fourth&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;sim_next_word&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;third&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;word_dist&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;fifth&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;sim_next_word&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;fourth&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;word_dist&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;hashtag&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;sim_hashtags&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;hashtag_dist&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="s1"&gt;&amp;#39; &amp;#39;&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;join&lt;/span&gt;&lt;span class="p"&gt;((&lt;/span&gt;&lt;span class="n"&gt;first&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;second&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;third&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;fourth&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;fifth&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;hashtag&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;div class="output_wrapper"&gt;
&lt;div class="output"&gt;
&lt;div class="output_area"&gt;&lt;div class="prompt output_prompt"&gt;Out[8]:&lt;/div&gt;
&lt;div class="output_text output_subarea output_execute_result"&gt;
&lt;pre&gt;&amp;apos;My first all-nighter of friends #oldschool&amp;apos;&lt;/pre&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;div class="cell border-box-sizing text_cell rendered"&gt;
&lt;div class="prompt input_prompt"&gt;
&lt;/div&gt;
&lt;div class="inner_cell"&gt;
&lt;div class="text_cell_render border-box-sizing rendered_html"&gt;
&lt;p&gt;Let's go ahead and put everything together! We're going to simulate a first word, simulate the hashtags, and then simulate to fill the gap until we've either taken up all the space or reached a period.&lt;/p&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;div class="cell border-box-sizing code_cell rendered"&gt;
&lt;div class="input"&gt;
&lt;div class="prompt input_prompt"&gt;In&amp;nbsp;[9]:&lt;/div&gt;
&lt;div class="inner_cell"&gt;
&lt;div class="input_area"&gt;
&lt;div class=" highlight hl-ipython3"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;simulate_tweet&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
&lt;span class="n"&gt;chars_remaining&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;140&lt;/span&gt;
&lt;span class="n"&gt;first&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;sim_first_word&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;first_word_dist&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;n_hash&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;sim_n_hashtags&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;mle&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;hashtags&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;sim_hashtags&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;n_hash&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;hashtag_dist&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;chars_remaining&lt;/span&gt; &lt;span class="o"&gt;-=&lt;/span&gt; &lt;span class="nb"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;first&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="nb"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;hashtags&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;tweet&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;first&lt;/span&gt;
&lt;span class="n"&gt;current&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;first&lt;/span&gt;
&lt;span class="k"&gt;while&lt;/span&gt; &lt;span class="n"&gt;chars_remaining&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;tweet&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="nb"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;hashtags&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="ow"&gt;and&lt;/span&gt; &lt;span class="n"&gt;current&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;!=&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;.&amp;#39;&lt;/span&gt; &lt;span class="ow"&gt;and&lt;/span&gt; &lt;span class="n"&gt;current&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;!=&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;!&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
&lt;span class="n"&gt;current&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;sim_next_word&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;current&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;word_dist&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;tweet&lt;/span&gt; &lt;span class="o"&gt;+=&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39; &amp;#39;&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;current&lt;/span&gt;
&lt;span class="n"&gt;tweet&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;tweet&lt;/span&gt;&lt;span class="p"&gt;[:&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;tweet&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39; &amp;#39;&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;join&lt;/span&gt;&lt;span class="p"&gt;((&lt;/span&gt;&lt;span class="n"&gt;tweet&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;hashtags&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;strip&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;div class="cell border-box-sizing text_cell rendered"&gt;
&lt;div class="prompt input_prompt"&gt;
&lt;/div&gt;
&lt;div class="inner_cell"&gt;
&lt;div class="text_cell_render border-box-sizing rendered_html"&gt;
&lt;h2 id="The-results"&gt;The results&lt;a class="anchor-link" href="#The-results"&gt;&amp;#182;&lt;/a&gt;&lt;/h2&gt;&lt;p&gt;And now for something completely different: twenty random tweets dreamed up by my computer and my Twitter data. Here you go:&lt;/p&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;div class="cell border-box-sizing code_cell rendered"&gt;
&lt;div class="input"&gt;
&lt;div class="prompt input_prompt"&gt;In&amp;nbsp;[12]:&lt;/div&gt;
&lt;div class="inner_cell"&gt;
&lt;div class="input_area"&gt;
&lt;div class=" highlight hl-ipython3"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="nb"&gt;range&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;20&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
&lt;span class="nb"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;simulate_tweet&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt;
&lt;span class="nb"&gt;print&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;div class="output_wrapper"&gt;
&lt;div class="output"&gt;
&lt;div class="output_area"&gt;&lt;div class="prompt"&gt;&lt;/div&gt;
&lt;div class="output_subarea output_stream output_stdout output_text"&gt;
&lt;pre&gt;Also , I&amp;apos;m at 8 this morning. #thursdaysgohard #ornot
Turns out of us breathe the code will want to my undergraduate career is becoming more night trying ? Religion is now as a chane #HYPE
You know what recursion is to review the UNCC. #ornot
There are really sore 3 bonfires in my first writing the library ground floor if awesome. #realtalk #impressed
So we can make it out there&amp;apos;s nothing but I&amp;apos;m not let us so hot I could think I may be good. #SwingDance
Happy Christmas , at Harris Teeter to be be godly or Roman Catholic ). #4b392b#4b392b #Isaiah26
For context , I in the most decisive factor of the same for homework. #accomplishment
Freaking done. #loveyouall
New blog post : Don&amp;apos;t jump in a quiz in with a knife fight. #haskell #earlybirthday
God shows me legitimately want to get some food and one day.
Stormed the queen city. #mindblown
The day of a cold at least outside right before the semester ..
Finished with the way back. #winners
Waking up , OJ , I feel like Nick Jonas today.
First draft of so hard drive. #humansvszombies
Eric Whitacre is the wise creation.
Ethics paper first , music in close to everyone who just be posting up with my sin , and Jerry Springr #TheLittleThings
Love that you know enough time I&amp;apos;ve eaten at 8 PM. #deepthoughts #stillblownaway
Lead. #ThinkingTooMuch #Christmas
Aamazing conference when you married #DepartmentOfRedundancyDepartment Yep , but there&amp;apos;s a legitimate challenge.
&lt;/pre&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;div class="cell border-box-sizing text_cell rendered"&gt;
&lt;div class="prompt input_prompt"&gt;
&lt;/div&gt;
&lt;div class="inner_cell"&gt;
&lt;div class="text_cell_render border-box-sizing rendered_html"&gt;
&lt;p&gt;...Which all ended up being a whole lot more nonsensical than I had hoped for. There are some good ones, so I'll call that an accomplishment! I was banking on grammar not being an issue: since my tweets use impeccable grammar, the program modeled off them should have pretty good grammar as well. There are going to be some hilarious edge cases (I'm looking at you, &lt;code&gt;Ethics paper first, music in close to everyone&lt;/code&gt;) that make no sense, and some hilarious edge cases (&lt;code&gt;Waking up, OJ, I feel like Nick Jonas today&lt;/code&gt;) that make me feel like I should have a Twitter rap career. On the whole though, the structure came out alright.&lt;/p&gt;
&lt;h2 id="Moving-on-from-here"&gt;Moving on from here&lt;a class="anchor-link" href="#Moving-on-from-here"&gt;&amp;#182;&lt;/a&gt;&lt;/h2&gt;&lt;p&gt;During class we also talked about an interesting idea: trying to analyze corporate documents and corporate speech. I'd be interested to know what this analysis applied to something like a couple of bank press releases could do. By any means, the code needs some work to clean it up before I get that far.&lt;/p&gt;
&lt;h2 id="For-further-reading"&gt;For further reading&lt;a class="anchor-link" href="#For-further-reading"&gt;&amp;#182;&lt;/a&gt;&lt;/h2&gt;&lt;p&gt;I'm pretty confident I re-invented a couple wheels along the way - what I'm doing feels a lot like what &lt;a href="https://en.wikipedia.org/wiki/Markov_chain_Monte_Carlo"&gt;Markov Chain Monte Carlo&lt;/a&gt; is intended to do. But I've never worked explicitly with that before, so more research is needed.&lt;/p&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;/div&gt;&lt;/p&gt;
&lt;script type="text/x-mathjax-config"&gt;
MathJax.Hub.Config({tex2jax: {inlineMath: [['$','$'], ['\(','\)']]}});
&lt;/script&gt;
&lt;script async src='https://cdn.mathjax.org/mathjax/latest/MathJax.js?config=TeX-AMS_CHTML'&gt;&lt;/script&gt;</summary><category term="twitter"></category><category term="MCMC"></category></entry><entry><title>Predicting Santander Customer Happiness</title><link href="https://bspeice.github.io/predicting-santander-customer-happiness.html" rel="alternate"></link><updated>2016-03-05T00:00:00-05:00</updated><author><name>Bradlee Speice</name></author><id>tag:bspeice.github.io,2016-03-05:predicting-santander-customer-happiness.html</id><summary type="html">&lt;p&gt;
&lt;div class="cell border-box-sizing text_cell rendered"&gt;
&lt;div class="prompt input_prompt"&gt;
&lt;/div&gt;

View File

@ -1,5 +1,580 @@
<?xml version="1.0" encoding="utf-8"?>
<feed xmlns="http://www.w3.org/2005/Atom"><title>Bradlee Speice</title><link href="https://bspeice.github.io/" rel="alternate"></link><link href="https://bspeice.github.io/feeds/blog.atom.xml" rel="self"></link><id>https://bspeice.github.io/</id><updated>2016-03-05T00:00:00-05:00</updated><entry><title>Predicting Santander Customer Happiness</title><link href="https://bspeice.github.io/predicting-santander-customer-happiness.html" rel="alternate"></link><updated>2016-03-05T00:00:00-05:00</updated><author><name>Bradlee Speice</name></author><id>tag:bspeice.github.io,2016-03-05:predicting-santander-customer-happiness.html</id><summary type="html">&lt;p&gt;
<feed xmlns="http://www.w3.org/2005/Atom"><title>Bradlee Speice</title><link href="https://bspeice.github.io/" rel="alternate"></link><link href="https://bspeice.github.io/feeds/blog.atom.xml" rel="self"></link><id>https://bspeice.github.io/</id><updated>2016-03-28T00:00:00-04:00</updated><entry><title>Tweet Like Me</title><link href="https://bspeice.github.io/tweet-like-me.html" rel="alternate"></link><updated>2016-03-28T00:00:00-04:00</updated><author><name>Bradlee Speice</name></author><id>tag:bspeice.github.io,2016-03-28:tweet-like-me.html</id><summary type="html">&lt;p&gt;
&lt;div class="cell border-box-sizing text_cell rendered"&gt;
&lt;div class="prompt input_prompt"&gt;
&lt;/div&gt;
&lt;div class="inner_cell"&gt;
&lt;div class="text_cell_render border-box-sizing rendered_html"&gt;
&lt;p&gt;An experiment in creating a robot that will imitate me on Twitter.&lt;/p&gt;
&lt;hr&gt;
&lt;p&gt;So, I'm taking a Machine Learning course this semester in school, and one of the topics we keep coming back to is natural language processing and the 'bag of words' data structure. That is, given a sentence:&lt;/p&gt;
&lt;p&gt;&lt;code&gt;How much wood would a woodchuck chuck if a woodchuck could chuck wood?&lt;/code&gt;&lt;/p&gt;
&lt;p&gt;We can represent that sentence as the following list:&lt;/p&gt;
&lt;p&gt;&lt;code&gt;{
How: 1
much: 1
wood: 2
would: 2
a: 2
woodchuck: 2
chuck: 2
if: 1
}&lt;/code&gt;&lt;/p&gt;
&lt;p&gt;Ignoring &lt;em&gt;where&lt;/em&gt; the words happened, we're just interested in how &lt;em&gt;often&lt;/em&gt; the words occurred. That got me thinking: I wonder what would happen if I built a robot that just imitated how often I said things? It's dangerous territory when computer scientists ask "what if," but I got curious enough I wanted to follow through.&lt;/p&gt;
&lt;h2 id="The-Objective"&gt;The Objective&lt;a class="anchor-link" href="#The-Objective"&gt;&amp;#182;&lt;/a&gt;&lt;/h2&gt;&lt;p&gt;Given an input list of Tweets, build up the following things:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;The distribution of starting words; since there are no "prior" words to go from, we need to treat this as a special case.&lt;/li&gt;
&lt;li&gt;The distribution of words given a previous word; for example, every time I use the word &lt;code&gt;woodchuck&lt;/code&gt; in the example sentence, there is a 50% chance it is followed by &lt;code&gt;chuck&lt;/code&gt; and a 50% chance it is followed by &lt;code&gt;could&lt;/code&gt;. I need this distribution for all words.&lt;/li&gt;
&lt;li&gt;The distribution of quantity of hashtags; Do I most often use just one? Two? Do they follow something like a Poisson distribution?&lt;/li&gt;
&lt;li&gt;Distribution of hashtags; Given a number of hashtags, what is the actual content? I'll treat hashtags as separate from the content of a tweet.&lt;/li&gt;
&lt;/ol&gt;
&lt;h2 id="The-Data"&gt;The Data&lt;a class="anchor-link" href="#The-Data"&gt;&amp;#182;&lt;/a&gt;&lt;/h2&gt;&lt;p&gt;I'm using as input my tweet history. I don't really use Twitter anymore, but it seems like a fun use of the dataset. I'd like to eventually build this to a point where I can imitate anyone on Twitter using their last 100 tweets or so, but I'll start with this as example code.&lt;/p&gt;
&lt;h2 id="The-Algorithm"&gt;The Algorithm&lt;a class="anchor-link" href="#The-Algorithm"&gt;&amp;#182;&lt;/a&gt;&lt;/h2&gt;&lt;p&gt;I'll be using the &lt;a href="http://www.nltk.org/"&gt;NLTK&lt;/a&gt; library for doing a lot of the heavy lifting. First, let's import the data:&lt;/p&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;div class="cell border-box-sizing code_cell rendered"&gt;
&lt;div class="input"&gt;
&lt;div class="prompt input_prompt"&gt;In&amp;nbsp;[1]:&lt;/div&gt;
&lt;div class="inner_cell"&gt;
&lt;div class="input_area"&gt;
&lt;div class=" highlight hl-ipython3"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="nn"&gt;pandas&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="nn"&gt;pd&lt;/span&gt;
&lt;span class="n"&gt;tweets&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;pd&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;read_csv&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;tweets.csv&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;text&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;tweets&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;text&lt;/span&gt;
&lt;span class="c1"&gt;# Don&amp;#39;t include tweets in reply to or mentioning people&lt;/span&gt;
&lt;span class="n"&gt;replies&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;str&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;contains&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;@&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;text_norep&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;loc&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="o"&gt;~&lt;/span&gt;&lt;span class="n"&gt;replies&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;div class="cell border-box-sizing text_cell rendered"&gt;
&lt;div class="prompt input_prompt"&gt;
&lt;/div&gt;
&lt;div class="inner_cell"&gt;
&lt;div class="text_cell_render border-box-sizing rendered_html"&gt;
&lt;p&gt;And now that we've got data, let's start crunching. First, tokenize and build out the distribution of first word:&lt;/p&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;div class="cell border-box-sizing code_cell rendered"&gt;
&lt;div class="input"&gt;
&lt;div class="prompt input_prompt"&gt;In&amp;nbsp;[2]:&lt;/div&gt;
&lt;div class="inner_cell"&gt;
&lt;div class="input_area"&gt;
&lt;div class=" highlight hl-ipython3"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="nn"&gt;nltk.tokenize&lt;/span&gt; &lt;span class="k"&gt;import&lt;/span&gt; &lt;span class="n"&gt;TweetTokenizer&lt;/span&gt;
&lt;span class="n"&gt;tknzr&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;TweetTokenizer&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="n"&gt;tokens&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;text_norep&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;map&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;tknzr&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;tokenize&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;first_words&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;tokens&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;map&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;lambda&lt;/span&gt; &lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
&lt;span class="n"&gt;first_words_alpha&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;first_words&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;first_words&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;str&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;isalpha&lt;/span&gt;&lt;span class="p"&gt;()]&lt;/span&gt;
&lt;span class="n"&gt;first_word_dist&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;first_words_alpha&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;value_counts&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="nb"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;first_words_alpha&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;div class="cell border-box-sizing text_cell rendered"&gt;
&lt;div class="prompt input_prompt"&gt;
&lt;/div&gt;
&lt;div class="inner_cell"&gt;
&lt;div class="text_cell_render border-box-sizing rendered_html"&gt;
&lt;p&gt;Next, we need to build out the conditional distributions. That is, what is the probability of the next word given the current word is $X$? This one is a bit more involved. First, find all unique words, and then find what words proceed them. This can probably be done in a more efficient manner than I'm currently doing here, but we'll ignore that for the moment.&lt;/p&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;div class="cell border-box-sizing code_cell rendered"&gt;
&lt;div class="input"&gt;
&lt;div class="prompt input_prompt"&gt;In&amp;nbsp;[3]:&lt;/div&gt;
&lt;div class="inner_cell"&gt;
&lt;div class="input_area"&gt;
&lt;div class=" highlight hl-ipython3"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="nn"&gt;functools&lt;/span&gt; &lt;span class="k"&gt;import&lt;/span&gt; &lt;span class="n"&gt;reduce&lt;/span&gt;
&lt;span class="c1"&gt;# Get all possible words&lt;/span&gt;
&lt;span class="n"&gt;all_words&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;reduce&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;lambda&lt;/span&gt; &lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;y&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="o"&gt;+&lt;/span&gt;&lt;span class="n"&gt;y&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;tokens&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;[])&lt;/span&gt;
&lt;span class="n"&gt;unique_words&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nb"&gt;set&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;all_words&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;actual_words&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nb"&gt;set&lt;/span&gt;&lt;span class="p"&gt;([&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt; &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;!=&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;.&amp;#39;&lt;/span&gt; &lt;span class="k"&gt;else&lt;/span&gt; &lt;span class="kc"&gt;None&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;x&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;unique_words&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
&lt;span class="n"&gt;word_dist&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{}&lt;/span&gt;
&lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;word&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="nb"&gt;iter&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;actual_words&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
&lt;span class="n"&gt;indices&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;j&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="nb"&gt;enumerate&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;all_words&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;j&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="n"&gt;word&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;span class="n"&gt;proceeding&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;all_words&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="o"&gt;+&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;indices&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;span class="n"&gt;word_dist&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;word&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;proceeding&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;div class="cell border-box-sizing text_cell rendered"&gt;
&lt;div class="prompt input_prompt"&gt;
&lt;/div&gt;
&lt;div class="inner_cell"&gt;
&lt;div class="text_cell_render border-box-sizing rendered_html"&gt;
&lt;p&gt;Now that we've got the tweet analysis done, it's time for the fun part: hashtags! Let's count how many hashtags are in each tweet, I want to get a sense of the distribution.&lt;/p&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;div class="cell border-box-sizing code_cell rendered"&gt;
&lt;div class="input"&gt;
&lt;div class="prompt input_prompt"&gt;In&amp;nbsp;[4]:&lt;/div&gt;
&lt;div class="inner_cell"&gt;
&lt;div class="input_area"&gt;
&lt;div class=" highlight hl-ipython3"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="nn"&gt;matplotlib.pyplot&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="nn"&gt;plt&lt;/span&gt;
&lt;span class="o"&gt;%&lt;/span&gt;&lt;span class="k"&gt;matplotlib&lt;/span&gt; inline
&lt;span class="n"&gt;hashtags&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;text_norep&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;str&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;count&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;#&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;bins&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;hashtags&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;unique&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;max&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="n"&gt;hashtags&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;plot&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;kind&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;hist&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;bins&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;bins&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;div class="output_wrapper"&gt;
&lt;div class="output"&gt;
&lt;div class="output_area"&gt;&lt;div class="prompt output_prompt"&gt;Out[4]:&lt;/div&gt;
&lt;div class="output_text output_subarea output_execute_result"&gt;
&lt;pre&gt;&amp;lt;matplotlib.axes._subplots.AxesSubplot at 0x18e59dc28d0&amp;gt;&lt;/pre&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;div class="output_area"&gt;&lt;div class="prompt"&gt;&lt;/div&gt;
&lt;div class="output_png output_subarea "&gt;
&lt;img src="data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAYkAAAEACAYAAABGYoqtAAAABHNCSVQICAgIfAhkiAAAAAlwSFlz
AAALEgAACxIB0t1+/AAAEe1JREFUeJzt3X+s3XV9x/HnCzqRinadjt6NosA0CGYOUSsJM7tmG4pG
YFuGuGkEMmOCTheThZZsazXZBOOcbguJUWYqw7CCIpi5UQi7Li5KmYKixdpkFrHQC1MHogQB3/vj
fGsP9X7KObf33HNu7/ORnPT7/dzv95x3v/32vO7n8/2VqkKSpLkcNu4CJEmTy5CQJDUZEpKkJkNC
ktRkSEiSmgwJSVLTyEMiya4kX01ye5JtXdvqJFuT7EhyY5JVfctvSLIzyV1Jzhh1fZKktsXoSfwU
mK6ql1TVuq5tPXBzVZ0I3AJsAEhyMnAucBJwJnB5kixCjZKkOSxGSGSOzzkb2NxNbwbO6abPAq6u
qserahewE1iHJGksFiMkCrgpyW1J/qRrW1NVswBVtQc4ums/Brinb93dXZskaQxWLMJnnF5V9yX5
ZWBrkh30gqOf9waRpAk08pCoqvu6Px9I8hl6w0ezSdZU1WySKeD+bvHdwLF9q6/t2p4kiaEiSfNQ
VUMd5x3pcFOSlUmO6qafAZwB3AncAJzfLfYW4Ppu+gbgvCRPS3I88Hxg21zvXVW+qti4cePYa5iU
l9vCbeG2OPBrPkbdk1gDXNf95r8CuKqqtib5b2BLkguBu+md0URVbU+yBdgOPAZcVPP9m0mSDtpI
Q6Kqvg2cMkf794HfaazzPuB9o6xLkjQYr7he4qanp8ddwsRwW+zjttjHbXFwshRHc5I4CiVJQ0pC
TdKBa0nS0mZISJKaDAlJUpMhIUlqMiQkSU2GhCSpyZCQJDUZEpKkJkNCktS0rEJiauo4kgz1mpo6
btxlS9LYLKvbcvQelz3sepn3LXYlaZJ4Ww5J0oIyJCRJTYaEJKnJkJAkNRkSkqQmQ0KS1GRISJKa
DAlJUpMhIUlqMiQkSU2GhCSpyZCQJDUZEpKkJkNCktRkSEiSmgwJSVKTISFJajIkJElNhoQkqcmQ
kCQ1GRKSpCZDQpLUZEhIkpoWJSSSHJbkK0lu6OZXJ9maZEeSG5Os6lt2Q5KdSe5KcsZi1CdJmtti
9STeBWzvm18P3FxVJwK3ABsAkpwMnAucBJwJXJ4ki1SjJGk/Iw+JJGuB1wIf62s+G9jcTW8Gzumm
zwKurqrHq2oXsBNYN+oaJUlzW4yexN8Bfw5UX9uaqpoFqKo9wNFd+zHAPX3L7e7aJEljsGKUb57k
dcBsVd2RZPoAi9YBfjanTZs2/Wx6enqa6ekDvb0kLT8zMzPMzMwc1Hukaujv58HfPPkb4E3A48CR
wDOB64CXAdNVNZtkCviPqjopyXqgquqybv1/BzZW1a37vW/Np+7e4Y1h1wuj3EaStFiSUFVDHecd
6XBTVV1SVc+tqhOA84BbqurNwGeB87vF3gJc303fAJyX5GlJjgeeD2wbZY2SpLaRDjcdwKXAliQX
AnfTO6OJqtqeZAu9M6EeAy6aV5dBkrQgRjrcNCoON0nS8CZuuEmStLQZEpKkJkNCktRkSEiSmgwJ
SVKTISFJajIkJElNhoQkqcmQkCQ1GRKSpCZDQpLUZEhIkpoMCUlSkyEhSWoyJCRJTYaEJKnJkJAk
NY3r8aUH7b3vfe9Qy69cuXJElUjSoWvJPr4U/nKodY444goeffRefHyppOVqPo8vXcIhMVzdq1ad
xoMP3oohIWm58hnXkqQFZUhIkpoMCUlSkyEhSWoyJCRJTYaEJKnJkJAkNRkSkqQmQ0KS1GRISJKa
DAlJUpMhIUlqMiQkSU2GhCSpyZCQJDWNNCSSHJHk1iS3J7kzycaufXWSrUl2JLkxyaq+dTYk2Znk
riRnjLI+SdKBjTQkqupR4FVV9RLgFODMJOuA9cDNVXUicAuwASDJycC5wEnAmcDlSYZ6QIYkaeGM
fLipqn7cTR5B75naBZwNbO7aNwPndNNnAVdX1eNVtQvYCawbdY2SpLkNFBJJfn2+H5DksCS3A3uA
m6rqNmBNVc0CVNUe4Ohu8WOAe/pW3921SZLGYNCexOVJtiW5qP/4wSCq6qfdcNNaYF2SF/HzD5r2
IdKSNIFWDLJQVb0yyQuAC4EvJ9kGfLyqbhr0g6rqoSQzwGuA2SRrqmo2yRRwf7fYbuDYvtXWdm1z
2NQ3Pd29JEl7zczMMDMzc1DvkarBf4lPcji94wd/DzwEBLikqj7dWP45wGNV9WCSI4EbgUuB3wK+
X1WXJbkYWF1V67sD11cBr6A3zHQT8ILar8gkNWznY9Wq03jwwVsZvtMShtlGkjSpklBVQ50MNFBP
IsmLgQuA19H74n59VX0lya8CXwTmDAngV4DNSQ6jN7T1L1X1uSRfArYkuRC4m94ZTVTV9iRbgO3A
Y8BF+weEJGnxDNSTSPJ54GPAtVX1yH4/e3NVXTmi+lr12JOQpCGNrCdBrwfxSFU90X3QYcDTq+rH
ix0QkqTFM+jZTTcDR/bNr+zaJEmHsEFD4ulV9fDemW565WhKkiRNikFD4kdJTt07k+SlwCMHWF6S
dAgY9JjEnwHXJLmX3mmvU8AbRlaVJGkiDHox3W1JXgic2DXtqKrHRleWJGkSDNqTAHg5cFy3zqnd
qVSfGElVkqSJMOjFdFcCvwbcATzRNRdgSEjSIWzQnsTLgJO9+lmSlpdBz276Or2D1ZKkZWTQnsRz
gO3d3V8f3dtYVWeNpCpJ0kQYNCQ2jbIISdJkGvQU2M8neR6923bfnGQlcPhoS5Mkjdugjy99K3At
8JGu6RjgM6MqSpI0GQY9cP124HR6Dxqiqnay77nUkqRD1KAh8WhV/WTvTJIV+FxqSTrkDRoSn09y
CXBkkt8FrgE+O7qyJEmTYNCQWA88ANwJvA34HPAXoypKkjQZBnp86aTx8aWSNLyRPb40ybeZ49u1
qk4Y5sMkSUvLMPdu2uvpwB8Cv7Tw5UiSJslAxySq6nt9r91V9SHgdSOuTZI0ZoMON53aN3sYvZ7F
MM+ikCQtQYN+0f9t3/TjwC7g3AWvRpI0UQa9d9OrRl2IJGnyDDrc9O4D/byqPrgw5UiSJskwZze9
HLihm389sA3YOYqiJEmTYdCQWAucWlU/BEiyCfjXqnrTqAqTJI3foLflWAP8pG/+J12bJOkQNmhP
4hPAtiTXdfPnAJtHU5IkaVIMenbTXyf5N+CVXdMFVXX76MqSJE2CQYebAFYCD1XVh4HvJjl+RDVJ
kibEoI8v3QhcDGzomn4B+OdRFSVJmgyD9iR+DzgL+BFAVd0LPHNURUmSJsOgIfGT6j1UoQCSPGN0
JUmSJsWgIbElyUeAX0zyVuBm4KOjK0uSNAkGvVX4B4BrgU8BJwJ/VVX/8FTrJVmb5JYk30hyZ5J3
du2rk2xNsiPJjUlW9a2zIcnOJHclOWN+fy1J0kJ4yseXJjkcuHk+N/lLMgVMVdUdSY4CvgycDVwA
fK+q3p/kYmB1Va1PcjJwFb1bgKyl12N5Qe1XpI8vlaThzefxpU/Zk6iqJ4Cf9v+2P6iq2lNVd3TT
DwN30fvyP5t9F+NtpndxHvQOjl9dVY9X1S5694ZaN+znSpIWxqBXXD8M3JnkJroznACq6p2DflCS
44BTgC8Ba6pqtnuPPUmO7hY7Bvhi32q7uzZJ0hgMGhKf7l7z0g01XQu8q6oe7g0XPYnjOZI0gQ4Y
EkmeW1Xfqap536cpyQp6AXFlVV3fNc8mWVNVs91xi/u79t3AsX2rr+3a5rCpb3q6e0mS9pqZmWFm
Zuag3uOAB66TfKWqTu2mP1VVfzD0BySfAP63qt7d13YZ8P2quqxx4PoV9IaZbsID15K0IOZz4Pqp
hpv63+yEeRR0OvDH9I5n3E7vG/oS4DJ6115cCNxN97zsqtqeZAuwHXgMuGj/gJAkLZ6nColqTA+k
qv4LOLzx499prPM+4H3DfpYkaeE9VUj8RpKH6PUojuym6earqp410uokSWN1wJCoqlYvQJK0DAzz
PAlJ0jJjSEiSmgwJSVKTISFJajIkJElNhoQkqcmQkCQ1GRKSpCZDQpLUZEhIkpoMCUlSkyEhSWoy
JCRJTYaEJKnJkJAkNRkSkqQmQ0KS1GRISJKaDAlJUpMhIUlqMiQkSU2GhCSpyZCQJDUZEpKkJkNC
ktRkSEiSmgwJSVKTISFJajIkJElNhoQkqcmQkCQ1GRJP6QiSDP2amjpu3IVL0kFbMe4CJt+jQA29
1uxsFr4USVpk9iQkSU0jDYkkVySZTfK1vrbVSbYm2ZHkxiSr+n62IcnOJHclOWOUtUmSntqoexIf
B169X9t64OaqOhG4BdgAkORk4FzgJOBM4PIkjtlI0hiNNCSq6gvAD/ZrPhvY3E1vBs7pps8Crq6q
x6tqF7ATWDfK+iRJBzaOYxJHV9UsQFXtAY7u2o8B7ulbbnfXJkkak0k4u2n4U4cA2NQ3Pd29JEl7
zczMMDMzc1DvMY6QmE2ypqpmk0wB93ftu4Fj+5Zb27U1bBpVfZJ0SJienmZ6evpn8+95z3uGfo/F
GG5K99rrBuD8bvotwPV97ecleVqS44HnA9sWoT5JUsNIexJJPklvHOjZSb4DbAQuBa5JciFwN70z
mqiq7Um2ANuBx4CLqmqeQ1GSpIWQpfg9nKSGPZSxatVpPPjgrQx/CCTzWKe33lLctpIOXUmoqqEu
LfCKa0lSkyEhSWoyJCRJTYaEJKnJkJAkNRkSkqQmQ0KS1GRISJKaDAlJUpMhIUlqMiQkSU2GhCSp
yZCQJDUZEpKkJkNCktRkSEiSmgwJSVKTISFJajIkJElNhoQkqcmQkCQ1GRKSpCZDQpLUZEhIkpoM
CUlSkyExMkeQZKjX1NRx4y5akp5kxbgLOHQ9CtRQa8zOZjSlSNI82ZOQJDUZEpKkJkNCktRkSEiS
mgwJSVKTISFJajIkJElNhoQkqWkiQyLJa5J8M8m3klw87nokabmauJBIchjwj8CrgRcBb0zywvFW
NblmZmbGXcLEcFvs47bYx21xcCYuJIB1wM6quruqHgOuBs4ec00Ty/8A+7gt9nFb7OO2ODiTGBLH
APf0zX+3a9McPvCBDw19I0FvJihpUEv2Bn/Petbrh1r+kUe+OaJKFlLvzrHDG+5GguDNBKVRmJo6
jtnZu4daZ82a57Fnz67RFLQAUjX8F8woJTkN2FRVr+nm1wNVVZf1LTNZRUvSElFVQ/2GOIkhcTiw
A/ht4D5gG/DGqrprrIVJ0jI0ccNNVfVEkncAW+kdM7nCgJCk8Zi4noQkaXJM4tlNB+SFdvsk2ZXk
q0luT7Jt3PUspiRXJJlN8rW+ttVJtibZkeTGJKvGWeNiaWyLjUm+m+Qr3es146xxsSRZm+SWJN9I
cmeSd3bty2rfmGM7/GnXPvR+saR6Et2Fdt+id7ziXuA24LyqWgqnLi24JP8DvLSqfjDuWhZbkt8E
HgY+UVUv7touA75XVe/vfoFYXVXrx1nnYmhsi43AD6vqg2MtbpElmQKmquqOJEcBX6Z3ndUFLKN9
4wDb4Q0MuV8stZ6EF9o9WVh6/4YLoqq+AOwfjmcDm7vpzcA5i1rUmDS2BfT2j2WlqvZU1R3d9MPA
XcBaltm+0dgOe683G2q/WGpfMF5o92QF3JTktiRvHXcxE+DoqpqF3n8S4Ogx1zNu70hyR5KPHerD
K3NJchxwCvAlYM1y3Tf6tsOtXdNQ+8VSCwk92elVdSrwWuDt3bCD9lk6Y6kL73LghKo6BdgDLLdh
p6OAa4F3db9J778vLIt9Y47tMPR+sdRCYjfw3L75tV3bslRV93V/PgBcR284bjmbTbIGfjYme/+Y
6xmbqnqg9h1w/Cjw8nHWs5iSrKD3xXhlVV3fNS+7fWOu7TCf/WKphcRtwPOTPC/J04DzgBvGXNNY
JFnZ/ZZAkmcAZwBfH29Viy48eXz1BuD8bvotwPX7r3AIe9K26L4I9/p9lte+8U/A9qr6cF/bctw3
fm47zGe/WFJnN0HvFFjgw+y70O7SMZc0FkmOp9d7KHoXRV61nLZFkk8C08CzgVlgI/AZ4BrgWOBu
4Nyq+r9x1bhYGtviVfTGoX8K7ALetndM/lCW5HTgP4E76f3fKOASendu2MIy2TcOsB3+iCH3iyUX
EpKkxbPUhpskSYvIkJAkNRkSkqQmQ0KS1GRISJKaDAlJUpMhIUlqMiQkSU3/DzepYDZSwMuQAAAA
AElFTkSuQmCC
"
&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;div class="cell border-box-sizing text_cell rendered"&gt;
&lt;div class="prompt input_prompt"&gt;
&lt;/div&gt;
&lt;div class="inner_cell"&gt;
&lt;div class="text_cell_render border-box-sizing rendered_html"&gt;
&lt;p&gt;That looks like a Poisson distribution, kind of as I expected. I'm guessing my number of hashtags per tweet is $\sim Poi(1)$, but let's actually find the &lt;a href="https://en.wikipedia.org/wiki/Poisson_distribution#Maximum_likelihood"&gt;most likely estimator&lt;/a&gt; which in this case is just $\bar{\lambda}$:&lt;/p&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;div class="cell border-box-sizing code_cell rendered"&gt;
&lt;div class="input"&gt;
&lt;div class="prompt input_prompt"&gt;In&amp;nbsp;[5]:&lt;/div&gt;
&lt;div class="inner_cell"&gt;
&lt;div class="input_area"&gt;
&lt;div class=" highlight hl-ipython3"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;span class="n"&gt;mle&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;hashtags&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;mean&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="n"&gt;mle&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;div class="output_wrapper"&gt;
&lt;div class="output"&gt;
&lt;div class="output_area"&gt;&lt;div class="prompt output_prompt"&gt;Out[5]:&lt;/div&gt;
&lt;div class="output_text output_subarea output_execute_result"&gt;
&lt;pre&gt;0.870236869207003&lt;/pre&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;div class="cell border-box-sizing text_cell rendered"&gt;
&lt;div class="prompt input_prompt"&gt;
&lt;/div&gt;
&lt;div class="inner_cell"&gt;
&lt;div class="text_cell_render border-box-sizing rendered_html"&gt;
&lt;p&gt;Pretty close! So we can now simulate how many hashtags are in a tweet. Let's also find what hashtags are actually used:&lt;/p&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;div class="cell border-box-sizing code_cell rendered"&gt;
&lt;div class="input"&gt;
&lt;div class="prompt input_prompt"&gt;In&amp;nbsp;[6]:&lt;/div&gt;
&lt;div class="inner_cell"&gt;
&lt;div class="input_area"&gt;
&lt;div class=" highlight hl-ipython3"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;span class="n"&gt;hashtags&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;x&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;all_words&lt;/span&gt; &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;#&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;span class="n"&gt;n_hashtags&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nb"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;hashtags&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;unique_hashtags&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nb"&gt;list&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nb"&gt;set&lt;/span&gt;&lt;span class="p"&gt;([&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;x&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;unique_words&lt;/span&gt; &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;#&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;]))&lt;/span&gt;
&lt;span class="n"&gt;hashtag_dist&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;pd&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;DataFrame&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;hashtags&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;unique_hashtags&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="s1"&gt;&amp;#39;prob&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;all_words&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;count&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;h&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="n"&gt;n_hashtags&lt;/span&gt;
&lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;h&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;unique_hashtags&lt;/span&gt;&lt;span class="p"&gt;]})&lt;/span&gt;
&lt;span class="nb"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;hashtag_dist&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;div class="output_wrapper"&gt;
&lt;div class="output"&gt;
&lt;div class="output_area"&gt;&lt;div class="prompt output_prompt"&gt;Out[6]:&lt;/div&gt;
&lt;div class="output_text output_subarea output_execute_result"&gt;
&lt;pre&gt;603&lt;/pre&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;div class="cell border-box-sizing text_cell rendered"&gt;
&lt;div class="prompt input_prompt"&gt;
&lt;/div&gt;
&lt;div class="inner_cell"&gt;
&lt;div class="text_cell_render border-box-sizing rendered_html"&gt;
&lt;p&gt;Turns out I have used 603 different hashtags during my time on Twitter. That means I was using a unique hashtag for about every third tweet.&lt;/p&gt;
&lt;p&gt;In better news though, we now have all the data we need to go about actually constructing tweets! The process will happen in a few steps:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;Randomly select what the first word will be.&lt;/li&gt;
&lt;li&gt;Randomly select the number of hashtags for this tweet, and then select the actual hashtags.&lt;/li&gt;
&lt;li&gt;Fill in the remaining space of 140 characters with random words taken from my tweets.&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;And hopefully, we won't have anything too crazy come out the other end. The way we do the selection follows a &lt;a href="https://en.wikipedia.org/wiki/Multinomial_distribution"&gt;Multinomial Distribution&lt;/a&gt;: given a lot of different values with specific probability, pick one. Let's give a quick example:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;x: .33
y: .5
z: .17&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;That is, I pick &lt;code&gt;x&lt;/code&gt; with probability 33%, &lt;code&gt;y&lt;/code&gt; with probability 50%, and so on. In context of our sentence construction, I've built out the probabilities of specific words already - now I just need to simulate that distribution. Time for the engine to actually be developed!&lt;/p&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;div class="cell border-box-sizing code_cell rendered"&gt;
&lt;div class="input"&gt;
&lt;div class="prompt input_prompt"&gt;In&amp;nbsp;[7]:&lt;/div&gt;
&lt;div class="inner_cell"&gt;
&lt;div class="input_area"&gt;
&lt;div class=" highlight hl-ipython3"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="nn"&gt;numpy&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="nn"&gt;np&lt;/span&gt;
&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;multinom_sim&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;n&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;vals&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;probs&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
&lt;span class="n"&gt;occurrences&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;random&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;multinomial&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;n&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;probs&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;results&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;occurrences&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;vals&lt;/span&gt;
&lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39; &amp;#39;&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;join&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;results&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;results&lt;/span&gt; &lt;span class="o"&gt;!=&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;sim_n_hashtags&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;hashtag_freq&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
&lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;random&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;poisson&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;hashtag_freq&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;sim_hashtags&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;n&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;hashtag_dist&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
&lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;multinom_sim&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;n&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;hashtag_dist&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;hashtags&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;hashtag_dist&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;prob&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;sim_first_word&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;first_word_dist&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
&lt;span class="n"&gt;probs&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;float64&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;first_word_dist&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;values&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;multinom_sim&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;first_word_dist&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;reset_index&lt;/span&gt;&lt;span class="p"&gt;()[&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;index&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="n"&gt;probs&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;sim_next_word&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;current&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;word_dist&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
&lt;span class="n"&gt;dist&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;pd&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Series&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;word_dist&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;current&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
&lt;span class="n"&gt;probs&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;ones&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nb"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;dist&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="nb"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;dist&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;multinom_sim&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;dist&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;probs&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;div class="cell border-box-sizing text_cell rendered"&gt;
&lt;div class="prompt input_prompt"&gt;
&lt;/div&gt;
&lt;div class="inner_cell"&gt;
&lt;div class="text_cell_render border-box-sizing rendered_html"&gt;
&lt;h2 id="Pulling-it-all-together"&gt;Pulling it all together&lt;a class="anchor-link" href="#Pulling-it-all-together"&gt;&amp;#182;&lt;/a&gt;&lt;/h2&gt;&lt;p&gt;I've now built out all the code I need to actually simulate a sentence written by me. Let's try doing an example with five words and a single hashtag:&lt;/p&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;div class="cell border-box-sizing code_cell rendered"&gt;
&lt;div class="input"&gt;
&lt;div class="prompt input_prompt"&gt;In&amp;nbsp;[8]:&lt;/div&gt;
&lt;div class="inner_cell"&gt;
&lt;div class="input_area"&gt;
&lt;div class=" highlight hl-ipython3"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;span class="n"&gt;first&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;sim_first_word&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;first_word_dist&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;second&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;sim_next_word&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;first&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;word_dist&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;third&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;sim_next_word&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;second&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;word_dist&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;fourth&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;sim_next_word&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;third&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;word_dist&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;fifth&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;sim_next_word&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;fourth&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;word_dist&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;hashtag&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;sim_hashtags&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;hashtag_dist&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="s1"&gt;&amp;#39; &amp;#39;&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;join&lt;/span&gt;&lt;span class="p"&gt;((&lt;/span&gt;&lt;span class="n"&gt;first&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;second&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;third&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;fourth&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;fifth&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;hashtag&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;div class="output_wrapper"&gt;
&lt;div class="output"&gt;
&lt;div class="output_area"&gt;&lt;div class="prompt output_prompt"&gt;Out[8]:&lt;/div&gt;
&lt;div class="output_text output_subarea output_execute_result"&gt;
&lt;pre&gt;&amp;apos;My first all-nighter of friends #oldschool&amp;apos;&lt;/pre&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;div class="cell border-box-sizing text_cell rendered"&gt;
&lt;div class="prompt input_prompt"&gt;
&lt;/div&gt;
&lt;div class="inner_cell"&gt;
&lt;div class="text_cell_render border-box-sizing rendered_html"&gt;
&lt;p&gt;Let's go ahead and put everything together! We're going to simulate a first word, simulate the hashtags, and then simulate to fill the gap until we've either taken up all the space or reached a period.&lt;/p&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;div class="cell border-box-sizing code_cell rendered"&gt;
&lt;div class="input"&gt;
&lt;div class="prompt input_prompt"&gt;In&amp;nbsp;[9]:&lt;/div&gt;
&lt;div class="inner_cell"&gt;
&lt;div class="input_area"&gt;
&lt;div class=" highlight hl-ipython3"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;simulate_tweet&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
&lt;span class="n"&gt;chars_remaining&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;140&lt;/span&gt;
&lt;span class="n"&gt;first&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;sim_first_word&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;first_word_dist&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;n_hash&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;sim_n_hashtags&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;mle&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;hashtags&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;sim_hashtags&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;n_hash&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;hashtag_dist&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;chars_remaining&lt;/span&gt; &lt;span class="o"&gt;-=&lt;/span&gt; &lt;span class="nb"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;first&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="nb"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;hashtags&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;tweet&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;first&lt;/span&gt;
&lt;span class="n"&gt;current&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;first&lt;/span&gt;
&lt;span class="k"&gt;while&lt;/span&gt; &lt;span class="n"&gt;chars_remaining&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;tweet&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="nb"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;hashtags&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="ow"&gt;and&lt;/span&gt; &lt;span class="n"&gt;current&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;!=&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;.&amp;#39;&lt;/span&gt; &lt;span class="ow"&gt;and&lt;/span&gt; &lt;span class="n"&gt;current&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;!=&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;!&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
&lt;span class="n"&gt;current&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;sim_next_word&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;current&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;word_dist&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;tweet&lt;/span&gt; &lt;span class="o"&gt;+=&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39; &amp;#39;&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;current&lt;/span&gt;
&lt;span class="n"&gt;tweet&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;tweet&lt;/span&gt;&lt;span class="p"&gt;[:&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;tweet&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39; &amp;#39;&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;join&lt;/span&gt;&lt;span class="p"&gt;((&lt;/span&gt;&lt;span class="n"&gt;tweet&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;hashtags&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;strip&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;div class="cell border-box-sizing text_cell rendered"&gt;
&lt;div class="prompt input_prompt"&gt;
&lt;/div&gt;
&lt;div class="inner_cell"&gt;
&lt;div class="text_cell_render border-box-sizing rendered_html"&gt;
&lt;h2 id="The-results"&gt;The results&lt;a class="anchor-link" href="#The-results"&gt;&amp;#182;&lt;/a&gt;&lt;/h2&gt;&lt;p&gt;And now for something completely different: twenty random tweets dreamed up by my computer and my Twitter data. Here you go:&lt;/p&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;div class="cell border-box-sizing code_cell rendered"&gt;
&lt;div class="input"&gt;
&lt;div class="prompt input_prompt"&gt;In&amp;nbsp;[12]:&lt;/div&gt;
&lt;div class="inner_cell"&gt;
&lt;div class="input_area"&gt;
&lt;div class=" highlight hl-ipython3"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="nb"&gt;range&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;20&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
&lt;span class="nb"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;simulate_tweet&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt;
&lt;span class="nb"&gt;print&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;div class="output_wrapper"&gt;
&lt;div class="output"&gt;
&lt;div class="output_area"&gt;&lt;div class="prompt"&gt;&lt;/div&gt;
&lt;div class="output_subarea output_stream output_stdout output_text"&gt;
&lt;pre&gt;Also , I&amp;apos;m at 8 this morning. #thursdaysgohard #ornot
Turns out of us breathe the code will want to my undergraduate career is becoming more night trying ? Religion is now as a chane #HYPE
You know what recursion is to review the UNCC. #ornot
There are really sore 3 bonfires in my first writing the library ground floor if awesome. #realtalk #impressed
So we can make it out there&amp;apos;s nothing but I&amp;apos;m not let us so hot I could think I may be good. #SwingDance
Happy Christmas , at Harris Teeter to be be godly or Roman Catholic ). #4b392b#4b392b #Isaiah26
For context , I in the most decisive factor of the same for homework. #accomplishment
Freaking done. #loveyouall
New blog post : Don&amp;apos;t jump in a quiz in with a knife fight. #haskell #earlybirthday
God shows me legitimately want to get some food and one day.
Stormed the queen city. #mindblown
The day of a cold at least outside right before the semester ..
Finished with the way back. #winners
Waking up , OJ , I feel like Nick Jonas today.
First draft of so hard drive. #humansvszombies
Eric Whitacre is the wise creation.
Ethics paper first , music in close to everyone who just be posting up with my sin , and Jerry Springr #TheLittleThings
Love that you know enough time I&amp;apos;ve eaten at 8 PM. #deepthoughts #stillblownaway
Lead. #ThinkingTooMuch #Christmas
Aamazing conference when you married #DepartmentOfRedundancyDepartment Yep , but there&amp;apos;s a legitimate challenge.
&lt;/pre&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;div class="cell border-box-sizing text_cell rendered"&gt;
&lt;div class="prompt input_prompt"&gt;
&lt;/div&gt;
&lt;div class="inner_cell"&gt;
&lt;div class="text_cell_render border-box-sizing rendered_html"&gt;
&lt;p&gt;...Which all ended up being a whole lot more nonsensical than I had hoped for. There are some good ones, so I'll call that an accomplishment! I was banking on grammar not being an issue: since my tweets use impeccable grammar, the program modeled off them should have pretty good grammar as well. There are going to be some hilarious edge cases (I'm looking at you, &lt;code&gt;Ethics paper first, music in close to everyone&lt;/code&gt;) that make no sense, and some hilarious edge cases (&lt;code&gt;Waking up, OJ, I feel like Nick Jonas today&lt;/code&gt;) that make me feel like I should have a Twitter rap career. On the whole though, the structure came out alright.&lt;/p&gt;
&lt;h2 id="Moving-on-from-here"&gt;Moving on from here&lt;a class="anchor-link" href="#Moving-on-from-here"&gt;&amp;#182;&lt;/a&gt;&lt;/h2&gt;&lt;p&gt;During class we also talked about an interesting idea: trying to analyze corporate documents and corporate speech. I'd be interested to know what this analysis applied to something like a couple of bank press releases could do. By any means, the code needs some work to clean it up before I get that far.&lt;/p&gt;
&lt;h2 id="For-further-reading"&gt;For further reading&lt;a class="anchor-link" href="#For-further-reading"&gt;&amp;#182;&lt;/a&gt;&lt;/h2&gt;&lt;p&gt;I'm pretty confident I re-invented a couple wheels along the way - what I'm doing feels a lot like what &lt;a href="https://en.wikipedia.org/wiki/Markov_chain_Monte_Carlo"&gt;Markov Chain Monte Carlo&lt;/a&gt; is intended to do. But I've never worked explicitly with that before, so more research is needed.&lt;/p&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;/div&gt;&lt;/p&gt;
&lt;script type="text/x-mathjax-config"&gt;
MathJax.Hub.Config({tex2jax: {inlineMath: [['$','$'], ['\(','\)']]}});
&lt;/script&gt;
&lt;script async src='https://cdn.mathjax.org/mathjax/latest/MathJax.js?config=TeX-AMS_CHTML'&gt;&lt;/script&gt;</summary><category term="twitter"></category><category term="MCMC"></category></entry><entry><title>Predicting Santander Customer Happiness</title><link href="https://bspeice.github.io/predicting-santander-customer-happiness.html" rel="alternate"></link><updated>2016-03-05T00:00:00-05:00</updated><author><name>Bradlee Speice</name></author><id>tag:bspeice.github.io,2016-03-05:predicting-santander-customer-happiness.html</id><summary type="html">&lt;p&gt;
&lt;div class="cell border-box-sizing text_cell rendered"&gt;
&lt;div class="prompt input_prompt"&gt;
&lt;/div&gt;

View File

@ -83,6 +83,8 @@
<div class="container content archive">
<h2><a href="https://bspeice.github.io/index.html"> </a></h2>
<dl class="dl-horizontal">
<dt>Mon 28 March 2016</dt>
<dd><a href="https://bspeice.github.io/tweet-like-me.html">Tweet Like Me</a></dd>
<dt>Sat 05 March 2016</dt>
<dd><a href="https://bspeice.github.io/predicting-santander-customer-happiness.html">Predicting Santander Customer Happiness</a></dd>
<dt>Fri 26 February 2016</dt>

123
tag/mcmc.html Normal file
View File

@ -0,0 +1,123 @@
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="utf-8">
<meta http-equiv="X-UA-Compatible" content="IE=edge">
<meta name="viewport" content="width=device-width, initial-scale=1">
<meta name="description" content=" MCMC">
<meta name="keywords" content="">
<link rel="icon" href="https://bspeice.github.io/favicon.ico">
<title> MCMC - Bradlee Speice</title>
<!-- Stylesheets -->
<link href="https://bspeice.github.io/theme/css/bootstrap.min.css" rel="stylesheet">
<link href="https://bspeice.github.io/theme/css/fonts.css" rel="stylesheet">
<link href="https://bspeice.github.io/theme/css/nest.css" rel="stylesheet">
<link href="https://bspeice.github.io/theme/css/pygment.css" rel="stylesheet">
<!-- /Stylesheets -->
<!-- RSS Feeds -->
<link href="https://bspeice.github.io/feeds/all.atom.xml" type="application/atom+xml" rel="alternate" title="Bradlee Speice Full Atom Feed" />
<!-- /RSS Feeds -->
<!-- HTML5 shim and Respond.js for IE8 support of HTML5 elements and media queries -->
<!--[if lt IE 9]>
<script src="https://oss.maxcdn.com/html5shiv/3.7.2/html5shiv.min.js"></script>
<script src="https://oss.maxcdn.com/respond/1.4.2/respond.min.js"></script>
<![endif]-->
<!-- Google Analytics -->
<script>
(function(i,s,o,g,r,a,m){i['GoogleAnalyticsObject']=r;i[r]=i[r]||function(){
(i[r].q=i[r].q||[]).push(arguments)},i[r].l=1*new Date();a=s.createElement(o),
m=s.getElementsByTagName(o)[0];a.async=1;a.src=g;m.parentNode.insertBefore(a,m)
})(window,document,'script','//www.google-analytics.com/analytics.js','ga');
ga('create', 'UA-74711362-1', 'auto');
ga('send', 'pageview');
</script>
<!-- /Google Analytics -->
</head>
<body>
<!-- Header -->
<div class="header-container gradient">
<!-- Static navbar -->
<div class="container">
<div class="header-nav">
<div class="header-logo">
<a class="pull-left" href="https://bspeice.github.io/"><img class="mr20" src="https://bspeice.github.io/images/logo.svg" alt="logo">Bradlee Speice</a>
</div>
<div class="nav pull-right">
</div>
</div>
</div>
<!-- /Static navbar -->
<!-- Header -->
<div class="container header-wrapper">
<div class="row">
<div class="col-lg-12">
<div class="header-content">
<h1 class="header-title text-uppercase"> : #MCMC</h1>
<div class="header-underline"></div>
<p class="header-subtitle header-subtitle-homepage"> #MCMC</p>
</div>
</div>
</div>
</div>
<!-- /Header -->
</div>
<!-- /Header -->
<!-- Content -->
<div class="archive-container">
<div class="container content archive">
<h2><a href="https://bspeice.github.io/tag/mcmc.html">MCMC</a></h2>
<dl class="dl-horizontal">
<dt>Mon 28 March 2016</dt>
<dd><a href="https://bspeice.github.io/tweet-like-me.html">Tweet Like Me</a></dd>
</dl>
</div>
</div>
<!-- /Content -->
<!-- Footer -->
<div class="footer gradient-2">
<div class="container footer-container ">
<div class="row">
<div class="col-xs-4 col-sm-3 col-md-3 col-lg-3">
<div class="footer-title"></div>
<ul class="list-unstyled">
<li><a href="https://bspeice.github.io/feeds/all.atom.xml" type="application/atom+xml" rel="alternate"></a></li>
</ul>
</div>
<div class="col-xs-4 col-sm-3 col-md-3 col-lg-3">
<div class="footer-title"></div>
<ul class="list-unstyled">
<li><a href="https://github.com/bspeice" target="_blank">Github</a></li>
<li><a href="https://www.linkedin.com/in/bradleespeice" target="_blank">LinkedIn</a></li>
</ul>
</div>
<div class="col-xs-4 col-sm-3 col-md-3 col-lg-3">
</div>
<div class="col-xs-12 col-sm-3 col-md-3 col-lg-3">
<p class="pull-right text-right">
<small><em>Proudly powered by <a href="http://docs.getpelican.com/" target="_blank">pelican</a></em></small><br/>
<small><em>Theme and code by <a href="https://github.com/molivier" target="_blank">molivier</a></em></small><br/>
<small></small>
</p>
</div>
</div>
</div>
</div>
<!-- /Footer -->
</body>
</html>

123
tag/twitter.html Normal file
View File

@ -0,0 +1,123 @@
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="utf-8">
<meta http-equiv="X-UA-Compatible" content="IE=edge">
<meta name="viewport" content="width=device-width, initial-scale=1">
<meta name="description" content=" twitter">
<meta name="keywords" content="">
<link rel="icon" href="https://bspeice.github.io/favicon.ico">
<title> twitter - Bradlee Speice</title>
<!-- Stylesheets -->
<link href="https://bspeice.github.io/theme/css/bootstrap.min.css" rel="stylesheet">
<link href="https://bspeice.github.io/theme/css/fonts.css" rel="stylesheet">
<link href="https://bspeice.github.io/theme/css/nest.css" rel="stylesheet">
<link href="https://bspeice.github.io/theme/css/pygment.css" rel="stylesheet">
<!-- /Stylesheets -->
<!-- RSS Feeds -->
<link href="https://bspeice.github.io/feeds/all.atom.xml" type="application/atom+xml" rel="alternate" title="Bradlee Speice Full Atom Feed" />
<!-- /RSS Feeds -->
<!-- HTML5 shim and Respond.js for IE8 support of HTML5 elements and media queries -->
<!--[if lt IE 9]>
<script src="https://oss.maxcdn.com/html5shiv/3.7.2/html5shiv.min.js"></script>
<script src="https://oss.maxcdn.com/respond/1.4.2/respond.min.js"></script>
<![endif]-->
<!-- Google Analytics -->
<script>
(function(i,s,o,g,r,a,m){i['GoogleAnalyticsObject']=r;i[r]=i[r]||function(){
(i[r].q=i[r].q||[]).push(arguments)},i[r].l=1*new Date();a=s.createElement(o),
m=s.getElementsByTagName(o)[0];a.async=1;a.src=g;m.parentNode.insertBefore(a,m)
})(window,document,'script','//www.google-analytics.com/analytics.js','ga');
ga('create', 'UA-74711362-1', 'auto');
ga('send', 'pageview');
</script>
<!-- /Google Analytics -->
</head>
<body>
<!-- Header -->
<div class="header-container gradient">
<!-- Static navbar -->
<div class="container">
<div class="header-nav">
<div class="header-logo">
<a class="pull-left" href="https://bspeice.github.io/"><img class="mr20" src="https://bspeice.github.io/images/logo.svg" alt="logo">Bradlee Speice</a>
</div>
<div class="nav pull-right">
</div>
</div>
</div>
<!-- /Static navbar -->
<!-- Header -->
<div class="container header-wrapper">
<div class="row">
<div class="col-lg-12">
<div class="header-content">
<h1 class="header-title text-uppercase"> : #twitter</h1>
<div class="header-underline"></div>
<p class="header-subtitle header-subtitle-homepage"> #twitter</p>
</div>
</div>
</div>
</div>
<!-- /Header -->
</div>
<!-- /Header -->
<!-- Content -->
<div class="archive-container">
<div class="container content archive">
<h2><a href="https://bspeice.github.io/tag/twitter.html">twitter</a></h2>
<dl class="dl-horizontal">
<dt>Mon 28 March 2016</dt>
<dd><a href="https://bspeice.github.io/tweet-like-me.html">Tweet Like Me</a></dd>
</dl>
</div>
</div>
<!-- /Content -->
<!-- Footer -->
<div class="footer gradient-2">
<div class="container footer-container ">
<div class="row">
<div class="col-xs-4 col-sm-3 col-md-3 col-lg-3">
<div class="footer-title"></div>
<ul class="list-unstyled">
<li><a href="https://bspeice.github.io/feeds/all.atom.xml" type="application/atom+xml" rel="alternate"></a></li>
</ul>
</div>
<div class="col-xs-4 col-sm-3 col-md-3 col-lg-3">
<div class="footer-title"></div>
<ul class="list-unstyled">
<li><a href="https://github.com/bspeice" target="_blank">Github</a></li>
<li><a href="https://www.linkedin.com/in/bradleespeice" target="_blank">LinkedIn</a></li>
</ul>
</div>
<div class="col-xs-4 col-sm-3 col-md-3 col-lg-3">
</div>
<div class="col-xs-12 col-sm-3 col-md-3 col-lg-3">
<p class="pull-right text-right">
<small><em>Proudly powered by <a href="http://docs.getpelican.com/" target="_blank">pelican</a></em></small><br/>
<small><em>Theme and code by <a href="https://github.com/molivier" target="_blank">molivier</a></em></small><br/>
<small></small>
</p>
</div>
</div>
</div>
</div>
<!-- /Footer -->
</body>
</html>

View File

@ -99,6 +99,8 @@
<dt><span class="label label-default">1</span> article </dt>
<dd><a href="https://bspeice.github.io/tag/martingale.html">martingale</a></dd>
<dt><span class="label label-default">1</span> article </dt>
<dd><a href="https://bspeice.github.io/tag/mcmc.html">MCMC</a></dd>
<dt><span class="label label-default">1</span> article </dt>
<dd><a href="https://bspeice.github.io/tag/monte-carlo.html">monte carlo</a></dd>
<dt><span class="label label-default">1</span> article </dt>
<dd><a href="https://bspeice.github.io/tag/python.html">python</a></dd>
@ -108,6 +110,8 @@
<dd><a href="https://bspeice.github.io/tag/strategy.html">strategy</a></dd>
<dt><span class="label label-default">1</span> article </dt>
<dd><a href="https://bspeice.github.io/tag/trading.html">trading</a></dd>
<dt><span class="label label-default">1</span> article </dt>
<dd><a href="https://bspeice.github.io/tag/twitter.html">twitter</a></dd>
<dt><span class="label label-default">2</span> articles </dt>
<dd><a href="https://bspeice.github.io/tag/weather.html">weather</a></dd>
</dl>

716
tweet-like-me.html Normal file
View File

@ -0,0 +1,716 @@
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="utf-8">
<meta http-equiv="X-UA-Compatible" content="IE=edge">
<meta name="viewport" content="width=device-width, initial-scale=1">
<meta name="description" content="An experiment in creating a robot that will imitate me on Twitter. So, I&#39;m taking a Machine Learning course this semester in school, and one of the topics we keep coming back to is natural ...">
<meta name="keywords" content="MCMC, twitter">
<link rel="icon" href="https://bspeice.github.io/favicon.ico">
<title>Tweet Like Me - Bradlee Speice</title>
<!-- Stylesheets -->
<link href="https://bspeice.github.io/theme/css/bootstrap.min.css" rel="stylesheet">
<link href="https://bspeice.github.io/theme/css/fonts.css" rel="stylesheet">
<link href="https://bspeice.github.io/theme/css/nest.css" rel="stylesheet">
<link href="https://bspeice.github.io/theme/css/pygment.css" rel="stylesheet">
<!-- /Stylesheets -->
<!-- RSS Feeds -->
<link href="https://bspeice.github.io/feeds/all.atom.xml" type="application/atom+xml" rel="alternate" title="Bradlee Speice Full Atom Feed" />
<link href="https://bspeice.github.io/feeds/blog.atom.xml" type="application/atom+xml" rel="alternate" title="Bradlee Speice Categories Atom Feed" />
<!-- /RSS Feeds -->
<!-- HTML5 shim and Respond.js for IE8 support of HTML5 elements and media queries -->
<!--[if lt IE 9]>
<script src="https://oss.maxcdn.com/html5shiv/3.7.2/html5shiv.min.js"></script>
<script src="https://oss.maxcdn.com/respond/1.4.2/respond.min.js"></script>
<![endif]-->
<!-- Google Analytics -->
<script>
(function(i,s,o,g,r,a,m){i['GoogleAnalyticsObject']=r;i[r]=i[r]||function(){
(i[r].q=i[r].q||[]).push(arguments)},i[r].l=1*new Date();a=s.createElement(o),
m=s.getElementsByTagName(o)[0];a.async=1;a.src=g;m.parentNode.insertBefore(a,m)
})(window,document,'script','//www.google-analytics.com/analytics.js','ga');
ga('create', 'UA-74711362-1', 'auto');
ga('send', 'pageview');
</script>
<!-- /Google Analytics -->
</head>
<body>
<!-- Header -->
<div class="header-container gradient">
<!-- Static navbar -->
<div class="container">
<div class="header-nav">
<div class="header-logo">
<a class="pull-left" href="https://bspeice.github.io/"><img class="mr20" src="https://bspeice.github.io/images/logo.svg" alt="logo">Bradlee Speice</a>
</div>
<div class="nav pull-right">
</div>
</div>
</div>
<!-- /Static navbar -->
<!-- Header -->
<!-- Header -->
<div class="container header-wrapper">
<div class="row">
<div class="col-lg-12">
<div class="header-content">
<h1 class="header-title">Tweet Like Me</h1>
<p class="header-date"> <a href="https://bspeice.github.io/author/bradlee-speice.html">Bradlee Speice</a>, Mon 28 March 2016, <a href="https://bspeice.github.io/category/blog.html">Blog</a></p>
<div class="header-underline"></div>
<div class="clearfix"></div>
<p class="pull-right header-tags">
<span class="glyphicon glyphicon-tags mr5" aria-hidden="true"></span>
<a href="https://bspeice.github.io/tag/mcmc.html">MCMC</a>, <a href="https://bspeice.github.io/tag/twitter.html">twitter</a> </p>
</div>
</div>
</div>
</div>
<!-- /Header -->
<!-- /Header -->
</div>
<!-- /Header -->
<!-- Content -->
<div class="container content">
<p>
<div class="cell border-box-sizing text_cell rendered">
<div class="prompt input_prompt">
</div>
<div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<p>An experiment in creating a robot that will imitate me on Twitter.</p>
<hr>
<p>So, I'm taking a Machine Learning course this semester in school, and one of the topics we keep coming back to is natural language processing and the 'bag of words' data structure. That is, given a sentence:</p>
<p><code>How much wood would a woodchuck chuck if a woodchuck could chuck wood?</code></p>
<p>We can represent that sentence as the following list:</p>
<p><code>{
How: 1
much: 1
wood: 2
would: 2
a: 2
woodchuck: 2
chuck: 2
if: 1
}</code></p>
<p>Ignoring <em>where</em> the words happened, we're just interested in how <em>often</em> the words occurred. That got me thinking: I wonder what would happen if I built a robot that just imitated how often I said things? It's dangerous territory when computer scientists ask "what if," but I got curious enough I wanted to follow through.</p>
<h2 id="The-Objective">The Objective<a class="anchor-link" href="#The-Objective">&#182;</a></h2><p>Given an input list of Tweets, build up the following things:</p>
<ol>
<li>The distribution of starting words; since there are no "prior" words to go from, we need to treat this as a special case.</li>
<li>The distribution of words given a previous word; for example, every time I use the word <code>woodchuck</code> in the example sentence, there is a 50% chance it is followed by <code>chuck</code> and a 50% chance it is followed by <code>could</code>. I need this distribution for all words.</li>
<li>The distribution of quantity of hashtags; Do I most often use just one? Two? Do they follow something like a Poisson distribution?</li>
<li>Distribution of hashtags; Given a number of hashtags, what is the actual content? I'll treat hashtags as separate from the content of a tweet.</li>
</ol>
<h2 id="The-Data">The Data<a class="anchor-link" href="#The-Data">&#182;</a></h2><p>I'm using as input my tweet history. I don't really use Twitter anymore, but it seems like a fun use of the dataset. I'd like to eventually build this to a point where I can imitate anyone on Twitter using their last 100 tweets or so, but I'll start with this as example code.</p>
<h2 id="The-Algorithm">The Algorithm<a class="anchor-link" href="#The-Algorithm">&#182;</a></h2><p>I'll be using the <a href="http://www.nltk.org/">NLTK</a> library for doing a lot of the heavy lifting. First, let's import the data:</p>
</div>
</div>
</div>
<div class="cell border-box-sizing code_cell rendered">
<div class="input">
<div class="prompt input_prompt">In&nbsp;[1]:</div>
<div class="inner_cell">
<div class="input_area">
<div class=" highlight hl-ipython3"><pre><span></span><span class="kn">import</span> <span class="nn">pandas</span> <span class="k">as</span> <span class="nn">pd</span>
<span class="n">tweets</span> <span class="o">=</span> <span class="n">pd</span><span class="o">.</span><span class="n">read_csv</span><span class="p">(</span><span class="s1">&#39;tweets.csv&#39;</span><span class="p">)</span>
<span class="n">text</span> <span class="o">=</span> <span class="n">tweets</span><span class="o">.</span><span class="n">text</span>
<span class="c1"># Don&#39;t include tweets in reply to or mentioning people</span>
<span class="n">replies</span> <span class="o">=</span> <span class="n">text</span><span class="o">.</span><span class="n">str</span><span class="o">.</span><span class="n">contains</span><span class="p">(</span><span class="s1">&#39;@&#39;</span><span class="p">)</span>
<span class="n">text_norep</span> <span class="o">=</span> <span class="n">text</span><span class="o">.</span><span class="n">loc</span><span class="p">[</span><span class="o">~</span><span class="n">replies</span><span class="p">]</span>
</pre></div>
</div>
</div>
</div>
</div>
<div class="cell border-box-sizing text_cell rendered">
<div class="prompt input_prompt">
</div>
<div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<p>And now that we've got data, let's start crunching. First, tokenize and build out the distribution of first word:</p>
</div>
</div>
</div>
<div class="cell border-box-sizing code_cell rendered">
<div class="input">
<div class="prompt input_prompt">In&nbsp;[2]:</div>
<div class="inner_cell">
<div class="input_area">
<div class=" highlight hl-ipython3"><pre><span></span><span class="kn">from</span> <span class="nn">nltk.tokenize</span> <span class="k">import</span> <span class="n">TweetTokenizer</span>
<span class="n">tknzr</span> <span class="o">=</span> <span class="n">TweetTokenizer</span><span class="p">()</span>
<span class="n">tokens</span> <span class="o">=</span> <span class="n">text_norep</span><span class="o">.</span><span class="n">map</span><span class="p">(</span><span class="n">tknzr</span><span class="o">.</span><span class="n">tokenize</span><span class="p">)</span>
<span class="n">first_words</span> <span class="o">=</span> <span class="n">tokens</span><span class="o">.</span><span class="n">map</span><span class="p">(</span><span class="k">lambda</span> <span class="n">x</span><span class="p">:</span> <span class="n">x</span><span class="p">[</span><span class="mi">0</span><span class="p">])</span>
<span class="n">first_words_alpha</span> <span class="o">=</span> <span class="n">first_words</span><span class="p">[</span><span class="n">first_words</span><span class="o">.</span><span class="n">str</span><span class="o">.</span><span class="n">isalpha</span><span class="p">()]</span>
<span class="n">first_word_dist</span> <span class="o">=</span> <span class="n">first_words_alpha</span><span class="o">.</span><span class="n">value_counts</span><span class="p">()</span> <span class="o">/</span> <span class="nb">len</span><span class="p">(</span><span class="n">first_words_alpha</span><span class="p">)</span>
</pre></div>
</div>
</div>
</div>
</div>
<div class="cell border-box-sizing text_cell rendered">
<div class="prompt input_prompt">
</div>
<div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<p>Next, we need to build out the conditional distributions. That is, what is the probability of the next word given the current word is $X$? This one is a bit more involved. First, find all unique words, and then find what words proceed them. This can probably be done in a more efficient manner than I'm currently doing here, but we'll ignore that for the moment.</p>
</div>
</div>
</div>
<div class="cell border-box-sizing code_cell rendered">
<div class="input">
<div class="prompt input_prompt">In&nbsp;[3]:</div>
<div class="inner_cell">
<div class="input_area">
<div class=" highlight hl-ipython3"><pre><span></span><span class="kn">from</span> <span class="nn">functools</span> <span class="k">import</span> <span class="n">reduce</span>
<span class="c1"># Get all possible words</span>
<span class="n">all_words</span> <span class="o">=</span> <span class="n">reduce</span><span class="p">(</span><span class="k">lambda</span> <span class="n">x</span><span class="p">,</span> <span class="n">y</span><span class="p">:</span> <span class="n">x</span><span class="o">+</span><span class="n">y</span><span class="p">,</span> <span class="n">tokens</span><span class="p">,</span> <span class="p">[])</span>
<span class="n">unique_words</span> <span class="o">=</span> <span class="nb">set</span><span class="p">(</span><span class="n">all_words</span><span class="p">)</span>
<span class="n">actual_words</span> <span class="o">=</span> <span class="nb">set</span><span class="p">([</span><span class="n">x</span> <span class="k">if</span> <span class="n">x</span><span class="p">[</span><span class="mi">0</span><span class="p">]</span> <span class="o">!=</span> <span class="s1">&#39;.&#39;</span> <span class="k">else</span> <span class="kc">None</span> <span class="k">for</span> <span class="n">x</span> <span class="ow">in</span> <span class="n">unique_words</span><span class="p">])</span>
<span class="n">word_dist</span> <span class="o">=</span> <span class="p">{}</span>
<span class="k">for</span> <span class="n">word</span> <span class="ow">in</span> <span class="nb">iter</span><span class="p">(</span><span class="n">actual_words</span><span class="p">):</span>
<span class="n">indices</span> <span class="o">=</span> <span class="p">[</span><span class="n">i</span> <span class="k">for</span> <span class="n">i</span><span class="p">,</span> <span class="n">j</span> <span class="ow">in</span> <span class="nb">enumerate</span><span class="p">(</span><span class="n">all_words</span><span class="p">)</span> <span class="k">if</span> <span class="n">j</span> <span class="o">==</span> <span class="n">word</span><span class="p">]</span>
<span class="n">proceeding</span> <span class="o">=</span> <span class="p">[</span><span class="n">all_words</span><span class="p">[</span><span class="n">i</span><span class="o">+</span><span class="mi">1</span><span class="p">]</span> <span class="k">for</span> <span class="n">i</span> <span class="ow">in</span> <span class="n">indices</span><span class="p">]</span>
<span class="n">word_dist</span><span class="p">[</span><span class="n">word</span><span class="p">]</span> <span class="o">=</span> <span class="n">proceeding</span>
</pre></div>
</div>
</div>
</div>
</div>
<div class="cell border-box-sizing text_cell rendered">
<div class="prompt input_prompt">
</div>
<div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<p>Now that we've got the tweet analysis done, it's time for the fun part: hashtags! Let's count how many hashtags are in each tweet, I want to get a sense of the distribution.</p>
</div>
</div>
</div>
<div class="cell border-box-sizing code_cell rendered">
<div class="input">
<div class="prompt input_prompt">In&nbsp;[4]:</div>
<div class="inner_cell">
<div class="input_area">
<div class=" highlight hl-ipython3"><pre><span></span><span class="kn">import</span> <span class="nn">matplotlib.pyplot</span> <span class="k">as</span> <span class="nn">plt</span>
<span class="o">%</span><span class="k">matplotlib</span> inline
<span class="n">hashtags</span> <span class="o">=</span> <span class="n">text_norep</span><span class="o">.</span><span class="n">str</span><span class="o">.</span><span class="n">count</span><span class="p">(</span><span class="s1">&#39;#&#39;</span><span class="p">)</span>
<span class="n">bins</span> <span class="o">=</span> <span class="n">hashtags</span><span class="o">.</span><span class="n">unique</span><span class="p">()</span><span class="o">.</span><span class="n">max</span><span class="p">()</span>
<span class="n">hashtags</span><span class="o">.</span><span class="n">plot</span><span class="p">(</span><span class="n">kind</span><span class="o">=</span><span class="s1">&#39;hist&#39;</span><span class="p">,</span> <span class="n">bins</span><span class="o">=</span><span class="n">bins</span><span class="p">)</span>
</pre></div>
</div>
</div>
</div>
<div class="output_wrapper">
<div class="output">
<div class="output_area"><div class="prompt output_prompt">Out[4]:</div>
<div class="output_text output_subarea output_execute_result">
<pre>&lt;matplotlib.axes._subplots.AxesSubplot at 0x18e59dc28d0&gt;</pre>
</div>
</div>
<div class="output_area"><div class="prompt"></div>
<div class="output_png output_subarea ">
<img src="data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAYkAAAEACAYAAABGYoqtAAAABHNCSVQICAgIfAhkiAAAAAlwSFlz
AAALEgAACxIB0t1+/AAAEe1JREFUeJzt3X+s3XV9x/HnCzqRinadjt6NosA0CGYOUSsJM7tmG4pG
YFuGuGkEMmOCTheThZZsazXZBOOcbguJUWYqw7CCIpi5UQi7Li5KmYKixdpkFrHQC1MHogQB3/vj
fGsP9X7KObf33HNu7/ORnPT7/dzv95x3v/32vO7n8/2VqkKSpLkcNu4CJEmTy5CQJDUZEpKkJkNC
ktRkSEiSmgwJSVLTyEMiya4kX01ye5JtXdvqJFuT7EhyY5JVfctvSLIzyV1Jzhh1fZKktsXoSfwU
mK6ql1TVuq5tPXBzVZ0I3AJsAEhyMnAucBJwJnB5kixCjZKkOSxGSGSOzzkb2NxNbwbO6abPAq6u
qserahewE1iHJGksFiMkCrgpyW1J/qRrW1NVswBVtQc4ums/Brinb93dXZskaQxWLMJnnF5V9yX5
ZWBrkh30gqOf9waRpAk08pCoqvu6Px9I8hl6w0ezSdZU1WySKeD+bvHdwLF9q6/t2p4kiaEiSfNQ
VUMd5x3pcFOSlUmO6qafAZwB3AncAJzfLfYW4Ppu+gbgvCRPS3I88Hxg21zvXVW+qti4cePYa5iU
l9vCbeG2OPBrPkbdk1gDXNf95r8CuKqqtib5b2BLkguBu+md0URVbU+yBdgOPAZcVPP9m0mSDtpI
Q6Kqvg2cMkf794HfaazzPuB9o6xLkjQYr7he4qanp8ddwsRwW+zjttjHbXFwshRHc5I4CiVJQ0pC
TdKBa0nS0mZISJKaDAlJUpMhIUlqMiQkSU2GhCSpyZCQJDUZEpKkJkNCktS0rEJiauo4kgz1mpo6
btxlS9LYLKvbcvQelz3sepn3LXYlaZJ4Ww5J0oIyJCRJTYaEJKnJkJAkNRkSkqQmQ0KS1GRISJKa
DAlJUpMhIUlqMiQkSU2GhCSpyZCQJDUZEpKkJkNCktRkSEiSmgwJSVKTISFJajIkJElNhoQkqcmQ
kCQ1GRKSpCZDQpLUZEhIkpoWJSSSHJbkK0lu6OZXJ9maZEeSG5Os6lt2Q5KdSe5KcsZi1CdJmtti
9STeBWzvm18P3FxVJwK3ABsAkpwMnAucBJwJXJ4ki1SjJGk/Iw+JJGuB1wIf62s+G9jcTW8Gzumm
zwKurqrHq2oXsBNYN+oaJUlzW4yexN8Bfw5UX9uaqpoFqKo9wNFd+zHAPX3L7e7aJEljsGKUb57k
dcBsVd2RZPoAi9YBfjanTZs2/Wx6enqa6ekDvb0kLT8zMzPMzMwc1Hukaujv58HfPPkb4E3A48CR
wDOB64CXAdNVNZtkCviPqjopyXqgquqybv1/BzZW1a37vW/Np+7e4Y1h1wuj3EaStFiSUFVDHecd
6XBTVV1SVc+tqhOA84BbqurNwGeB87vF3gJc303fAJyX5GlJjgeeD2wbZY2SpLaRDjcdwKXAliQX
AnfTO6OJqtqeZAu9M6EeAy6aV5dBkrQgRjrcNCoON0nS8CZuuEmStLQZEpKkJkNCktRkSEiSmgwJ
SVKTISFJajIkJElNhoQkqcmQkCQ1GRKSpCZDQpLUZEhIkpoMCUlSkyEhSWoyJCRJTYaEJKnJkJAk
NY3r8aUH7b3vfe9Qy69cuXJElUjSoWvJPr4U/nKodY444goeffRefHyppOVqPo8vXcIhMVzdq1ad
xoMP3oohIWm58hnXkqQFZUhIkpoMCUlSkyEhSWoyJCRJTYaEJKnJkJAkNRkSkqQmQ0KS1GRISJKa
DAlJUpMhIUlqMiQkSU2GhCSpyZCQJDWNNCSSHJHk1iS3J7kzycaufXWSrUl2JLkxyaq+dTYk2Znk
riRnjLI+SdKBjTQkqupR4FVV9RLgFODMJOuA9cDNVXUicAuwASDJycC5wEnAmcDlSYZ6QIYkaeGM
fLipqn7cTR5B75naBZwNbO7aNwPndNNnAVdX1eNVtQvYCawbdY2SpLkNFBJJfn2+H5DksCS3A3uA
m6rqNmBNVc0CVNUe4Ohu8WOAe/pW3921SZLGYNCexOVJtiW5qP/4wSCq6qfdcNNaYF2SF/HzD5r2
IdKSNIFWDLJQVb0yyQuAC4EvJ9kGfLyqbhr0g6rqoSQzwGuA2SRrqmo2yRRwf7fYbuDYvtXWdm1z
2NQ3Pd29JEl7zczMMDMzc1DvkarBf4lPcji94wd/DzwEBLikqj7dWP45wGNV9WCSI4EbgUuB3wK+
X1WXJbkYWF1V67sD11cBr6A3zHQT8ILar8gkNWznY9Wq03jwwVsZvtMShtlGkjSpklBVQ50MNFBP
IsmLgQuA19H74n59VX0lya8CXwTmDAngV4DNSQ6jN7T1L1X1uSRfArYkuRC4m94ZTVTV9iRbgO3A
Y8BF+weEJGnxDNSTSPJ54GPAtVX1yH4/e3NVXTmi+lr12JOQpCGNrCdBrwfxSFU90X3QYcDTq+rH
ix0QkqTFM+jZTTcDR/bNr+zaJEmHsEFD4ulV9fDemW565WhKkiRNikFD4kdJTt07k+SlwCMHWF6S
dAgY9JjEnwHXJLmX3mmvU8AbRlaVJGkiDHox3W1JXgic2DXtqKrHRleWJGkSDNqTAHg5cFy3zqnd
qVSfGElVkqSJMOjFdFcCvwbcATzRNRdgSEjSIWzQnsTLgJO9+lmSlpdBz276Or2D1ZKkZWTQnsRz
gO3d3V8f3dtYVWeNpCpJ0kQYNCQ2jbIISdJkGvQU2M8neR6923bfnGQlcPhoS5Mkjdugjy99K3At
8JGu6RjgM6MqSpI0GQY9cP124HR6Dxqiqnay77nUkqRD1KAh8WhV/WTvTJIV+FxqSTrkDRoSn09y
CXBkkt8FrgE+O7qyJEmTYNCQWA88ANwJvA34HPAXoypKkjQZBnp86aTx8aWSNLyRPb40ybeZ49u1
qk4Y5sMkSUvLMPdu2uvpwB8Cv7Tw5UiSJslAxySq6nt9r91V9SHgdSOuTZI0ZoMON53aN3sYvZ7F
MM+ikCQtQYN+0f9t3/TjwC7g3AWvRpI0UQa9d9OrRl2IJGnyDDrc9O4D/byqPrgw5UiSJskwZze9
HLihm389sA3YOYqiJEmTYdCQWAucWlU/BEiyCfjXqnrTqAqTJI3foLflWAP8pG/+J12bJOkQNmhP
4hPAtiTXdfPnAJtHU5IkaVIMenbTXyf5N+CVXdMFVXX76MqSJE2CQYebAFYCD1XVh4HvJjl+RDVJ
kibEoI8v3QhcDGzomn4B+OdRFSVJmgyD9iR+DzgL+BFAVd0LPHNURUmSJsOgIfGT6j1UoQCSPGN0
JUmSJsWgIbElyUeAX0zyVuBm4KOjK0uSNAkGvVX4B4BrgU8BJwJ/VVX/8FTrJVmb5JYk30hyZ5J3
du2rk2xNsiPJjUlW9a2zIcnOJHclOWN+fy1J0kJ4yseXJjkcuHk+N/lLMgVMVdUdSY4CvgycDVwA
fK+q3p/kYmB1Va1PcjJwFb1bgKyl12N5Qe1XpI8vlaThzefxpU/Zk6iqJ4Cf9v+2P6iq2lNVd3TT
DwN30fvyP5t9F+NtpndxHvQOjl9dVY9X1S5694ZaN+znSpIWxqBXXD8M3JnkJroznACq6p2DflCS
44BTgC8Ba6pqtnuPPUmO7hY7Bvhi32q7uzZJ0hgMGhKf7l7z0g01XQu8q6oe7g0XPYnjOZI0gQ4Y
EkmeW1Xfqap536cpyQp6AXFlVV3fNc8mWVNVs91xi/u79t3AsX2rr+3a5rCpb3q6e0mS9pqZmWFm
Zuag3uOAB66TfKWqTu2mP1VVfzD0BySfAP63qt7d13YZ8P2quqxx4PoV9IaZbsID15K0IOZz4Pqp
hpv63+yEeRR0OvDH9I5n3E7vG/oS4DJ6115cCNxN97zsqtqeZAuwHXgMuGj/gJAkLZ6nColqTA+k
qv4LOLzx499prPM+4H3DfpYkaeE9VUj8RpKH6PUojuym6earqp410uokSWN1wJCoqlYvQJK0DAzz
PAlJ0jJjSEiSmgwJSVKTISFJajIkJElNhoQkqcmQkCQ1GRKSpCZDQpLUZEhIkpoMCUlSkyEhSWoy
JCRJTYaEJKnJkJAkNRkSkqQmQ0KS1GRISJKaDAlJUpMhIUlqMiQkSU2GhCSpyZCQJDUZEpKkJkNC
ktRkSEiSmgwJSVKTISFJajIkJElNhoQkqcmQkCQ1GRJP6QiSDP2amjpu3IVL0kFbMe4CJt+jQA29
1uxsFr4USVpk9iQkSU0jDYkkVySZTfK1vrbVSbYm2ZHkxiSr+n62IcnOJHclOWOUtUmSntqoexIf
B169X9t64OaqOhG4BdgAkORk4FzgJOBM4PIkjtlI0hiNNCSq6gvAD/ZrPhvY3E1vBs7pps8Crq6q
x6tqF7ATWDfK+iRJBzaOYxJHV9UsQFXtAY7u2o8B7ulbbnfXJkkak0k4u2n4U4cA2NQ3Pd29JEl7
zczMMDMzc1DvMY6QmE2ypqpmk0wB93ftu4Fj+5Zb27U1bBpVfZJ0SJienmZ6evpn8+95z3uGfo/F
GG5K99rrBuD8bvotwPV97ecleVqS44HnA9sWoT5JUsNIexJJPklvHOjZSb4DbAQuBa5JciFwN70z
mqiq7Um2ANuBx4CLqmqeQ1GSpIWQpfg9nKSGPZSxatVpPPjgrQx/CCTzWKe33lLctpIOXUmoqqEu
LfCKa0lSkyEhSWoyJCRJTYaEJKnJkJAkNRkSkqQmQ0KS1GRISJKaDAlJUpMhIUlqMiQkSU2GhCSp
yZCQJDUZEpKkJkNCktRkSEiSmgwJSVKTISFJajIkJElNhoQkqcmQkCQ1GRKSpCZDQpLUZEhIkpoM
CUlSkyExMkeQZKjX1NRx4y5akp5kxbgLOHQ9CtRQa8zOZjSlSNI82ZOQJDUZEpKkJkNCktRkSEiS
mgwJSVKTISFJajIkJElNhoQkqWkiQyLJa5J8M8m3klw87nokabmauJBIchjwj8CrgRcBb0zywvFW
NblmZmbGXcLEcFvs47bYx21xcCYuJIB1wM6quruqHgOuBs4ec00Ty/8A+7gt9nFb7OO2ODiTGBLH
APf0zX+3a9McPvCBDw19I0FvJihpUEv2Bn/Petbrh1r+kUe+OaJKFlLvzrHDG+5GguDNBKVRmJo6
jtnZu4daZ82a57Fnz67RFLQAUjX8F8woJTkN2FRVr+nm1wNVVZf1LTNZRUvSElFVQ/2GOIkhcTiw
A/ht4D5gG/DGqrprrIVJ0jI0ccNNVfVEkncAW+kdM7nCgJCk8Zi4noQkaXJM4tlNB+SFdvsk2ZXk
q0luT7Jt3PUspiRXJJlN8rW+ttVJtibZkeTGJKvGWeNiaWyLjUm+m+Qr3es146xxsSRZm+SWJN9I
cmeSd3bty2rfmGM7/GnXPvR+saR6Et2Fdt+id7ziXuA24LyqWgqnLi24JP8DvLSqfjDuWhZbkt8E
HgY+UVUv7touA75XVe/vfoFYXVXrx1nnYmhsi43AD6vqg2MtbpElmQKmquqOJEcBX6Z3ndUFLKN9
4wDb4Q0MuV8stZ6EF9o9WVh6/4YLoqq+AOwfjmcDm7vpzcA5i1rUmDS2BfT2j2WlqvZU1R3d9MPA
XcBaltm+0dgOe683G2q/WGpfMF5o92QF3JTktiRvHXcxE+DoqpqF3n8S4Ogx1zNu70hyR5KPHerD
K3NJchxwCvAlYM1y3Tf6tsOtXdNQ+8VSCwk92elVdSrwWuDt3bCD9lk6Y6kL73LghKo6BdgDLLdh
p6OAa4F3db9J778vLIt9Y47tMPR+sdRCYjfw3L75tV3bslRV93V/PgBcR284bjmbTbIGfjYme/+Y
6xmbqnqg9h1w/Cjw8nHWs5iSrKD3xXhlVV3fNS+7fWOu7TCf/WKphcRtwPOTPC/J04DzgBvGXNNY
JFnZ/ZZAkmcAZwBfH29Viy48eXz1BuD8bvotwPX7r3AIe9K26L4I9/p9lte+8U/A9qr6cF/bctw3
fm47zGe/WFJnN0HvFFjgw+y70O7SMZc0FkmOp9d7KHoXRV61nLZFkk8C08CzgVlgI/AZ4BrgWOBu
4Nyq+r9x1bhYGtviVfTGoX8K7ALetndM/lCW5HTgP4E76f3fKOASendu2MIy2TcOsB3+iCH3iyUX
EpKkxbPUhpskSYvIkJAkNRkSkqQmQ0KS1GRISJKaDAlJUpMhIUlqMiQkSU3/DzepYDZSwMuQAAAA
AElFTkSuQmCC
"
>
</div>
</div>
</div>
</div>
</div>
<div class="cell border-box-sizing text_cell rendered">
<div class="prompt input_prompt">
</div>
<div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<p>That looks like a Poisson distribution, kind of as I expected. I'm guessing my number of hashtags per tweet is $\sim Poi(1)$, but let's actually find the <a href="https://en.wikipedia.org/wiki/Poisson_distribution#Maximum_likelihood">most likely estimator</a> which in this case is just $\bar{\lambda}$:</p>
</div>
</div>
</div>
<div class="cell border-box-sizing code_cell rendered">
<div class="input">
<div class="prompt input_prompt">In&nbsp;[5]:</div>
<div class="inner_cell">
<div class="input_area">
<div class=" highlight hl-ipython3"><pre><span></span><span class="n">mle</span> <span class="o">=</span> <span class="n">hashtags</span><span class="o">.</span><span class="n">mean</span><span class="p">()</span>
<span class="n">mle</span>
</pre></div>
</div>
</div>
</div>
<div class="output_wrapper">
<div class="output">
<div class="output_area"><div class="prompt output_prompt">Out[5]:</div>
<div class="output_text output_subarea output_execute_result">
<pre>0.870236869207003</pre>
</div>
</div>
</div>
</div>
</div>
<div class="cell border-box-sizing text_cell rendered">
<div class="prompt input_prompt">
</div>
<div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<p>Pretty close! So we can now simulate how many hashtags are in a tweet. Let's also find what hashtags are actually used:</p>
</div>
</div>
</div>
<div class="cell border-box-sizing code_cell rendered">
<div class="input">
<div class="prompt input_prompt">In&nbsp;[6]:</div>
<div class="inner_cell">
<div class="input_area">
<div class=" highlight hl-ipython3"><pre><span></span><span class="n">hashtags</span> <span class="o">=</span> <span class="p">[</span><span class="n">x</span> <span class="k">for</span> <span class="n">x</span> <span class="ow">in</span> <span class="n">all_words</span> <span class="k">if</span> <span class="n">x</span><span class="p">[</span><span class="mi">0</span><span class="p">]</span> <span class="o">==</span> <span class="s1">&#39;#&#39;</span><span class="p">]</span>
<span class="n">n_hashtags</span> <span class="o">=</span> <span class="nb">len</span><span class="p">(</span><span class="n">hashtags</span><span class="p">)</span>
<span class="n">unique_hashtags</span> <span class="o">=</span> <span class="nb">list</span><span class="p">(</span><span class="nb">set</span><span class="p">([</span><span class="n">x</span> <span class="k">for</span> <span class="n">x</span> <span class="ow">in</span> <span class="n">unique_words</span> <span class="k">if</span> <span class="n">x</span><span class="p">[</span><span class="mi">0</span><span class="p">]</span> <span class="o">==</span> <span class="s1">&#39;#&#39;</span><span class="p">]))</span>
<span class="n">hashtag_dist</span> <span class="o">=</span> <span class="n">pd</span><span class="o">.</span><span class="n">DataFrame</span><span class="p">({</span><span class="s1">&#39;hashtags&#39;</span><span class="p">:</span> <span class="n">unique_hashtags</span><span class="p">,</span>
<span class="s1">&#39;prob&#39;</span><span class="p">:</span> <span class="p">[</span><span class="n">all_words</span><span class="o">.</span><span class="n">count</span><span class="p">(</span><span class="n">h</span><span class="p">)</span> <span class="o">/</span> <span class="n">n_hashtags</span>
<span class="k">for</span> <span class="n">h</span> <span class="ow">in</span> <span class="n">unique_hashtags</span><span class="p">]})</span>
<span class="nb">len</span><span class="p">(</span><span class="n">hashtag_dist</span><span class="p">)</span>
</pre></div>
</div>
</div>
</div>
<div class="output_wrapper">
<div class="output">
<div class="output_area"><div class="prompt output_prompt">Out[6]:</div>
<div class="output_text output_subarea output_execute_result">
<pre>603</pre>
</div>
</div>
</div>
</div>
</div>
<div class="cell border-box-sizing text_cell rendered">
<div class="prompt input_prompt">
</div>
<div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<p>Turns out I have used 603 different hashtags during my time on Twitter. That means I was using a unique hashtag for about every third tweet.</p>
<p>In better news though, we now have all the data we need to go about actually constructing tweets! The process will happen in a few steps:</p>
<ol>
<li>Randomly select what the first word will be.</li>
<li>Randomly select the number of hashtags for this tweet, and then select the actual hashtags.</li>
<li>Fill in the remaining space of 140 characters with random words taken from my tweets.</li>
</ol>
<p>And hopefully, we won't have anything too crazy come out the other end. The way we do the selection follows a <a href="https://en.wikipedia.org/wiki/Multinomial_distribution">Multinomial Distribution</a>: given a lot of different values with specific probability, pick one. Let's give a quick example:</p>
<pre><code>x: .33
y: .5
z: .17</code></pre>
<p>That is, I pick <code>x</code> with probability 33%, <code>y</code> with probability 50%, and so on. In context of our sentence construction, I've built out the probabilities of specific words already - now I just need to simulate that distribution. Time for the engine to actually be developed!</p>
</div>
</div>
</div>
<div class="cell border-box-sizing code_cell rendered">
<div class="input">
<div class="prompt input_prompt">In&nbsp;[7]:</div>
<div class="inner_cell">
<div class="input_area">
<div class=" highlight hl-ipython3"><pre><span></span><span class="kn">import</span> <span class="nn">numpy</span> <span class="k">as</span> <span class="nn">np</span>
<span class="k">def</span> <span class="nf">multinom_sim</span><span class="p">(</span><span class="n">n</span><span class="p">,</span> <span class="n">vals</span><span class="p">,</span> <span class="n">probs</span><span class="p">):</span>
<span class="n">occurrences</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">random</span><span class="o">.</span><span class="n">multinomial</span><span class="p">(</span><span class="n">n</span><span class="p">,</span> <span class="n">probs</span><span class="p">)</span>
<span class="n">results</span> <span class="o">=</span> <span class="n">occurrences</span> <span class="o">*</span> <span class="n">vals</span>
<span class="k">return</span> <span class="s1">&#39; &#39;</span><span class="o">.</span><span class="n">join</span><span class="p">(</span><span class="n">results</span><span class="p">[</span><span class="n">results</span> <span class="o">!=</span> <span class="s1">&#39;&#39;</span><span class="p">])</span>
<span class="k">def</span> <span class="nf">sim_n_hashtags</span><span class="p">(</span><span class="n">hashtag_freq</span><span class="p">):</span>
<span class="k">return</span> <span class="n">np</span><span class="o">.</span><span class="n">random</span><span class="o">.</span><span class="n">poisson</span><span class="p">(</span><span class="n">hashtag_freq</span><span class="p">)</span>
<span class="k">def</span> <span class="nf">sim_hashtags</span><span class="p">(</span><span class="n">n</span><span class="p">,</span> <span class="n">hashtag_dist</span><span class="p">):</span>
<span class="k">return</span> <span class="n">multinom_sim</span><span class="p">(</span><span class="n">n</span><span class="p">,</span> <span class="n">hashtag_dist</span><span class="o">.</span><span class="n">hashtags</span><span class="p">,</span> <span class="n">hashtag_dist</span><span class="o">.</span><span class="n">prob</span><span class="p">)</span>
<span class="k">def</span> <span class="nf">sim_first_word</span><span class="p">(</span><span class="n">first_word_dist</span><span class="p">):</span>
<span class="n">probs</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">float64</span><span class="p">(</span><span class="n">first_word_dist</span><span class="o">.</span><span class="n">values</span><span class="p">)</span>
<span class="k">return</span> <span class="n">multinom_sim</span><span class="p">(</span><span class="mi">1</span><span class="p">,</span> <span class="n">first_word_dist</span><span class="o">.</span><span class="n">reset_index</span><span class="p">()[</span><span class="s1">&#39;index&#39;</span><span class="p">],</span> <span class="n">probs</span><span class="p">)</span>
<span class="k">def</span> <span class="nf">sim_next_word</span><span class="p">(</span><span class="n">current</span><span class="p">,</span> <span class="n">word_dist</span><span class="p">):</span>
<span class="n">dist</span> <span class="o">=</span> <span class="n">pd</span><span class="o">.</span><span class="n">Series</span><span class="p">(</span><span class="n">word_dist</span><span class="p">[</span><span class="n">current</span><span class="p">])</span>
<span class="n">probs</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">ones</span><span class="p">(</span><span class="nb">len</span><span class="p">(</span><span class="n">dist</span><span class="p">))</span> <span class="o">/</span> <span class="nb">len</span><span class="p">(</span><span class="n">dist</span><span class="p">)</span>
<span class="k">return</span> <span class="n">multinom_sim</span><span class="p">(</span><span class="mi">1</span><span class="p">,</span> <span class="n">dist</span><span class="p">,</span> <span class="n">probs</span><span class="p">)</span>
</pre></div>
</div>
</div>
</div>
</div>
<div class="cell border-box-sizing text_cell rendered">
<div class="prompt input_prompt">
</div>
<div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<h2 id="Pulling-it-all-together">Pulling it all together<a class="anchor-link" href="#Pulling-it-all-together">&#182;</a></h2><p>I've now built out all the code I need to actually simulate a sentence written by me. Let's try doing an example with five words and a single hashtag:</p>
</div>
</div>
</div>
<div class="cell border-box-sizing code_cell rendered">
<div class="input">
<div class="prompt input_prompt">In&nbsp;[8]:</div>
<div class="inner_cell">
<div class="input_area">
<div class=" highlight hl-ipython3"><pre><span></span><span class="n">first</span> <span class="o">=</span> <span class="n">sim_first_word</span><span class="p">(</span><span class="n">first_word_dist</span><span class="p">)</span>
<span class="n">second</span> <span class="o">=</span> <span class="n">sim_next_word</span><span class="p">(</span><span class="n">first</span><span class="p">,</span> <span class="n">word_dist</span><span class="p">)</span>
<span class="n">third</span> <span class="o">=</span> <span class="n">sim_next_word</span><span class="p">(</span><span class="n">second</span><span class="p">,</span> <span class="n">word_dist</span><span class="p">)</span>
<span class="n">fourth</span> <span class="o">=</span> <span class="n">sim_next_word</span><span class="p">(</span><span class="n">third</span><span class="p">,</span> <span class="n">word_dist</span><span class="p">)</span>
<span class="n">fifth</span> <span class="o">=</span> <span class="n">sim_next_word</span><span class="p">(</span><span class="n">fourth</span><span class="p">,</span> <span class="n">word_dist</span><span class="p">)</span>
<span class="n">hashtag</span> <span class="o">=</span> <span class="n">sim_hashtags</span><span class="p">(</span><span class="mi">1</span><span class="p">,</span> <span class="n">hashtag_dist</span><span class="p">)</span>
<span class="s1">&#39; &#39;</span><span class="o">.</span><span class="n">join</span><span class="p">((</span><span class="n">first</span><span class="p">,</span> <span class="n">second</span><span class="p">,</span> <span class="n">third</span><span class="p">,</span> <span class="n">fourth</span><span class="p">,</span> <span class="n">fifth</span><span class="p">,</span> <span class="n">hashtag</span><span class="p">))</span>
</pre></div>
</div>
</div>
</div>
<div class="output_wrapper">
<div class="output">
<div class="output_area"><div class="prompt output_prompt">Out[8]:</div>
<div class="output_text output_subarea output_execute_result">
<pre>&apos;My first all-nighter of friends #oldschool&apos;</pre>
</div>
</div>
</div>
</div>
</div>
<div class="cell border-box-sizing text_cell rendered">
<div class="prompt input_prompt">
</div>
<div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<p>Let's go ahead and put everything together! We're going to simulate a first word, simulate the hashtags, and then simulate to fill the gap until we've either taken up all the space or reached a period.</p>
</div>
</div>
</div>
<div class="cell border-box-sizing code_cell rendered">
<div class="input">
<div class="prompt input_prompt">In&nbsp;[9]:</div>
<div class="inner_cell">
<div class="input_area">
<div class=" highlight hl-ipython3"><pre><span></span><span class="k">def</span> <span class="nf">simulate_tweet</span><span class="p">():</span>
<span class="n">chars_remaining</span> <span class="o">=</span> <span class="mi">140</span>
<span class="n">first</span> <span class="o">=</span> <span class="n">sim_first_word</span><span class="p">(</span><span class="n">first_word_dist</span><span class="p">)</span>
<span class="n">n_hash</span> <span class="o">=</span> <span class="n">sim_n_hashtags</span><span class="p">(</span><span class="n">mle</span><span class="p">)</span>
<span class="n">hashtags</span> <span class="o">=</span> <span class="n">sim_hashtags</span><span class="p">(</span><span class="n">n_hash</span><span class="p">,</span> <span class="n">hashtag_dist</span><span class="p">)</span>
<span class="n">chars_remaining</span> <span class="o">-=</span> <span class="nb">len</span><span class="p">(</span><span class="n">first</span><span class="p">)</span> <span class="o">+</span> <span class="nb">len</span><span class="p">(</span><span class="n">hashtags</span><span class="p">)</span>
<span class="n">tweet</span> <span class="o">=</span> <span class="n">first</span>
<span class="n">current</span> <span class="o">=</span> <span class="n">first</span>
<span class="k">while</span> <span class="n">chars_remaining</span> <span class="o">&gt;</span> <span class="nb">len</span><span class="p">(</span><span class="n">tweet</span><span class="p">)</span> <span class="o">+</span> <span class="nb">len</span><span class="p">(</span><span class="n">hashtags</span><span class="p">)</span> <span class="ow">and</span> <span class="n">current</span><span class="p">[</span><span class="mi">0</span><span class="p">]</span> <span class="o">!=</span> <span class="s1">&#39;.&#39;</span> <span class="ow">and</span> <span class="n">current</span><span class="p">[</span><span class="mi">0</span><span class="p">]</span> <span class="o">!=</span> <span class="s1">&#39;!&#39;</span><span class="p">:</span>
<span class="n">current</span> <span class="o">=</span> <span class="n">sim_next_word</span><span class="p">(</span><span class="n">current</span><span class="p">,</span> <span class="n">word_dist</span><span class="p">)</span>
<span class="n">tweet</span> <span class="o">+=</span> <span class="s1">&#39; &#39;</span> <span class="o">+</span> <span class="n">current</span>
<span class="n">tweet</span> <span class="o">=</span> <span class="n">tweet</span><span class="p">[:</span><span class="o">-</span><span class="mi">2</span><span class="p">]</span> <span class="o">+</span> <span class="n">tweet</span><span class="p">[</span><span class="o">-</span><span class="mi">1</span><span class="p">]</span>
<span class="k">return</span> <span class="s1">&#39; &#39;</span><span class="o">.</span><span class="n">join</span><span class="p">((</span><span class="n">tweet</span><span class="p">,</span> <span class="n">hashtags</span><span class="p">))</span><span class="o">.</span><span class="n">strip</span><span class="p">()</span>
</pre></div>
</div>
</div>
</div>
</div>
<div class="cell border-box-sizing text_cell rendered">
<div class="prompt input_prompt">
</div>
<div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<h2 id="The-results">The results<a class="anchor-link" href="#The-results">&#182;</a></h2><p>And now for something completely different: twenty random tweets dreamed up by my computer and my Twitter data. Here you go:</p>
</div>
</div>
</div>
<div class="cell border-box-sizing code_cell rendered">
<div class="input">
<div class="prompt input_prompt">In&nbsp;[12]:</div>
<div class="inner_cell">
<div class="input_area">
<div class=" highlight hl-ipython3"><pre><span></span><span class="k">for</span> <span class="n">i</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="mi">0</span><span class="p">,</span> <span class="mi">20</span><span class="p">):</span>
<span class="nb">print</span><span class="p">(</span><span class="n">simulate_tweet</span><span class="p">())</span>
<span class="nb">print</span><span class="p">()</span>
</pre></div>
</div>
</div>
</div>
<div class="output_wrapper">
<div class="output">
<div class="output_area"><div class="prompt"></div>
<div class="output_subarea output_stream output_stdout output_text">
<pre>Also , I&apos;m at 8 this morning. #thursdaysgohard #ornot
Turns out of us breathe the code will want to my undergraduate career is becoming more night trying ? Religion is now as a chane #HYPE
You know what recursion is to review the UNCC. #ornot
There are really sore 3 bonfires in my first writing the library ground floor if awesome. #realtalk #impressed
So we can make it out there&apos;s nothing but I&apos;m not let us so hot I could think I may be good. #SwingDance
Happy Christmas , at Harris Teeter to be be godly or Roman Catholic ). #4b392b#4b392b #Isaiah26
For context , I in the most decisive factor of the same for homework. #accomplishment
Freaking done. #loveyouall
New blog post : Don&apos;t jump in a quiz in with a knife fight. #haskell #earlybirthday
God shows me legitimately want to get some food and one day.
Stormed the queen city. #mindblown
The day of a cold at least outside right before the semester ..
Finished with the way back. #winners
Waking up , OJ , I feel like Nick Jonas today.
First draft of so hard drive. #humansvszombies
Eric Whitacre is the wise creation.
Ethics paper first , music in close to everyone who just be posting up with my sin , and Jerry Springr #TheLittleThings
Love that you know enough time I&apos;ve eaten at 8 PM. #deepthoughts #stillblownaway
Lead. #ThinkingTooMuch #Christmas
Aamazing conference when you married #DepartmentOfRedundancyDepartment Yep , but there&apos;s a legitimate challenge.
</pre>
</div>
</div>
</div>
</div>
</div>
<div class="cell border-box-sizing text_cell rendered">
<div class="prompt input_prompt">
</div>
<div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<p>...Which all ended up being a whole lot more nonsensical than I had hoped for. There are some good ones, so I'll call that an accomplishment! I was banking on grammar not being an issue: since my tweets use impeccable grammar, the program modeled off them should have pretty good grammar as well. There are going to be some hilarious edge cases (I'm looking at you, <code>Ethics paper first, music in close to everyone</code>) that make no sense, and some hilarious edge cases (<code>Waking up, OJ, I feel like Nick Jonas today</code>) that make me feel like I should have a Twitter rap career. On the whole though, the structure came out alright.</p>
<h2 id="Moving-on-from-here">Moving on from here<a class="anchor-link" href="#Moving-on-from-here">&#182;</a></h2><p>During class we also talked about an interesting idea: trying to analyze corporate documents and corporate speech. I'd be interested to know what this analysis applied to something like a couple of bank press releases could do. By any means, the code needs some work to clean it up before I get that far.</p>
<h2 id="For-further-reading">For further reading<a class="anchor-link" href="#For-further-reading">&#182;</a></h2><p>I'm pretty confident I re-invented a couple wheels along the way - what I'm doing feels a lot like what <a href="https://en.wikipedia.org/wiki/Markov_chain_Monte_Carlo">Markov Chain Monte Carlo</a> is intended to do. But I've never worked explicitly with that before, so more research is needed.</p>
</div>
</div>
</div></p>
<script type="text/x-mathjax-config">
MathJax.Hub.Config({tex2jax: {inlineMath: [['$','$'], ['\(','\)']]}});
</script>
<script async src='https://cdn.mathjax.org/mathjax/latest/MathJax.js?config=TeX-AMS_CHTML'></script>
<div class="comments">
<div id="disqus_thread"></div>
<script type="text/javascript">
var disqus_shortname = 'bradleespeice';
var disqus_identifier = 'tweet-like-me.html';
var disqus_url = 'https://bspeice.github.io/tweet-like-me.html';
(function() {
var dsq = document.createElement('script'); dsq.type = 'text/javascript'; dsq.async = true;
dsq.src = '//' + disqus_shortname + '.disqus.com/embed.js';
(document.getElementsByTagName('head')[0] || document.getElementsByTagName('body')[0]).appendChild(dsq);
})();
</script>
<noscript>Please enable JavaScript to view the comments.</noscript>
</div>
</div>
<!-- /Content -->
<!-- Footer -->
<div class="footer gradient-2">
<div class="container footer-container ">
<div class="row">
<div class="col-xs-4 col-sm-3 col-md-3 col-lg-3">
<div class="footer-title"></div>
<ul class="list-unstyled">
<li><a href="https://bspeice.github.io/feeds/all.atom.xml" type="application/atom+xml" rel="alternate"></a></li>
</ul>
</div>
<div class="col-xs-4 col-sm-3 col-md-3 col-lg-3">
<div class="footer-title"></div>
<ul class="list-unstyled">
<li><a href="https://github.com/bspeice" target="_blank">Github</a></li>
<li><a href="https://www.linkedin.com/in/bradleespeice" target="_blank">LinkedIn</a></li>
</ul>
</div>
<div class="col-xs-4 col-sm-3 col-md-3 col-lg-3">
</div>
<div class="col-xs-12 col-sm-3 col-md-3 col-lg-3">
<p class="pull-right text-right">
<small><em>Proudly powered by <a href="http://docs.getpelican.com/" target="_blank">pelican</a></em></small><br/>
<small><em>Theme and code by <a href="https://github.com/molivier" target="_blank">molivier</a></em></small><br/>
<small></small>
</p>
</div>
</div>
</div>
</div>
<!-- /Footer -->
</body>
</html>