<!doctype html><htmllang=endir=ltrclass="blog-wrapper blog-post-page plugin-blog plugin-id-default"data-has-hydrated=false><metacharset=UTF-8><metaname=generatorcontent="Docusaurus v3.7.0"><titledata-rh=true>Tweet like me | The Old Speice Guy</title><metadata-rh=truename=viewportcontent="width=device-width, initial-scale=1.0"><metadata-rh=truename=twitter:cardcontent=summary_large_image><metadata-rh=trueproperty=og:urlcontent=https://speice.io/2016/03/tweet-like-me><metadata-rh=trueproperty=og:localecontent=en><metadata-rh=truename=docusaurus_localecontent=en><metadata-rh=truename=docusaurus_tagcontent=default><metadata-rh=truename=docsearch:languagecontent=en><metadata-rh=truename=docsearch:docusaurus_tagcontent=default><metadata-rh=trueproperty=og:titlecontent="Tweet like me | The Old Speice Guy"><metadata-rh=truename=descriptioncontent="In which I try to create a robot that will tweet like I tweet."><metadata-rh=trueproperty=og:descriptioncontent="In which I try to create a robot that will tweet like I tweet."><metadata-rh=trueproperty=og:typecontent=article><metadata-rh=trueproperty=article:published_timecontent=2016-03-28T12:00:00.000Z><linkdata-rh=truerel=iconhref=/img/favicon.ico><linkdata-rh=truerel=canonicalhref=https://speice.io/2016/03/tweet-like-me><linkdata-rh=truerel=alternatehref=https://speice.io/2016/03/tweet-like-mehreflang=en><linkdata-rh=truerel=alternatehref=https://speice.io/2016/03/tweet-like-mehreflang=x-default><scriptdata-rh=truetype=application/ld+json>{"@context":"https://schema.org","@id":"https://speice.io/2016/03/tweet-like-me","@type":"BlogPosting","author":{"@type":"Person","name":"Bradlee Speice"},"dateModified":"2024-11-03T23:57:32.000Z","datePublished":"2016-03-28T12:00:00.000Z","description":"In which I try to create a robot that will tweet like I tweet.","headline":"Tweet like me","isPartOf":{"@id":"https://speice.io/","@type":"Blog","name":"Blog"},"keywords":[],"mainEntityOfPage":"https://speice.io/2016/03/tweet-like-me","name":"Tweet like me","url":"https://speice.io/2016/03/tweet-like-me"}</script><linkrel=alternatetype=application/rss+xmlhref=/rss.xmltitle="The Old Speice Guy RSS Feed"><linkrel=alternatetype=application/atom+xmlhref=/atom.xmltitle="The Old Speice Guy Atom Feed"><linkrel=stylesheethref=/katex/katex.min.csstype=text/css><linkrel=stylesheethref=/assets/css/styles.24ac2c37.css><scriptsrc=/assets/js/runtime~main.8ba92cdd.jsdefer></script><scriptsrc=/assets/js/main.a392e665.jsdefer></script><bodyclass=navigation-with-keyboard><script>!function(){vart,e=function(){try{returnnewURLSearchParams(window.location.search).get("docusaurus-theme")}catch(t){}}()||function(){try{returnwindow.localStorage.getItem("theme")}catch(t){}}();t=null!==e?e:"light",document.documentElement.setAttribute("data-theme",t)}(),function(){try{for(var[t,e]ofnewURLSearchParams(window.location.search).entries())if(t.startsWith("docusaurus-data-")){vara=t.replace("docusaurus-data-","data-");document.documentElement.setAttribute(a,e)}}catch(t){}}()</script><divid=__docusaurus><divrole=regionaria-label="Skip to main content"><aclass=skipToContent_fXgnhref=#__docusaurus_skipToContent_fallback>Skip to main content</a></div><navaria-label=Mainclass="navbar navbar--fixed-top"><divclass=navbar__inner><divclass=navbar__items><buttonaria-label="Toggle navigation bar"aria-expanded=falseclass="navbar__toggle clean-btn"type=button><svgwidth=30height=30viewBox="0 0 30 30"aria-hidden=true><pathstroke=currentColorstroke-linecap=roundstroke-miterlimit=10stroke-width=2d="M4 7h22M4 15h22M4 23h22"/></svg></button><aclass=navbar__brandhref=/><divclass=navbar__logo><imgsrc=/img/logo.svgalt="Sierpinski Gasket"class="themedComponent_mlkZ themedComponent--light_NVdE"><imgsrc=/img/logo-dark.svgalt="Sierpinski Gasket"class="themedComponent_mlkZ themedComponent--dark_xIcU"></div><bclass="navbar__title text--truncate">The Old Speice Guy</b></a></div><divclass="navbar__items navbar__items--right"><ahref=http
<p>So, I'm taking a Machine Learning course this semester in school, and one of the topics we keep coming back to is natural language processing and the 'bag of words' data structure. That is, given a sentence:</p>
<p><code>How much wood would a woodchuck chuck if a woodchuck could chuck wood?</code></p>
<p>We can represent that sentence as the following list:</p>
<p>Ignoring <em>where</em> the words happened, we're just interested in how <em>often</em> the words occurred. That got me thinking: I wonder what would happen if I built a robot that just imitated how often I said things? It's dangerous territory when computer scientists ask "what if," but I got curious enough I wanted to follow through.</p>
<h2class="anchor anchorWithStickyNavbar_LWe7"id=the-objective>The Objective<ahref=#the-objectiveclass=hash-linkaria-label="Direct link to The Objective"title="Direct link to The Objective"></a></h2>
<p>Given an input list of Tweets, build up the following things:</p>
<ol>
<li>The distribution of starting words; since there are no "prior" words to go from, we need to treat this as a special case.</li>
<li>The distribution of words given a previous word; for example, every time I use the word <code>woodchuck</code> in the example sentence, there is a 50% chance it is followed by <code>chuck</code> and a 50% chance it is followed by <code>could</code>. I need this distribution for all words.</li>
<li>The distribution of quantity of hashtags; Do I most often use just one? Two? Do they follow something like a Poisson distribution?</li>
<li>Distribution of hashtags; Given a number of hashtags, what is the actual content? I'll treat hashtags as separate from the content of a tweet.</li>
</ol>
<h2class="anchor anchorWithStickyNavbar_LWe7"id=the-data>The Data<ahref=#the-dataclass=hash-linkaria-label="Direct link to The Data"title="Direct link to The Data"></a></h2>
<p>I'm using as input my tweet history. I don't really use Twitter anymore, but it seems like a fun use of the dataset. I'd like to eventually build this to a point where I can imitate anyone on Twitter using their last 100 tweets or so, but I'll start with this as example code.</p>
<h2class="anchor anchorWithStickyNavbar_LWe7"id=the-algorithm>The Algorithm<ahref=#the-algorithmclass=hash-linkaria-label="Direct link to The Algorithm"title="Direct link to The Algorithm"></a></h2>
<p>I'll be using the <ahref=http://www.nltk.org/target=_blankrel="noopener noreferrer">NLTK</a> library for doing a lot of the heavy lifting. First, let's import the data:</p>
<p>Next, we need to build out the conditional distributions. That is, what is the probability of the next word given the current word is <spanclass=katex><spanclass=katex-mathml><mathxmlns=http://www.w3.org/1998/Math/MathML><semantics><mrow><mi>X</mi></mrow><annotationencoding=application/x-tex>X</annotation></semantics></math></span><spanclass=katex-htmlaria-hidden=true><spanclass=base><spanclass=strutstyle=height:0.6833em></span><spanclass="mord mathnormal"style=margin-right:0.07847em>X</span></span></span></span>? This one is a bit more involved. First, find all unique words, and then find what words proceed them. This can probably be done in a more efficient manner than I'm currently doing here, but we'll ignore that for the moment.</p>
<p>Now that we've got the tweet analysis done, it's time for the fun part: hashtags! Let's count how many hashtags are in each tweet, I want to get a sense of the distribution.</p>
<p>That looks like a Poisson distribution, kind of as I expected. I'm guessing my number of hashtags per tweet is <spanclass=katex><spanclass=katex-mathml><mathxmlns=http://www.w3.org/1998/Math/MathML><semantics><mrow><mo>∼</mo><mi>P</mi><mi>o</mi><mi>i</mi><mostretchy=false>(</mo><mn>1</mn><mostretchy=false>)</mo></mrow><annotationencoding=application/x-tex>\sim Poi(1)</annotation></semantics></math></span><spanclass=katex-htmlaria-hidden=true><spanclass=base><spanclass=strutstyle=height:0.3669em></span><spanclass=mrel>∼</span><spanclass=mspacestyle=margin-right:0.2778em></span></span><spanclass=base><spanclass=strutstyle=height:1em;vertical-align:-0.25em></span><spanclass="mord mathnormal"style=margin-right:0.13889em>P</span><spanclass="mord mathnormal">o</span><spanclass="mord mathnormal">i</span><spanclass=mopen>(</span><spanclass=mord>1</span><spanclass=mclose>)</span></span></span></span>, but let's actually find the <ahref=https://en.wikipedia.org/wiki/Poisson_distribution#Maximum_likelihoodtarget=_blankrel="noopener noreferrer">most likely estimator</a> which in this case is just <spanclass=katex><spanclass=katex-mathml><mathxmlns=http://www.w3.org/1998/Math/MathML><semantics><mrow><moveraccent=true><mi>λ</mi><mo>ˉ</mo></mover></mrow><annotationencoding=application/x-tex>\bar{\lambda}</annotation></semantics></math></span><spanclass=katex-htmlaria-hidden=true><spanclass=base><spanclass=strutstyle=height:0.8312em></span><spanclass="mord accent"><spanclass=vlist-t><spanclass=vlist-r><spanclass=vliststyle=height:0.8312em><spanstyle=top:-3em><spanclass=pstrutstyle=height:3em></span><spanclass="mord mathnormal">λ</span></span><spanstyle=top:-3.2634em><spanclass=pstrutstyle=height:3em></span><spanclass=accent-bodystyle=left:-0.25em><spanclass=mord>ˉ</span></span></span></span></span></span></span></span></span></span>:</p>
<p>Turns out I have used 603 different hashtags during my time on Twitter. That means I was using a unique hashtag for about every third tweet.</p>
<p>In better news though, we now have all the data we need to go about actually constructing tweets! The process will happen in a few steps:</p>
<ol>
<li>Randomly select what the first word will be.</li>
<li>Randomly select the number of hashtags for this tweet, and then select the actual hashtags.</li>
<li>Fill in the remaining space of 140 characters with random words taken from my tweets.</li>
</ol>
<p>And hopefully, we won't have anything too crazy come out the other end. The way we do the selection follows a <ahref=https://en.wikipedia.org/wiki/Multinomial_distributiontarget=_blankrel="noopener noreferrer">Multinomial Distribution</a>: given a lot of different values with specific probability, pick one. Let's give a quick example:</p>
<p>That is, I pick <code>x</code> with probability 33%, <code>y</code> with probability 50%, and so on. In context of our sentence construction, I've built out the probabilities of specific words already - now I just need to simulate that distribution. Time for the engine to actually be developed!</p>
<h2class="anchor anchorWithStickyNavbar_LWe7"id=pulling-it-all-together>Pulling it all together<ahref=#pulling-it-all-togetherclass=hash-linkaria-label="Direct link to Pulling it all together"title="Direct link to Pulling it all together"></a></h2>
<p>I've now built out all the code I need to actually simulate a sentence written by me. Let's try doing an example with five words and a single hashtag:</p>
<p>Let's go ahead and put everything together! We're going to simulate a first word, simulate the hashtags, and then simulate to fill the gap until we've either taken up all the space or reached a period.</p>
<h2class="anchor anchorWithStickyNavbar_LWe7"id=the-results>The results<ahref=#the-resultsclass=hash-linkaria-label="Direct link to The results"title="Direct link to The results"></a></h2>
<p>And now for something completely different: twenty random tweets dreamed up by my computer and my Twitter data. Here you go:</p>
<divclass="codeBlockContainer_Ckt0 theme-code-block"style="--prism-background-color:hsl(230, 1%, 98%);--prism-color:hsl(230, 8%, 24%)"><divclass=codeBlockContent_biex><pretabindex=0class="prism-code language-text codeBlock_bY9V thin-scrollbar"style="background-color:hsl(230, 1%, 98%);color:hsl(230, 8%, 24%)"><codeclass=codeBlockLines_e6Vv><spanclass=token-linestyle="color:hsl(230, 8%, 24%)"><spanclass="token plain"> Also , I'm at 8 this morning. #thursdaysgohard #ornot</span><br></span><spanclass=token-linestyle="color:hsl(230, 8%, 24%)"><spanclass="token plain"></span><br></span><spanclass=token-linestyle="color:hsl(230, 8%, 24%)"><spanclass="token plain"> Turns out of us breathe the code will want to my undergraduate career is becoming more night trying ? Religion is now as a chane #HYPE</span><br></span><spanclass=token-linestyle="color:hsl(230, 8%, 24%)"><spanclass="token plain"></span><br></span><spanclass=token-linestyle="color:hsl(230, 8%, 24%)"><spanclass="token plain"> You know what recursion is to review the UNCC. #ornot</span><br></span><spanclass=token-linestyle="color:hsl(230, 8%, 24%)"><spanclass="token plain"></span><br></span><spanclass=token-linestyle="color:hsl(230, 8%, 24%)"><spanclass="token plain"> There are really sore 3 bonfires in my first writing the library ground floor if awesome. #realtalk #impressed</span><br></span><spanclass=token-linestyle="color:hsl(230, 8%, 24%)"><spanclass="token plain"></span><br></span><spanclass=token-linestyle="color:hsl(230, 8%, 24%)"><spanclass="token plain"> So we can make it out there's nothing but I'm not let us so hot I could think I may be good. #SwingDance</span><br></span><spanclass=token-linestyle="color:hsl(230, 8%, 24%)"><spanclass="token plain"></span><br></span><spanclass=token-linestyle="color:hsl(230, 8%, 24%)"><spanclass="token plain"> Happy Christmas , at Harris Teeter to be be godly or Roman Catholic ). #4b392b#4b392b #Isaiah26</span><br></span><spanclass=token-linestyle="color:hsl(230, 8%, 24%)"><spanclass="token plain"></span><br></span><spanclass=token-linestyle="color:hsl(230, 8%, 24%)"><spanclass="token plain"> For context , I in the most decisive factor of the same for homework. #accomplishment</span><br></span><spanclass=token-linestyle="color:hsl(230, 8%, 24%)"><spanclass="token plain"></span><br></span><spanclass=token-linestyle="color:hsl(230, 8%, 24%)"><spanclass="token plain"> Freaking done. #loveyouall</span><br></span><spanclass=token-linestyle="color:hsl(230, 8%, 24%)"><spanclass="token plain"></span><br></span><spanclass=token-linestyle="color:hsl(230, 8%, 24%)"><spanclass="token plain"> New blog post : Don't jump in a quiz in with a knife fight. #haskell #earlybirthday</span><br></span><spanclass=token-linestyle="color:hsl(230, 8%, 24%)"><spanclass="token plain"></span><br></span><spanclass=token-linestyle="color:hsl(230, 8%, 24%)"><spanclass="token plain"> God shows me legitimately want to get some food and one day.</span><br></span><spanclass=token-linestyle="color:hsl(230, 8%, 24%)"><spanclass="token plain"></span><br></span><spanclass=token-linestyle="color:hsl(230, 8%, 24%)"><spanclass="token plain"> Stormed the queen city. #mindblown</span><br></span><spanclass=token-linestyle="color:hsl(230, 8%, 24%)"><spanclass="token plain"></span><br></span><spanclass=token-linestyle="color:hsl(230, 8%, 24%)"><spanclass="token plain"> The day of a cold at least outside right before the semester ..</span><br></span><spanclass=token-linestyle="color:hsl(230, 8%, 24%)"><spanclass="token plain"></span><br></span><spanclass=token-linestyle="color:hsl(230, 8%, 24%)"><spanclass="token plain"> Finished with the way back. #winners</span><br></span><spanclass=token-linestyle="color:hsl(230, 8%, 24%)"><spanclass="token plain"></span><br></span><spanclass=token-linestyle="color:hsl(230, 8%, 24%)"><spanclass="token plain"> Waking up , OJ , I feel like Nick Jonas today.</
<p>...Which all ended up being a whole lot more nonsensical than I had hoped for. There are some good ones, so I'll call that an accomplishment! I was banking on grammar not being an issue: since my tweets use impeccable grammar, the program modeled off them should have pretty good grammar as well. There are going to be some hilarious edge cases (I'm looking at you, <code>Ethics paper first, music in close to everyone</code>) that make no sense, and some hilarious edge cases (<code>Waking up, OJ, I feel like Nick Jonas today</code>) that make me feel like I should have a Twitter rap career. On the whole though, the structure came out alright.</p>
<h2class="anchor anchorWithStickyNavbar_LWe7"id=moving-on-from-here>Moving on from here<ahref=#moving-on-from-hereclass=hash-linkaria-label="Direct link to Moving on from here"title="Direct link to Moving on from here"></a></h2>
<p>During class we also talked about an interesting idea: trying to analyze corporate documents and corporate speech. I'd be interested to know what this analysis applied to something like a couple of bank press releases could do. By any means, the code needs some work to clean it up before I get that far.</p>
<h2class="anchor anchorWithStickyNavbar_LWe7"id=for-further-reading>For further reading<ahref=#for-further-readingclass=hash-linkaria-label="Direct link to For further reading"title="Direct link to For further reading"></a></h2>