speice.io/2016/03/tweet-like-me/index.html

59 lines
105 KiB
HTML
Raw Blame History

This file contains invisible Unicode characters

This file contains invisible Unicode characters that are indistinguishable to humans but may be processed differently by a computer. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

<!doctype html><html lang=en dir=ltr class="blog-wrapper blog-post-page plugin-blog plugin-id-default" data-has-hydrated=false><meta charset=UTF-8><meta name=generator content="Docusaurus v3.6.0"><title data-rh=true>Tweet like me | The Old Speice Guy</title><meta data-rh=true name=viewport content="width=device-width,initial-scale=1.0"><meta data-rh=true name=twitter:card content=summary_large_image><meta data-rh=true property=og:url content=https://speice.io/2016/03/tweet-like-me><meta data-rh=true property=og:locale content=en><meta data-rh=true name=docusaurus_locale content=en><meta data-rh=true name=docusaurus_tag content=default><meta data-rh=true name=docsearch:language content=en><meta data-rh=true name=docsearch:docusaurus_tag content=default><meta data-rh=true property=og:title content="Tweet like me | The Old Speice Guy"><meta data-rh=true name=description content="In which I try to create a robot that will tweet like I tweet."><meta data-rh=true property=og:description content="In which I try to create a robot that will tweet like I tweet."><meta data-rh=true property=og:type content=article><meta data-rh=true property=article:published_time content=2016-03-28T12:00:00.000Z><link data-rh=true rel=icon href=/img/favicon.ico><link data-rh=true rel=canonical href=https://speice.io/2016/03/tweet-like-me><link data-rh=true rel=alternate href=https://speice.io/2016/03/tweet-like-me hreflang=en><link data-rh=true rel=alternate href=https://speice.io/2016/03/tweet-like-me hreflang=x-default><script data-rh=true type=application/ld+json>{"@context":"https://schema.org","@id":"https://speice.io/2016/03/tweet-like-me","@type":"BlogPosting","author":{"@type":"Person","name":"Bradlee Speice"},"dateModified":"2024-11-03T23:57:32.000Z","datePublished":"2016-03-28T12:00:00.000Z","description":"In which I try to create a robot that will tweet like I tweet.","headline":"Tweet like me","isPartOf":{"@id":"https://speice.io/","@type":"Blog","name":"Blog"},"keywords":[],"mainEntityOfPage":"https://speice.io/2016/03/tweet-like-me","name":"Tweet like me","url":"https://speice.io/2016/03/tweet-like-me"}</script><link rel=alternate type=application/rss+xml href=/rss.xml title="The Old Speice Guy RSS Feed"><link rel=alternate type=application/atom+xml href=/atom.xml title="The Old Speice Guy Atom Feed"><link rel=stylesheet href=https://cdn.jsdelivr.net/npm/katex@0.13.24/dist/katex.min.css integrity=sha384-odtC+0UGzzFL/6PNoE8rX/SPcQDXBJ+uRepguP4QkPCm2LBxH3FA3y+fKSiJ+AmM crossorigin><link rel=stylesheet href=/assets/css/styles.ae6ff4a3.css><script src=/assets/js/runtime~main.751b419d.js defer></script><script src=/assets/js/main.62ce6156.js defer></script><body class=navigation-with-keyboard><script>!function(){var t,e=function(){try{return new URLSearchParams(window.location.search).get("docusaurus-theme")}catch(t){}}()||function(){try{return window.localStorage.getItem("theme")}catch(t){}}();t=null!==e?e:"light",document.documentElement.setAttribute("data-theme",t)}(),function(){try{for(var[t,e]of new URLSearchParams(window.location.search).entries())if(t.startsWith("docusaurus-data-")){var a=t.replace("docusaurus-data-","data-");document.documentElement.setAttribute(a,e)}}catch(t){}}()</script><div id=__docusaurus><div role=region aria-label="Skip to main content"><a class=skipToContent_fXgn href=#__docusaurus_skipToContent_fallback>Skip to main content</a></div><nav aria-label=Main class="navbar navbar--fixed-top"><div class=navbar__inner><div class=navbar__items><button aria-label="Toggle navigation bar" aria-expanded=false class="navbar__toggle clean-btn" type=button><svg width=30 height=30 viewBox="0 0 30 30" aria-hidden=true><path stroke=currentColor stroke-linecap=round stroke-miterlimit=10 stroke-width=2 d="M4 7h22M4 15h22M4 23h22"/></svg></button><a class=navbar__brand href=/><div class=navbar__logo><img src=/img/logo.svg alt="Sierpinski Gasket" class="themedComponent_mlkZ themedComponent--light_NVdE"><img src=/img/logo-dark.svg alt="Sierpinski Gasket" class="themedComponent_mlkZ themedComponent--dark_xIcU"></div><b class="navbar__title text--truncate">The Old Speice Guy</b></a></div><div class="navbar__items navbar__items--right"><a href=https://github.com/bspeice target=_blank rel="noopener noreferrer" class="navbar__item navbar__link header-github-link"></a><div class="toggle_vylO colorModeToggle_DEke"><button class="clean-btn toggleButton_gllP toggleButtonDisabled_aARS" type=button disabled title="Switch between dark and light mode (currently light mode)" aria-label="Switch between dark and light mode (currently light mode)" aria-live=polite aria-pressed=false><svg viewBox="0 0 24 24" width=24 height=24 class=lightToggleIcon_pyhR><path fill=currentColor d="M12,9c1.65,0,3,1.35,3,3s-1.35,3-3,3s-3-1.35-3-3S10.35,9,12,9 M12,7c-2.76,0-5,2.24-5,5s2.24,5,5,5s5-2.24,5-5 S14.76,7,12,7L12,7z M2,13l2,0c0.55,0,1-0.45,1-1s-0.45-1-1-1l-2,0c-0.55,0-1,0.45-1,1S1.45,13,2,13z M20,13l2,0c0.55,0,1-0.45,1-1 s-0.45-1-1-1l-2,0c-0.55,0-1,0.45-1,1S19.45,13,20,13z M11,2v2c0,0.55,0.45,1,1,1s1-0.45,1-1V2c0-0.55-0.45-1-1-1S11,1.45,11,2z M11,20v2c0,0.55,0.45,1,1,1s1-0.45,1-1v-2c0-0.55-0.45-1-1-1C11.45,19,11,19.45,11,20z M5.99,4.58c-0.39-0.39-1.03-0.39-1.41,0 c-0.39,0.39-0.39,1.03,0,1.41l1.06,1.06c0.39,0.39,1.03,0.39,1.41,0s0.39-1.03,0-1.41L5.99,4.58z M18.36,16.95 c-0.39-0.39-1.03-0.39-1.41,0c-0.39,0.39-0.39,1.03,0,1.41l1.06,1.06c0.39,0.39,1.03,0.39,1.41,0c0.39-0.39,0.39-1.03,0-1.41 L18.36,16.95z M19.42,5.99c0.39-0.39,0.39-1.03,0-1.41c-0.39-0.39-1.03-0.39-1.41,0l-1.06,1.06c-0.39,0.39-0.39,1.03,0,1.41 s1.03,0.39,1.41,0L19.42,5.99z M7.05,18.36c0.39-0.39,0.39-1.03,0-1.41c-0.39-0.39-1.03-0.39-1.41,0l-1.06,1.06 c-0.39,0.39-0.39,1.03,0,1.41s1.03,0.39,1.41,0L7.05,18.36z"/></svg><svg viewBox="0 0 24 24" width=24 height=24 class=darkToggleIcon_wfgR><path fill=currentColor d="M9.37,5.51C9.19,6.15,9.1,6.82,9.1,7.5c0,4.08,3.32,7.4,7.4,7.4c0.68,0,1.35-0.09,1.99-0.27C17.45,17.19,14.93,19,12,19 c-3.86,0-7-3.14-7-7C5,9.07,6.81,6.55,9.37,5.51z M12,3c-4.97,0-9,4.03-9,9s4.03,9,9,9s9-4.03,9-9c0-0.46-0.04-0.92-0.1-1.36 c-0.98,1.37-2.58,2.26-4.4,2.26c-2.98,0-5.4-2.42-5.4-5.4c0-1.81,0.89-3.42,2.26-4.4C12.92,3.04,12.46,3,12,3L12,3z"/></svg></button></div><div class=navbarSearchContainer_Bca1><div class=navbar__search><span aria-label="expand searchbar" role=button class=search-icon tabindex=0></span><input id=search_input_react type=search placeholder=Loading... aria-label=Search class="navbar__search-input search-bar" disabled></div></div></div></div><div role=presentation class=navbar-sidebar__backdrop></div></nav><div id=__docusaurus_skipToContent_fallback class="main-wrapper mainWrapper_z2l0"><div class="container margin-vert--lg"><div class=row><aside class="col col--3"><nav class="sidebar_re4s thin-scrollbar" aria-label="Blog recent posts navigation"><div class="sidebarItemTitle_pO2u margin-bottom--md">All posts</div><div role=group><h3>2022</h3><ul class="sidebarItemList_Yudw clean-list"><li class=sidebarItem__DBe><a class=sidebarItemLink_mo7H href=/2011/11/webpack-industrial-complex>The webpack industrial complex</a></ul></div><div role=group><h3>2019</h3><ul class="sidebarItemList_Yudw clean-list"><li class=sidebarItem__DBe><a class=sidebarItemLink_mo7H href=/2019/12/release-the-gil>Release the GIL</a></ul><ul class="sidebarItemList_Yudw clean-list"><li class=sidebarItem__DBe><a class=sidebarItemLink_mo7H href=/2019/09/binary-format-shootout>Binary format shootout</a></ul><ul class="sidebarItemList_Yudw clean-list"><li class=sidebarItem__DBe><a class=sidebarItemLink_mo7H href=/2019/06/high-performance-systems>On building high performance systems</a></ul><ul class="sidebarItemList_Yudw clean-list"><li class=sidebarItem__DBe><a class=sidebarItemLink_mo7H href=/2019/05/making-bread>Making bread</a></ul><div role=group><h4>Allocations in Rust</h4><ul><ul class="sidebarItemList_Yudw clean-list"><li class=sidebarItem__DBe><a class=sidebarItemLink_mo7H href=/2019/02/understanding-allocations-in-rust>Foreword</a><li class=sidebarItem__DBe><a class=sidebarItemLink_mo7H href=/2019/02/the-whole-world>Global memory</a><li class=sidebarItem__DBe><a class=sidebarItemLink_mo7H href=/2019/02/stacking-up>Fixed memory</a><li class=sidebarItem__DBe><a class=sidebarItemLink_mo7H href=/2019/02/a-heaping-helping>Dynamic memory</a><li class=sidebarItem__DBe><a class=sidebarItemLink_mo7H href=/2019/02/08/compiler-optimizations>Compiler optimizations</a><li class=sidebarItem__DBe><a class=sidebarItemLink_mo7H href=/2019/02/summary>Summary</a></ul></ul></div></div><div role=group><h3>2018</h3><ul class="sidebarItemList_Yudw clean-list"><li class=sidebarItem__DBe><a class=sidebarItemLink_mo7H href=/2018/12/allocation-safety>QADAPT - debug_assert! for allocations</a></ul><ul class="sidebarItemList_Yudw clean-list"><li class=sidebarItem__DBe><a class=sidebarItemLink_mo7H href=/2018/12/what-small-business-really-means>More "what companies really mean"</a></ul><ul class="sidebarItemList_Yudw clean-list"><li class=sidebarItem__DBe><a class=sidebarItemLink_mo7H href=/2018/10/case-study-optimization>A case study in heaptrack</a></ul><ul class="sidebarItemList_Yudw clean-list"><li class=sidebarItem__DBe><a class=sidebarItemLink_mo7H href=/2018/09/isomorphic-apps>Isomorphic desktop apps with Rust</a></ul><ul class="sidebarItemList_Yudw clean-list"><li class=sidebarItem__DBe><a class=sidebarItemLink_mo7H href=/2018/09/primitives-in-rust-are-weird>Primitives in Rust are weird (and cool)</a></ul><ul class="sidebarItemList_Yudw clean-list"><li class=sidebarItem__DBe><a class=sidebarItemLink_mo7H href=/2018/06/dateutil-parser-to-rust>What I learned porting dateutil to Rust</a></ul><ul class="sidebarItemList_Yudw clean-list"><li class=sidebarItem__DBe><a class=sidebarItemLink_mo7H href=/2018/05/hello>Hello!</a></ul><div role=group><h4>Captain's Cookbook</h4><ul><ul class="sidebarItemList_Yudw clean-list"><li class=sidebarItem__DBe><a class=sidebarItemLink_mo7H href=/2018/01/captains-cookbook-part-1>Project setup</a><li class=sidebarItem__DBe><a class=sidebarItemLink_mo7H href=/2018/01/captains-cookbook-part-2>Practical usage</a></ul></ul></div></div><div role=group><h3>2016</h3><ul class="sidebarItemList_Yudw clean-list"><li class=sidebarItem__DBe><a class=sidebarItemLink_mo7H href=/2016/11/pca-audio-compression>PCA audio compression</a></ul><ul class="sidebarItemList_Yudw clean-list"><li class=sidebarItem__DBe><a class=sidebarItemLink_mo7H href=/2016/10/rustic-repodcasting>A Rustic re-podcasting server</a></ul><ul class="sidebarItemList_Yudw clean-list"><li class=sidebarItem__DBe><a class=sidebarItemLink_mo7H href=/2016/06/event-studies-and-earnings-releases>Event studies and earnings releases</a></ul><ul class="sidebarItemList_Yudw clean-list"><li class=sidebarItem__DBe><a class=sidebarItemLink_mo7H href=/2016/05/the-unfair-casino>The unfair casino</a></ul><ul class="sidebarItemList_Yudw clean-list"><li class=sidebarItem__DBe><a class=sidebarItemLink_mo7H href=/2016/04/tick-tock>Tick tock...</a></ul><ul class="sidebarItemList_Yudw clean-list"><li class=sidebarItem__DBe><a aria-current=page class="sidebarItemLink_mo7H sidebarItemLinkActive_I1ZP" href=/2016/03/tweet-like-me>Tweet like me</a></ul><ul class="sidebarItemList_Yudw clean-list"><li class=sidebarItem__DBe><a class=sidebarItemLink_mo7H href=/2016/03/predicting-santander-customer-happiness>Predicting Santander customer happiness</a></ul><ul class="sidebarItemList_Yudw clean-list"><li class=sidebarItem__DBe><a class=sidebarItemLink_mo7H href=/2016/02/profitability-using-the-investment-formula>Profitability using the investment formula</a></ul><ul class="sidebarItemList_Yudw clean-list"><li class=sidebarItem__DBe><a class=sidebarItemLink_mo7H href=/2016/02/guaranteed-money-maker>Guaranteed money maker</a></ul><ul class="sidebarItemList_Yudw clean-list"><li class=sidebarItem__DBe><a class=sidebarItemLink_mo7H href=/2016/01/cloudy-in-seattle>Cloudy in Seattle</a></ul><ul class="sidebarItemList_Yudw clean-list"><li class=sidebarItem__DBe><a class=sidebarItemLink_mo7H href=/2016/01/complaining-about-the-weather>Complaining about the weather</a></ul></div><div role=group><h3>2015</h3><ul class="sidebarItemList_Yudw clean-list"><li class=sidebarItem__DBe><a class=sidebarItemLink_mo7H href=/2015/12/testing-cramer>Testing Cramer</a></ul><ul class="sidebarItemList_Yudw clean-list"><li class=sidebarItem__DBe><a class=sidebarItemLink_mo7H href=/2015/11/autocallable>Autocallable Bonds</a></ul><ul class="sidebarItemList_Yudw clean-list"><li class=sidebarItem__DBe><a class=sidebarItemLink_mo7H href=/2015/11/welcome>Welcome, and an algorithm</a></ul></div></nav></aside><main class="col col--7"><article><header><h1 class=title_f1Hy>Tweet like me</h1><div class="container_mt6G margin-vert--md"><time datetime=2016-03-28T12:00:00.000Z>March 28, 2016</time> · <!-- -->9 min read</div><div class="margin-top--md margin-bottom--sm row"><div class="col col--12 authorCol_Hf19"><div class="avatar margin-bottom--sm"><div class="avatar__intro authorDetails_lV9A"><div class=avatar__name><span class=authorName_yefp>Bradlee Speice</span></div><div class=authorSocials_rSDt><a href=https://github.com/bspeice target=_blank rel="noopener noreferrer" class=authorSocialLink_owbf title=GitHub><svg viewBox="0 0 256 250" width=1em height=1em class="authorSocialLink_owbf githubSvg_Uu4N" style=--dark:#000;--light:#fff preserveAspectRatio=xMidYMid><path d="M128.001 0C57.317 0 0 57.307 0 128.001c0 56.554 36.676 104.535 87.535 121.46 6.397 1.185 8.746-2.777 8.746-6.158 0-3.052-.12-13.135-.174-23.83-35.61 7.742-43.124-15.103-43.124-15.103-5.823-14.795-14.213-18.73-14.213-18.73-11.613-7.944.876-7.78.876-7.78 12.853.902 19.621 13.19 19.621 13.19 11.417 19.568 29.945 13.911 37.249 10.64 1.149-8.272 4.466-13.92 8.127-17.116-28.431-3.236-58.318-14.212-58.318-63.258 0-13.975 5-25.394 13.188-34.358-1.329-3.224-5.71-16.242 1.24-33.874 0 0 10.749-3.44 35.21 13.121 10.21-2.836 21.16-4.258 32.038-4.307 10.878.049 21.837 1.47 32.066 4.307 24.431-16.56 35.165-13.12 35.165-13.12 6.967 17.63 2.584 30.65 1.255 33.873 8.207 8.964 13.173 20.383 13.173 34.358 0 49.163-29.944 59.988-58.447 63.157 4.591 3.972 8.682 11.762 8.682 23.704 0 17.126-.148 30.91-.148 35.126 0 3.407 2.304 7.398 8.792 6.14C219.37 232.5 256 184.537 256 128.002 256 57.307 198.691 0 128.001 0Zm-80.06 182.34c-.282.636-1.283.827-2.194.39-.929-.417-1.45-1.284-1.15-1.922.276-.655 1.279-.838 2.205-.399.93.418 1.46 1.293 1.139 1.931Zm6.296 5.618c-.61.566-1.804.303-2.614-.591-.837-.892-.994-2.086-.375-2.66.63-.566 1.787-.301 2.626.591.838.903 1 2.088.363 2.66Zm4.32 7.188c-.785.545-2.067.034-2.86-1.104-.784-1.138-.784-2.503.017-3.05.795-.547 2.058-.055 2.861 1.075.782 1.157.782 2.522-.019 3.08Zm7.304 8.325c-.701.774-2.196.566-3.29-.49-1.119-1.032-1.43-2.496-.726-3.27.71-.776 2.213-.558 3.315.49 1.11 1.03 1.45 2.505.701 3.27Zm9.442 2.81c-.31 1.003-1.75 1.459-3.199 1.033-1.448-.439-2.395-1.613-2.103-2.626.301-1.01 1.747-1.484 3.207-1.028 1.446.436 2.396 1.602 2.095 2.622Zm10.744 1.193c.036 1.055-1.193 1.93-2.715 1.95-1.53.034-2.769-.82-2.786-1.86 0-1.065 1.202-1.932 2.733-1.958 1.522-.03 2.768.818 2.768 1.868Zm10.555-.405c.182 1.03-.875 2.088-2.387 2.37-1.485.271-2.861-.365-3.05-1.386-.184-1.056.893-2.114 2.376-2.387 1.514-.263 2.868.356 3.061 1.403Z"/></svg></a></div></div></div></div></div></header><div id=__blog-post-container class=markdown><p>In which I try to create a robot that will tweet like I tweet.</p>
<p>So, I'm taking a Machine Learning course this semester in school, and one of the topics we keep coming back to is natural language processing and the 'bag of words' data structure. That is, given a sentence:</p>
<p><code>How much wood would a woodchuck chuck if a woodchuck could chuck wood?</code></p>
<p>We can represent that sentence as the following list:</p>
<p><code>{ How: 1 much: 1 wood: 2 would: 2 a: 2 woodchuck: 2 chuck: 2 if: 1 }</code></p>
<p>Ignoring <em>where</em> the words happened, we're just interested in how <em>often</em> the words occurred. That got me thinking: I wonder what would happen if I built a robot that just imitated how often I said things? It's dangerous territory when computer scientists ask "what if," but I got curious enough I wanted to follow through.</p>
<h2 class="anchor anchorWithStickyNavbar_LWe7" id=the-objective>The Objective<a href=#the-objective class=hash-link aria-label="Direct link to The Objective" title="Direct link to The Objective"></a></h2>
<p>Given an input list of Tweets, build up the following things:</p>
<ol>
<li>The distribution of starting words; since there are no "prior" words to go from, we need to treat this as a special case.</li>
<li>The distribution of words given a previous word; for example, every time I use the word <code>woodchuck</code> in the example sentence, there is a 50% chance it is followed by <code>chuck</code> and a 50% chance it is followed by <code>could</code>. I need this distribution for all words.</li>
<li>The distribution of quantity of hashtags; Do I most often use just one? Two? Do they follow something like a Poisson distribution?</li>
<li>Distribution of hashtags; Given a number of hashtags, what is the actual content? I'll treat hashtags as separate from the content of a tweet.</li>
</ol>
<h2 class="anchor anchorWithStickyNavbar_LWe7" id=the-data>The Data<a href=#the-data class=hash-link aria-label="Direct link to The Data" title="Direct link to The Data"></a></h2>
<p>I'm using as input my tweet history. I don't really use Twitter anymore, but it seems like a fun use of the dataset. I'd like to eventually build this to a point where I can imitate anyone on Twitter using their last 100 tweets or so, but I'll start with this as example code.</p>
<h2 class="anchor anchorWithStickyNavbar_LWe7" id=the-algorithm>The Algorithm<a href=#the-algorithm class=hash-link aria-label="Direct link to The Algorithm" title="Direct link to The Algorithm"></a></h2>
<p>I'll be using the <a href=http://www.nltk.org/ target=_blank rel="noopener noreferrer">NLTK</a> library for doing a lot of the heavy lifting. First, let's import the data:</p>
<div class="language-python codeBlockContainer_Ckt0 theme-code-block" style="--prism-background-color:hsl(230, 1%, 98%);--prism-color:hsl(230, 8%, 24%)"><div class=codeBlockContent_biex><pre tabindex=0 class="prism-code language-python codeBlock_bY9V thin-scrollbar" style="background-color:hsl(230, 1%, 98%);color:hsl(230, 8%, 24%)"><code class=codeBlockLines_e6Vv><span class=token-line style="color:hsl(230, 8%, 24%)"><span class="token keyword" style="color:hsl(301, 63%, 40%)">import</span><span class="token plain"> pandas </span><span class="token keyword" style="color:hsl(301, 63%, 40%)">as</span><span class="token plain"> pd</span><br></span><span class=token-line style="color:hsl(230, 8%, 24%)"><span class="token plain" style=display:inline-block></span><br></span><span class=token-line style="color:hsl(230, 8%, 24%)"><span class="token plain">tweets </span><span class="token operator" style="color:hsl(221, 87%, 60%)">=</span><span class="token plain"> pd</span><span class="token punctuation" style="color:hsl(119, 34%, 47%)">.</span><span class="token plain">read_csv</span><span class="token punctuation" style="color:hsl(119, 34%, 47%)">(</span><span class="token string" style="color:hsl(119, 34%, 47%)">'tweets.csv'</span><span class="token punctuation" style="color:hsl(119, 34%, 47%)">)</span><span class="token plain"></span><br></span><span class=token-line style="color:hsl(230, 8%, 24%)"><span class="token plain">text </span><span class="token operator" style="color:hsl(221, 87%, 60%)">=</span><span class="token plain"> tweets</span><span class="token punctuation" style="color:hsl(119, 34%, 47%)">.</span><span class="token plain">text</span><br></span><span class=token-line style="color:hsl(230, 8%, 24%)"><span class="token plain" style=display:inline-block></span><br></span><span class=token-line style="color:hsl(230, 8%, 24%)"><span class="token plain"></span><span class="token comment" style="color:hsl(230, 4%, 64%)"># Don't include tweets in reply to or mentioning people</span><span class="token plain"></span><br></span><span class=token-line style="color:hsl(230, 8%, 24%)"><span class="token plain">replies </span><span class="token operator" style="color:hsl(221, 87%, 60%)">=</span><span class="token plain"> text</span><span class="token punctuation" style="color:hsl(119, 34%, 47%)">.</span><span class="token builtin" style="color:hsl(119, 34%, 47%)">str</span><span class="token punctuation" style="color:hsl(119, 34%, 47%)">.</span><span class="token plain">contains</span><span class="token punctuation" style="color:hsl(119, 34%, 47%)">(</span><span class="token string" style="color:hsl(119, 34%, 47%)">'@'</span><span class="token punctuation" style="color:hsl(119, 34%, 47%)">)</span><span class="token plain"></span><br></span><span class=token-line style="color:hsl(230, 8%, 24%)"><span class="token plain">text_norep </span><span class="token operator" style="color:hsl(221, 87%, 60%)">=</span><span class="token plain"> text</span><span class="token punctuation" style="color:hsl(119, 34%, 47%)">.</span><span class="token plain">loc</span><span class="token punctuation" style="color:hsl(119, 34%, 47%)">[</span><span class="token operator" style="color:hsl(221, 87%, 60%)">~</span><span class="token plain">replies</span><span class="token punctuation" style="color:hsl(119, 34%, 47%)">]</span><br></span></code></pre><div class=buttonGroup__atx><button type=button aria-label="Copy code to clipboard" title=Copy class=clean-btn><span class=copyButtonIcons_eSgA aria-hidden=true><svg viewBox="0 0 24 24" class=copyButtonIcon_y97N><path fill=currentColor d="M19,21H8V7H19M19,5H8A2,2 0 0,0 6,7V21A2,2 0 0,0 8,23H19A2,2 0 0,0 21,21V7A2,2 0 0,0 19,5M16,1H4A2,2 0 0,0 2,3V17H4V3H16V1Z"/></svg><svg viewBox="0 0 24 24" class=copyButtonSuccessIcon_LjdS><path fill=currentColor d=M21,7L9,19L3.5,13.5L4.91,12.09L9,16.17L19.59,5.59L21,7Z /></svg></span></button></div></div></div>
<p>And now that we've got data, let's start crunching. First, tokenize and build out the distribution of first word:</p>
<div class="language-python codeBlockContainer_Ckt0 theme-code-block" style="--prism-background-color:hsl(230, 1%, 98%);--prism-color:hsl(230, 8%, 24%)"><div class=codeBlockContent_biex><pre tabindex=0 class="prism-code language-python codeBlock_bY9V thin-scrollbar" style="background-color:hsl(230, 1%, 98%);color:hsl(230, 8%, 24%)"><code class=codeBlockLines_e6Vv><span class=token-line style="color:hsl(230, 8%, 24%)"><span class="token keyword" style="color:hsl(301, 63%, 40%)">from</span><span class="token plain"> nltk</span><span class="token punctuation" style="color:hsl(119, 34%, 47%)">.</span><span class="token plain">tokenize </span><span class="token keyword" style="color:hsl(301, 63%, 40%)">import</span><span class="token plain"> TweetTokenizer</span><br></span><span class=token-line style="color:hsl(230, 8%, 24%)"><span class="token plain">tknzr </span><span class="token operator" style="color:hsl(221, 87%, 60%)">=</span><span class="token plain"> TweetTokenizer</span><span class="token punctuation" style="color:hsl(119, 34%, 47%)">(</span><span class="token punctuation" style="color:hsl(119, 34%, 47%)">)</span><span class="token plain"></span><br></span><span class=token-line style="color:hsl(230, 8%, 24%)"><span class="token plain">tokens </span><span class="token operator" style="color:hsl(221, 87%, 60%)">=</span><span class="token plain"> text_norep</span><span class="token punctuation" style="color:hsl(119, 34%, 47%)">.</span><span class="token builtin" style="color:hsl(119, 34%, 47%)">map</span><span class="token punctuation" style="color:hsl(119, 34%, 47%)">(</span><span class="token plain">tknzr</span><span class="token punctuation" style="color:hsl(119, 34%, 47%)">.</span><span class="token plain">tokenize</span><span class="token punctuation" style="color:hsl(119, 34%, 47%)">)</span><span class="token plain"></span><br></span><span class=token-line style="color:hsl(230, 8%, 24%)"><span class="token plain" style=display:inline-block></span><br></span><span class=token-line style="color:hsl(230, 8%, 24%)"><span class="token plain">first_words </span><span class="token operator" style="color:hsl(221, 87%, 60%)">=</span><span class="token plain"> tokens</span><span class="token punctuation" style="color:hsl(119, 34%, 47%)">.</span><span class="token builtin" style="color:hsl(119, 34%, 47%)">map</span><span class="token punctuation" style="color:hsl(119, 34%, 47%)">(</span><span class="token keyword" style="color:hsl(301, 63%, 40%)">lambda</span><span class="token plain"> x</span><span class="token punctuation" style="color:hsl(119, 34%, 47%)">:</span><span class="token plain"> x</span><span class="token punctuation" style="color:hsl(119, 34%, 47%)">[</span><span class="token number" style="color:hsl(35, 99%, 36%)">0</span><span class="token punctuation" style="color:hsl(119, 34%, 47%)">]</span><span class="token punctuation" style="color:hsl(119, 34%, 47%)">)</span><span class="token plain"></span><br></span><span class=token-line style="color:hsl(230, 8%, 24%)"><span class="token plain">first_words_alpha </span><span class="token operator" style="color:hsl(221, 87%, 60%)">=</span><span class="token plain"> first_words</span><span class="token punctuation" style="color:hsl(119, 34%, 47%)">[</span><span class="token plain">first_words</span><span class="token punctuation" style="color:hsl(119, 34%, 47%)">.</span><span class="token builtin" style="color:hsl(119, 34%, 47%)">str</span><span class="token punctuation" style="color:hsl(119, 34%, 47%)">.</span><span class="token plain">isalpha</span><span class="token punctuation" style="color:hsl(119, 34%, 47%)">(</span><span class="token punctuation" style="color:hsl(119, 34%, 47%)">)</span><span class="token punctuation" style="color:hsl(119, 34%, 47%)">]</span><span class="token plain"></span><br></span><span class=token-line style="color:hsl(230, 8%, 24%)"><span class="token plain">first_word_dist </span><span class="token operator" style="color:hsl(221, 87%, 60%)">=</span><span class="token plain"> first_words_alpha</span><span class="token punctuation" style="color:hsl(119, 34%, 47%)">.</span><span class="token plain">value_counts</span><span class="token punctuation" style="color:hsl(119, 34%, 47%)">(</span><span class="token punctuation" style="color:hsl(119, 34%, 47%)">)</span><span class="token plain"> </span><span class="token operator" style="color:hsl(221, 87%, 60%)">/</span><span class="token plain"> </span><span class="token builtin" style="color:hsl(119, 34%, 47%)">len</span><span class="token punctuation" style="color:hsl(119, 34%, 47%)">(</span><span class="token plain">first_words_alpha</span><span class="token punctuation" style="color:hsl(119, 34%, 47%)">)</span><br></span></code></pre><div class=buttonGroup__atx><button type=button aria-label="Copy code to clipboard" title=Copy class=clean-btn><span class=copyButtonIcons_eSgA aria-hidden=true><svg viewBox="0 0 24 24" class=copyButtonIcon_y97N><path fill=currentColor d="M19,21H8V7H19M19,5H8A2,2 0 0,0 6,7V21A2,2 0 0,0 8,23H19A2,2 0 0,0 21,21V7A2,2 0 0,0 19,5M16,1H4A2,2 0 0,0 2,3V17H4V3H16V1Z"/></svg><svg viewBox="0 0 24 24" class=copyButtonSuccessIcon_LjdS><path fill=currentColor d=M21,7L9,19L3.5,13.5L4.91,12.09L9,16.17L19.59,5.59L21,7Z /></svg></span></button></div></div></div>
<p>Next, we need to build out the conditional distributions. That is, what is the probability of the next word given the current word is <span class=katex><span class=katex-mathml><math><semantics><mrow><mi>X</mi></mrow><annotation encoding=application/x-tex>X</annotation></semantics></math></span><span class=katex-html aria-hidden=true><span class=base><span class=strut style=height:0.6833em></span><span class="mord mathnormal" style=margin-right:0.07847em>X</span></span></span></span>? This one is a bit more involved. First, find all unique words, and then find what words proceed them. This can probably be done in a more efficient manner than I'm currently doing here, but we'll ignore that for the moment.</p>
<div class="language-python codeBlockContainer_Ckt0 theme-code-block" style="--prism-background-color:hsl(230, 1%, 98%);--prism-color:hsl(230, 8%, 24%)"><div class=codeBlockContent_biex><pre tabindex=0 class="prism-code language-python codeBlock_bY9V thin-scrollbar" style="background-color:hsl(230, 1%, 98%);color:hsl(230, 8%, 24%)"><code class=codeBlockLines_e6Vv><span class=token-line style="color:hsl(230, 8%, 24%)"><span class="token keyword" style="color:hsl(301, 63%, 40%)">from</span><span class="token plain"> functools </span><span class="token keyword" style="color:hsl(301, 63%, 40%)">import</span><span class="token plain"> </span><span class="token builtin" style="color:hsl(119, 34%, 47%)">reduce</span><span class="token plain"></span><br></span><span class=token-line style="color:hsl(230, 8%, 24%)"><span class="token plain" style=display:inline-block></span><br></span><span class=token-line style="color:hsl(230, 8%, 24%)"><span class="token plain"></span><span class="token comment" style="color:hsl(230, 4%, 64%)"># Get all possible words</span><span class="token plain"></span><br></span><span class=token-line style="color:hsl(230, 8%, 24%)"><span class="token plain">all_words </span><span class="token operator" style="color:hsl(221, 87%, 60%)">=</span><span class="token plain"> </span><span class="token builtin" style="color:hsl(119, 34%, 47%)">reduce</span><span class="token punctuation" style="color:hsl(119, 34%, 47%)">(</span><span class="token keyword" style="color:hsl(301, 63%, 40%)">lambda</span><span class="token plain"> x</span><span class="token punctuation" style="color:hsl(119, 34%, 47%)">,</span><span class="token plain"> y</span><span class="token punctuation" style="color:hsl(119, 34%, 47%)">:</span><span class="token plain"> x</span><span class="token operator" style="color:hsl(221, 87%, 60%)">+</span><span class="token plain">y</span><span class="token punctuation" style="color:hsl(119, 34%, 47%)">,</span><span class="token plain"> tokens</span><span class="token punctuation" style="color:hsl(119, 34%, 47%)">,</span><span class="token plain"> </span><span class="token punctuation" style="color:hsl(119, 34%, 47%)">[</span><span class="token punctuation" style="color:hsl(119, 34%, 47%)">]</span><span class="token punctuation" style="color:hsl(119, 34%, 47%)">)</span><span class="token plain"></span><br></span><span class=token-line style="color:hsl(230, 8%, 24%)"><span class="token plain">unique_words </span><span class="token operator" style="color:hsl(221, 87%, 60%)">=</span><span class="token plain"> </span><span class="token builtin" style="color:hsl(119, 34%, 47%)">set</span><span class="token punctuation" style="color:hsl(119, 34%, 47%)">(</span><span class="token plain">all_words</span><span class="token punctuation" style="color:hsl(119, 34%, 47%)">)</span><span class="token plain"></span><br></span><span class=token-line style="color:hsl(230, 8%, 24%)"><span class="token plain">actual_words </span><span class="token operator" style="color:hsl(221, 87%, 60%)">=</span><span class="token plain"> </span><span class="token builtin" style="color:hsl(119, 34%, 47%)">set</span><span class="token punctuation" style="color:hsl(119, 34%, 47%)">(</span><span class="token punctuation" style="color:hsl(119, 34%, 47%)">[</span><span class="token plain">x </span><span class="token keyword" style="color:hsl(301, 63%, 40%)">if</span><span class="token plain"> x</span><span class="token punctuation" style="color:hsl(119, 34%, 47%)">[</span><span class="token number" style="color:hsl(35, 99%, 36%)">0</span><span class="token punctuation" style="color:hsl(119, 34%, 47%)">]</span><span class="token plain"> </span><span class="token operator" style="color:hsl(221, 87%, 60%)">!=</span><span class="token plain"> </span><span class="token string" style="color:hsl(119, 34%, 47%)">'.'</span><span class="token plain"> </span><span class="token keyword" style="color:hsl(301, 63%, 40%)">else</span><span class="token plain"> </span><span class="token boolean" style="color:hsl(35, 99%, 36%)">None</span><span class="token plain"> </span><span class="token keyword" style="color:hsl(301, 63%, 40%)">for</span><span class="token plain"> x </span><span class="token keyword" style="color:hsl(301, 63%, 40%)">in</span><span class="token plain"> unique_words</span><span class="token punctuation" style="color:hsl(119, 34%, 47%)">]</span><span class="token punctuation" style="color:hsl(119, 34%, 47%)">)</span><span class="token plain"></span><br></span><span class=token-line style="color:hsl(230, 8%, 24%)"><span class="token plain" style=display:inline-block></span><br></span><span class=token-line style="color:hsl(230, 8%, 24%)"><span class="token plain">word_dist </span><span class="token operator" style="color:hsl(221, 87%, 60%)">=</span><span class="token plain"> </span><span class="token punctuation" style="color:hsl(119, 34%, 47%)">{</span><span class="token punctuation" style="color:hsl(119, 34%, 47%)">}</span><span class="token plain"></span><br></span><span class=token-line style="color:hsl(230, 8%, 24%)"><span class="token plain"></span><span class="token keyword" style="color:hsl(301, 63%, 40%)">for</span><span class="token plain"> word </span><span class="token keyword" style="color:hsl(301, 63%, 40%)">in</span><span class="token plain"> </span><span class="token builtin" style="color:hsl(119, 34%, 47%)">iter</span><span class="token punctuation" style="color:hsl(119, 34%, 47%)">(</span><span class="token plain">actual_words</span><span class="token punctuation" style="color:hsl(119, 34%, 47%)">)</span><span class="token punctuation" style="color:hsl(119, 34%, 47%)">:</span><span class="token plain"></span><br></span><span class=token-line style="color:hsl(230, 8%, 24%)"><span class="token plain"> indices </span><span class="token operator" style="color:hsl(221, 87%, 60%)">=</span><span class="token plain"> </span><span class="token punctuation" style="color:hsl(119, 34%, 47%)">[</span><span class="token plain">i </span><span class="token keyword" style="color:hsl(301, 63%, 40%)">for</span><span class="token plain"> i</span><span class="token punctuation" style="color:hsl(119, 34%, 47%)">,</span><span class="token plain"> j </span><span class="token keyword" style="color:hsl(301, 63%, 40%)">in</span><span class="token plain"> </span><span class="token builtin" style="color:hsl(119, 34%, 47%)">enumerate</span><span class="token punctuation" style="color:hsl(119, 34%, 47%)">(</span><span class="token plain">all_words</span><span class="token punctuation" style="color:hsl(119, 34%, 47%)">)</span><span class="token plain"> </span><span class="token keyword" style="color:hsl(301, 63%, 40%)">if</span><span class="token plain"> j </span><span class="token operator" style="color:hsl(221, 87%, 60%)">==</span><span class="token plain"> word</span><span class="token punctuation" style="color:hsl(119, 34%, 47%)">]</span><span class="token plain"></span><br></span><span class=token-line style="color:hsl(230, 8%, 24%)"><span class="token plain"> proceeding </span><span class="token operator" style="color:hsl(221, 87%, 60%)">=</span><span class="token plain"> </span><span class="token punctuation" style="color:hsl(119, 34%, 47%)">[</span><span class="token plain">all_words</span><span class="token punctuation" style="color:hsl(119, 34%, 47%)">[</span><span class="token plain">i</span><span class="token operator" style="color:hsl(221, 87%, 60%)">+</span><span class="token number" style="color:hsl(35, 99%, 36%)">1</span><span class="token punctuation" style="color:hsl(119, 34%, 47%)">]</span><span class="token plain"> </span><span class="token keyword" style="color:hsl(301, 63%, 40%)">for</span><span class="token plain"> i </span><span class="token keyword" style="color:hsl(301, 63%, 40%)">in</span><span class="token plain"> indices</span><span class="token punctuation" style="color:hsl(119, 34%, 47%)">]</span><span class="token plain"></span><br></span><span class=token-line style="color:hsl(230, 8%, 24%)"><span class="token plain"> word_dist</span><span class="token punctuation" style="color:hsl(119, 34%, 47%)">[</span><span class="token plain">word</span><span class="token punctuation" style="color:hsl(119, 34%, 47%)">]</span><span class="token plain"> </span><span class="token operator" style="color:hsl(221, 87%, 60%)">=</span><span class="token plain"> proceeding</span><br></span></code></pre><div class=buttonGroup__atx><button type=button aria-label="Copy code to clipboard" title=Copy class=clean-btn><span class=copyButtonIcons_eSgA aria-hidden=true><svg viewBox="0 0 24 24" class=copyButtonIcon_y97N><path fill=currentColor d="M19,21H8V7H19M19,5H8A2,2 0 0,0 6,7V21A2,2 0 0,0 8,23H19A2,2 0 0,0 21,21V7A2,2 0 0,0 19,5M16,1H4A2,2 0 0,0 2,3V17H4V3H16V1Z"/></svg><svg viewBox="0 0 24 24" class=copyButtonSuccessIcon_LjdS><path fill=currentColor d=M21,7L9,19L3.5,13.5L4.91,12.09L9,16.17L19.59,5.59L21,7Z /></svg></span></button></div></div></div>
<p>Now that we've got the tweet analysis done, it's time for the fun part: hashtags! Let's count how many hashtags are in each tweet, I want to get a sense of the distribution.</p>
<div class="language-python codeBlockContainer_Ckt0 theme-code-block" style="--prism-background-color:hsl(230, 1%, 98%);--prism-color:hsl(230, 8%, 24%)"><div class=codeBlockContent_biex><pre tabindex=0 class="prism-code language-python codeBlock_bY9V thin-scrollbar" style="background-color:hsl(230, 1%, 98%);color:hsl(230, 8%, 24%)"><code class=codeBlockLines_e6Vv><span class=token-line style="color:hsl(230, 8%, 24%)"><span class="token keyword" style="color:hsl(301, 63%, 40%)">import</span><span class="token plain"> matplotlib</span><span class="token punctuation" style="color:hsl(119, 34%, 47%)">.</span><span class="token plain">pyplot </span><span class="token keyword" style="color:hsl(301, 63%, 40%)">as</span><span class="token plain"> plt</span><br></span><span class=token-line style="color:hsl(230, 8%, 24%)"><span class="token plain"></span><span class="token operator" style="color:hsl(221, 87%, 60%)">%</span><span class="token plain">matplotlib inline</span><br></span><span class=token-line style="color:hsl(230, 8%, 24%)"><span class="token plain" style=display:inline-block></span><br></span><span class=token-line style="color:hsl(230, 8%, 24%)"><span class="token plain">hashtags </span><span class="token operator" style="color:hsl(221, 87%, 60%)">=</span><span class="token plain"> text_norep</span><span class="token punctuation" style="color:hsl(119, 34%, 47%)">.</span><span class="token builtin" style="color:hsl(119, 34%, 47%)">str</span><span class="token punctuation" style="color:hsl(119, 34%, 47%)">.</span><span class="token plain">count</span><span class="token punctuation" style="color:hsl(119, 34%, 47%)">(</span><span class="token string" style="color:hsl(119, 34%, 47%)">'#'</span><span class="token punctuation" style="color:hsl(119, 34%, 47%)">)</span><span class="token plain"></span><br></span><span class=token-line style="color:hsl(230, 8%, 24%)"><span class="token plain">bins </span><span class="token operator" style="color:hsl(221, 87%, 60%)">=</span><span class="token plain"> hashtags</span><span class="token punctuation" style="color:hsl(119, 34%, 47%)">.</span><span class="token plain">unique</span><span class="token punctuation" style="color:hsl(119, 34%, 47%)">(</span><span class="token punctuation" style="color:hsl(119, 34%, 47%)">)</span><span class="token punctuation" style="color:hsl(119, 34%, 47%)">.</span><span class="token builtin" style="color:hsl(119, 34%, 47%)">max</span><span class="token punctuation" style="color:hsl(119, 34%, 47%)">(</span><span class="token punctuation" style="color:hsl(119, 34%, 47%)">)</span><span class="token plain"></span><br></span><span class=token-line style="color:hsl(230, 8%, 24%)"><span class="token plain">hashtags</span><span class="token punctuation" style="color:hsl(119, 34%, 47%)">.</span><span class="token plain">plot</span><span class="token punctuation" style="color:hsl(119, 34%, 47%)">(</span><span class="token plain">kind</span><span class="token operator" style="color:hsl(221, 87%, 60%)">=</span><span class="token string" style="color:hsl(119, 34%, 47%)">'hist'</span><span class="token punctuation" style="color:hsl(119, 34%, 47%)">,</span><span class="token plain"> bins</span><span class="token operator" style="color:hsl(221, 87%, 60%)">=</span><span class="token plain">bins</span><span class="token punctuation" style="color:hsl(119, 34%, 47%)">)</span><br></span></code></pre><div class=buttonGroup__atx><button type=button aria-label="Copy code to clipboard" title=Copy class=clean-btn><span class=copyButtonIcons_eSgA aria-hidden=true><svg viewBox="0 0 24 24" class=copyButtonIcon_y97N><path fill=currentColor d="M19,21H8V7H19M19,5H8A2,2 0 0,0 6,7V21A2,2 0 0,0 8,23H19A2,2 0 0,0 21,21V7A2,2 0 0,0 19,5M16,1H4A2,2 0 0,0 2,3V17H4V3H16V1Z"/></svg><svg viewBox="0 0 24 24" class=copyButtonSuccessIcon_LjdS><path fill=currentColor d=M21,7L9,19L3.5,13.5L4.91,12.09L9,16.17L19.59,5.59L21,7Z /></svg></span></button></div></div></div>
<div class="codeBlockContainer_Ckt0 theme-code-block" style="--prism-background-color:hsl(230, 1%, 98%);--prism-color:hsl(230, 8%, 24%)"><div class=codeBlockContent_biex><pre tabindex=0 class="prism-code language-text codeBlock_bY9V thin-scrollbar" style="background-color:hsl(230, 1%, 98%);color:hsl(230, 8%, 24%)"><code class=codeBlockLines_e6Vv><span class=token-line style="color:hsl(230, 8%, 24%)"><span class="token plain"> &lt;matplotlib.axes._subplots.AxesSubplot at 0x18e59dc28d0></span><br></span></code></pre><div class=buttonGroup__atx><button type=button aria-label="Copy code to clipboard" title=Copy class=clean-btn><span class=copyButtonIcons_eSgA aria-hidden=true><svg viewBox="0 0 24 24" class=copyButtonIcon_y97N><path fill=currentColor d="M19,21H8V7H19M19,5H8A2,2 0 0,0 6,7V21A2,2 0 0,0 8,23H19A2,2 0 0,0 21,21V7A2,2 0 0,0 19,5M16,1H4A2,2 0 0,0 2,3V17H4V3H16V1Z"/></svg><svg viewBox="0 0 24 24" class=copyButtonSuccessIcon_LjdS><path fill=currentColor d=M21,7L9,19L3.5,13.5L4.91,12.09L9,16.17L19.59,5.59L21,7Z /></svg></span></button></div></div></div>
<p><img decoding=async loading=lazy alt=png src=data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAYkAAAEACAYAAABGYoqtAAAABHNCSVQICAgIfAhkiAAAAAlwSFlzAAALEgAACxIB0t1+/AAAEe1JREFUeJzt3X+s3XV9x/HnCzqRinadjt6NosA0CGYOUSsJM7tmG4pGYFuGuGkEMmOCTheThZZsazXZBOOcbguJUWYqw7CCIpi5UQi7Li5KmYKixdpkFrHQC1MHogQB3/vjfGsP9X7KObf33HNu7/ORnPT7/dzv95x3v/32vO7n8/2VqkKSpLkcNu4CJEmTy5CQJDUZEpKkJkNCktRkSEiSmgwJSVLTyEMiya4kX01ye5JtXdvqJFuT7EhyY5JVfctvSLIzyV1Jzhh1fZKktsXoSfwUmK6ql1TVuq5tPXBzVZ0I3AJsAEhyMnAucBJwJnB5kixCjZKkOSxGSGSOzzkb2NxNbwbO6abPAq6uqserahewE1iHJGksFiMkCrgpyW1J/qRrW1NVswBVtQc4ums/Brinb93dXZskaQxWLMJnnF5V9yX5ZWBrkh30gqOf9waRpAk08pCoqvu6Px9I8hl6w0ezSdZU1WySKeD+bvHdwLF9q6/t2p4kiaEiSfNQVUMd5x3pcFOSlUmO6qafAZwB3AncAJzfLfYW4Ppu+gbgvCRPS3I88Hxg21zvXVW+qti4cePYa5iUl9vCbeG2OPBrPkbdk1gDXNf95r8CuKqqtib5b2BLkguBu+md0URVbU+yBdgOPAZcVPP9m0mSDtpIQ6Kqvg2cMkf794HfaazzPuB9o6xLkjQYr7he4qanp8ddwsRwW+zjttjHbXFwshRHc5I4CiVJQ0pCTdKBa0nS0mZISJKaDAlJUpMhIUlqMiQkSU2GhCSpyZCQJDUZEpKkJkNCktS0rEJiauo4kgz1mpo6btxlS9LYLKvbcvQelz3sepn3LXYlaZJ4Ww5J0oIyJCRJTYaEJKnJkJAkNRkSkqQmQ0KS1GRISJKaDAlJUpMhIUlqMiQkSU2GhCSpyZCQJDUZEpKkJkNCktRkSEiSmgwJSVKTISFJajIkJElNhoQkqcmQkCQ1GRKSpCZDQpLUZEhIkpoWJSSSHJbkK0lu6OZXJ9maZEeSG5Os6lt2Q5KdSe5KcsZi1CdJmtti9STeBWzvm18P3FxVJwK3ABsAkpwMnAucBJwJXJ4ki1SjJGk/Iw+JJGuB1wIf62s+G9jcTW8GzummzwKurqrHq2oXsBNYN+oaJUlzW4yexN8Bfw5UX9uaqpoFqKo9wNFd+zHAPX3L7e7aJEljsGKUb57kdcBsVd2RZPoAi9YBfjanTZs2/Wx6enqa6ekDvb0kLT8zMzPMzMwc1Hukaujv58HfPPkb4E3A48CRwDOB64CXAdNVNZtkCviPqjopyXqgquqybv1/BzZW1a37vW/Np+7e4Y1h1wuj3EaStFiSUFVDHecd6XBTVV1SVc+tqhOA84BbqurNwGeB87vF3gJc303fAJyX5GlJjgeeD2wbZY2SpLaRDjcdwKXAliQXAnfTO6OJqtqeZAu9M6EeAy6aV5dBkrQgRjrcNCoON0nS8CZuuEmStLQZEpKkJkNCktRkSEiSmgwJSVKTISFJajIkJElNhoQkqcmQkCQ1GRKSpCZDQpLUZEhIkpoMCUlSkyEhSWoyJCRJTYaEJKnJkJAkNY3r8aUH7b3vfe9Qy69cuXJElUjSoWvJPr4U/nKodY444goeffRefHyppOVqPo8vXcIhMVzdq1adxoMP3oohIWm58hnXkqQFZUhIkpoMCUlSkyEhSWoyJCRJTYaEJKnJkJAkNRkSkqQmQ0KS1GRISJKaDAlJUpMhIUlqMiQkSU2GhCSpyZCQJDWNNCSSHJHk1iS3J7kzycaufXWSrUl2JLkxyaq+dTYk2ZnkriRnjLI+SdKBjTQkqupR4FVV9RLgFODMJOuA9cDNVXUicAuwASDJycC5wEnAmcDlSYZ6QIYkaeGMfLipqn7cTR5B75naBZwNbO7aNwPndNNnAVdX1eNVtQvYCawbdY2SpLkNFBJJfn2+H5DksCS3A3uAm6rqNmBNVc0CVNUe4Ohu8WOAe/pW3921SZLGYNCexOVJtiW5qP/4wSCq6qfdcNNaYF2SF/HzD5r2IdKSNIFWDLJQVb0yyQuAC4EvJ9kGfLyqbhr0g6rqoSQzwGuA2SRrqmo2yRRwf7fYbuDYvtXWdm1z2NQ3Pd29JEl7zczMMDMzc1DvkarBf4lPcji94wd/DzwEBLikqj7dWP45wGNV9WCSI4EbgUuB3wK+X1WXJbkYWF1V67sD11cBr6A3zHQT8ILar8gkNWznY9Wq03jwwVsZvtMShtlGkjSpklBVQ50MNFBPIsmLgQuA19H74n59VX0lya8CXwTmDAngV4DNSQ6jN7T1L1X1uSRfArYkuRC4m94ZTVTV9iRbgO3AY8BF+weEJGnxDNSTSPJ54GPAtVX1yH4/e3NVXTmi+lr12JOQpCGNrCdBrwfxSFU90X3QYcDTq+rHix0QkqTFM+jZTTcDR/bNr+zaJEmHsEFD4ulV9fDemW565WhKkiRNikFD4kdJTt07k+SlwCMHWF6SdAgY9JjEnwHXJLmX3mmvU8AbRlaVJGkiDHox3W1JXgic2DXtqKrHRleWJGkSDNqTAHg5cFy3zqndqVSfGElVkqSJMOjFdFcCvwbcATzRNRdgSEjSIWzQnsTLgJO9+lmSlpdBz276Or2D1ZKkZWTQnsRzgO3d3V8f3dtYVWeNpCpJ0kQYNCQ2jbIISdJkGvQU2M8neR6923bfnGQlcPhoS5Mkjdugjy99K3At8JGu6RjgM6MqSpI0GQY9cP124HR6Dxqiqnay77nUkqRD1KAh8WhV/WTvTJIV+FxqSTrkDRoSn09yCXBkkt8FrgE+O7qyJEmTYNCQWA88ANwJvA34HPAXoypKkjQZBnp86aTx8aWSNLyRPb40ybeZ49u1qk4Y5sMkSUvLMPdu2uvpwB8Cv7Tw5UiSJslAxySq6nt9r91V9SHgdSOuTZI0ZoMON53aN3sYvZ7FMM+ikCQtQYN+0f9t3/TjwC7g3AWvRpI0UQa9d9OrRl2IJGnyDDrc9O4D/byqPrgw5UiSJskwZze9HLihm389sA3YOYqiJEmTYdCQWAucWlU/BEiyCfjXqnrTqAqTJI3foLflWAP8pG/+J12bJOkQNmhP4hPAtiTXdfPnAJtHU5IkaVIMenbTXyf5N+CVXdMFVXX76MqSJE2CQYebAFYCD1XVh4HvJjl+RDVJkibEoI8v3QhcDGzomn4B+OdRFSVJmgyD9iR+DzgL+BFAVd0LPHNURUmSJsOgIfGT6j1UoQCSPGN0JUmSJsWgIbElyUeAX0zyVuBm4KOjK0uSNAkGvVX4B4BrgU8BJwJ/VVX/8FTrJVmb5JYk30hyZ5J3du2rk2xNsiPJjUlW9a2zIcnOJHclOWN+fy1J0kJ4yseXJjkcuHk+N/lLMgVMVdUdSY4CvgycDVwAfK+q3p/kYmB1Va1PcjJwFb1bgKyl12N5Qe1XpI8vlaThzefxpU/Zk6iqJ4Cf9v+2P6iq2lNVd3TTDwN30fvyP5t9F+NtpndxHvQOjl9dVY9X1S5694ZaN+znSpIWxqBXXD8M3JnkJroznACq6p2DflCS44BTgC8Ba6pqtnuPPUmO7hY7Bvhi32q7uzZJ0hgMGhKf7l7z0g01XQu8q6oe7g0XPYnjOZI0gQ4YEkmeW1Xfqap536cpyQp6AXFlVV3fNc8mWVNVs91xi/u79t3AsX2rr+3a5rCpb3q6e0mS9pqZmWFmZuag3uOAB66TfKWqTu2mP1VVfzD0BySfAP63qt7d13YZ8P2quqxx4PoV9IaZbsID15K0IOZz4Pqphpv63+yEeRR0OvDH9I5n3E7vG/oS4DJ6115cCNxN97zsqtqeZAuwHXgMuGj/gJAkLZ6nColqTA+kqv4LOLzx499prPM+4H3DfpYkaeE9VUj8RpKH6PUojuym6earqp410uokSWN1wJCoqlYvQJK0DAzzPAlJ0jJjSEiSmgwJSVKTISFJajIkJElNhoQkqcmQkCQ1GRKSpCZDQpLUZEhIkpoMCUlSkyEhSWoyJCRJTYaEJKnJkJAkNRkSkqQmQ0KS1GRISJKaDAlJUpMhIUlqMiQkSU2GhCSpyZCQJDUZEpKkJkNCktRkSEiSmgwJSVKTISFJajIkJElNhoQkqcmQkCQ1GRJP6QiSDP2amjpu3IVL0kFbMe4CJt+jQA291uxsFr4USVpk9iQkSU0jDYkkVySZTfK1vrbVSbYm2ZHkxiSr+n62IcnOJHclOWOUtUmSntqoexIfB169X9t64OaqOhG4BdgAkORk4FzgJOBM4PIkjtlI0hiNNCSq6gvAD/ZrPhvY3E1vBs7pps8Crq6qx6tqF7ATWDfK+iRJBzaOYxJHV9UsQFXtAY7u2o8B7ulbbnfXJkkak0k4u2n4U4cA2NQ3Pd29JEl7zczMMDMzc1DvMY6QmE2ypqpmk0wB93ftu4Fj+5Zb27U1bBpVfZJ0SJienmZ6evpn8+95z3uGfo/FGG5K99rrBuD8bvotwPV97ecleVqS44HnA9sWoT5JUsNIexJJPklvHOjZSb4DbAQuBa5JciFwN70zmqiq7Um2ANuBx4CLqmqeQ1GSpIWQpfg9nKSGPZSxatVpPPjgrQx/CCTzWKe33lLctpIOXUmoqqEuLfCKa0lSkyEhSWoyJCRJTYaEJKnJkJAkNRkSkqQmQ0KS1GRISJKaDAlJUpMhIUlqMiQkSU2GhCSpyZCQJDUZEpKkJkNCktRkSEiSmgwJSVKTISFJajIkJElNhoQkqcmQkCQ1GRKSpCZDQpLUZEhIkpoMCUlSkyExMkeQZKjX1NRx4y5akp5kxbgLOHQ9CtRQa8zOZjSlSNI82ZOQJDUZEpKkJkNCktRkSEiSmgwJSVKTISFJajIkJElNhoQkqWkiQyLJa5J8M8m3klw87nokabmauJBIchjwj8CrgRcBb0zywvFWNblmZmbGXcLEcFvs47bYx21xcCYuJIB1wM6quruqHgOuBs4ec00Ty/8A+7gt9nFb7OO2ODiTGBLHAPf0zX+3a9McPvCBDw19I0FvJihpUEv2Bn/Petbrh1r+kUe+OaJKFlLvzrHDG+5GguDNBKVRmJo6jtnZu4daZ82a57Fnz67RFLQAUjX8F8woJTkN2FRVr+nm1wNVVZf1LTNZRUvSElFVQ/2GOIkhcTiwA/ht4D5gG/DGqrprrIVJ0jI0ccNNVfVEkncAW+kdM7nCgJCk8Zi4noQkaXJM4tlNB+SFdvsk2ZXkq0luT7Jt3PUspiRXJJlN8rW+ttVJtibZkeTGJKvGWeNiaWyLjUm+m+Qr3es146xxsSRZm+SWJN9IcmeSd3bty2rfmGM7/GnXPvR+saR6Et2Fdt+id7ziXuA24LyqWgqnLi24JP8DvLSqfjDuWhZbkt8EHgY+UVUv7touA75XVe/vfoFYXVXrx1nnYmhsi43AD6vqg2MtbpElmQKmquqOJEcBX6Z3ndUFLKN94wDb4Q0MuV8stZ6EF9o9WVh6/4YLoqq+AOwfjmcDm7vpzcA5i1rUmDS2BfT2j2WlqvZU1R3d9MPAXcBaltm+0dgOe683G2q/WGpfMF5o92QF3JTktiRvHXcxE+DoqpqF3n8S4Ogx1zNu70hyR5KPHerDK3NJchxwCvAlYM1y3Tf6tsOtXdNQ+8VSCwk92elVdSrwWuDt3bCD9lk6Y6kL73LghKo6BdgDLLdhp6OAa4F3db9J778vLIt9Y47tMPR+sdRCYjfw3L75tV3bslRV93V/PgBcR284bjmbTbIGfjYme/+Y6xmbqnqg9h1w/Cjw8nHWs5iSrKD3xXhlVV3fNS+7fWOu7TCf/WKphcRtwPOTPC/J04DzgBvGXNNYJFnZ/ZZAkmcAZwBfH29Viy48eXz1BuD8bvotwPX7r3AIe9K26L4I9/p9lte+8U/A9qr6cF/bctw3fm47zGe/WFJnN0HvFFjgw+y70O7SMZc0FkmOp9d7KHoXRV61nLZFkk8C08CzgVlgI/AZ4BrgWOBu4Nyq+r9x1bhYGtviVfTGoX8K7ALetndM/lCW5HTgP4E76f3fKOASendu2MIy2TcOsB3+iCH3iyUXEpKkxbPUhpskSYvIkJAkNRkSkqQmQ0KS1GRISJKaDAlJUpMhIUlqMiQkSU3/DzepYDZSwMuQAAAAAElFTkSuQmCC width=393 height=256 class=img_ev3q></p>
<p>That looks like a Poisson distribution, kind of as I expected. I'm guessing my number of hashtags per tweet is <span class=katex><span class=katex-mathml><math><semantics><mrow><mo></mo><mi>P</mi><mi>o</mi><mi>i</mi><mo stretchy=false>(</mo><mn>1</mn><mo stretchy=false>)</mo></mrow><annotation encoding=application/x-tex>\sim Poi(1)</annotation></semantics></math></span><span class=katex-html aria-hidden=true><span class=base><span class=strut style=height:0.3669em></span><span class=mrel></span><span class=mspace style=margin-right:0.2778em></span></span><span class=base><span class=strut style=height:1em;vertical-align:-0.25em></span><span class="mord mathnormal" style=margin-right:0.13889em>P</span><span class="mord mathnormal">o</span><span class="mord mathnormal">i</span><span class=mopen>(</span><span class=mord>1</span><span class=mclose>)</span></span></span></span>, but let's actually find the <a href=https://en.wikipedia.org/wiki/Poisson_distribution#Maximum_likelihood target=_blank rel="noopener noreferrer">most likely estimator</a> which in this case is just <span class=katex><span class=katex-mathml><math><semantics><mrow><mover accent=true><mi>λ</mi><mo>ˉ</mo></mover></mrow><annotation encoding=application/x-tex>\bar{\lambda}</annotation></semantics></math></span><span class=katex-html aria-hidden=true><span class=base><span class=strut style=height:0.8312em></span><span class="mord accent"><span class=vlist-t><span class=vlist-r><span class=vlist style=height:0.8312em><span style=top:-3em><span class=pstrut style=height:3em></span><span class="mord mathnormal">λ</span></span><span style=top:-3.2634em><span class=pstrut style=height:3em></span><span class=accent-body style=left:-0.25em><span class=mord>ˉ</span></span></span></span></span></span></span></span></span></span>:</p>
<div class="language-python codeBlockContainer_Ckt0 theme-code-block" style="--prism-background-color:hsl(230, 1%, 98%);--prism-color:hsl(230, 8%, 24%)"><div class=codeBlockContent_biex><pre tabindex=0 class="prism-code language-python codeBlock_bY9V thin-scrollbar" style="background-color:hsl(230, 1%, 98%);color:hsl(230, 8%, 24%)"><code class=codeBlockLines_e6Vv><span class=token-line style="color:hsl(230, 8%, 24%)"><span class="token plain">mle </span><span class="token operator" style="color:hsl(221, 87%, 60%)">=</span><span class="token plain"> hashtags</span><span class="token punctuation" style="color:hsl(119, 34%, 47%)">.</span><span class="token plain">mean</span><span class="token punctuation" style="color:hsl(119, 34%, 47%)">(</span><span class="token punctuation" style="color:hsl(119, 34%, 47%)">)</span><span class="token plain"></span><br></span><span class=token-line style="color:hsl(230, 8%, 24%)"><span class="token plain">mle</span><br></span></code></pre><div class=buttonGroup__atx><button type=button aria-label="Copy code to clipboard" title=Copy class=clean-btn><span class=copyButtonIcons_eSgA aria-hidden=true><svg viewBox="0 0 24 24" class=copyButtonIcon_y97N><path fill=currentColor d="M19,21H8V7H19M19,5H8A2,2 0 0,0 6,7V21A2,2 0 0,0 8,23H19A2,2 0 0,0 21,21V7A2,2 0 0,0 19,5M16,1H4A2,2 0 0,0 2,3V17H4V3H16V1Z"/></svg><svg viewBox="0 0 24 24" class=copyButtonSuccessIcon_LjdS><path fill=currentColor d=M21,7L9,19L3.5,13.5L4.91,12.09L9,16.17L19.59,5.59L21,7Z /></svg></span></button></div></div></div>
<div class="codeBlockContainer_Ckt0 theme-code-block" style="--prism-background-color:hsl(230, 1%, 98%);--prism-color:hsl(230, 8%, 24%)"><div class=codeBlockContent_biex><pre tabindex=0 class="prism-code language-text codeBlock_bY9V thin-scrollbar" style="background-color:hsl(230, 1%, 98%);color:hsl(230, 8%, 24%)"><code class=codeBlockLines_e6Vv><span class=token-line style="color:hsl(230, 8%, 24%)"><span class="token plain"> 0.870236869207003</span><br></span></code></pre><div class=buttonGroup__atx><button type=button aria-label="Copy code to clipboard" title=Copy class=clean-btn><span class=copyButtonIcons_eSgA aria-hidden=true><svg viewBox="0 0 24 24" class=copyButtonIcon_y97N><path fill=currentColor d="M19,21H8V7H19M19,5H8A2,2 0 0,0 6,7V21A2,2 0 0,0 8,23H19A2,2 0 0,0 21,21V7A2,2 0 0,0 19,5M16,1H4A2,2 0 0,0 2,3V17H4V3H16V1Z"/></svg><svg viewBox="0 0 24 24" class=copyButtonSuccessIcon_LjdS><path fill=currentColor d=M21,7L9,19L3.5,13.5L4.91,12.09L9,16.17L19.59,5.59L21,7Z /></svg></span></button></div></div></div>
<p>Pretty close! So we can now simulate how many hashtags are in a tweet. Let's also find what hashtags are actually used:</p>
<div class="language-python codeBlockContainer_Ckt0 theme-code-block" style="--prism-background-color:hsl(230, 1%, 98%);--prism-color:hsl(230, 8%, 24%)"><div class=codeBlockContent_biex><pre tabindex=0 class="prism-code language-python codeBlock_bY9V thin-scrollbar" style="background-color:hsl(230, 1%, 98%);color:hsl(230, 8%, 24%)"><code class=codeBlockLines_e6Vv><span class=token-line style="color:hsl(230, 8%, 24%)"><span class="token plain">hashtags </span><span class="token operator" style="color:hsl(221, 87%, 60%)">=</span><span class="token plain"> </span><span class="token punctuation" style="color:hsl(119, 34%, 47%)">[</span><span class="token plain">x </span><span class="token keyword" style="color:hsl(301, 63%, 40%)">for</span><span class="token plain"> x </span><span class="token keyword" style="color:hsl(301, 63%, 40%)">in</span><span class="token plain"> all_words </span><span class="token keyword" style="color:hsl(301, 63%, 40%)">if</span><span class="token plain"> x</span><span class="token punctuation" style="color:hsl(119, 34%, 47%)">[</span><span class="token number" style="color:hsl(35, 99%, 36%)">0</span><span class="token punctuation" style="color:hsl(119, 34%, 47%)">]</span><span class="token plain"> </span><span class="token operator" style="color:hsl(221, 87%, 60%)">==</span><span class="token plain"> </span><span class="token string" style="color:hsl(119, 34%, 47%)">'#'</span><span class="token punctuation" style="color:hsl(119, 34%, 47%)">]</span><span class="token plain"></span><br></span><span class=token-line style="color:hsl(230, 8%, 24%)"><span class="token plain">n_hashtags </span><span class="token operator" style="color:hsl(221, 87%, 60%)">=</span><span class="token plain"> </span><span class="token builtin" style="color:hsl(119, 34%, 47%)">len</span><span class="token punctuation" style="color:hsl(119, 34%, 47%)">(</span><span class="token plain">hashtags</span><span class="token punctuation" style="color:hsl(119, 34%, 47%)">)</span><span class="token plain"></span><br></span><span class=token-line style="color:hsl(230, 8%, 24%)"><span class="token plain" style=display:inline-block></span><br></span><span class=token-line style="color:hsl(230, 8%, 24%)"><span class="token plain">unique_hashtags </span><span class="token operator" style="color:hsl(221, 87%, 60%)">=</span><span class="token plain"> </span><span class="token builtin" style="color:hsl(119, 34%, 47%)">list</span><span class="token punctuation" style="color:hsl(119, 34%, 47%)">(</span><span class="token builtin" style="color:hsl(119, 34%, 47%)">set</span><span class="token punctuation" style="color:hsl(119, 34%, 47%)">(</span><span class="token punctuation" style="color:hsl(119, 34%, 47%)">[</span><span class="token plain">x </span><span class="token keyword" style="color:hsl(301, 63%, 40%)">for</span><span class="token plain"> x </span><span class="token keyword" style="color:hsl(301, 63%, 40%)">in</span><span class="token plain"> unique_words </span><span class="token keyword" style="color:hsl(301, 63%, 40%)">if</span><span class="token plain"> x</span><span class="token punctuation" style="color:hsl(119, 34%, 47%)">[</span><span class="token number" style="color:hsl(35, 99%, 36%)">0</span><span class="token punctuation" style="color:hsl(119, 34%, 47%)">]</span><span class="token plain"> </span><span class="token operator" style="color:hsl(221, 87%, 60%)">==</span><span class="token plain"> </span><span class="token string" style="color:hsl(119, 34%, 47%)">'#'</span><span class="token punctuation" style="color:hsl(119, 34%, 47%)">]</span><span class="token punctuation" style="color:hsl(119, 34%, 47%)">)</span><span class="token punctuation" style="color:hsl(119, 34%, 47%)">)</span><span class="token plain"></span><br></span><span class=token-line style="color:hsl(230, 8%, 24%)"><span class="token plain">hashtag_dist </span><span class="token operator" style="color:hsl(221, 87%, 60%)">=</span><span class="token plain"> pd</span><span class="token punctuation" style="color:hsl(119, 34%, 47%)">.</span><span class="token plain">DataFrame</span><span class="token punctuation" style="color:hsl(119, 34%, 47%)">(</span><span class="token punctuation" style="color:hsl(119, 34%, 47%)">{</span><span class="token string" style="color:hsl(119, 34%, 47%)">'hashtags'</span><span class="token punctuation" style="color:hsl(119, 34%, 47%)">:</span><span class="token plain"> unique_hashtags</span><span class="token punctuation" style="color:hsl(119, 34%, 47%)">,</span><span class="token plain"></span><br></span><span class=token-line style="color:hsl(230, 8%, 24%)"><span class="token plain"> </span><span class="token string" style="color:hsl(119, 34%, 47%)">'prob'</span><span class="token punctuation" style="color:hsl(119, 34%, 47%)">:</span><span class="token plain"> </span><span class="token punctuation" style="color:hsl(119, 34%, 47%)">[</span><span class="token plain">all_words</span><span class="token punctuation" style="color:hsl(119, 34%, 47%)">.</span><span class="token plain">count</span><span class="token punctuation" style="color:hsl(119, 34%, 47%)">(</span><span class="token plain">h</span><span class="token punctuation" style="color:hsl(119, 34%, 47%)">)</span><span class="token plain"> </span><span class="token operator" style="color:hsl(221, 87%, 60%)">/</span><span class="token plain"> n_hashtags</span><br></span><span class=token-line style="color:hsl(230, 8%, 24%)"><span class="token plain"> </span><span class="token keyword" style="color:hsl(301, 63%, 40%)">for</span><span class="token plain"> h </span><span class="token keyword" style="color:hsl(301, 63%, 40%)">in</span><span class="token plain"> unique_hashtags</span><span class="token punctuation" style="color:hsl(119, 34%, 47%)">]</span><span class="token punctuation" style="color:hsl(119, 34%, 47%)">}</span><span class="token punctuation" style="color:hsl(119, 34%, 47%)">)</span><span class="token plain"></span><br></span><span class=token-line style="color:hsl(230, 8%, 24%)"><span class="token plain"></span><span class="token builtin" style="color:hsl(119, 34%, 47%)">len</span><span class="token punctuation" style="color:hsl(119, 34%, 47%)">(</span><span class="token plain">hashtag_dist</span><span class="token punctuation" style="color:hsl(119, 34%, 47%)">)</span><br></span></code></pre><div class=buttonGroup__atx><button type=button aria-label="Copy code to clipboard" title=Copy class=clean-btn><span class=copyButtonIcons_eSgA aria-hidden=true><svg viewBox="0 0 24 24" class=copyButtonIcon_y97N><path fill=currentColor d="M19,21H8V7H19M19,5H8A2,2 0 0,0 6,7V21A2,2 0 0,0 8,23H19A2,2 0 0,0 21,21V7A2,2 0 0,0 19,5M16,1H4A2,2 0 0,0 2,3V17H4V3H16V1Z"/></svg><svg viewBox="0 0 24 24" class=copyButtonSuccessIcon_LjdS><path fill=currentColor d=M21,7L9,19L3.5,13.5L4.91,12.09L9,16.17L19.59,5.59L21,7Z /></svg></span></button></div></div></div>
<div class="codeBlockContainer_Ckt0 theme-code-block" style="--prism-background-color:hsl(230, 1%, 98%);--prism-color:hsl(230, 8%, 24%)"><div class=codeBlockContent_biex><pre tabindex=0 class="prism-code language-text codeBlock_bY9V thin-scrollbar" style="background-color:hsl(230, 1%, 98%);color:hsl(230, 8%, 24%)"><code class=codeBlockLines_e6Vv><span class=token-line style="color:hsl(230, 8%, 24%)"><span class="token plain"> 603</span><br></span></code></pre><div class=buttonGroup__atx><button type=button aria-label="Copy code to clipboard" title=Copy class=clean-btn><span class=copyButtonIcons_eSgA aria-hidden=true><svg viewBox="0 0 24 24" class=copyButtonIcon_y97N><path fill=currentColor d="M19,21H8V7H19M19,5H8A2,2 0 0,0 6,7V21A2,2 0 0,0 8,23H19A2,2 0 0,0 21,21V7A2,2 0 0,0 19,5M16,1H4A2,2 0 0,0 2,3V17H4V3H16V1Z"/></svg><svg viewBox="0 0 24 24" class=copyButtonSuccessIcon_LjdS><path fill=currentColor d=M21,7L9,19L3.5,13.5L4.91,12.09L9,16.17L19.59,5.59L21,7Z /></svg></span></button></div></div></div>
<p>Turns out I have used 603 different hashtags during my time on Twitter. That means I was using a unique hashtag for about every third tweet.</p>
<p>In better news though, we now have all the data we need to go about actually constructing tweets! The process will happen in a few steps:</p>
<ol>
<li>Randomly select what the first word will be.</li>
<li>Randomly select the number of hashtags for this tweet, and then select the actual hashtags.</li>
<li>Fill in the remaining space of 140 characters with random words taken from my tweets.</li>
</ol>
<p>And hopefully, we won't have anything too crazy come out the other end. The way we do the selection follows a <a href=https://en.wikipedia.org/wiki/Multinomial_distribution target=_blank rel="noopener noreferrer">Multinomial Distribution</a>: given a lot of different values with specific probability, pick one. Let's give a quick example:</p>
<div class="codeBlockContainer_Ckt0 theme-code-block" style="--prism-background-color:hsl(230, 1%, 98%);--prism-color:hsl(230, 8%, 24%)"><div class=codeBlockContent_biex><pre tabindex=0 class="prism-code language-text codeBlock_bY9V thin-scrollbar" style="background-color:hsl(230, 1%, 98%);color:hsl(230, 8%, 24%)"><code class=codeBlockLines_e6Vv><span class=token-line style="color:hsl(230, 8%, 24%)"><span class="token plain">x: .33</span><br></span><span class=token-line style="color:hsl(230, 8%, 24%)"><span class="token plain">y: .5</span><br></span><span class=token-line style="color:hsl(230, 8%, 24%)"><span class="token plain">z: .17</span><br></span></code></pre><div class=buttonGroup__atx><button type=button aria-label="Copy code to clipboard" title=Copy class=clean-btn><span class=copyButtonIcons_eSgA aria-hidden=true><svg viewBox="0 0 24 24" class=copyButtonIcon_y97N><path fill=currentColor d="M19,21H8V7H19M19,5H8A2,2 0 0,0 6,7V21A2,2 0 0,0 8,23H19A2,2 0 0,0 21,21V7A2,2 0 0,0 19,5M16,1H4A2,2 0 0,0 2,3V17H4V3H16V1Z"/></svg><svg viewBox="0 0 24 24" class=copyButtonSuccessIcon_LjdS><path fill=currentColor d=M21,7L9,19L3.5,13.5L4.91,12.09L9,16.17L19.59,5.59L21,7Z /></svg></span></button></div></div></div>
<p>That is, I pick <code>x</code> with probability 33%, <code>y</code> with probability 50%, and so on. In context of our sentence construction, I've built out the probabilities of specific words already - now I just need to simulate that distribution. Time for the engine to actually be developed!</p>
<div class="language-python codeBlockContainer_Ckt0 theme-code-block" style="--prism-background-color:hsl(230, 1%, 98%);--prism-color:hsl(230, 8%, 24%)"><div class=codeBlockContent_biex><pre tabindex=0 class="prism-code language-python codeBlock_bY9V thin-scrollbar" style="background-color:hsl(230, 1%, 98%);color:hsl(230, 8%, 24%)"><code class=codeBlockLines_e6Vv><span class=token-line style="color:hsl(230, 8%, 24%)"><span class="token keyword" style="color:hsl(301, 63%, 40%)">import</span><span class="token plain"> numpy </span><span class="token keyword" style="color:hsl(301, 63%, 40%)">as</span><span class="token plain"> np</span><br></span><span class=token-line style="color:hsl(230, 8%, 24%)"><span class="token plain" style=display:inline-block></span><br></span><span class=token-line style="color:hsl(230, 8%, 24%)"><span class="token plain"></span><span class="token keyword" style="color:hsl(301, 63%, 40%)">def</span><span class="token plain"> </span><span class="token function" style="color:hsl(221, 87%, 60%)">multinom_sim</span><span class="token punctuation" style="color:hsl(119, 34%, 47%)">(</span><span class="token plain">n</span><span class="token punctuation" style="color:hsl(119, 34%, 47%)">,</span><span class="token plain"> vals</span><span class="token punctuation" style="color:hsl(119, 34%, 47%)">,</span><span class="token plain"> probs</span><span class="token punctuation" style="color:hsl(119, 34%, 47%)">)</span><span class="token punctuation" style="color:hsl(119, 34%, 47%)">:</span><span class="token plain"></span><br></span><span class=token-line style="color:hsl(230, 8%, 24%)"><span class="token plain"> occurrences </span><span class="token operator" style="color:hsl(221, 87%, 60%)">=</span><span class="token plain"> np</span><span class="token punctuation" style="color:hsl(119, 34%, 47%)">.</span><span class="token plain">random</span><span class="token punctuation" style="color:hsl(119, 34%, 47%)">.</span><span class="token plain">multinomial</span><span class="token punctuation" style="color:hsl(119, 34%, 47%)">(</span><span class="token plain">n</span><span class="token punctuation" style="color:hsl(119, 34%, 47%)">,</span><span class="token plain"> probs</span><span class="token punctuation" style="color:hsl(119, 34%, 47%)">)</span><span class="token plain"></span><br></span><span class=token-line style="color:hsl(230, 8%, 24%)"><span class="token plain"> results </span><span class="token operator" style="color:hsl(221, 87%, 60%)">=</span><span class="token plain"> occurrences </span><span class="token operator" style="color:hsl(221, 87%, 60%)">*</span><span class="token plain"> vals</span><br></span><span class=token-line style="color:hsl(230, 8%, 24%)"><span class="token plain"> </span><span class="token keyword" style="color:hsl(301, 63%, 40%)">return</span><span class="token plain"> </span><span class="token string" style="color:hsl(119, 34%, 47%)">' '</span><span class="token punctuation" style="color:hsl(119, 34%, 47%)">.</span><span class="token plain">join</span><span class="token punctuation" style="color:hsl(119, 34%, 47%)">(</span><span class="token plain">results</span><span class="token punctuation" style="color:hsl(119, 34%, 47%)">[</span><span class="token plain">results </span><span class="token operator" style="color:hsl(221, 87%, 60%)">!=</span><span class="token plain"> </span><span class="token string" style="color:hsl(119, 34%, 47%)">''</span><span class="token punctuation" style="color:hsl(119, 34%, 47%)">]</span><span class="token punctuation" style="color:hsl(119, 34%, 47%)">)</span><span class="token plain"></span><br></span><span class=token-line style="color:hsl(230, 8%, 24%)"><span class="token plain" style=display:inline-block></span><br></span><span class=token-line style="color:hsl(230, 8%, 24%)"><span class="token plain"></span><span class="token keyword" style="color:hsl(301, 63%, 40%)">def</span><span class="token plain"> </span><span class="token function" style="color:hsl(221, 87%, 60%)">sim_n_hashtags</span><span class="token punctuation" style="color:hsl(119, 34%, 47%)">(</span><span class="token plain">hashtag_freq</span><span class="token punctuation" style="color:hsl(119, 34%, 47%)">)</span><span class="token punctuation" style="color:hsl(119, 34%, 47%)">:</span><span class="token plain"></span><br></span><span class=token-line style="color:hsl(230, 8%, 24%)"><span class="token plain"> </span><span class="token keyword" style="color:hsl(301, 63%, 40%)">return</span><span class="token plain"> np</span><span class="token punctuation" style="color:hsl(119, 34%, 47%)">.</span><span class="token plain">random</span><span class="token punctuation" style="color:hsl(119, 34%, 47%)">.</span><span class="token plain">poisson</span><span class="token punctuation" style="color:hsl(119, 34%, 47%)">(</span><span class="token plain">hashtag_freq</span><span class="token punctuation" style="color:hsl(119, 34%, 47%)">)</span><span class="token plain"></span><br></span><span class=token-line style="color:hsl(230, 8%, 24%)"><span class="token plain" style=display:inline-block></span><br></span><span class=token-line style="color:hsl(230, 8%, 24%)"><span class="token plain"></span><span class="token keyword" style="color:hsl(301, 63%, 40%)">def</span><span class="token plain"> </span><span class="token function" style="color:hsl(221, 87%, 60%)">sim_hashtags</span><span class="token punctuation" style="color:hsl(119, 34%, 47%)">(</span><span class="token plain">n</span><span class="token punctuation" style="color:hsl(119, 34%, 47%)">,</span><span class="token plain"> hashtag_dist</span><span class="token punctuation" style="color:hsl(119, 34%, 47%)">)</span><span class="token punctuation" style="color:hsl(119, 34%, 47%)">:</span><span class="token plain"></span><br></span><span class=token-line style="color:hsl(230, 8%, 24%)"><span class="token plain"> </span><span class="token keyword" style="color:hsl(301, 63%, 40%)">return</span><span class="token plain"> multinom_sim</span><span class="token punctuation" style="color:hsl(119, 34%, 47%)">(</span><span class="token plain">n</span><span class="token punctuation" style="color:hsl(119, 34%, 47%)">,</span><span class="token plain"> hashtag_dist</span><span class="token punctuation" style="color:hsl(119, 34%, 47%)">.</span><span class="token plain">hashtags</span><span class="token punctuation" style="color:hsl(119, 34%, 47%)">,</span><span class="token plain"> hashtag_dist</span><span class="token punctuation" style="color:hsl(119, 34%, 47%)">.</span><span class="token plain">prob</span><span class="token punctuation" style="color:hsl(119, 34%, 47%)">)</span><span class="token plain"></span><br></span><span class=token-line style="color:hsl(230, 8%, 24%)"><span class="token plain" style=display:inline-block></span><br></span><span class=token-line style="color:hsl(230, 8%, 24%)"><span class="token plain"></span><span class="token keyword" style="color:hsl(301, 63%, 40%)">def</span><span class="token plain"> </span><span class="token function" style="color:hsl(221, 87%, 60%)">sim_first_word</span><span class="token punctuation" style="color:hsl(119, 34%, 47%)">(</span><span class="token plain">first_word_dist</span><span class="token punctuation" style="color:hsl(119, 34%, 47%)">)</span><span class="token punctuation" style="color:hsl(119, 34%, 47%)">:</span><span class="token plain"></span><br></span><span class=token-line style="color:hsl(230, 8%, 24%)"><span class="token plain"> probs </span><span class="token operator" style="color:hsl(221, 87%, 60%)">=</span><span class="token plain"> np</span><span class="token punctuation" style="color:hsl(119, 34%, 47%)">.</span><span class="token plain">float64</span><span class="token punctuation" style="color:hsl(119, 34%, 47%)">(</span><span class="token plain">first_word_dist</span><span class="token punctuation" style="color:hsl(119, 34%, 47%)">.</span><span class="token plain">values</span><span class="token punctuation" style="color:hsl(119, 34%, 47%)">)</span><span class="token plain"></span><br></span><span class=token-line style="color:hsl(230, 8%, 24%)"><span class="token plain"> </span><span class="token keyword" style="color:hsl(301, 63%, 40%)">return</span><span class="token plain"> multinom_sim</span><span class="token punctuation" style="color:hsl(119, 34%, 47%)">(</span><span class="token number" style="color:hsl(35, 99%, 36%)">1</span><span class="token punctuation" style="color:hsl(119, 34%, 47%)">,</span><span class="token plain"> first_word_dist</span><span class="token punctuation" style="color:hsl(119, 34%, 47%)">.</span><span class="token plain">reset_index</span><span class="token punctuation" style="color:hsl(119, 34%, 47%)">(</span><span class="token punctuation" style="color:hsl(119, 34%, 47%)">)</span><span class="token punctuation" style="color:hsl(119, 34%, 47%)">[</span><span class="token string" style="color:hsl(119, 34%, 47%)">'index'</span><span class="token punctuation" style="color:hsl(119, 34%, 47%)">]</span><span class="token punctuation" style="color:hsl(119, 34%, 47%)">,</span><span class="token plain"> probs</span><span class="token punctuation" style="color:hsl(119, 34%, 47%)">)</span><span class="token plain"></span><br></span><span class=token-line style="color:hsl(230, 8%, 24%)"><span class="token plain" style=display:inline-block></span><br></span><span class=token-line style="color:hsl(230, 8%, 24%)"><span class="token plain"></span><span class="token keyword" style="color:hsl(301, 63%, 40%)">def</span><span class="token plain"> </span><span class="token function" style="color:hsl(221, 87%, 60%)">sim_next_word</span><span class="token punctuation" style="color:hsl(119, 34%, 47%)">(</span><span class="token plain">current</span><span class="token punctuation" style="color:hsl(119, 34%, 47%)">,</span><span class="token plain"> word_dist</span><span class="token punctuation" style="color:hsl(119, 34%, 47%)">)</span><span class="token punctuation" style="color:hsl(119, 34%, 47%)">:</span><span class="token plain"></span><br></span><span class=token-line style="color:hsl(230, 8%, 24%)"><span class="token plain"> dist </span><span class="token operator" style="color:hsl(221, 87%, 60%)">=</span><span class="token plain"> pd</span><span class="token punctuation" style="color:hsl(119, 34%, 47%)">.</span><span class="token plain">Series</span><span class="token punctuation" style="color:hsl(119, 34%, 47%)">(</span><span class="token plain">word_dist</span><span class="token punctuation" style="color:hsl(119, 34%, 47%)">[</span><span class="token plain">current</span><span class="token punctuation" style="color:hsl(119, 34%, 47%)">]</span><span class="token punctuation" style="color:hsl(119, 34%, 47%)">)</span><span class="token plain"></span><br></span><span class=token-line style="color:hsl(230, 8%, 24%)"><span class="token plain"> probs </span><span class="token operator" style="color:hsl(221, 87%, 60%)">=</span><span class="token plain"> np</span><span class="token punctuation" style="color:hsl(119, 34%, 47%)">.</span><span class="token plain">ones</span><span class="token punctuation" style="color:hsl(119, 34%, 47%)">(</span><span class="token builtin" style="color:hsl(119, 34%, 47%)">len</span><span class="token punctuation" style="color:hsl(119, 34%, 47%)">(</span><span class="token plain">dist</span><span class="token punctuation" style="color:hsl(119, 34%, 47%)">)</span><span class="token punctuation" style="color:hsl(119, 34%, 47%)">)</span><span class="token plain"> </span><span class="token operator" style="color:hsl(221, 87%, 60%)">/</span><span class="token plain"> </span><span class="token builtin" style="color:hsl(119, 34%, 47%)">len</span><span class="token punctuation" style="color:hsl(119, 34%, 47%)">(</span><span class="token plain">dist</span><span class="token punctuation" style="color:hsl(119, 34%, 47%)">)</span><span class="token plain"></span><br></span><span class=token-line style="color:hsl(230, 8%, 24%)"><span class="token plain"> </span><span class="token keyword" style="color:hsl(301, 63%, 40%)">return</span><span class="token plain"> multinom_sim</span><span class="token punctuation" style="color:hsl(119, 34%, 47%)">(</span><span class="token number" style="color:hsl(35, 99%, 36%)">1</span><span class="token punctuation" style="color:hsl(119, 34%, 47%)">,</span><span class="token plain"> dist</span><span class="token punctuation" style="color:hsl(119, 34%, 47%)">,</span><span class="token plain"> probs</span><span class="token punctuation" style="color:hsl(119, 34%, 47%)">)</span><br></span></code></pre><div class=buttonGroup__atx><button type=button aria-label="Copy code to clipboard" title=Copy class=clean-btn><span class=copyButtonIcons_eSgA aria-hidden=true><svg viewBox="0 0 24 24" class=copyButtonIcon_y97N><path fill=currentColor d="M19,21H8V7H19M19,5H8A2,2 0 0,0 6,7V21A2,2 0 0,0 8,23H19A2,2 0 0,0 21,21V7A2,2 0 0,0 19,5M16,1H4A2,2 0 0,0 2,3V17H4V3H16V1Z"/></svg><svg viewBox="0 0 24 24" class=copyButtonSuccessIcon_LjdS><path fill=currentColor d=M21,7L9,19L3.5,13.5L4.91,12.09L9,16.17L19.59,5.59L21,7Z /></svg></span></button></div></div></div>
<h2 class="anchor anchorWithStickyNavbar_LWe7" id=pulling-it-all-together>Pulling it all together<a href=#pulling-it-all-together class=hash-link aria-label="Direct link to Pulling it all together" title="Direct link to Pulling it all together"></a></h2>
<p>I've now built out all the code I need to actually simulate a sentence written by me. Let's try doing an example with five words and a single hashtag:</p>
<div class="language-python codeBlockContainer_Ckt0 theme-code-block" style="--prism-background-color:hsl(230, 1%, 98%);--prism-color:hsl(230, 8%, 24%)"><div class=codeBlockContent_biex><pre tabindex=0 class="prism-code language-python codeBlock_bY9V thin-scrollbar" style="background-color:hsl(230, 1%, 98%);color:hsl(230, 8%, 24%)"><code class=codeBlockLines_e6Vv><span class=token-line style="color:hsl(230, 8%, 24%)"><span class="token plain">first </span><span class="token operator" style="color:hsl(221, 87%, 60%)">=</span><span class="token plain"> sim_first_word</span><span class="token punctuation" style="color:hsl(119, 34%, 47%)">(</span><span class="token plain">first_word_dist</span><span class="token punctuation" style="color:hsl(119, 34%, 47%)">)</span><span class="token plain"></span><br></span><span class=token-line style="color:hsl(230, 8%, 24%)"><span class="token plain">second </span><span class="token operator" style="color:hsl(221, 87%, 60%)">=</span><span class="token plain"> sim_next_word</span><span class="token punctuation" style="color:hsl(119, 34%, 47%)">(</span><span class="token plain">first</span><span class="token punctuation" style="color:hsl(119, 34%, 47%)">,</span><span class="token plain"> word_dist</span><span class="token punctuation" style="color:hsl(119, 34%, 47%)">)</span><span class="token plain"></span><br></span><span class=token-line style="color:hsl(230, 8%, 24%)"><span class="token plain">third </span><span class="token operator" style="color:hsl(221, 87%, 60%)">=</span><span class="token plain"> sim_next_word</span><span class="token punctuation" style="color:hsl(119, 34%, 47%)">(</span><span class="token plain">second</span><span class="token punctuation" style="color:hsl(119, 34%, 47%)">,</span><span class="token plain"> word_dist</span><span class="token punctuation" style="color:hsl(119, 34%, 47%)">)</span><span class="token plain"></span><br></span><span class=token-line style="color:hsl(230, 8%, 24%)"><span class="token plain">fourth </span><span class="token operator" style="color:hsl(221, 87%, 60%)">=</span><span class="token plain"> sim_next_word</span><span class="token punctuation" style="color:hsl(119, 34%, 47%)">(</span><span class="token plain">third</span><span class="token punctuation" style="color:hsl(119, 34%, 47%)">,</span><span class="token plain"> word_dist</span><span class="token punctuation" style="color:hsl(119, 34%, 47%)">)</span><span class="token plain"></span><br></span><span class=token-line style="color:hsl(230, 8%, 24%)"><span class="token plain">fifth </span><span class="token operator" style="color:hsl(221, 87%, 60%)">=</span><span class="token plain"> sim_next_word</span><span class="token punctuation" style="color:hsl(119, 34%, 47%)">(</span><span class="token plain">fourth</span><span class="token punctuation" style="color:hsl(119, 34%, 47%)">,</span><span class="token plain"> word_dist</span><span class="token punctuation" style="color:hsl(119, 34%, 47%)">)</span><span class="token plain"></span><br></span><span class=token-line style="color:hsl(230, 8%, 24%)"><span class="token plain">hashtag </span><span class="token operator" style="color:hsl(221, 87%, 60%)">=</span><span class="token plain"> sim_hashtags</span><span class="token punctuation" style="color:hsl(119, 34%, 47%)">(</span><span class="token number" style="color:hsl(35, 99%, 36%)">1</span><span class="token punctuation" style="color:hsl(119, 34%, 47%)">,</span><span class="token plain"> hashtag_dist</span><span class="token punctuation" style="color:hsl(119, 34%, 47%)">)</span><span class="token plain"></span><br></span><span class=token-line style="color:hsl(230, 8%, 24%)"><span class="token plain" style=display:inline-block></span><br></span><span class=token-line style="color:hsl(230, 8%, 24%)"><span class="token plain"></span><span class="token string" style="color:hsl(119, 34%, 47%)">' '</span><span class="token punctuation" style="color:hsl(119, 34%, 47%)">.</span><span class="token plain">join</span><span class="token punctuation" style="color:hsl(119, 34%, 47%)">(</span><span class="token punctuation" style="color:hsl(119, 34%, 47%)">(</span><span class="token plain">first</span><span class="token punctuation" style="color:hsl(119, 34%, 47%)">,</span><span class="token plain"> second</span><span class="token punctuation" style="color:hsl(119, 34%, 47%)">,</span><span class="token plain"> third</span><span class="token punctuation" style="color:hsl(119, 34%, 47%)">,</span><span class="token plain"> fourth</span><span class="token punctuation" style="color:hsl(119, 34%, 47%)">,</span><span class="token plain"> fifth</span><span class="token punctuation" style="color:hsl(119, 34%, 47%)">,</span><span class="token plain"> hashtag</span><span class="token punctuation" style="color:hsl(119, 34%, 47%)">)</span><span class="token punctuation" style="color:hsl(119, 34%, 47%)">)</span><br></span></code></pre><div class=buttonGroup__atx><button type=button aria-label="Copy code to clipboard" title=Copy class=clean-btn><span class=copyButtonIcons_eSgA aria-hidden=true><svg viewBox="0 0 24 24" class=copyButtonIcon_y97N><path fill=currentColor d="M19,21H8V7H19M19,5H8A2,2 0 0,0 6,7V21A2,2 0 0,0 8,23H19A2,2 0 0,0 21,21V7A2,2 0 0,0 19,5M16,1H4A2,2 0 0,0 2,3V17H4V3H16V1Z"/></svg><svg viewBox="0 0 24 24" class=copyButtonSuccessIcon_LjdS><path fill=currentColor d=M21,7L9,19L3.5,13.5L4.91,12.09L9,16.17L19.59,5.59L21,7Z /></svg></span></button></div></div></div>
<div class="codeBlockContainer_Ckt0 theme-code-block" style="--prism-background-color:hsl(230, 1%, 98%);--prism-color:hsl(230, 8%, 24%)"><div class=codeBlockContent_biex><pre tabindex=0 class="prism-code language-text codeBlock_bY9V thin-scrollbar" style="background-color:hsl(230, 1%, 98%);color:hsl(230, 8%, 24%)"><code class=codeBlockLines_e6Vv><span class=token-line style="color:hsl(230, 8%, 24%)"><span class="token plain"> 'My first all-nighter of friends #oldschool'</span><br></span></code></pre><div class=buttonGroup__atx><button type=button aria-label="Copy code to clipboard" title=Copy class=clean-btn><span class=copyButtonIcons_eSgA aria-hidden=true><svg viewBox="0 0 24 24" class=copyButtonIcon_y97N><path fill=currentColor d="M19,21H8V7H19M19,5H8A2,2 0 0,0 6,7V21A2,2 0 0,0 8,23H19A2,2 0 0,0 21,21V7A2,2 0 0,0 19,5M16,1H4A2,2 0 0,0 2,3V17H4V3H16V1Z"/></svg><svg viewBox="0 0 24 24" class=copyButtonSuccessIcon_LjdS><path fill=currentColor d=M21,7L9,19L3.5,13.5L4.91,12.09L9,16.17L19.59,5.59L21,7Z /></svg></span></button></div></div></div>
<p>Let's go ahead and put everything together! We're going to simulate a first word, simulate the hashtags, and then simulate to fill the gap until we've either taken up all the space or reached a period.</p>
<div class="language-python codeBlockContainer_Ckt0 theme-code-block" style="--prism-background-color:hsl(230, 1%, 98%);--prism-color:hsl(230, 8%, 24%)"><div class=codeBlockContent_biex><pre tabindex=0 class="prism-code language-python codeBlock_bY9V thin-scrollbar" style="background-color:hsl(230, 1%, 98%);color:hsl(230, 8%, 24%)"><code class=codeBlockLines_e6Vv><span class=token-line style="color:hsl(230, 8%, 24%)"><span class="token keyword" style="color:hsl(301, 63%, 40%)">def</span><span class="token plain"> </span><span class="token function" style="color:hsl(221, 87%, 60%)">simulate_tweet</span><span class="token punctuation" style="color:hsl(119, 34%, 47%)">(</span><span class="token punctuation" style="color:hsl(119, 34%, 47%)">)</span><span class="token punctuation" style="color:hsl(119, 34%, 47%)">:</span><span class="token plain"></span><br></span><span class=token-line style="color:hsl(230, 8%, 24%)"><span class="token plain"> chars_remaining </span><span class="token operator" style="color:hsl(221, 87%, 60%)">=</span><span class="token plain"> </span><span class="token number" style="color:hsl(35, 99%, 36%)">140</span><span class="token plain"></span><br></span><span class=token-line style="color:hsl(230, 8%, 24%)"><span class="token plain"> first </span><span class="token operator" style="color:hsl(221, 87%, 60%)">=</span><span class="token plain"> sim_first_word</span><span class="token punctuation" style="color:hsl(119, 34%, 47%)">(</span><span class="token plain">first_word_dist</span><span class="token punctuation" style="color:hsl(119, 34%, 47%)">)</span><span class="token plain"></span><br></span><span class=token-line style="color:hsl(230, 8%, 24%)"><span class="token plain"> n_hash </span><span class="token operator" style="color:hsl(221, 87%, 60%)">=</span><span class="token plain"> sim_n_hashtags</span><span class="token punctuation" style="color:hsl(119, 34%, 47%)">(</span><span class="token plain">mle</span><span class="token punctuation" style="color:hsl(119, 34%, 47%)">)</span><span class="token plain"></span><br></span><span class=token-line style="color:hsl(230, 8%, 24%)"><span class="token plain"> hashtags </span><span class="token operator" style="color:hsl(221, 87%, 60%)">=</span><span class="token plain"> sim_hashtags</span><span class="token punctuation" style="color:hsl(119, 34%, 47%)">(</span><span class="token plain">n_hash</span><span class="token punctuation" style="color:hsl(119, 34%, 47%)">,</span><span class="token plain"> hashtag_dist</span><span class="token punctuation" style="color:hsl(119, 34%, 47%)">)</span><span class="token plain"></span><br></span><span class=token-line style="color:hsl(230, 8%, 24%)"><span class="token plain"> </span><br></span><span class=token-line style="color:hsl(230, 8%, 24%)"><span class="token plain"> chars_remaining </span><span class="token operator" style="color:hsl(221, 87%, 60%)">-=</span><span class="token plain"> </span><span class="token builtin" style="color:hsl(119, 34%, 47%)">len</span><span class="token punctuation" style="color:hsl(119, 34%, 47%)">(</span><span class="token plain">first</span><span class="token punctuation" style="color:hsl(119, 34%, 47%)">)</span><span class="token plain"> </span><span class="token operator" style="color:hsl(221, 87%, 60%)">+</span><span class="token plain"> </span><span class="token builtin" style="color:hsl(119, 34%, 47%)">len</span><span class="token punctuation" style="color:hsl(119, 34%, 47%)">(</span><span class="token plain">hashtags</span><span class="token punctuation" style="color:hsl(119, 34%, 47%)">)</span><span class="token plain"></span><br></span><span class=token-line style="color:hsl(230, 8%, 24%)"><span class="token plain"> </span><br></span><span class=token-line style="color:hsl(230, 8%, 24%)"><span class="token plain"> tweet </span><span class="token operator" style="color:hsl(221, 87%, 60%)">=</span><span class="token plain"> first</span><br></span><span class=token-line style="color:hsl(230, 8%, 24%)"><span class="token plain"> current </span><span class="token operator" style="color:hsl(221, 87%, 60%)">=</span><span class="token plain"> first</span><br></span><span class=token-line style="color:hsl(230, 8%, 24%)"><span class="token plain"> </span><span class="token keyword" style="color:hsl(301, 63%, 40%)">while</span><span class="token plain"> chars_remaining </span><span class="token operator" style="color:hsl(221, 87%, 60%)">></span><span class="token plain"> </span><span class="token builtin" style="color:hsl(119, 34%, 47%)">len</span><span class="token punctuation" style="color:hsl(119, 34%, 47%)">(</span><span class="token plain">tweet</span><span class="token punctuation" style="color:hsl(119, 34%, 47%)">)</span><span class="token plain"> </span><span class="token operator" style="color:hsl(221, 87%, 60%)">+</span><span class="token plain"> </span><span class="token builtin" style="color:hsl(119, 34%, 47%)">len</span><span class="token punctuation" style="color:hsl(119, 34%, 47%)">(</span><span class="token plain">hashtags</span><span class="token punctuation" style="color:hsl(119, 34%, 47%)">)</span><span class="token plain"> </span><span class="token keyword" style="color:hsl(301, 63%, 40%)">and</span><span class="token plain"> current</span><span class="token punctuation" style="color:hsl(119, 34%, 47%)">[</span><span class="token number" style="color:hsl(35, 99%, 36%)">0</span><span class="token punctuation" style="color:hsl(119, 34%, 47%)">]</span><span class="token plain"> </span><span class="token operator" style="color:hsl(221, 87%, 60%)">!=</span><span class="token plain"> </span><span class="token string" style="color:hsl(119, 34%, 47%)">'.'</span><span class="token plain"> </span><span class="token keyword" style="color:hsl(301, 63%, 40%)">and</span><span class="token plain"> current</span><span class="token punctuation" style="color:hsl(119, 34%, 47%)">[</span><span class="token number" style="color:hsl(35, 99%, 36%)">0</span><span class="token punctuation" style="color:hsl(119, 34%, 47%)">]</span><span class="token plain"> </span><span class="token operator" style="color:hsl(221, 87%, 60%)">!=</span><span class="token plain"> </span><span class="token string" style="color:hsl(119, 34%, 47%)">'!'</span><span class="token punctuation" style="color:hsl(119, 34%, 47%)">:</span><span class="token plain"></span><br></span><span class=token-line style="color:hsl(230, 8%, 24%)"><span class="token plain"> current </span><span class="token operator" style="color:hsl(221, 87%, 60%)">=</span><span class="token plain"> sim_next_word</span><span class="token punctuation" style="color:hsl(119, 34%, 47%)">(</span><span class="token plain">current</span><span class="token punctuation" style="color:hsl(119, 34%, 47%)">,</span><span class="token plain"> word_dist</span><span class="token punctuation" style="color:hsl(119, 34%, 47%)">)</span><span class="token plain"></span><br></span><span class=token-line style="color:hsl(230, 8%, 24%)"><span class="token plain"> tweet </span><span class="token operator" style="color:hsl(221, 87%, 60%)">+=</span><span class="token plain"> </span><span class="token string" style="color:hsl(119, 34%, 47%)">' '</span><span class="token plain"> </span><span class="token operator" style="color:hsl(221, 87%, 60%)">+</span><span class="token plain"> current</span><br></span><span class=token-line style="color:hsl(230, 8%, 24%)"><span class="token plain"> </span><br></span><span class=token-line style="color:hsl(230, 8%, 24%)"><span class="token plain"> tweet </span><span class="token operator" style="color:hsl(221, 87%, 60%)">=</span><span class="token plain"> tweet</span><span class="token punctuation" style="color:hsl(119, 34%, 47%)">[</span><span class="token punctuation" style="color:hsl(119, 34%, 47%)">:</span><span class="token operator" style="color:hsl(221, 87%, 60%)">-</span><span class="token number" style="color:hsl(35, 99%, 36%)">2</span><span class="token punctuation" style="color:hsl(119, 34%, 47%)">]</span><span class="token plain"> </span><span class="token operator" style="color:hsl(221, 87%, 60%)">+</span><span class="token plain"> tweet</span><span class="token punctuation" style="color:hsl(119, 34%, 47%)">[</span><span class="token operator" style="color:hsl(221, 87%, 60%)">-</span><span class="token number" style="color:hsl(35, 99%, 36%)">1</span><span class="token punctuation" style="color:hsl(119, 34%, 47%)">]</span><span class="token plain"></span><br></span><span class=token-line style="color:hsl(230, 8%, 24%)"><span class="token plain"> </span><br></span><span class=token-line style="color:hsl(230, 8%, 24%)"><span class="token plain"> </span><span class="token keyword" style="color:hsl(301, 63%, 40%)">return</span><span class="token plain"> </span><span class="token string" style="color:hsl(119, 34%, 47%)">' '</span><span class="token punctuation" style="color:hsl(119, 34%, 47%)">.</span><span class="token plain">join</span><span class="token punctuation" style="color:hsl(119, 34%, 47%)">(</span><span class="token punctuation" style="color:hsl(119, 34%, 47%)">(</span><span class="token plain">tweet</span><span class="token punctuation" style="color:hsl(119, 34%, 47%)">,</span><span class="token plain"> hashtags</span><span class="token punctuation" style="color:hsl(119, 34%, 47%)">)</span><span class="token punctuation" style="color:hsl(119, 34%, 47%)">)</span><span class="token punctuation" style="color:hsl(119, 34%, 47%)">.</span><span class="token plain">strip</span><span class="token punctuation" style="color:hsl(119, 34%, 47%)">(</span><span class="token punctuation" style="color:hsl(119, 34%, 47%)">)</span><br></span></code></pre><div class=buttonGroup__atx><button type=button aria-label="Copy code to clipboard" title=Copy class=clean-btn><span class=copyButtonIcons_eSgA aria-hidden=true><svg viewBox="0 0 24 24" class=copyButtonIcon_y97N><path fill=currentColor d="M19,21H8V7H19M19,5H8A2,2 0 0,0 6,7V21A2,2 0 0,0 8,23H19A2,2 0 0,0 21,21V7A2,2 0 0,0 19,5M16,1H4A2,2 0 0,0 2,3V17H4V3H16V1Z"/></svg><svg viewBox="0 0 24 24" class=copyButtonSuccessIcon_LjdS><path fill=currentColor d=M21,7L9,19L3.5,13.5L4.91,12.09L9,16.17L19.59,5.59L21,7Z /></svg></span></button></div></div></div>
<h2 class="anchor anchorWithStickyNavbar_LWe7" id=the-results>The results<a href=#the-results class=hash-link aria-label="Direct link to The results" title="Direct link to The results"></a></h2>
<p>And now for something completely different: twenty random tweets dreamed up by my computer and my Twitter data. Here you go:</p>
<div class="language-python codeBlockContainer_Ckt0 theme-code-block" style="--prism-background-color:hsl(230, 1%, 98%);--prism-color:hsl(230, 8%, 24%)"><div class=codeBlockContent_biex><pre tabindex=0 class="prism-code language-python codeBlock_bY9V thin-scrollbar" style="background-color:hsl(230, 1%, 98%);color:hsl(230, 8%, 24%)"><code class=codeBlockLines_e6Vv><span class=token-line style="color:hsl(230, 8%, 24%)"><span class="token keyword" style="color:hsl(301, 63%, 40%)">for</span><span class="token plain"> i </span><span class="token keyword" style="color:hsl(301, 63%, 40%)">in</span><span class="token plain"> </span><span class="token builtin" style="color:hsl(119, 34%, 47%)">range</span><span class="token punctuation" style="color:hsl(119, 34%, 47%)">(</span><span class="token number" style="color:hsl(35, 99%, 36%)">0</span><span class="token punctuation" style="color:hsl(119, 34%, 47%)">,</span><span class="token plain"> </span><span class="token number" style="color:hsl(35, 99%, 36%)">20</span><span class="token punctuation" style="color:hsl(119, 34%, 47%)">)</span><span class="token punctuation" style="color:hsl(119, 34%, 47%)">:</span><span class="token plain"></span><br></span><span class=token-line style="color:hsl(230, 8%, 24%)"><span class="token plain"> </span><span class="token keyword" style="color:hsl(301, 63%, 40%)">print</span><span class="token punctuation" style="color:hsl(119, 34%, 47%)">(</span><span class="token plain">simulate_tweet</span><span class="token punctuation" style="color:hsl(119, 34%, 47%)">(</span><span class="token punctuation" style="color:hsl(119, 34%, 47%)">)</span><span class="token punctuation" style="color:hsl(119, 34%, 47%)">)</span><span class="token plain"></span><br></span><span class=token-line style="color:hsl(230, 8%, 24%)"><span class="token plain"> </span><span class="token keyword" style="color:hsl(301, 63%, 40%)">print</span><span class="token punctuation" style="color:hsl(119, 34%, 47%)">(</span><span class="token punctuation" style="color:hsl(119, 34%, 47%)">)</span><br></span></code></pre><div class=buttonGroup__atx><button type=button aria-label="Copy code to clipboard" title=Copy class=clean-btn><span class=copyButtonIcons_eSgA aria-hidden=true><svg viewBox="0 0 24 24" class=copyButtonIcon_y97N><path fill=currentColor d="M19,21H8V7H19M19,5H8A2,2 0 0,0 6,7V21A2,2 0 0,0 8,23H19A2,2 0 0,0 21,21V7A2,2 0 0,0 19,5M16,1H4A2,2 0 0,0 2,3V17H4V3H16V1Z"/></svg><svg viewBox="0 0 24 24" class=copyButtonSuccessIcon_LjdS><path fill=currentColor d=M21,7L9,19L3.5,13.5L4.91,12.09L9,16.17L19.59,5.59L21,7Z /></svg></span></button></div></div></div>
<div class="codeBlockContainer_Ckt0 theme-code-block" style="--prism-background-color:hsl(230, 1%, 98%);--prism-color:hsl(230, 8%, 24%)"><div class=codeBlockContent_biex><pre tabindex=0 class="prism-code language-text codeBlock_bY9V thin-scrollbar" style="background-color:hsl(230, 1%, 98%);color:hsl(230, 8%, 24%)"><code class=codeBlockLines_e6Vv><span class=token-line style="color:hsl(230, 8%, 24%)"><span class="token plain"> Also , I'm at 8 this morning. #thursdaysgohard #ornot</span><br></span><span class=token-line style="color:hsl(230, 8%, 24%)"><span class="token plain"> </span><br></span><span class=token-line style="color:hsl(230, 8%, 24%)"><span class="token plain"> Turns out of us breathe the code will want to my undergraduate career is becoming more night trying ? Religion is now as a chane #HYPE</span><br></span><span class=token-line style="color:hsl(230, 8%, 24%)"><span class="token plain"> </span><br></span><span class=token-line style="color:hsl(230, 8%, 24%)"><span class="token plain"> You know what recursion is to review the UNCC. #ornot</span><br></span><span class=token-line style="color:hsl(230, 8%, 24%)"><span class="token plain"> </span><br></span><span class=token-line style="color:hsl(230, 8%, 24%)"><span class="token plain"> There are really sore 3 bonfires in my first writing the library ground floor if awesome. #realtalk #impressed</span><br></span><span class=token-line style="color:hsl(230, 8%, 24%)"><span class="token plain"> </span><br></span><span class=token-line style="color:hsl(230, 8%, 24%)"><span class="token plain"> So we can make it out there's nothing but I'm not let us so hot I could think I may be good. #SwingDance</span><br></span><span class=token-line style="color:hsl(230, 8%, 24%)"><span class="token plain"> </span><br></span><span class=token-line style="color:hsl(230, 8%, 24%)"><span class="token plain"> Happy Christmas , at Harris Teeter to be be godly or Roman Catholic ). #4b392b#4b392b #Isaiah26</span><br></span><span class=token-line style="color:hsl(230, 8%, 24%)"><span class="token plain"> </span><br></span><span class=token-line style="color:hsl(230, 8%, 24%)"><span class="token plain"> For context , I in the most decisive factor of the same for homework. #accomplishment</span><br></span><span class=token-line style="color:hsl(230, 8%, 24%)"><span class="token plain"> </span><br></span><span class=token-line style="color:hsl(230, 8%, 24%)"><span class="token plain"> Freaking done. #loveyouall</span><br></span><span class=token-line style="color:hsl(230, 8%, 24%)"><span class="token plain"> </span><br></span><span class=token-line style="color:hsl(230, 8%, 24%)"><span class="token plain"> New blog post : Don't jump in a quiz in with a knife fight. #haskell #earlybirthday</span><br></span><span class=token-line style="color:hsl(230, 8%, 24%)"><span class="token plain"> </span><br></span><span class=token-line style="color:hsl(230, 8%, 24%)"><span class="token plain"> God shows me legitimately want to get some food and one day.</span><br></span><span class=token-line style="color:hsl(230, 8%, 24%)"><span class="token plain"> </span><br></span><span class=token-line style="color:hsl(230, 8%, 24%)"><span class="token plain"> Stormed the queen city. #mindblown</span><br></span><span class=token-line style="color:hsl(230, 8%, 24%)"><span class="token plain"> </span><br></span><span class=token-line style="color:hsl(230, 8%, 24%)"><span class="token plain"> The day of a cold at least outside right before the semester ..</span><br></span><span class=token-line style="color:hsl(230, 8%, 24%)"><span class="token plain"> </span><br></span><span class=token-line style="color:hsl(230, 8%, 24%)"><span class="token plain"> Finished with the way back. #winners</span><br></span><span class=token-line style="color:hsl(230, 8%, 24%)"><span class="token plain"> </span><br></span><span class=token-line style="color:hsl(230, 8%, 24%)"><span class="token plain"> Waking up , OJ , I feel like Nick Jonas today.</span><br></span><span class=token-line style="color:hsl(230, 8%, 24%)"><span class="token plain"> </span><br></span><span class=token-line style="color:hsl(230, 8%, 24%)"><span class="token plain"> First draft of so hard drive. #humansvszombies</span><br></span><span class=token-line style="color:hsl(230, 8%, 24%)"><span class="token plain"> </span><br></span><span class=token-line style="color:hsl(230, 8%, 24%)"><span class="token plain"> Eric Whitacre is the wise creation.</span><br></span><span class=token-line style="color:hsl(230, 8%, 24%)"><span class="token plain"> </span><br></span><span class=token-line style="color:hsl(230, 8%, 24%)"><span class="token plain"> Ethics paper first , music in close to everyone who just be posting up with my sin , and Jerry Springr #TheLittleThings</span><br></span><span class=token-line style="color:hsl(230, 8%, 24%)"><span class="token plain"> </span><br></span><span class=token-line style="color:hsl(230, 8%, 24%)"><span class="token plain"> Love that you know enough time I've eaten at 8 PM. #deepthoughts #stillblownaway</span><br></span><span class=token-line style="color:hsl(230, 8%, 24%)"><span class="token plain"> </span><br></span><span class=token-line style="color:hsl(230, 8%, 24%)"><span class="token plain"> Lead. #ThinkingTooMuch #Christmas</span><br></span><span class=token-line style="color:hsl(230, 8%, 24%)"><span class="token plain"> </span><br></span><span class=token-line style="color:hsl(230, 8%, 24%)"><span class="token plain"> Aamazing conference when you married #DepartmentOfRedundancyDepartment Yep , but there's a legitimate challenge.</span><br></span></code></pre><div class=buttonGroup__atx><button type=button aria-label="Copy code to clipboard" title=Copy class=clean-btn><span class=copyButtonIcons_eSgA aria-hidden=true><svg viewBox="0 0 24 24" class=copyButtonIcon_y97N><path fill=currentColor d="M19,21H8V7H19M19,5H8A2,2 0 0,0 6,7V21A2,2 0 0,0 8,23H19A2,2 0 0,0 21,21V7A2,2 0 0,0 19,5M16,1H4A2,2 0 0,0 2,3V17H4V3H16V1Z"/></svg><svg viewBox="0 0 24 24" class=copyButtonSuccessIcon_LjdS><path fill=currentColor d=M21,7L9,19L3.5,13.5L4.91,12.09L9,16.17L19.59,5.59L21,7Z /></svg></span></button></div></div></div>
<p>...Which all ended up being a whole lot more nonsensical than I had hoped for. There are some good ones, so I'll call that an accomplishment! I was banking on grammar not being an issue: since my tweets use impeccable grammar, the program modeled off them should have pretty good grammar as well. There are going to be some hilarious edge cases (I'm looking at you, <code>Ethics paper first, music in close to everyone</code>) that make no sense, and some hilarious edge cases (<code>Waking up, OJ, I feel like Nick Jonas today</code>) that make me feel like I should have a Twitter rap career. On the whole though, the structure came out alright.</p>
<h2 class="anchor anchorWithStickyNavbar_LWe7" id=moving-on-from-here>Moving on from here<a href=#moving-on-from-here class=hash-link aria-label="Direct link to Moving on from here" title="Direct link to Moving on from here"></a></h2>
<p>During class we also talked about an interesting idea: trying to analyze corporate documents and corporate speech. I'd be interested to know what this analysis applied to something like a couple of bank press releases could do. By any means, the code needs some work to clean it up before I get that far.</p>
<h2 class="anchor anchorWithStickyNavbar_LWe7" id=for-further-reading>For further reading<a href=#for-further-reading class=hash-link aria-label="Direct link to For further reading" title="Direct link to For further reading"></a></h2>
<p>I'm pretty confident I re-invented a couple wheels along the way - what I'm doing feels a lot like what <a href=https://en.wikipedia.org/wiki/Markov_chain_Monte_Carlo target=_blank rel="noopener noreferrer">Markov Chain Monte Carlo</a> is intended to do. But I've never worked explicitly with that before, so more research is needed.</div></article><nav class="pagination-nav docusaurus-mt-lg" aria-label="Blog post page navigation"><a class="pagination-nav__link pagination-nav__link--prev" href=/2016/03/predicting-santander-customer-happiness><div class=pagination-nav__sublabel>Older post</div><div class=pagination-nav__label>Predicting Santander customer happiness</div></a><a class="pagination-nav__link pagination-nav__link--next" href=/2016/04/tick-tock><div class=pagination-nav__sublabel>Newer post</div><div class=pagination-nav__label>Tick tock...</div></a></nav></main><div class="col col--2"><div class="tableOfContents_bqdL thin-scrollbar"><ul class="table-of-contents table-of-contents__left-border"><li><a href=#the-objective class="table-of-contents__link toc-highlight">The Objective</a><li><a href=#the-data class="table-of-contents__link toc-highlight">The Data</a><li><a href=#the-algorithm class="table-of-contents__link toc-highlight">The Algorithm</a><li><a href=#pulling-it-all-together class="table-of-contents__link toc-highlight">Pulling it all together</a><li><a href=#the-results class="table-of-contents__link toc-highlight">The results</a><li><a href=#moving-on-from-here class="table-of-contents__link toc-highlight">Moving on from here</a><li><a href=#for-further-reading class="table-of-contents__link toc-highlight">For further reading</a></ul></div></div></div></div></div><footer class=footer><div class="container container-fluid"><div class="footer__bottom text--center"><div class=footer__copyright>Copyright © 2024 Bradlee Speice</div></div></div></footer></div>