mirror of
https://github.com/bspeice/speice.io
synced 2024-11-14 14:08:09 -05:00
1 line
307 KiB
JSON
1 line
307 KiB
JSON
{"searchDocs":[{"title":"The webpack industrial complex","type":0,"sectionRef":"#","url":"/2011/11/webpack-industrial-complex","content":"","keywords":"","version":null},{"title":"Starting strong","type":1,"pageTitle":"The webpack industrial complex","url":"/2011/11/webpack-industrial-complex#starting-strong","content":" The sole starting requirement was to write everything in TypeScript. Not because of project scale, but because guardrails help with unfamiliar territory. Keeping that in mind, the first question was: how does one start a new project? All I actually need is "compile TypeScript, show it in a browser." Create React App (CRA) came to the rescue and the rest of that evening was a joy. My TypeScript/JavaScript skills were rusty, but the online documentation was helpful. I had never understood the appeal of JSX (why put a DOM in JavaScript?) until it made connecting an onEvent handler and a function easy. Some quick dimensional analysis later and there was a sine wave oscillator playing A=440 through the speakers. I specifically remember thinking "modern browsers are magical." ","version":null,"tagName":"h2"},{"title":"Continuing on","type":1,"pageTitle":"The webpack industrial complex","url":"/2011/11/webpack-industrial-complex#continuing-on","content":" Now comes the first mistake: I began to worry about "scale" before encountering an actual problem. Rather than rendering audio in the main thread, why not use audio worklets and render in a background thread instead? The first sign something was amiss came from the TypeScript compiler errors showing the audio worklet API was missing. After searching out Github issues and (unsuccessfully) tweaking the .tsconfig settings, I settled on installing a package and moving on. The next problem came from actually using the API. Worklets must load from separate "modules," but it wasn't clear how to guarantee the worklet code stayed separate from the application. I saw recommendations to use new URL(<local path>, import.meta.url) and it worked! Well, kind of: That file has the audio processor code, so why does it get served with Content-Type: video/mp2t? ","version":null,"tagName":"h2"},{"title":"Floundering about","type":1,"pageTitle":"The webpack industrial complex","url":"/2011/11/webpack-industrial-complex#floundering-about","content":" Now comes the second mistake: even though I didn't understand the error, I ignored recommendations to just use JavaScript and stuck by the original TypeScript requirement. I tried different project structures. Moving the worklet code to a new folder didn't help, nor did setting up a monorepo and placing it in a new package. I tried three different CRA tools - react-app-rewired, craco, customize-react-app - but got the same problem. Each has varying levels of compatibility with recent CRA versions, so it wasn't clear if I had the right solution but implemented it incorrectly. After attempting to eject the application and panicking after seeing the configuration, I abandoned that as well. I tried changing the webpack configuration: using new loaders, setting asset rules, even changing how webpack detects worker resources. In hindsight, entry points may have been the answer. But because CRA actively resists attempts to change its webpack configuration, and I couldn't find audio worklet examples in any other framework, I gave up. I tried so many application frameworks. Next.js looked like a good candidate, but added its own bespoke webpack complexity to the existing confusion. Astro had the best "getting started" experience, but I refuse to install an IDE-specific plugin. I first used Deno while exploring Lume, but it couldn't import the audio worklet types (maybe because of module compatibility?). Each framework was unique in its own way (shout-out to SvelteKit) but I couldn't figure out how to make them work. ","version":null,"tagName":"h2"},{"title":"Learning and reflecting","type":1,"pageTitle":"The webpack industrial complex","url":"/2011/11/webpack-industrial-complex#learning-and-reflecting","content":" I ended up using Vite and vite-plugin-react-pages to handle both "build the app" and "bundle worklets," but the specific tool choice isn't important. Instead, the focus should be on lessons learned. For myself: I'm obsessed with tooling, to the point it can derail the original goal. While it comes from a good place (for example: "types are awesome"), it can get in the way of more important workI tend to reach for online resources right after seeing a new problem. While finding help online is often faster, spending time understanding the problem would have been more productive than cycling through (often outdated) blog posts For the tools: Resource bundling is great and solves a genuine challenge. I've heard too many horror stories of developers writing modules by hand to believe this is unnecessary complexityWebpack is a build system and modern frameworks are deeply dependent on it (hence the "webpack industrial complex"). While this often saves users from unnecessary complexity, there's no path forward if something breaksThere's little ability to mix and match tools across frameworks. Next.js and Gatsby let users extend webpack, but because each framework adds its own modules, changes aren't portable. After spending a week looking at webpack, I had an example running with parcel in thirty minutes, but couldn't integrate it In the end, learning new systems is fun, but a focus on tools that "just work" can leave users out in the cold if they break down. ","version":null,"tagName":"h2"},{"title":"Autocallable Bonds","type":0,"sectionRef":"#","url":"/2015/11/autocallable","content":"","keywords":"","version":null},{"title":"Underlying simulation","type":1,"pageTitle":"Autocallable Bonds","url":"/2015/11/autocallable#underlying-simulation","content":" In order to price the autocallable bonds, we need to simulate the underlying assets. Let's go ahead and set up the simulation first, as this lays the foundation for what we're trying to do. We're going to use JNJ as the basis for our simulation. This implies the following parameters: S0S_0S0 = $102.2 (as of time of writing)qqq = 2.84%rrr = [.49, .9, 1.21, 1.45, 1.69] (term structure as of time of writing, linear interpolation)μ\\muμ = r−qr - qr−q (note that this implies a negative drift because of current low rates)σ\\sigmaσ = σimp\\sigma_{imp}σimp = 15.62% (from VIX implied volatility) We additionally define some parameters for simulation: T: The number of years to simulatem: The number of paths to simulaten: The number of steps to simulate in a year S0 = 102.2 nominal = 100 q = 2.84 / 100 σ = 15.37 / 100 term = [0, .49, .9, 1.21, 1.45, 1.69] / 100 + 1 ### # Potential: Based on PEP # S0 = 100.6 # σ = 14.86 # q = 2.7 ### # Simulation parameters T = 5 # Using years as the unit of time n = 250 # simulations per year m = 100000 # paths num_simulations = 5; # simulation rounds per price ","version":null,"tagName":"h2"},{"title":"Defining the simulation","type":1,"pageTitle":"Autocallable Bonds","url":"/2015/11/autocallable#defining-the-simulation","content":" To make things simpler, we simulate a single year at a time. This allows us to easily add in a dividend policy without too much difficulty, and update the simulation every year to match the term structure. The underlying uses GBM for simulation between years. simulate_gbm = function(S0, μ, σ, T, n) # Set the initial state m = length(S0) t = T / n motion = zeros(m, n) motion[:,1] = S0 # Build out all states for i=1:(n-1) motion[:,i+1] = motion[:,i] .* exp((μ - σ^2/2)*t) .* exp(sqrt(t) * σ .* randn(m)) end return motion end function display_motion(motion, T) # Given a matrix of paths, display the motion n = length(motion[1,:]) m = length(motion[:,1]) x = repmat(1:n, m) # Calculate the ticks we're going to use. We'd like to # have an xtick every month, so calculate where those # ticks will actually be at. if (T > 3) num_ticks = T xlabel = "Years" else num_ticks = T * 12 xlabel = "Months" end tick_width = n / num_ticks x_ticks = [] for i=1:round(num_ticks) x_ticks = vcat(x_ticks, i*tick_width) end # Use one color for each path. I'm not sure if there's # a better way to do this without going through DataFrames colors = [] for i = 1:m colors = vcat(colors, ones(n)*i) end plot(x=x, y=motion', color=colors, Geom.line, Guide.xticks(ticks=x_ticks, label=false), Guide.xlabel(xlabel), Guide.ylabel("Value")) end; ","version":null,"tagName":"h3"},{"title":"Example simulation","type":1,"pageTitle":"Autocallable Bonds","url":"/2015/11/autocallable#example-simulation","content":" Let's go ahead and run a sample simulation to see what the functions got us! initial = ones(5) * S0 # Using μ=0, T=.25 for now, we'll use the proper values later motion = simulate_gbm(initial, 0, σ, .25, 200) display_motion(motion, .25) ","version":null,"tagName":"h3"},{"title":"Computing the term structure","type":1,"pageTitle":"Autocallable Bonds","url":"/2015/11/autocallable#computing-the-term-structure","content":" Now that we've got the basic motion set up, let's start making things a bit more sophisticated for the model. We're going to assume that the drift of the stock is the difference between the implied forward rate and the quarterly dividend rate. We're given the yearly term structure, and need to calculate the quarterly forward rate to match this structure. The term structure is assumed to follow: d(0,t)=d(0,t−1)⋅fi−1,id(0, t) = d(0,t-1)\\cdot f_{i-1, i}d(0,t)=d(0,t−1)⋅fi−1,i Where fi−1,if_{i-1, i}fi−1,i is the quarterly forward rate. forward_term = function(yearly_term) # It is assumed that we have a yearly term structure passed in, and starts at year 0 # This implies a nominal rate above 0 for the first year! years = length(term)-1 # because we start at 0 structure = [(term[i+1] / term[i]) for i=1:years] end; ","version":null,"tagName":"h3"},{"title":"Illustrating the term structure","type":1,"pageTitle":"Autocallable Bonds","url":"/2015/11/autocallable#illustrating-the-term-structure","content":" Now that we've got our term structure, let's validate that we're getting the correct results! If we've done this correctly, then: term[2] == term[1] * structure[1] # Example term structure taken from: # http://www.treasury.gov/resource-center/data-chart-center/interest-rates/Pages/TextView.aspx?data=yield # Linear interpolation used years in-between periods, assuming real-dollar # interest rates forward_yield = forward_term(term) calculated_term2 = term[1] * forward_yield[1] println("Actual term[2]: $(term[2]); Calculated term[2]: $(calculated_term2)") Actual term[2]: 1.0049; Calculated term[2]: 1.0049 ","version":null,"tagName":"h3"},{"title":"The full underlying simulation","type":1,"pageTitle":"Autocallable Bonds","url":"/2015/11/autocallable#the-full-underlying-simulation","content":" Now that we have the term structure set up, we can actually start doing some real simulation! Let's construct some paths through the full 5-year time frame. In order to do this, we will simulate 1 year at a time, and use the forward rates at those times to compute the drift. Thus, there will be 5 total simulations batched together. full_motion = ones(5) * S0 full_term = vcat(term[1], forward_yield) for i=1:T μ = (full_term[i] - 1 - q) year_motion = simulate_gbm(full_motion[:,end], μ, σ, 1, n) full_motion = hcat(full_motion, year_motion) end display_motion(full_motion, T) ","version":null,"tagName":"h3"},{"title":"Final simulation","type":1,"pageTitle":"Autocallable Bonds","url":"/2015/11/autocallable#final-simulation","content":" We're now going to actually build out the full motion that we'll use for computing the pricing of our autocallable products. It will be largely the same, but we will use far more sample paths for the simulation. full_simulation = function(S0, T, n, m, term) forward = vcat(term[1], forward_term(term)) # And an S0 to kick things off. final_motion = ones(m) * S0 for i=1:T μ = (forward[i] - 1 - q) year_motion = simulate_gbm(final_motion[:,end], μ, σ, 1, n) final_motion = hcat(final_motion, year_motion) end return final_motion end tic() full_simulation(S0, T, n, m, term) time = toq() @printf("Time to run simulation: %.2fs", time) Time to run simulation: 5.34s ","version":null,"tagName":"h3"},{"title":"Athena Simulation","type":1,"pageTitle":"Autocallable Bonds","url":"/2015/11/autocallable#athena-simulation","content":" Now that we've defined our underlying simulation, let's actually try and price an Athena note. Athena has the following characteristics: Automatically called if the underlying is above the call barrier at observationAccelerated coupon paid if the underlying is above the call barrier at observation The coupon paid is c⋅ic \\cdot ic⋅i with iii as the current year, and ccc the coupon rate Principle protection up until a protection barrier at observation; All principle at risk if this barrier not metObserved yearly call_barrier = S0 strike = S0 protection_barrier = S0 * .6 coupon = nominal * .07 price_athena = function(initial_price, year_prices, call_barrier, protection_barrier, coupon, forward_structure) total_coupons = 0 t = length(year_prices) for i=1:t price = year_prices[i] if price ≥ call_barrier return (nominal + coupon*i) * exp((prod(forward_structure[i:end])-1)*(t-i)) end end # We've reached maturity, time to check capital protection if year_prices[end] > protection_barrier return nominal else put = (strike - year_prices[end]) / strike return nominal*(1-put) end end forward_structure = forward_term(term) price_function = (year_prices) -> price_athena(S0, year_prices, call_barrier, protection_barrier, coupon, forward_structure) athena = function() year_indexes = [n*i for i=1:T] motion = full_simulation(S0, T, n, m, term) payoffs = [price_function(motion[i, year_indexes]) for i=1:m] return mean(payoffs) end mean_payoffs = zeros(num_simulations) for i=1:num_simulations tic() mean_payoffs[i] = athena() time = toq() @printf("Mean of simulation %i: \\$%.4f; Simulation time: %.2fs\\n", i, mean_payoffs[i], time) end final_mean = mean(mean_payoffs) println("Mean over $num_simulations simulations: $(mean(mean_payoffs))") pv = final_mean * (exp(-(prod(forward_structure)-1)*T)) @printf("Present value of Athena note: \\$%.2f, notional: \\$%.2f", pv, nominal) Mean of simulation 1: $103.2805; Simulation time: 5.59s Mean of simulation 2: $103.3796; Simulation time: 5.05s Mean of simulation 3: $103.4752; Simulation time: 5.18s Mean of simulation 4: $103.4099; Simulation time: 5.37s Mean of simulation 5: $103.3260; Simulation time: 5.32s Mean over 5 simulations: 103.37421610015554 Present value of Athena note: $95.00, notional: $100.00 ","version":null,"tagName":"h2"},{"title":"Phoenix without Memory Simulation","type":1,"pageTitle":"Autocallable Bonds","url":"/2015/11/autocallable#phoenix-without-memory-simulation","content":" Let's move into pricing a Phoenix without memory. It's very similar to the Athena production, with the exception that we introduce a coupon barrier so coupons are paid even when the underlying is below the initial price. The Phoenix product has the following characteristics (example here): Automatically called if the underlying is above the call barrier at observationCoupon paid if the underlying is above a coupon barrier at observationPrinciple protection up until a protection barrier at observation; All principle at risk if this barrier not metObserved yearly Some example paths (all assume that a call barrier of the current price, and coupon barrier some level below that): At the end of year 1, the stock is above the call barrier; the note is called and you receive the value of the stock plus the coupon being paid.At the end of year 1, the stock is above the coupon barrier, but not the call barrier; you receive the coupon. At the end of year 2, the stock is below the coupon barrier; you receive nothing. At the end of year 3, the stock is above the call barrier; the note is called and you receive the value of the stock plus a coupon for year 3. We're going to re-use the same simulation, with the following parameters: Call barrier: 100%Coupon barrier: 70%Coupon: 6%Capital protection until 70% (at maturity) call_barrier = S0 coupon_barrier = S0 * .8 protection_barrier = S0 * .6 coupon = nominal * .06 price_phoenix_no_memory = function(initial_price, year_prices, call_barrier, coupon_barrier, protection_barrier, coupon, forward_structure) total_coupons = 0 t = length(year_prices) for i=1:t price = year_prices[i] if price ≥ call_barrier return (nominal + coupon + total_coupons)*exp((prod(forward_structure[i:end])-1)*(t-i)) elseif price ≥ coupon_barrier total_coupons = total_coupons * exp(forward_structure[i]-1) + coupon else total_coupons *= exp(forward_structure[i]-1) end end # We've reached maturity, time to check capital protection if year_prices[end] > protection_barrier return nominal + total_coupons else put = (strike - year_prices[end]) / strike return nominal*(1-put) end end forward_structure = forward_term(term) price_function = (year_prices) -> price_phoenix_no_memory(S0, year_prices, call_barrier, coupon_barrier, protection_barrier, coupon, forward_structure) phoenix_no_memory = function() year_indexes = [n*i for i=1:T] motion = full_simulation(S0, T, n, m, term) payoffs = [price_function(motion[i, year_indexes]) for i=1:m] return mean(payoffs) end mean_payoffs = zeros(num_simulations) for i=1:num_simulations tic() mean_payoffs[i] = phoenix_no_memory() time = toq() @printf("Mean of simulation %i: \\$%.4f; Simulation time: %.2fs\\n", i, mean_payoffs[i], time) end final_mean = mean(mean_payoffs) println("Mean over $num_simulations simulations: $(mean(mean_payoffs))") pv = final_mean * exp(-(prod(forward_structure)-1)*(T)) @printf("Present value of Phoenix without memory note: \\$%.2f", pv) Mean of simulation 1: $106.0562; Simulation time: 5.72s Mean of simulation 2: $106.0071; Simulation time: 5.85s Mean of simulation 3: $105.9959; Simulation time: 5.87s Mean of simulation 4: $106.0665; Simulation time: 5.93s Mean of simulation 5: $106.0168; Simulation time: 5.81s Mean over 5 simulations: 106.02850857209883 Present value of Phoenix without memory note: $97.44 ","version":null,"tagName":"h2"},{"title":"Phoenix with Memory Simulation","type":1,"pageTitle":"Autocallable Bonds","url":"/2015/11/autocallable#phoenix-with-memory-simulation","content":" The Phoenix with Memory structure is very similar to the Phoenix, but as the name implies, has a special "memory" property: It remembers any coupons that haven't been paid at prior observation times, and pays them all if the underlying crosses the coupon barrier. For example: Note issued with 100% call barrier, 70% coupon barrier. At year 1, the underlying is at 50%, so no coupons are paid. At year 2, the underlying is at 80%, so coupons for both year 1 and 2 are paid, resulting in a double coupon. You can also find an example here. Let's go ahead and set up the simulation! The parameters will be the same, but we can expect that the value will go up because of the memory attribute call_barrier = S0 coupon_barrier = S0 * .8 protection_barrier = S0 * .6 coupon = nominal * .07 price_phoenix_with_memory = function(initial_price, year_prices, call_barrier, coupon_barrier, protection_barrier, coupon, forward_structure) last_coupon = 0 total_coupons = 0 t = length(year_prices) for i=1:t price = year_prices[i] if price > call_barrier return (nominal + coupon + total_coupons)*exp((prod(forward_structure[i:end])-1)*(t-i)) elseif price > coupon_barrier #################################################################### # The only difference between with/without memory is the below lines memory_coupons = (i - last_coupon) * coupon last_coupon = i total_coupons = total_coupons * exp(forward_structure[i]-1) + memory_coupons #################################################################### else total_coupons *= exp(forward_structure[i]-1) end end # We've reached maturity, time to check capital protection if year_prices[end] > protection_barrier return nominal + total_coupons else put = (strike - year_prices[end]) / strike return nominal*(1-put) end end forward_structure = forward_term(term) price_function = (year_prices) -> price_phoenix_with_memory(S0, year_prices, call_barrier, coupon_barrier, protection_barrier, coupon, forward_structure) phoenix_with_memory = function() year_indexes = [n*i for i=1:T] motion = full_simulation(S0, T, n, m, term) payoffs = [price_function(motion[i, year_indexes]) for i=1:m] return mean(payoffs) end mean_payoffs = zeros(num_simulations) for i=1:num_simulations tic() mean_payoffs[i] = phoenix_with_memory() time = toq() @printf("Mean of simulation %i: \\$%.4f; Simulation time: %.2fs\\n", i, mean_payoffs[i], time) end final_mean = mean(mean_payoffs) println("Mean over $num_simulations simulations: $(mean(mean_payoffs))") pv = final_mean * exp(-(prod(forward_structure)-1)*(T)) @printf("Present value of Phoenix with memory note: \\$%.2f", pv) Mean of simulation 1: $108.8612; Simulation time: 5.89s Mean of simulation 2: $109.0226; Simulation time: 5.90s Mean of simulation 3: $108.9175; Simulation time: 5.92s Mean of simulation 4: $108.9426; Simulation time: 5.94s Mean of simulation 5: $108.8087; Simulation time: 6.06s Mean over 5 simulations: 108.91052564051816 Present value of Phoenix with memory note: $100.09 ","version":null,"tagName":"h2"},{"title":"Welcome, and an algorithm","type":0,"sectionRef":"#","url":"/2015/11/welcome","content":"","keywords":"","version":null},{"title":"Trading Competition Optimization","type":1,"pageTitle":"Welcome, and an algorithm","url":"/2015/11/welcome#trading-competition-optimization","content":" Goal: Max return given maximum Sharpe and Drawdown from IPython.display import display import Quandl from datetime import datetime, timedelta tickers = ['XOM', 'CVX', 'CLB', 'OXY', 'SLB'] market_ticker = 'GOOG/NYSE_VOO' lookback = 30 d_col = 'Close' data = {tick: Quandl.get('YAHOO/{}'.format(tick))[-lookback:] for tick in tickers} market = Quandl.get(market_ticker) ","version":null,"tagName":"h2"},{"title":"Calculating the Return","type":1,"pageTitle":"Welcome, and an algorithm","url":"/2015/11/welcome#calculating-the-return","content":" We first want to know how much each ticker returned over the prior period. returns = {tick: data[tick][d_col].pct_change() for tick in tickers} display({tick: returns[tick].mean() for tick in tickers}) {'CLB': -0.0016320202164526894, 'CVX': 0.0010319531629488911, 'OXY': 0.00093418904454400551, 'SLB': 0.00098431254720448159, 'XOM': 0.00044165797556096868} ","version":null,"tagName":"h2"},{"title":"Calculating the Sharpe ratio","type":1,"pageTitle":"Welcome, and an algorithm","url":"/2015/11/welcome#calculating-the-sharpe-ratio","content":" Sharpe: R−RMσ{R - R_M \\over \\sigma}σR−RM We use the average return over the lookback period, minus the market average return, over the ticker standard deviation to calculate the Sharpe. Shorting a stock turns a negative Sharpe positive. market_returns = market.pct_change() sharpe = lambda ret: (ret.mean() - market_returns[d_col].mean()) / ret.std() sharpes = {tick: sharpe(returns[tick]) for tick in tickers} display(sharpes) {'CLB': -0.10578734457846127, 'CVX': 0.027303529817677398, 'OXY': 0.022622210057414487, 'SLB': 0.026950946344858676, 'XOM': -0.0053519259698605499} ","version":null,"tagName":"h2"},{"title":"Calculating the drawdown","type":1,"pageTitle":"Welcome, and an algorithm","url":"/2015/11/welcome#calculating-the-drawdown","content":" This one is easy - what is the maximum daily change over the lookback period? That is, because we will allow short positions, we are not concerned strictly with maximum downturn, but in general, what is the largest 1-day change? drawdown = lambda ret: ret.abs().max() drawdowns = {tick: drawdown(returns[tick]) for tick in tickers} display(drawdowns) {'CLB': 0.043551495607375035, 'CVX': 0.044894389686214398, 'OXY': 0.051424517867144637, 'SLB': 0.034774627850375328, 'XOM': 0.035851524605672758} Performing the optimization max μ⋅ωs.t. 1⃗ω=1S⃗ω≥sD⃗⋅∣ω∣≤d∣ω∣≤l\\begin{align*} max\\ \\ & \\mu \\cdot \\omega\\\\ s.t.\\ \\ & \\vec{1} \\omega = 1\\\\ & \\vec{S} \\omega \\ge s\\\\ & \\vec{D} \\cdot | \\omega | \\le d\\\\ & \\left|\\omega\\right| \\le l\\\\ \\end{align*}max s.t. μ⋅ω1ω=1Sω≥sD⋅∣ω∣≤d∣ω∣≤l We want to maximize average return subject to having a full portfolio, Sharpe above a specific level, drawdown below a level, and leverage not too high - that is, don't have huge long/short positions. import numpy as np from scipy.optimize import minimize #sharpe_limit = .1 drawdown_limit = .05 leverage = 250 # Use the map so we can guarantee we maintain the correct order # So we can write as upper-bound # sharpe_a = np.array(list(map(lambda tick: sharpes[tick], tickers))) * -1 dd_a = np.array(list(map(lambda tick: drawdowns[tick], tickers))) # Because minimizing returns_a = np.array(list(map(lambda tick: returns[tick].mean(), tickers))) meets_sharpe = lambda x: sum(abs(x) * sharpe_a) - sharpe_limit def meets_dd(x): portfolio = sum(abs(x)) if portfolio < .1: # If there are no stocks in the portfolio, # we can accidentally induce division by 0, # or division by something small enough to cause infinity return 0 return drawdown_limit - sum(abs(x) * dd_a) / sum(abs(x)) is_portfolio = lambda x: sum(x) - 1 def within_leverage(x): return leverage - sum(abs(x)) objective = lambda x: sum(x * returns_a) * -1 # Because we're minimizing bounds = ((None, None),) * len(tickers) x = np.zeros(len(tickers)) constraints = [ { 'type': 'eq', 'fun': is_portfolio }, { 'type': 'ineq', 'fun': within_leverage #}, { # 'type': 'ineq', # 'fun': meets_sharpe }, { 'type': 'ineq', 'fun': meets_dd } ] optimal = minimize(objective, x, bounds=bounds, constraints=constraints, options={'maxiter': 500}) # Optimization time! display(optimal.message) display("Holdings: {}".format(list(zip(tickers, optimal.x)))) # multiply by -100 to scale, and compensate for minimizing expected_return = optimal.fun * -100 display("Expected Return: {:.3f}%".format(expected_return)) expected_drawdown = sum(abs(optimal.x) * dd_a) / sum(abs(optimal.x)) * 100 display("Expected Max Drawdown: {0:.2f}%".format(expected_drawdown)) # TODO: Calculate expected Sharpe 'Optimization terminated successfully.' "Holdings: [('XOM', 5.8337945679814904), ('CVX', 42.935064321851307), ('CLB', -124.5), ('OXY', 36.790387773552119), ('SLB', 39.940753336615096)]" 'Expected Return: 32.375%' 'Expected Max Drawdown: 4.34%' ","version":null,"tagName":"h2"},{"title":"Testing Cramer","type":0,"sectionRef":"#","url":"/2015/12/testing-cramer","content":"","keywords":"","version":null},{"title":"Downloading Futures data from Seeking Alpha","type":1,"pageTitle":"Testing Cramer","url":"/2015/12/testing-cramer#downloading-futures-data-from-seeking-alpha","content":" We're going to define two HTML parsing classes - one to get the article URL's from a page, and one to get the actual data from each article. class ArticleListParser(HTMLParser): """Given a web page with articles on it, parse out the article links""" articles = [] def handle_starttag(self, tag, attrs): #if tag == 'div' and ("id", "author_articles_wrapper") in attrs: # self.fetch_links = True if tag == 'a' and ('class', 'dashboard_article_link') in attrs: href = list(filter(lambda x: x[0] == 'href', attrs))[0][1] self.articles.append(href) base_url = "http://seekingalpha.com/author/wall-street-breakfast/articles" article_page_urls = [base_url] + [base_url + '/{}'.format(i) for i in range(2, 20)] global_articles = [] for page in article_page_urls: # We need to switch the user agent, as SA blocks the standard requests agent articles_html = requests.get(page, headers={"User-Agent": "Wget/1.13.4"}) parser = ArticleListParser() parser.feed(articles_html.text) global_articles += (parser.articles) class ArticleReturnParser(HTMLParser): "Given an article, parse out the futures returns in it" record_font_tags = False in_font_tag = False counter = 0 # data = {} # See __init__ def __init__(self, *args, **kwargs): super().__init__(*args, **kwargs) self.data = {} def handle_starttag(self, tag, attrs): if tag == 'span' and ('itemprop', 'datePublished') in attrs: date_string = list(filter(lambda x: x[0] == 'content', attrs))[0][1] date = dtparser.parse(date_string) self.data['date'] = date self.in_font_tag = tag == 'font' def safe_float(self, string): try: return float(string[:-1]) / 100 except ValueError: return np.NaN def handle_data(self, content): if not self.record_font_tags and "Futures at 6" in content: self.record_font_tags = True if self.record_font_tags and self.in_font_tag: if self.counter == 0: self.data['DOW'] = self.safe_float(content) elif self.counter == 1: self.data['S&P'] = self.safe_float(content) elif self.counter == 2: self.data['NASDAQ'] = self.safe_float(content) elif self.counter == 3: self.data['Crude'] = self.safe_float(content) elif self.counter == 4: self.data['Gold'] = self.safe_float(content) self.counter += 1 def handle_endtag(self, tag): self.in_font_tag = False def retrieve_data(url): sa = "http://seekingalpha.com" article_html = requests.get(sa + url, headers={"User-Agent": "Wget/1.13.4"}) parser = ArticleReturnParser() parser.feed(article_html.text) parser.data.update({"url": url}) parser.data.update({"text": article_html.text}) return parser.data # This copy **MUST** be in place. I'm not sure why, # as you'd think that the data being returned would already # represent a different memory location. Even so, it blows up # if you don't do this. article_list = list(set(global_articles)) article_data = [copy(retrieve_data(url)) for url in article_list] # If there's an issue downloading the article, drop it. article_df = pd.DataFrame.from_dict(article_data).dropna() ","version":null,"tagName":"h2"},{"title":"Fetching the Returns data","type":1,"pageTitle":"Testing Cramer","url":"/2015/12/testing-cramer#fetching-the-returns-data","content":" Now that we have the futures data, we're going to compare across 4 different indices - the S&P 500 index, Dow Jones Industrial, Russell 2000, and NASDAQ 100. Let's get the data off of Quandl to make things easier! # article_df is sorted by date, so we get the first row. start_date = article_df.sort_values(by='date').iloc[0]['date'] - relativedelta(days=1) SPY = Quandl.get("GOOG/NYSE_SPY", trim_start=start_date) DJIA = Quandl.get("GOOG/AMS_DIA", trim_start=start_date) RUSS = Quandl.get("GOOG/AMEX_IWM", trim_start=start_date) NASDAQ = Quandl.get("GOOG/EPA_QQQ", trim_start=start_date) ","version":null,"tagName":"h2"},{"title":"Running the Comparison","type":1,"pageTitle":"Testing Cramer","url":"/2015/12/testing-cramer#running-the-comparison","content":" There are two types of tests I want to determine: How accurate each futures category is at predicting the index's opening change over the close before, and predicting the index's daily return. Let's first calculate how good each future is at predicting the opening return over the previous day. I expect that the futures will be more than 50% accurate, since the information is recorded 3 hours before the markets open. def calculate_opening_ret(frame): # I'm not a huge fan of the appending for loop, # but it's a bit verbose for a comprehension data = {} for i in range(1, len(frame)): date = frame.iloc[i].name prior_close = frame.iloc[i-1]['Close'] open_val = frame.iloc[i]['Open'] data[date] = (open_val - prior_close) / prior_close return data SPY_open_ret = calculate_opening_ret(SPY) DJIA_open_ret = calculate_opening_ret(DJIA) RUSS_open_ret = calculate_opening_ret(RUSS) NASDAQ_open_ret = calculate_opening_ret(NASDAQ) def signs_match(list_1, list_2): # This is a surprisingly difficult task - we have to match # up the dates in order to check if opening returns actually match index_dict_dt = {key.to_datetime(): list_2[key] for key in list_2.keys()} matches = [] for row in list_1.iterrows(): row_dt = row[1][1] row_value = row[1][0] index_dt = datetime(row_dt.year, row_dt.month, row_dt.day) if index_dt in list_2: index_value = list_2[index_dt] if (row_value > 0 and index_value > 0) or \\ (row_value < 0 and index_value < 0) or \\ (row_value == 0 and index_value == 0): matches += [1] else: matches += [0] #print("{}".format(list_2[index_dt])) return matches prediction_dict = {} matches_dict = {} count_dict = {} index_dict = {"SPY": SPY_open_ret, "DJIA": DJIA_open_ret, "RUSS": RUSS_open_ret, "NASDAQ": NASDAQ_open_ret} indices = ["SPY", "DJIA", "RUSS", "NASDAQ"] futures = ["Crude", "Gold", "DOW", "NASDAQ", "S&P"] for index in indices: matches_dict[index] = {future: signs_match(article_df[[future, 'date']], index_dict[index]) for future in futures} count_dict[index] = {future: len(matches_dict[index][future]) for future in futures} prediction_dict[index] = {future: np.mean(matches_dict[index][future]) for future in futures} print("Articles Checked: ") print(pd.DataFrame.from_dict(count_dict)) print() print("Prediction Accuracy:") print(pd.DataFrame.from_dict(prediction_dict)) Articles Checked: DJIA NASDAQ RUSS SPY Crude 268 268 271 271 DOW 268 268 271 271 Gold 268 268 271 271 NASDAQ 268 268 271 271 S&P 268 268 271 271 Prediction Accuracy: DJIA NASDAQ RUSS SPY Crude 0.544776 0.522388 0.601476 0.590406 DOW 0.611940 0.604478 0.804428 0.841328 Gold 0.462687 0.455224 0.464945 0.476015 NASDAQ 0.615672 0.608209 0.797048 0.830258 S&P 0.604478 0.597015 0.811808 0.848708 This data is very interesting. Some insights: Both DOW and NASDAQ futures are pretty bad at predicting their actual market openingsNASDAQ and Dow are fairly unpredictable; Russell 2000 and S&P are very predictableGold is a poor predictor in general - intuitively Gold should move inverse to the market, but it appears to be about as accurate as a coin flip. All said though it appears that futures data is important for determining market direction for both the S&P 500 and Russell 2000. Cramer is half-right: futures data isn't very helpful for the Dow and NASDAQ indices, but is great for the S&P and Russell indices. ","version":null,"tagName":"h2"},{"title":"The next step - Predicting the close","type":1,"pageTitle":"Testing Cramer","url":"/2015/12/testing-cramer#the-next-step---predicting-the-close","content":" Given the code we currently have, I'd like to predict the close of the market as well. We can re-use most of the code, so let's see what happens: def calculate_closing_ret(frame): # I'm not a huge fan of the appending for loop, # but it's a bit verbose for a comprehension data = {} for i in range(0, len(frame)): date = frame.iloc[i].name open_val = frame.iloc[i]['Open'] close_val = frame.iloc[i]['Close'] data[date] = (close_val - open_val) / open_val return data SPY_close_ret = calculate_closing_ret(SPY) DJIA_close_ret = calculate_closing_ret(DJIA) RUSS_close_ret = calculate_closing_ret(RUSS) NASDAQ_close_ret = calculate_closing_ret(NASDAQ) def signs_match(list_1, list_2): # This is a surprisingly difficult task - we have to match # up the dates in order to check if opening returns actually match index_dict_dt = {key.to_datetime(): list_2[key] for key in list_2.keys()} matches = [] for row in list_1.iterrows(): row_dt = row[1][1] row_value = row[1][0] index_dt = datetime(row_dt.year, row_dt.month, row_dt.day) if index_dt in list_2: index_value = list_2[index_dt] if (row_value > 0 and index_value > 0) or \\ (row_value < 0 and index_value < 0) or \\ (row_value == 0 and index_value == 0): matches += [1] else: matches += [0] #print("{}".format(list_2[index_dt])) return matches matches_dict = {} count_dict = {} prediction_dict = {} index_dict = {"SPY": SPY_close_ret, "DJIA": DJIA_close_ret, "RUSS": RUSS_close_ret, "NASDAQ": NASDAQ_close_ret} indices = ["SPY", "DJIA", "RUSS", "NASDAQ"] futures = ["Crude", "Gold", "DOW", "NASDAQ", "S&P"] for index in indices: matches_dict[index] = {future: signs_match(article_df[[future, 'date']], index_dict[index]) for future in futures} count_dict[index] = {future: len(matches_dict[index][future]) for future in futures} prediction_dict[index] = {future: np.mean(matches_dict[index][future]) for future in futures} print("Articles Checked:") print(pd.DataFrame.from_dict(count_dict)) print() print("Prediction Accuracy:") print(pd.DataFrame.from_dict(prediction_dict)) Articles Checked: DJIA NASDAQ RUSS SPY Crude 268 268 271 271 DOW 268 268 271 271 Gold 268 268 271 271 NASDAQ 268 268 271 271 S&P 268 268 271 271 Prediction Accuracy: DJIA NASDAQ RUSS SPY Crude 0.533582 0.529851 0.501845 0.542435 DOW 0.589552 0.608209 0.535055 0.535055 Gold 0.455224 0.451493 0.483395 0.512915 NASDAQ 0.582090 0.626866 0.531365 0.538745 S&P 0.585821 0.608209 0.535055 0.535055 Well, it appears that the futures data is terrible at predicting market close. NASDAQ predicting NASDAQ is the most interesting data point, but 63% accuracy isn't accurate enough to make money consistently. ","version":null,"tagName":"h2"},{"title":"Final sentiments","type":1,"pageTitle":"Testing Cramer","url":"/2015/12/testing-cramer#final-sentiments","content":" The data bears out very close to what I expected would happen: Futures data is more accurate than a coin flip for predicting openings, which makes sense since it is recorded only 3 hours before the actual openingFutures data is about as acccurate as a coin flip for predicting closings, which means there is no money to be made in trying to predict the market direction for the day given the futures data. In summary: Cramer is half right: Futures data is not good for predicting the market open of the Dow and NASDAQ indices. Contrary to Cramer though, it is very good for predicting the S&P and Russell indices - we can achieve an accuracy slightly over 80% for each.Making money in the market is hard. We can't just go to the futures and treat them as an oracle for where the market will close. I hope you've enjoyed this, I quite enjoyed taking a deep dive in the analytics this way. I'll be posting more soon! ","version":null,"tagName":"h2"},{"title":"Cloudy in Seattle","type":0,"sectionRef":"#","url":"/2016/01/cloudy-in-seattle","content":"","keywords":"","version":null},{"title":"Examining other cities","type":1,"pageTitle":"Cloudy in Seattle","url":"/2016/01/cloudy-in-seattle#examining-other-cities","content":" After taking some time to explore how the weather in North Carolina stacked up over the past years, I was interested in doing the same analysis for other cities. Growing up with family from Binghamton, NY I was always told it was very cloudy there. And Seattle has a nasty reputation for being very depressing and cloudy. All said, the cities I want to examine are: Binghamton, NYCary, NCSeattle, WANew York City, NY I'd be interested to try this analysis worldwide at some point - comparing London and Seattle might be an interesting analysis. For now though, we'll stick with trying out the US data. There will be plenty of charts. I want to know: How has average cloud cover and precipitation chance changed over the years for each city mentioned? This will hopefully tell us whether Seattle has actually earned its reputation for being a depressing city. city_forecasts = pickle.load(open('city_forecasts.p', 'rb')) forecasts_df = pd.DataFrame.from_dict(city_forecasts) cities = ['binghamton', 'cary', 'nyc', 'seattle'] city_colors = {cities[i]: Palette[i] for i in range(0, 4)} def safe_cover(frame): if frame and 'cloudCover' in frame: return frame['cloudCover'] else: return np.NaN def monthly_avg_cloudcover(city, year, month): dates = pd.DatetimeIndex(start=datetime(year, month, 1, 12), end=datetime(year, month + 1, 1, 12), freq='D', closed='left') cloud_cover_vals = list(map(lambda x: safe_cover(forecasts_df[city][x]['currently']), dates)) cloud_cover_samples = len(list(filter(lambda x: x is not np.NaN, cloud_cover_vals))) # Ignore an issue with nanmean having all NaN values. We'll discuss the data issues below. with warnings.catch_warnings(): warnings.simplefilter('ignore') return np.nanmean(cloud_cover_vals), cloud_cover_samples years = range(1990, 2016) def city_avg_cc(city, month): return [monthly_avg_cloudcover(city, y, month) for y in years] months = [ ('July', 7), ('August', 8), ('September', 9), ('October', 10), ('November', 11) ] for month, month_id in months: month_averages = {city: city_avg_cc(city, month_id) for city in cities} f = figure(title="{} Average Cloud Cover".format(month), x_axis_label='Year', y_axis_label='Cloud Cover Percentage') for city in cities: f.line(years, [x[0] for x in month_averages[city]], legend=city, color=city_colors[city]) show(f) Well, as it so happens it looks like there are some data issues. July's data is a bit sporadic, and 2013 seems to be missing from most months as well. I think really only two things can really be confirmed here: Seattle, specifically for the months of October and November, is in fact significantly more cloudy on average than are other citiesAll cities surveyed have seen average cloud cover decline over the months studied. There are data issues, but the trend seems clear. Let's now move from cloud cover data to looking at average rainfall chance. def safe_precip(frame): if frame and 'precipProbability' in frame: return frame['precipProbability'] else: return np.NaN def monthly_avg_precip(city, year, month): dates = pd.DatetimeIndex(start=datetime(year, month, 1, 12), end=datetime(year, month + 1, 1, 12), freq='D', closed='left') precip_vals = list(map(lambda x: safe_precip(forecasts_df[city][x]['currently']), dates)) precip_samples = len(list(filter(lambda x: x is not np.NaN, precip_vals))) # Ignore an issue with nanmean having all NaN values. We'll discuss the data issues below. with warnings.catch_warnings(): warnings.simplefilter('ignore') return np.nanmean(precip_vals), precip_samples def city_avg_precip(city, month): return [monthly_avg_precip(city, y, month) for y in years] for month, month_id in months: month_averages = {city: city_avg_cc(city, month_id) for city in cities} f = figure(title="{} Average Precipitation Chance".format(month), x_axis_label='Year', y_axis_label='Precipitation Chance Percentage') for city in cities: f.line(years, [x[0] for x in month_averages[city]], legend=city, color=city_colors[city]) show(f) The same data issue caveats apply here: 2013 seems to be missing some data, and July has some issues as well. However, this seems to confirm the trends we saw with cloud cover: Seattle, specifically for the months of August, October, and November has had a consistently higher chance of rain than other cities surveyed.Average precipitation chance, just like average cloud cover, has been trending down over time. ","version":null,"tagName":"h2"},{"title":"Conclusion","type":1,"pageTitle":"Cloudy in Seattle","url":"/2016/01/cloudy-in-seattle#conclusion","content":" I have to admit I was a bit surprised after doing this analysis. Seattle showed a higher average cloud cover and average precipitation chance than did the other cities surveyed. Maybe Seattle is actually an objectively more depressing city to live in. Well that's all for weather data at the moment. It's been a great experiment, but I think this is about as far as I'll be able to get with weather data without some domain knowledge. Talk again soon! ","version":null,"tagName":"h2"},{"title":"Complaining about the weather","type":0,"sectionRef":"#","url":"/2016/01/complaining-about-the-weather","content":"Figuring out whether people should be complaining about the recent weather in North Carolina. from bokeh.plotting import figure, output_notebook, show from bokeh.palettes import PuBuGn9 as Palette import pandas as pd import numpy as np from datetime import datetime import pickle output_notebook() BokehJS successfully loaded. I'm originally from North Carolina, and I've been hearing a lot of people talking about how often it's been raining recently. They're excited for any day that has sun. So I got a bit curious: Has North Carolina over the past few months actually had more cloudy and rainy days recently than in previous years? This shouldn't be a particularly challenging task, but I'm interested to know if people's perceptions actually reflect reality. The data we'll use comes from forecast.io, since they can give us a cloud cover percentage. I've gone ahead and retrieved the data to a pickle file, and included the code that was used to generate it. First up: What was the average cloud cover in North Carolina during August - November, and how many days were cloudy? We're going to assume that a "cloudy" day is defined as any day in which the cloud cover is above 50%. city_forecasts = pickle.load(open('city_forecasts.p', 'rb')) forecast_df = pd.DataFrame.from_dict(city_forecasts) cary_forecast = forecast_df['cary'] years = range(1990, 2016) months = range(7, 12) months_str = ['July', 'August', 'September', 'October', 'November'] def safe_cover(frame): if frame and 'cloudCover' in frame: return frame['cloudCover'] else: return np.NaN def monthly_avg_cloudcover(year, month): dates = pd.DatetimeIndex(start=datetime(year, month, 1, 12), end=datetime(year, month + 1, 1, 12), freq='D', closed='left') cloud_cover_vals = list(map(lambda x: safe_cover(cary_forecast[x]['currently']), dates)) cloud_cover_samples = len(list(filter(lambda x: x is not np.NaN, cloud_cover_vals))) return np.nanmean(cloud_cover_vals), cloud_cover_samples monthly_cover_vals = [[monthly_avg_cloudcover(y, m)[0] for y in years] for m in months] f = figure(title='Monthly Average Cloud Cover', x_range=(1990, 2015), x_axis_label='Year') for x in range(0, len(months)): f.line(years, monthly_cover_vals[x], legend=months_str[x], color=Palette[x]) show(f) As we can see from the chart above, on the whole the monthly average cloud cover has been generally trending down over time. The average cloud cover is also lower than it was last year - it seems people are mostly just complaining. There are some data issues that start in 2012 that we need to be aware of - the cloud cover percentage doesn't exist for all days. Even so, the data that we have seems to reflect the wider trend, so we'll assume for now that the missing data doesn't skew our results. There's one more metric we want to check though - how many cloudy days were there? This is probably a better gauge of sentiment than the average monthly cover. def monthly_cloudy_days(year, month): dates = pd.DatetimeIndex(start=datetime(year, month, 1, 12), end=datetime(year, month + 1, 1, 12), freq='D', closed='left') cloud_cover_vals = list(map(lambda x: safe_cover(cary_forecast[x]['currently']), dates)) cloud_cover_samples = len(list(filter(lambda x: x is not np.NaN, cloud_cover_vals))) cloudy_days = [cover > .5 for cover in cloud_cover_vals] return np.count_nonzero(cloudy_days), cloud_cover_samples monthly_days_vals = [[monthly_cloudy_days(y, m)[0] for y in years] for m in months] monthly_cover_samples = [[monthly_cloudy_days(y, m)[1] for y in years] for m in months] f = figure(title='Monthly Cloudy Days', x_range=(1990, 2015), x_axis_label='Year') for x in range(0, len(months)): f.line(years, monthly_days_vals[x], legend=months_str[x], color=Palette[x]) show(f) f = figure(title='Monthly Cloud Cover Samples', x_range=(1990, 2015), x_axis_label='Year', height=300) for x in range(0, len(months)): f.line(years, monthly_cover_samples[x], legend=months_str[x], color=Palette[x]) show(f) On the whole, the number of cloudy days seems to reflect the trend with average cloud cover - it's actually becoming more sunny as time progresses. That said, we need to be careful in how we view this number - because there weren't as many samples in 2015 as previous years, the number of days can get thrown off. In context though, even if most days not recorded were in fact cloudy, the overall count for 2015 would still be lower than previous years. In addition to checking cloud cover, I wanted to check precipitation data as well - what is the average precipitation chance over a month, and how many days during a month is rain likely? The thinking is that days with a high-precipitation chance will also be days in which it is cloudy or depressing. def safe_precip(frame): if frame and 'precipProbability' in frame: return frame['precipProbability'] else: return np.NaN def monthly_avg_precip(year, month): dates = pd.DatetimeIndex(start=datetime(year, month, 1, 12), end=datetime(year, month + 1, 1, 12), freq='D', closed='left') precip_vals = list(map(lambda x: safe_precip(cary_forecast[x]['currently']), dates)) precip_samples = len(list(filter(lambda x: x is not np.NaN, precip_vals))) return np.nanmean(precip_vals), precip_samples monthly_avg_precip_vals = [[monthly_avg_precip(y, m)[0] for y in years] for m in months] f = figure(title='Monthly Average Precipitation Chance', x_range=(1990, 2015), x_axis_label='Year') for x in range(0, len(months)): f.line(years, monthly_avg_precip_vals[x], legend=months_str[x], color=Palette[x]) show(f) As we can see from the chart, the average chance of precipitation over a month more or less stays within a band of 0 - .1 for all months over all years. This is further evidence that the past few months are no more cloudy or rainy than previous years. Like the cloud cover though, we still want to get a count of all the rainy days, in addition to the average chance. We'll define a "rainy day" as any day in which the chance of rain is greater than 25%. def monthly_rainy_days(year, month): dates = pd.DatetimeIndex(start=datetime(year, month, 1, 12), end=datetime(year, month + 1, 1, 12), freq='D', closed='left') precip_prob_vals = list(map(lambda x: safe_precip(cary_forecast[x]['currently']), dates)) precip_prob_samples = len(list(filter(lambda x: x is not np.NaN, precip_prob_vals))) precip_days = [prob > .25 for prob in precip_prob_vals] return np.count_nonzero(precip_days), precip_prob_samples monthly_precip_days_vals = [[monthly_rainy_days(y, m)[0] for y in years] for m in months] monthly_precip_samples = [[monthly_rainy_days(y, m)[1] for y in years] for m in months] f = figure(title='Monthly Rainy Days', x_range=(1990, 2015), x_axis_label='Year') for x in range(0, len(months)): f.line(years, monthly_precip_days_vals[x], legend=months_str[x], color=Palette[x]) show(f) f = figure(title='Monthly Rainy Days Samples', x_range=(1990, 2015), x_axis_label='Year', height=300) for x in range(0, len(months)): f.line(years, monthly_precip_samples[x], legend=months_str[x], color=Palette[x]) show(f) After trying to find the number of days that are rainy, we can see that November hit its max value for rainy days in 2015. However, that value is 6, as compared to a previous maximum of 5. While it is a new record, the value isn't actually all that different. And for other months, the values are mostly in-line with the averages. Summary and Conclusions After having looked at forecast data for Cary, it appears that the months of July - November this year in terms of weather were at worst on par with prior years, if not slightly more sunny. This seems to be a case of confirmation bias: someone complains about a string of cloudy or rainy days, and suddenly you start noticing them more. While this analysis doesn't take into account other areas of North Carolina, my initial guess would be to assume that other areas also will show similar results: nothing interesting is happening. Maybe that will be for another blog post later! Coming soon: I'll compare rain/cloud conditions in North Carolina to some other places in the U.S.! Generating the Forecast file The following code was generates the file that was used throughout the blog post. Please note that I'm retrieving data for other cities to use in a future blog post, only Cary data was used for this post. import pandas as pd from functools import reduce import requests from datetime import datetime # Coordinate data from http://itouchmap.com/latlong.html cary_loc = (35.79154,-78.781117) nyc_loc = (40.78306,-73.971249) seattle_loc = (47.60621,-122.332071) binghamton_loc = (42.098687,-75.917974) cities = { 'cary': cary_loc, 'nyc': nyc_loc, 'seattle': seattle_loc, 'binghamton': binghamton_loc } apikey = '' # My super-secret API Key def get_forecast(lat, long, date=None): forecast_base = "https://api.forecast.io/forecast/" if date is None: url = forecast_base + apikey + '/{},{}'.format(lat, long) else: epoch = int(date.timestamp()) url = forecast_base + apikey + '/{},{},{}'.format(lat, long, epoch) return requests.get(url).json() years = range(1990,2016) # For datetimes, the 12 is for getting the weather at noon. # We're doing this over midnight because we're more concerned # with what people see, and people don't typically see the weather # at midnight. dt_indices = [pd.date_range(start=datetime(year, 7, 1, 12), end=datetime(year, 11, 30, 12)) for year in years] dt_merge = reduce(lambda x, y: x.union(y), dt_indices) # Because we have to pay a little bit to use the API, we use for loops here # instead of a comprehension - if something breaks, we want to preserve the # data already retrieved city_forecasts = {} for city, loc in cities.items(): print("Retrieving data for {} starting at {}".format(city, datetime.now().strftime("%I:%M:%S %p"))) for dt in dt_merge: try: city_forecasts[(city, dt)] = get_forecast(*loc, dt) except Exception as e: print(e) city_forecasts[(city, dt)] = None print("End forecast retrieval: {}".format(datetime.now().strftime("%I:%M:%S %p"))) import pickle pickle.dump(city_forecasts, open('city_forecasts.p', 'wb')) ### Output: # Retrieving data for binghamton starting at 05:13:42 PM # Retrieving data for seattle starting at 05:30:51 PM # Retrieving data for nyc starting at 05:48:30 PM # Retrieving data for cary starting at 06:08:32 PM # End forecast retrieval: 06:25:21 PM ","keywords":"","version":null},{"title":"Guaranteed money maker","type":0,"sectionRef":"#","url":"/2016/02/guaranteed-money-maker","content":"","keywords":"","version":null},{"title":"Applying the Martingale Strategy","type":1,"pageTitle":"Guaranteed money maker","url":"/2016/02/guaranteed-money-maker#applying-the-martingale-strategy","content":" But we're all realistic people, and once you start talking about "unlimited money" eyebrows should be raised. Even still, this is an interesting strategy to investigate, and I want to apply it to the stock market. As long as we can guarantee there's a single day in which the stock goes up, we should be able to make money right? The question is just how much we have to invest to guarantee this. Now it's time for the math. We'll use the following definitions: oio_ioi = the share price at the opening of day iiicic_ici = the share price at the close of day iiidid_idi = the amount of money we want to invest at the beginning of day iii With those definitions in place, I'd like to present the formula that is guaranteed to make you money. I call it Bradlee's Investment Formula: cn∑i=1ndioi>∑i=1ndic_n \\sum_{i=1}^n \\frac{d_i}{o_i} > \\sum_{i=1}^{n} d_icn∑i=1noidi>∑i=1ndi It might not look like much, but if you can manage to make it so that this formula holds true, you will be guaranteed to make money. The intuition behind the formula is this: The closing share price times the number of shares you have purchased ends up greater than the amount of money you invested. That is, on day nnn, if you know what the closing price will be you can set up the amount of money you invest that day to guarantee you make money. I'll even teach you to figure out how much money that is! Take a look: cn∑i=1n−1dioi+cndnon>∑i=1n−1di+dncndnon−dn>∑i=1n−1(di−cndioi)dn(cn−onon)>∑i=1n−1di(1−cnoi)dn>oncn−on∑i=1n−1di(1−1oi)\\begin{align*} c_n \\sum_{i=1}^{n-1} \\frac{d_i}{o_i} + \\frac{c_nd_n}{o_n} &> \\sum_{i=1}^{n-1}d_i + d_n\\\\ \\frac{c_nd_n}{o_n} - d_n &> \\sum_{i=1}^{n-1}(d_i - \\frac{c_nd_i}{o_i})\\\\ d_n (\\frac{c_n - o_n}{o_n}) &> \\sum_{i=1}^{n-1} d_i(1 - \\frac{c_n}{o_i})\\\\ d_n &> \\frac{o_n}{c_n - o_n} \\sum_{i=1}^{n-1} d_i(1 - \\frac{1}{o_i}) \\end{align*}cni=1∑n−1oidi+oncndnoncndn−dndn(oncn−on)dn>i=1∑n−1di+dn>i=1∑n−1(di−oicndi)>i=1∑n−1di(1−oicn)>cn−ononi=1∑n−1di(1−oi1) If you invest exactly dnd_ndn that day, you'll break even. But if you can make sure the money you invest is greater than that quantity on the right (which requires that you have a crystal ball tell you the stock's closing price) you are guaranteed to make money! ","version":null,"tagName":"h2"},{"title":"Interesting Implications","type":1,"pageTitle":"Guaranteed money maker","url":"/2016/02/guaranteed-money-maker#interesting-implications","content":" On a more serious note though, the formula above tells us a couple of interesting things: It's impossible to make money without the closing price at some point being greater than the opening price (or vice-versa if you are short selling) - there is no amount of money you can invest that will turn things in your favor.Close prices of the past aren't important if you're concerned about the bottom line. While chart technicians use price history to make judgment calls, in the end, the closing price on anything other than the last day is irrelevant.It's possible to make money as long as there is a single day where the closing price is greater than the opening price! You might have to invest a lot to do so, but it's possible.You must make a prediction about where the stock will close at if you want to know how much to invest. That is, we can set up our investment for the day to make money if the stock goes up 1%, but if it only goes up .5% we'll still lose money.It's possible the winning move is to scale back your position. Consider the scenario: You invest money and the stock closes down the day .5%You invest tomorrow expecting the stock to go up 1%The winning investment to break even (assuming a 1% increase) is to scale back the position, since the shares you purchased at the beginning would then be profitable ","version":null,"tagName":"h2"},{"title":"Running the simulation","type":1,"pageTitle":"Guaranteed money maker","url":"/2016/02/guaranteed-money-maker#running-the-simulation","content":" So now that we've defined our investment formula,we need to tweak a couple things in order to make an investment strategy we can actually work with. There are two issues we need to address: The formula only tells us how much to invest if we want to break even (dnd_ndn). If we actually want to turn a profit, we need to invest more than that, which we will refer to as the bias.The formula assumes we know what the closing price will be on any given day. If we don't know this, we can still invest assuming the stock price will close at a level we choose. If the price doesn't meet this objective, we try again tomorrow! This predetermined closing price will be referred to as the expectation. Now that we've defined our bias and expectation, we can actually build a strategy we can simulate. Much like the martingale strategy told you to bet twice your previous bet in order to make money, we've designed a system that tells us how much to bet in order to make money as well. Now, let's get to the code! using Quandl api_key = "" daily_investment = function(current_open, current_close, purchase_history, open_history) # We're not going to safeguard against divide by 0 - that's the user's responsibility t1 = current_close / current_open - 1 t2 = sum(purchase_history - purchase_history*current_close ./ open_history) return t2 / t1 end; And let's code a way to run simulations quickly: is_profitable = function(current_price, purchase_history, open_history) shares = sum(purchase_history ./ open_history) return current_price*shares > sum(purchase_history) end simulate = function(name, start, init, expected, bias) ticker_info = quandlget(name, from=start, api_key=api_key) open_vals = ticker_info["Open"].values close_vals = ticker_info["Close"].values invested = [init] # The simulation stops once we've made a profit day = 1 profitable = is_profitable(close_vals[day], invested, open_vals[1:length(invested)]) || is_profitable(open_vals[day+1], invested, open_vals[1:length(invested)]) while !profitable expected_close = open_vals[day+1] * expected todays_purchase = daily_investment(open_vals[day+1], expected_close, invested, open_vals[1:day]) invested = [invested; todays_purchase + bias] # expected_profit = expected_close * sum(invested ./ open_vals[1:length(invested)]) - sum(invested) day += 1 profitable = is_profitable(close_vals[day], invested, open_vals[1:length(invested)]) || is_profitable(open_vals[day+1], invested, open_vals[1:length(invested)]) end shares = sum(invested ./ open_vals[1:length(invested)]) max_profit = max(close_vals[day], open_vals[day+1]) profit = shares * max_profit - sum(invested) return (invested, profit) end sim_summary = function(investments, profit) leverages = [sum(investments[1:i]) for i=1:length(investments)] max_leverage = maximum(leverages) / investments[1] println("Max leverage: $(max_leverage)") println("Days invested: $(length(investments))") println("Profit: $profit") end; Now, let's get some data and run a simulation! Our first test: We'll invest 100 dollars in LMT, and expect that the stock will close up 1% every day. We'll invest dnd_ndn + 10 dollars every day that we haven't turned a profit, and end the simulation once we've made a profit. investments, profit = simulate("YAHOO/LMT", Date(2015, 11, 29), 100, 1.01, 10) sim_summary(investments, profit) Max leverage: 5.590373200042106 Days invested: 5 Profit: 0.6894803101560001 The result: We need to invest 5.6x our initial position over a period of 5 days to make approximately .69¢ Now let's try the same thing, but we'll assume the stock closes up 2% instead. investments, profit = simulate("YAHOO/LMT", Date(2015, 11, 29), 100, 1.02, 10) sim_summary(investments, profit) Max leverage: 1.854949900247809 Days invested: 25 Profit: 0.08304813163696423 In this example, we only get up to a 1.85x leveraged position, but it takes 25 days to turn a profit of 8¢ ","version":null,"tagName":"h2"},{"title":"Summary","type":1,"pageTitle":"Guaranteed money maker","url":"/2016/02/guaranteed-money-maker#summary","content":" We've defined an investment strategy that can tell us how much to invest when we know what the closing position of a stock will be. We can tweak the strategy to actually make money, but plenty of work needs to be done so that we can optimize the money invested. In the next post I'm going to post more information about some backtests and strategy tests on this strategy (unless of course this experiment actually produces a significant profit potential, and then I'm keeping it for myself). ","version":null,"tagName":"h2"},{"title":"Side note and disclaimer","type":1,"pageTitle":"Guaranteed money maker","url":"/2016/02/guaranteed-money-maker#side-note-and-disclaimer","content":" The claims made in this presentation about being able to guarantee making money are intended as a joke and do not constitute investment advice of any sort. ","version":null,"tagName":"h3"},{"title":"Profitability using the investment formula","type":0,"sectionRef":"#","url":"/2016/02/profitability-using-the-investment-formula","content":"","keywords":"","version":null},{"title":"Theoretical Justification","type":1,"pageTitle":"Profitability using the investment formula","url":"/2016/02/profitability-using-the-investment-formula#theoretical-justification","content":" The formula itself is designed to be simple in principle: I like making a profit, and I want to penalize the leverage you incur and days you have to invest. Ideally, we want to have a stock that goes up all the time. However, the investment formula takes advantage of a different case: trying to profit from highly volatile assets. If we can make money when the investment only has one day up, let's do it! Even so, there are two potential issues: First, stocks that trend upward will have a higher profitability score - both leverage and days invested will be 1. To protect against only investing in this trend, I can do things like taking log(d)\\log(d)log(d). I don't want to start biasing the scoring function until I have a practical reason to do so, so right now I'll leave it standing. The second issue is how to penalize leverage and days invested relative to each other. As it currently stands, a leverage of 6x with only 1 day invested is the same as leveraging 2x with 3 days invested. In the future, I'd again want to look at making the impact of days invested smaller - I can get over an extra 3 days in the market if it means that I don't have to incur a highly leveraged position. So there could be things about the scoring function we change in the future, but I want to run some actual tests before we start worrying about things like that! ","version":null,"tagName":"h2"},{"title":"Running a simulation","type":1,"pageTitle":"Profitability using the investment formula","url":"/2016/02/profitability-using-the-investment-formula#running-a-simulation","content":" This won't be an incredibly rigorous backtest, I just want to see some results from the work so far. Let's set up the simulation code again, and start looking into some random stocks. If you've read the last blog post, you can skip over the code. The only difference is that it's been ported to python to make the data-wrangling easier. Julia doesn't yet support some of the multi-index things I'm trying to do. import numpy as np import pandas as pd import matplotlib.pyplot as plt from Quandl import get as qget %matplotlib inline api_key = '' profitability = lambda p, i, m, d: 1000*p / (m + i*d) def is_profitable(current_price, purchase_history, open_history): shares = (purchase_history / open_history).sum() return current_price * shares > sum(purchase_history) def daily_investment(current_open, current_close, purchase_history, open_history): t1 = current_close / current_open - 1 t2 = (purchase_history - purchase_history * current_close / open_history).sum() return t2 / t1 def simulate_day(open_vals, close_vals, init, expected, bias): invested = np.array([init]) day = 1 profitable = is_profitable(close_vals[day-1], invested, open_vals[0:len(invested)]) \\ or is_profitable(open_vals[day], invested, open_vals[0:len(invested)]) while not profitable: expected_close = open_vals[day] * expected todays_purchase = daily_investment(open_vals[day], expected_close, invested, open_vals[0:day]) invested = np.append(invested, todays_purchase + bias) # expected_profit = expected_close * (invested / open_vals[0:len(invested)]).sum() - invested.sum() day += 1 profitable = is_profitable(close_vals[day-1], invested, open_vals[0:len(invested)]) \\ or is_profitable(open_vals[day], invested, open_vals[0:len(invested)]) shares = (invested / open_vals[0:len(invested)]).sum() # Make sure we can't see into the future - we know either today's close or tomorrow's open # will be profitable, but we need to check which one. if is_profitable(close_vals[day-1], invested, open_vals[0:len(invested)]): ending_price = close_vals[day-1] else: ending_price = open_vals[day] profit = shares * ending_price - sum(invested) return invested, profit def simulate_ts(name, start, end, initial, expected, bias): ticker_info = qget(name, trim_start=start, api_key=api_key) evaluation_times = ticker_info[:end].index # Handle Google vs. YFinance data if "Adjusted Close" in ticker_info.columns: close_column = "Adjusted Close" else: close_column = "Close" sim = {d: simulate_day(ticker_info[d:]["Open"], ticker_info[d:][close_column], 100, 1.02, 10) for d in evaluation_times} sim_series = pd.Series(sim) result = pd.DataFrame() result["profit"] = sim_series.apply(lambda x: x[1]) result["max"] = sim_series.apply(lambda x: max(x[0])) result["days"] = sim_series.apply(lambda x: len(x[0])) result["score"] = sim_series.apply(lambda x: profitability(x[1], x[0][0], max(x[0]), len(x[0]))) result["investments"] = sim_series.apply(lambda x: x[0]) return result def simulate_tickers(tickers): from datetime import datetime results = {} for ticker in tickers: start = datetime(2015, 1, 1) results_df = simulate_ts(ticker, start, datetime(2016, 1, 1), 100, 1.01, 10) results[ticker] = results_df return pd.concat(list(results.values()), keys=list(results.keys()), axis=1) ","version":null,"tagName":"h2"},{"title":"And now the interesting part","type":1,"pageTitle":"Profitability using the investment formula","url":"/2016/02/profitability-using-the-investment-formula#and-now-the-interesting-part","content":" Let's start looking into the data! FANG stocks have been big over the past year, let's see how they look: fang_df = simulate_tickers(["YAHOO/FB", "YAHOO/AAPL", "YAHOO/NFLX", "YAHOO/GOOG"]) fang_df.xs('days', axis=1, level=1).hist() plt.gcf().set_size_inches(18, 8); plt.gcf().suptitle("Distribution of Days Until Profitability", fontsize=18); fang_df.xs('score', axis=1, level=1).plot() plt.gcf().set_size_inches(18, 6) plt.gcf().suptitle("Profitability score over time", fontsize=18); Let's think about these graphs. First, the histogram. What we like seeing is a lot of 1's - that means there were a lot of days that the stock went up and we didn't have to worry about actually implementing the strategy - we were able to close the trade at a profit. Looking at the profitability score over time though is a bit more interesting. First off, stocks that are more volatile will tend to have a higher profitability score, no two ways about that. However, Netflix consistently outperformed on this metric. We know that 2015 was a good year for Netflix, so that's a (small) sign the strategy is performing as expected. The final interesting note happens around the end of August 2015. Around this period, the markets were selling off in a big way due to issues in China (not unlike what's happening now). Even so, all of the FANG stocks saw an uptick in profitability around this time. This is another sign that the strategy being developed performs better during periods of volatility, rather than from riding markets up or down. What about FANG vs. some cyclicals? cyclic_df = simulate_tickers(["YAHOO/X", "YAHOO/CAT", "YAHOO/NFLX", "YAHOO/GOOG"]) cyclic_df.xs('days', axis=1, level=1).hist() plt.gcf().set_size_inches(18, 8); plt.gcf().suptitle("Distribution of Days Until Profitability", fontsize=18); cyclic_df.xs('score', axis=1, level=1).plot() plt.gcf().set_size_inches(18, 6) plt.gcf().suptitle("Profitability score over time", fontsize=18); Some more interesting results come from this as well. First off, US Steel (X) has a much smoother distribution of days until profitability - it doesn't have a huge number of values at 1 and then drop off. Intuitively, we're not terribly large fans of this, we want a stock to go up! However, on the profitability score it is the only serious contender to Netflix. Second, we see the same trend around August - the algorithm performs well in volatile markets. For a final test, let's try some biotech and ETFs! biotech_df = simulate_tickers(['YAHOO/REGN', 'YAHOO/CELG', 'GOOG/NASDAQ_BIB', 'GOOG/NASDAQ_IBB']) biotech_df.xs('days', axis=1, level=1).hist() plt.gcf().set_size_inches(18, 8); plt.gcf().suptitle("Distribution of Days Until Profitability", fontsize=18); biotech_df.xs('score', axis=1, level=1).plot() plt.gcf().set_size_inches(18, 6) plt.gcf().suptitle("Profitability score over time", fontsize=18); In this example, we don't see a whole lot of interesting things: the scores are all fairly close together with notable exceptions in late August, and mid-October. What is interesting is that during the volatile period, the ETF's performed significantly better than the stocks did in terms of profitability. The leveraged ETF (BIB) performed far above anyone else, and it appears that indeed, it is most profitable during volatile periods. Even so, it was far more likely to take multiple days to give a return. Its count of 1-day investments trails the other ETF and both stocks by a decent margin. And consider me an OCD freak, but I just really like Celgene's distribution - it looks nice and smooth. ","version":null,"tagName":"h2"},{"title":"Summary and plans for the next post","type":1,"pageTitle":"Profitability using the investment formula","url":"/2016/02/profitability-using-the-investment-formula#summary-and-plans-for-the-next-post","content":" So far I'm really enjoying playing with this strategy - there's a lot of depth here to understand, though the preliminary results seem to indicate that it profits mostly from taking the other side of a volatile trade. I'd be interested to run results later on data from January - It's been a particularly volatile start to the year so it would be neat to see whether this strategy would work then. For the next post, I want to start playing with some of the parameters: How do the bias and expected close influence the process? The values have been fairly conservative so far, it will be interesting to see how the simulations respond afterward. ","version":null,"tagName":"h2"},{"title":"Predicting Santander customer happiness","type":0,"sectionRef":"#","url":"/2016/03/predicting-santander-customer-happiness","content":"","keywords":"","version":null},{"title":"Data Exploration","type":1,"pageTitle":"Predicting Santander customer happiness","url":"/2016/03/predicting-santander-customer-happiness#data-exploration","content":" First up: we need to load our data and do some exploratory work. Because we're going to be using this data for model selection prior to testing, we need to make a further split. I've already gone ahead and done this work, please see the code in the appendix below. import pandas as pd import numpy as np import matplotlib.pyplot as plt %matplotlib inline # Record how long it takes to run the notebook - I'm curious. from datetime import datetime start = datetime.now() dataset = pd.read_csv('split_train.csv') dataset.index = dataset.ID X = dataset.drop(['TARGET', 'ID', 'ID.1'], 1) y = dataset.TARGET y.unique() array([0, 1], dtype=int64) len(X.columns) 369 Okay, so there are only two classes we're predicting: 1 for unsatisfied customers, 0 for satisfied customers. I would have preferred this to be something more like a regression, or predicting multiple classes: maybe the customer isn't the most happy, but is nowhere near closing their accounts. For now though, that's just the data we're working with. Now, I'd like to make a scatter matrix of everything going on. Unfortunately as noted above, we have 369 different features. There's no way I can graphically make sense of that much data to start with. We're also not told what the data actually represents: Are these survey results? Average time between contact with a customer care person? Frequency of contacting a customer care person? The idea is that I need to reduce the number of dimensions we're predicting across. ","version":null,"tagName":"h2"},{"title":"Dimensionality Reduction pt. 1 - Binary Classifiers","type":1,"pageTitle":"Predicting Santander customer happiness","url":"/2016/03/predicting-santander-customer-happiness#dimensionality-reduction-pt-1---binary-classifiers","content":" My first attempt to reduce the data dimensionality is to find all the binary classifiers in the dataset (i.e. 0 or 1 values) and see if any of those are good (or anti-good) predictors of the final data. cols = X.columns b_class = [] for c in cols: if len(X[c].unique()) == 2: b_class.append(c) len(b_class) 111 So there are 111 features in the dataset that are a binary label. Let's see if any of them are good at predicting the users satisfaction! # First we need to `binarize` the data to 0-1; some of the labels are {0, 1}, # some are {0, 3}, etc. from sklearn.preprocessing import binarize X_bin = binarize(X[b_class]) accuracy = [np.mean(X_bin[:,i] == y) for i in range(0, len(b_class))] acc_df = pd.DataFrame({"Accuracy": accuracy}, index=b_class) acc_df.describe() \tAccuracycount\t111.000000 mean\t0.905159 std\t0.180602 min\t0.043598 25%\t0.937329 50%\t0.959372 75%\t0.960837 max\t0.960837 Wow! Looks like we've got some incredibly predictive features! So much so that we should be a bit concerned. My initial guess for what's happening is that we have a sparsity issue: so many of the values are 0, and these likely happen to line up with satisfied customers. So the question we must now answer, which I likely should have asked long before now: What exactly is the distribution of un/satisfied customers? unsat = y[y == 1].count() print("Satisfied customers: {}; Unsatisfied customers: {}".format(len(y) - unsat, unsat)) naive_guess = np.mean(y == np.zeros(len(y))) print("Naive guess accuracy: {}".format(naive_guess)) Satisfied customers: 51131; Unsatisfied customers: 2083 Naive guess accuracy: 0.9608561656706882 This is a bit discouraging. A naive guess of "always satisfied" performs as well as our best individual binary classifier. What this tells me then, is that these data columns aren't incredibly helpful in prediction. I'd be interested in a polynomial expansion of this data-set, but for now, that's more computation than I want to take on. ","version":null,"tagName":"h3"},{"title":"Dimensionality Reduction pt. 2 - LDA","type":1,"pageTitle":"Predicting Santander customer happiness","url":"/2016/03/predicting-santander-customer-happiness#dimensionality-reduction-pt-2---lda","content":" Knowing that our naive guess performs so well is a blessing and a curse: Curse: The threshold for performance is incredibly high: We can only "improve" over the naive guess by 4%Blessing: All the binary classification features we just discovered are worthless on their own. We can throw them out and reduce the data dimensionality from 369 to 111. Now, in removing these features from the dataset, I'm not saying that there is no "information" contained within them. There might be. But the only way we'd know is through a polynomial expansion, and I'm not going to take that on within this post. My initial thought for a "next guess" is to use the LDA model for dimensionality reduction. However, it can only reduce dimensions to 1−p1 - p1−p, with ppp being the number of classes. Since this is a binary classification, every LDA model that I try will have dimensionality one; when I actually try this, the predictor ends up being slightly less accurate than the naive guess. Instead, let's take a different approach to dimensionality reduction: principle components analysis. This allows us to perform the dimensionality reduction without worrying about the number of classes. Then, we'll use a Gaussian Naive Bayes model to actually do the prediction. This model is chosen simply because it doesn't take a long time to fit and compute; because PCA will take so long, I just want a prediction at the end of this. We can worry about using a more sophisticated LDA/QDA/SVM model later. Now into the actual process: We're going to test out PCA dimensionality reduction from 1 - 20 dimensions, and then predict using a Gaussian Naive Bayes model. The 20 dimensions upper limit was selected because the accuracy never improves after you get beyond that (I found out by running it myself). Hopefully, we'll find that we can create a model better than the naive guess. from sklearn.naive_bayes import GaussianNB from sklearn.decomposition import PCA X_no_bin = X.drop(b_class, 1) def evaluate_gnb(dims): pca = PCA(n_components=dims) X_xform = pca.fit_transform(X_no_bin) gnb = GaussianNB() gnb.fit(X_xform, y) return gnb.score(X_xform, y) dim_range = np.arange(1, 21) plt.plot(dim_range, [evaluate_gnb(dim) for dim in dim_range], label="Gaussian NB Accuracy") plt.axhline(naive_guess, label="Naive Guess", c='k') plt.axhline(1 - naive_guess, label="Inverse Naive Guess", c='k') plt.gcf().set_size_inches(12, 6) plt.legend(); sigh... After all the effort and computational power, we're still at square one: we have yet to beat out the naive guess threshold. With PCA in play we end up performing terribly, but not terribly enough that we can guess against ourselves. Let's try one last-ditch attempt using the entire data set: def evaluate_gnb_full(dims): pca = PCA(n_components=dims) X_xform = pca.fit_transform(X) gnb = GaussianNB() gnb.fit(X_xform, y) return gnb.score(X_xform, y) dim_range = np.arange(1, 21) plt.plot(dim_range, [evaluate_gnb(dim) for dim in dim_range], label="Gaussian NB Accuracy") plt.axhline(naive_guess, label="Naive Guess", c='k') plt.axhline(1 - naive_guess, label="Inverse Naive Guess", c='k') plt.gcf().set_size_inches(12, 6) plt.legend(); Nothing. It is interesting to note that the graphs are almost exactly the same: This would imply again that the variables we removed earlier (all the binary classifiers) indeed have almost no predictive power. It seems this problem is high-dimensional, but with almost no data that can actually inform our decisions. ","version":null,"tagName":"h3"},{"title":"Summary for Day 1","type":1,"pageTitle":"Predicting Santander customer happiness","url":"/2016/03/predicting-santander-customer-happiness#summary-for-day-1","content":" After spending a couple hours with this dataset, there seems to be a fundamental issue in play: We have very high-dimensional data, and it has no bearing on our ability to actually predict customer satisfaction. This can be a huge issue: it implies that no matter what model we use, we fundamentally can't perform well. I'm sure most of this is because I'm not an experienced data scientist. Even so, we have yet to develop a strategy that can actually beat out the village idiot; so far, the bank is best off just assuming all its customers are satisfied. Hopefully more to come soon. end = datetime.now() print("Running time: {}".format(end - start)) Running time: 0:00:58.715714 ","version":null,"tagName":"h2"},{"title":"Appendix","type":1,"pageTitle":"Predicting Santander customer happiness","url":"/2016/03/predicting-santander-customer-happiness#appendix","content":" Code used to split the initial training data: from sklearn.cross_validation import train_test_split data = pd.read_csv('train.csv') data.index = data.ID data_train, data_validate = train_test_split( data, train_size=.7) data_train.to_csv('split_train.csv') data_validate.to_csv('split_validate.csv') ","version":null,"tagName":"h2"},{"title":"Tweet like me","type":0,"sectionRef":"#","url":"/2016/03/tweet-like-me","content":"","keywords":"","version":null},{"title":"The Objective","type":1,"pageTitle":"Tweet like me","url":"/2016/03/tweet-like-me#the-objective","content":" Given an input list of Tweets, build up the following things: The distribution of starting words; since there are no "prior" words to go from, we need to treat this as a special case.The distribution of words given a previous word; for example, every time I use the word woodchuck in the example sentence, there is a 50% chance it is followed by chuck and a 50% chance it is followed by could. I need this distribution for all words.The distribution of quantity of hashtags; Do I most often use just one? Two? Do they follow something like a Poisson distribution?Distribution of hashtags; Given a number of hashtags, what is the actual content? I'll treat hashtags as separate from the content of a tweet. ","version":null,"tagName":"h2"},{"title":"The Data","type":1,"pageTitle":"Tweet like me","url":"/2016/03/tweet-like-me#the-data","content":" I'm using as input my tweet history. I don't really use Twitter anymore, but it seems like a fun use of the dataset. I'd like to eventually build this to a point where I can imitate anyone on Twitter using their last 100 tweets or so, but I'll start with this as example code. ","version":null,"tagName":"h2"},{"title":"The Algorithm","type":1,"pageTitle":"Tweet like me","url":"/2016/03/tweet-like-me#the-algorithm","content":" I'll be using the NLTK library for doing a lot of the heavy lifting. First, let's import the data: import pandas as pd tweets = pd.read_csv('tweets.csv') text = tweets.text # Don't include tweets in reply to or mentioning people replies = text.str.contains('@') text_norep = text.loc[~replies] And now that we've got data, let's start crunching. First, tokenize and build out the distribution of first word: from nltk.tokenize import TweetTokenizer tknzr = TweetTokenizer() tokens = text_norep.map(tknzr.tokenize) first_words = tokens.map(lambda x: x[0]) first_words_alpha = first_words[first_words.str.isalpha()] first_word_dist = first_words_alpha.value_counts() / len(first_words_alpha) Next, we need to build out the conditional distributions. That is, what is the probability of the next word given the current word is XXX? This one is a bit more involved. First, find all unique words, and then find what words proceed them. This can probably be done in a more efficient manner than I'm currently doing here, but we'll ignore that for the moment. from functools import reduce # Get all possible words all_words = reduce(lambda x, y: x+y, tokens, []) unique_words = set(all_words) actual_words = set([x if x[0] != '.' else None for x in unique_words]) word_dist = {} for word in iter(actual_words): indices = [i for i, j in enumerate(all_words) if j == word] proceeding = [all_words[i+1] for i in indices] word_dist[word] = proceeding Now that we've got the tweet analysis done, it's time for the fun part: hashtags! Let's count how many hashtags are in each tweet, I want to get a sense of the distribution. import matplotlib.pyplot as plt %matplotlib inline hashtags = text_norep.str.count('#') bins = hashtags.unique().max() hashtags.plot(kind='hist', bins=bins) <matplotlib.axes._subplots.AxesSubplot at 0x18e59dc28d0> That looks like a Poisson distribution, kind of as I expected. I'm guessing my number of hashtags per tweet is ∼Poi(1)\\sim Poi(1)∼Poi(1), but let's actually find the most likely estimator which in this case is just λˉ\\bar{\\lambda}λˉ: mle = hashtags.mean() mle 0.870236869207003 Pretty close! So we can now simulate how many hashtags are in a tweet. Let's also find what hashtags are actually used: hashtags = [x for x in all_words if x[0] == '#'] n_hashtags = len(hashtags) unique_hashtags = list(set([x for x in unique_words if x[0] == '#'])) hashtag_dist = pd.DataFrame({'hashtags': unique_hashtags, 'prob': [all_words.count(h) / n_hashtags for h in unique_hashtags]}) len(hashtag_dist) 603 Turns out I have used 603 different hashtags during my time on Twitter. That means I was using a unique hashtag for about every third tweet. In better news though, we now have all the data we need to go about actually constructing tweets! The process will happen in a few steps: Randomly select what the first word will be.Randomly select the number of hashtags for this tweet, and then select the actual hashtags.Fill in the remaining space of 140 characters with random words taken from my tweets. And hopefully, we won't have anything too crazy come out the other end. The way we do the selection follows a Multinomial Distribution: given a lot of different values with specific probability, pick one. Let's give a quick example: x: .33 y: .5 z: .17 That is, I pick x with probability 33%, y with probability 50%, and so on. In context of our sentence construction, I've built out the probabilities of specific words already - now I just need to simulate that distribution. Time for the engine to actually be developed! import numpy as np def multinom_sim(n, vals, probs): occurrences = np.random.multinomial(n, probs) results = occurrences * vals return ' '.join(results[results != '']) def sim_n_hashtags(hashtag_freq): return np.random.poisson(hashtag_freq) def sim_hashtags(n, hashtag_dist): return multinom_sim(n, hashtag_dist.hashtags, hashtag_dist.prob) def sim_first_word(first_word_dist): probs = np.float64(first_word_dist.values) return multinom_sim(1, first_word_dist.reset_index()['index'], probs) def sim_next_word(current, word_dist): dist = pd.Series(word_dist[current]) probs = np.ones(len(dist)) / len(dist) return multinom_sim(1, dist, probs) ","version":null,"tagName":"h2"},{"title":"Pulling it all together","type":1,"pageTitle":"Tweet like me","url":"/2016/03/tweet-like-me#pulling-it-all-together","content":" I've now built out all the code I need to actually simulate a sentence written by me. Let's try doing an example with five words and a single hashtag: first = sim_first_word(first_word_dist) second = sim_next_word(first, word_dist) third = sim_next_word(second, word_dist) fourth = sim_next_word(third, word_dist) fifth = sim_next_word(fourth, word_dist) hashtag = sim_hashtags(1, hashtag_dist) ' '.join((first, second, third, fourth, fifth, hashtag)) 'My first all-nighter of friends #oldschool' Let's go ahead and put everything together! We're going to simulate a first word, simulate the hashtags, and then simulate to fill the gap until we've either taken up all the space or reached a period. def simulate_tweet(): chars_remaining = 140 first = sim_first_word(first_word_dist) n_hash = sim_n_hashtags(mle) hashtags = sim_hashtags(n_hash, hashtag_dist) chars_remaining -= len(first) + len(hashtags) tweet = first current = first while chars_remaining > len(tweet) + len(hashtags) and current[0] != '.' and current[0] != '!': current = sim_next_word(current, word_dist) tweet += ' ' + current tweet = tweet[:-2] + tweet[-1] return ' '.join((tweet, hashtags)).strip() ","version":null,"tagName":"h2"},{"title":"The results","type":1,"pageTitle":"Tweet like me","url":"/2016/03/tweet-like-me#the-results","content":" And now for something completely different: twenty random tweets dreamed up by my computer and my Twitter data. Here you go: for i in range(0, 20): print(simulate_tweet()) print() Also , I'm at 8 this morning. #thursdaysgohard #ornot Turns out of us breathe the code will want to my undergraduate career is becoming more night trying ? Religion is now as a chane #HYPE You know what recursion is to review the UNCC. #ornot There are really sore 3 bonfires in my first writing the library ground floor if awesome. #realtalk #impressed So we can make it out there's nothing but I'm not let us so hot I could think I may be good. #SwingDance Happy Christmas , at Harris Teeter to be be godly or Roman Catholic ). #4b392b#4b392b #Isaiah26 For context , I in the most decisive factor of the same for homework. #accomplishment Freaking done. #loveyouall New blog post : Don't jump in a quiz in with a knife fight. #haskell #earlybirthday God shows me legitimately want to get some food and one day. Stormed the queen city. #mindblown The day of a cold at least outside right before the semester .. Finished with the way back. #winners Waking up , OJ , I feel like Nick Jonas today. First draft of so hard drive. #humansvszombies Eric Whitacre is the wise creation. Ethics paper first , music in close to everyone who just be posting up with my sin , and Jerry Springr #TheLittleThings Love that you know enough time I've eaten at 8 PM. #deepthoughts #stillblownaway Lead. #ThinkingTooMuch #Christmas Aamazing conference when you married #DepartmentOfRedundancyDepartment Yep , but there's a legitimate challenge. ...Which all ended up being a whole lot more nonsensical than I had hoped for. There are some good ones, so I'll call that an accomplishment! I was banking on grammar not being an issue: since my tweets use impeccable grammar, the program modeled off them should have pretty good grammar as well. There are going to be some hilarious edge cases (I'm looking at you, Ethics paper first, music in close to everyone) that make no sense, and some hilarious edge cases (Waking up, OJ, I feel like Nick Jonas today) that make me feel like I should have a Twitter rap career. On the whole though, the structure came out alright. ","version":null,"tagName":"h2"},{"title":"Moving on from here","type":1,"pageTitle":"Tweet like me","url":"/2016/03/tweet-like-me#moving-on-from-here","content":" During class we also talked about an interesting idea: trying to analyze corporate documents and corporate speech. I'd be interested to know what this analysis applied to something like a couple of bank press releases could do. By any means, the code needs some work to clean it up before I get that far. ","version":null,"tagName":"h2"},{"title":"For further reading","type":1,"pageTitle":"Tweet like me","url":"/2016/03/tweet-like-me#for-further-reading","content":" I'm pretty confident I re-invented a couple wheels along the way - what I'm doing feels a lot like what Markov Chain Monte Carlo is intended to do. But I've never worked explicitly with that before, so more research is needed. ","version":null,"tagName":"h2"},{"title":"Tick tock...","type":0,"sectionRef":"#","url":"/2016/04/tick-tock","content":"","keywords":"","version":null},{"title":"2.5 Billion","type":1,"pageTitle":"Tick tock...","url":"/2016/04/tick-tock#25-billion","content":" If PBS is right, that's the total number of heartbeats we get. Approximately once every second that number goes down, and down, and down again... total_heartbeats = 2500000000 I got a Fitbit this past Christmas season, mostly because I was interested in the data and trying to work on some data science projects with it. This is going to be the first project, but there will likely be more (and not nearly as morbid). My idea was: If this is the final number that I'm running up against, how far have I come, and how far am I likely to go? I've currently had about 3 months' time to estimate what my data will look like, so let's go ahead and see: given a lifetime 2.5 billion heart beats, how much time do I have left? ","version":null,"tagName":"h2"},{"title":"Statistical Considerations","type":1,"pageTitle":"Tick tock...","url":"/2016/04/tick-tock#statistical-considerations","content":" Since I'm starting to work with health data, there are a few considerations I think are important before I start digging through my data. The concept of 2.5 billion as an agreed-upon number is tenuous at best. I've seen anywhere from 2.21 billion to 3.4 billion so even if I knew exactly how many times my heart had beaten so far, the ending result is suspect at best. I'm using 2.5 billion because that seems to be about the midpoint of the estimates I've seen so far.Most of the numbers I've seen so far are based on extrapolating number of heart beats from life expectancy. As life expectancy goes up, the number of expected heart beats goes up too.My estimation of the number of heartbeats in my life so far is based on 3 months worth of data, and I'm extrapolating an entire lifetime based on this. So while the ending number is not useful in any medical context, it is still an interesting project to work with the data I have on hand. ","version":null,"tagName":"h2"},{"title":"Getting the data","type":1,"pageTitle":"Tick tock...","url":"/2016/04/tick-tock#getting-the-data","content":" Fitbit has an API available for people to pull their personal data off the system. It requires registering an application, authentication with OAuth, and some other complicated things. If you're not interested in how I fetch the data, skip here. ","version":null,"tagName":"h2"},{"title":"Registering an application","type":1,"pageTitle":"Tick tock...","url":"/2016/04/tick-tock#registering-an-application","content":" I've already registered a personal application with Fitbit, so I can go ahead and retrieve things like the client secret from a file. # Import all the OAuth secret information from a local file from secrets import CLIENT_SECRET, CLIENT_ID, CALLBACK_URL ","version":null,"tagName":"h2"},{"title":"Handling OAuth 2","type":1,"pageTitle":"Tick tock...","url":"/2016/04/tick-tock#handling-oauth-2","content":" So, all the people that know what OAuth 2 is know what's coming next. For those who don't: OAuth is how people allow applications to access other data without having to know your password. Essentially the dialog goes like this: Application: I've got a user here who wants to use my application, but I need their data. Fitbit: OK, what data do you need access to, and for how long? Application: I need all of these scopes, and for this amount of time. Fitbit: OK, let me check with the user to make sure they really want to do this. Fitbit: User, do you really want to let this application have your data? User: I do! And to prove it, here's my password. Fitbit: OK, everything checks out. I'll let the application access your data. Fitbit: Application, you can access the user's data. Use this special value whenever you need to request data from me. Application: Thank you, now give me all the data. Effectively, this allows an application to gain access to a user's data without ever needing to know the user's password. That way, even if the other application is hacked, the user's original data remains safe. Plus, the user can let the data service know to stop providing the application access any time they want. All in all, very secure. It does make handling small requests a bit challenging, but I'll go through the steps here. We'll be using the Implicit Grant workflow, as it requires fewer steps in processing. First, we need to set up the URL the user would visit to authenticate: import urllib FITBIT_URI = 'https://www.fitbit.com/oauth2/authorize' params = { # If we need more than one scope, must be a CSV string 'scope': 'heartrate', 'response_type': 'token', 'expires_in': 86400, # 1 day 'redirect_uri': CALLBACK_URL, 'client_id': CLIENT_ID } request_url = FITBIT_URI + '?' + urllib.parse.urlencode(params) Now, here you would print out the request URL, go visit it, and get the full URL that it sends you back to. Because that is very sensitive information (specifically containing my CLIENT_ID that I'd really rather not share on the internet), I've skipped that step in the code here, but it happens in the background. # The `response_url` variable contains the full URL that # FitBit sent back to us, but most importantly, # contains the token we need for authorization. access_token = dict(urllib.parse.parse_qsl(response_url))['access_token'] ","version":null,"tagName":"h3"},{"title":"Requesting the data","type":1,"pageTitle":"Tick tock...","url":"/2016/04/tick-tock#requesting-the-data","content":" Now that we've actually set up our access via the access_token, it's time to get the actual heart rate data. I'll be using data from January 1, 2016 through March 31, 2016, and extrapolating wildly from that. Fitbit only lets us fetch intraday data one day at a time, so I'll create a date range using pandas and iterate through that to pull down all the data. from requests_oauthlib import OAuth2Session import pandas as pd from datetime import datetime session = OAuth2Session(token={ 'access_token': access_token, 'token_type': 'Bearer' }) format_str = '%Y-%m-%d' start_date = datetime(2016, 1, 1) end_date = datetime(2016, 3, 31) dr = pd.date_range(start_date, end_date) url = 'https://api.fitbit.com/1/user/-/activities/heart/date/{0}/1d/1min.json' hr_responses = [session.get(url.format(d.strftime(format_str))) for d in dr] def record_to_df(record): if 'activities-heart' not in record: return None date_str = record['activities-heart'][0]['dateTime'] df = pd.DataFrame(record['activities-heart-intraday']['dataset']) df.index = df['time'].apply( lambda x: datetime.strptime(date_str + ' ' + x, '%Y-%m-%d %H:%M:%S')) return df hr_dataframes = [record_to_df(record.json()) for record in hr_responses] hr_df_concat = pd.concat(hr_dataframes) # There are some minutes with missing data, so we need to correct that full_daterange = pd.date_range(hr_df_concat.index[0], hr_df_concat.index[-1], freq='min') hr_df_full = hr_df_concat.reindex(full_daterange, method='nearest') print("Heartbeats from {} to {}: {}".format(hr_df_full.index[0], hr_df_full.index[-1], hr_df_full['value'].sum())) Heartbeats from 2016-01-01 00:00:00 to 2016-03-31 23:59:00: 8139060 And now we've retrieved all the available heart rate data for January 1st through March 31st! Let's get to the actual analysis. ","version":null,"tagName":"h3"},{"title":"Wild Extrapolations from Small Data","type":1,"pageTitle":"Tick tock...","url":"/2016/04/tick-tock#wild-extrapolations-from-small-data","content":" A fundamental issue of this data is that it's pretty small. I'm using 3 months of data to make predictions about my entire life. But, purely as an exercise, I'll move forward. ","version":null,"tagName":"h2"},{"title":"How many heartbeats so far?","type":1,"pageTitle":"Tick tock...","url":"/2016/04/tick-tock#how-many-heartbeats-so-far","content":" The first step is figuring out how many of the 2.5 billion heartbeats I've used so far. We're going to try and work backward from the present day to when I was born to get that number. The easy part comes first: going back to January 1st, 1992. That's because I can generalize how many 3-month increments there were between now and then, account for leap years, and call that section done. Between January 1992 and January 2016 there were 96 quarters, and 6 leap days. The number we're looking for is: hrq⋅n−hrd⋅(n−m)\\begin{equation*} hr_q \\cdot n - hr_d \\cdot (n-m) \\end{equation*}hrq⋅n−hrd⋅(n−m) hrqhr_qhrq: Number of heartbeats per quarterhrdhr_dhrd: Number of heartbeats on leap daynnn: Number of quarters, in this case 96mmm: Number of leap days, in this case 6 quarterly_count = hr_df_full['value'].sum() leap_day_count = hr_df_full[(hr_df_full.index.month == 2) & (hr_df_full.index.day == 29)]['value'].sum() num_quarters = 96 leap_days = 6 jan_92_jan_16 = quarterly_count * num_quarters - leap_day_count * (num_quarters - leap_days) jan_92_jan_16 773609400 So between January 1992 and January 2016 I've used ≈\\approx≈ 774 million heartbeats. Now, I need to go back to my exact birthday. I'm going to first find on average how many heartbeats I use in a minute, and multiply that by the number of minutes between my birthday and January 1992. For privacy purposes I'll put the code here that I'm using, but without any identifying information: minute_mean = hr_df_full['value'].mean() # Don't you wish you knew? # birthday_minutes = ??? birthday_heartbeats = birthday_minutes * minute_mean heartbeats_until_2016 = int(birthday_heartbeats + jan_92_jan_16) remaining_2016 = total_heartbeats - heartbeats_until_2016 print("Heartbeats so far: {}".format(heartbeats_until_2016)) print("Remaining heartbeats: {}".format(remaining_2016)) Heartbeats so far: 775804660 Remaining heartbeats: 1724195340 It would appear that my heart has beaten 775,804,660 times between my moment of birth and January 1st 2016, and that I have 1.72 billion left. ","version":null,"tagName":"h3"},{"title":"How many heartbeats longer?","type":1,"pageTitle":"Tick tock...","url":"/2016/04/tick-tock#how-many-heartbeats-longer","content":" Now comes the tricky bit. I know how many heart beats I've used so far, and how many I have remaining, so I'd like to come up with a (relatively) accurate estimate of when exactly my heart should give out. We'll do this in a few steps, increasing in granularity. First step, how many heartbeats do I use in a 4-year period? I have data for a single quarter including leap day, so I want to know: hrq⋅n−hrd⋅(n−m)\\begin{equation*} hr_q \\cdot n - hr_d \\cdot (n - m) \\end{equation*}hrq⋅n−hrd⋅(n−m) hrqhr_qhrq: Heartbeats per quarterhrdhr_dhrd: Heartbeats per leap daynnn: Number of quarters = 16mmm: Number of leap days = 1 heartbeats_4year = quarterly_count * 16 - leap_day_count * (16 - 1) heartbeats_4year 128934900 Now, I can fast forward from 2016 the number of periods of 4 years I have left. four_year_periods = remaining_2016 // heartbeats_4year remaining_4y = remaining_2016 - four_year_periods * heartbeats_4year print("Four year periods remaining: {}".format(four_year_periods)) print("Remaining heartbeats after 4 year periods: {}".format(remaining_4y)) Four year periods remaining: 13 Remaining heartbeats after 4 year periods: 48041640 Given that there are 13 four-year periods left, I can move from 2016 all the way to 2068, and find that I will have 48 million heart beats left. Let's drop down to figuring out how many quarters that is. I know that 2068 will have a leap day (unless someone finally decides to get rid of them), so I'll subtract that out first. Then, I'm left to figure out how many quarters exactly are left. remaining_leap = remaining_4y - leap_day_count # Ignore leap day in the data set heartbeats_quarter = hr_df_full[(hr_df_full.index.month != 2) & (hr_df_full.index.day != 29)]['value'].sum() quarters_left = remaining_leap // heartbeats_quarter remaining_year = remaining_leap - quarters_left * heartbeats_quarter print("Quarters left starting 2068: {}".format(quarters_left)) print("Remaining heartbeats after that: {}".format(remaining_year)) Quarters left starting 2068: 8 Remaining heartbeats after that: 4760716 So, that analysis gets me through until January 1st 2070. Final step, using that minute estimate to figure out how many minutes past that I'm predicted to have: from datetime import timedelta base = datetime(2070, 1, 1) minutes_left = remaining_year // minute_mean kaput = timedelta(minutes=minutes_left) base + kaput datetime.datetime(2070, 2, 23, 5, 28) According to this, I've got until February 23rd, 2070 at 5:28 PM in the evening before my heart gives out. ","version":null,"tagName":"h3"},{"title":"Summary","type":1,"pageTitle":"Tick tock...","url":"/2016/04/tick-tock#summary","content":" Well, that's kind of a creepy date to know. As I said at the top though, this number is totally useless in any medical context. It ignores the rate at which we continue to get better at making people live longer, and is extrapolating from 3 months' worth of data the rest of my life. Additionally, throughout my time developing this post I made many minor mistakes. I think they're all fixed now, but it's easy to mix a number up here or there and the analysis gets thrown off by a couple years. Even still, I think philosophically humans have a desire to know how much time we have left in the world. Man is but a breath, and it's scary to think just how quickly that date may be coming up. This analysis asks an important question though: what are you going to do with the time you have left? Thanks for sticking with me on this one, I promise it will be much less depressing next time! ","version":null,"tagName":"h2"},{"title":"The unfair casino","type":0,"sectionRef":"#","url":"/2016/05/the-unfair-casino","content":"","keywords":"","version":null},{"title":"Proving we can detect cheating","type":1,"pageTitle":"The unfair casino","url":"/2016/05/the-unfair-casino#proving-we-can-detect-cheating","content":" My first question is simply, is this possible? There's a lot of trivial cases that make it obvious that there's cheating going on. But there are some edge cases that might give us more difficulty. First though, let's get a picture of what the fair distribution looks like. In principle, we can only detect cheating if the distribution of the fair die differs from the distribution of the loaded die. import numpy as np import pandas as pd import matplotlib.pyplot as plt %matplotlib inline fair_1 = np.random.randint(1, 7, 10000) fair_2 = np.random.randint(1, 7, 10000) pd.Series(fair_1 + fair_2).plot(kind='hist', bins=11); plt.title('Fair Distribution'); This distribution makes sense: there are many ways to make a 7 (the most frequent observed value) and very few ways to make a 12 or 2; an important symmetry. As a special note, you can notice that the sum of two fair dice is a discrete case of the Triangle Distribution, which is itself a special case of the Irwin-Hall Distribution. ","version":null,"tagName":"h2"},{"title":"The Edge Cases","type":1,"pageTitle":"The unfair casino","url":"/2016/05/the-unfair-casino#the-edge-cases","content":" Given that we understand how the results of two fair dice are distributed, let's see some of the interesting edge cases that come up. This will give us assurance that when a casino is cheating, it is detectable (given sufficient data). To make this as hard as possible, we will think of scenarios where the expected value of the sum of loaded dice is the same as the expected value of the sum of fair dice. ","version":null,"tagName":"h2"},{"title":"Edge Case 1","type":1,"pageTitle":"The unfair casino","url":"/2016/05/the-unfair-casino#edge-case-1","content":" What happens when one die is biased low, and one die is biased high? That is, where: D1={1w.p.1/32w.p.1/33w.p.1/124w.p.1/125w.p.1/126w.p.1/12D2={1w.p.1/122w.p.1/123w.p.1/124w.p.1/125w.p.1/36w.p.1/3E[D1]=2.5E[D2]=4.5E[D1+D2]=7=E[Dfair+Dfair]\\begin{align*} \\begin{array}{cc} D_1 = \\left\\{ \\begin{array}{lr} 1 & w.p. 1/3\\\\ 2 & w.p. 1/3\\\\ 3 & w.p. 1/12\\\\ 4 & w.p. 1/12\\\\ 5 & w.p. 1/12\\\\ 6 & w.p. 1/12 \\end{array} \\right. & D_2 = \\left\\{ \\begin{array}{lr} 1 & w.p. 1/12\\\\ 2 & w.p. 1/12\\\\ 3 & w.p. 1/12\\\\ 4 & w.p. 1/12\\\\ 5 & w.p. 1/3\\\\ 6 & w.p. 1/3 \\end{array} \\right. \\\\ \\mathbb{E}[D_1] = 2.5 & \\mathbb{E}[D_2] = 4.5 \\end{array}\\\\ \\mathbb{E}[D_1 + D_2] = 7 = \\mathbb{E}[D_{fair} + D_{fair}] \\end{align*}D1=⎩⎨⎧123456w.p.1/3w.p.1/3w.p.1/12w.p.1/12w.p.1/12w.p.1/12E[D1]=2.5D2=⎩⎨⎧123456w.p.1/12w.p.1/12w.p.1/12w.p.1/12w.p.1/3w.p.1/3E[D2]=4.5E[D1+D2]=7=E[Dfair+Dfair] def unfair_die(p_vals, n): x = np.random.multinomial(1, p_vals, n) return x.nonzero()[1] + 1 d1 = [1/3, 1/3, 1/12, 1/12, 1/12, 1/12] d2 = [1/12, 1/12, 1/12, 1/12, 1/3, 1/3] x1 = unfair_die(d1, 10000) x2 = unfair_die(d2, 10000) pd.Series(x1 + x2).plot(kind='hist', bins=11); plt.title('$D_1$ biased low, $D_2$ biased high'); We can see that while the 7 value remains the most likely (as expected), the distribution is not so nicely shaped any more. ","version":null,"tagName":"h3"},{"title":"Edge Case 2","type":1,"pageTitle":"The unfair casino","url":"/2016/05/the-unfair-casino#edge-case-2","content":" When one die is loaded low, and one is loaded high, we've seen how we can detect them. How about when two die are loaded both low and high? That is, we have the following distribution: D1={1w.p.1/32w.p.1/123w.p.1/124w.p.1/125w.p.1/126w.p.1/3D2={1w.p.1/32w.p.1/123w.p.1/124w.p.1/125w.p.1/126w.p.1/3E[D1]=3.5E[D2]=3.5E[D1+D2]=7=E[Dfair+Dfair]\\begin{align*} \\begin{array}{cc} D_1 = \\left\\{ \\begin{array}{lr} 1 & w.p. 1/3\\\\ 2 & w.p. 1/12\\\\ 3 & w.p. 1/12\\\\ 4 & w.p. 1/12\\\\ 5 & w.p. 1/12\\\\ 6 & w.p. 1/3 \\end{array} \\right. & D_2 = \\left\\{ \\begin{array}{lr} 1 & w.p. 1/3\\\\ 2 & w.p. 1/12\\\\ 3 & w.p. 1/12\\\\ 4 & w.p. 1/12\\\\ 5 & w.p. 1/12\\\\ 6 & w.p. 1/3 \\end{array} \\right. \\\\ \\mathbb{E}[D_1] = 3.5 & \\mathbb{E}[D_2] = 3.5 \\end{array}\\\\ \\mathbb{E}[D_1 + D_2] = 7 = \\mathbb{E}[D_{fair} + D_{fair}] \\end{align*}D1=⎩⎨⎧123456w.p.1/3w.p.1/12w.p.1/12w.p.1/12w.p.1/12w.p.1/3E[D1]=3.5D2=⎩⎨⎧123456w.p.1/3w.p.1/12w.p.1/12w.p.1/12w.p.1/12w.p.1/3E[D2]=3.5E[D1+D2]=7=E[Dfair+Dfair] We can see even that the expected value of each individual die is the same as the fair die! However, the distribution (if we are doing this correctly) should still be skewed: d1 = [1/3, 1/12, 1/12, 1/12, 1/12, 1/3] d2 = d1 x1 = unfair_die(d1, 10000) x2 = unfair_die(d2, 10000) pd.Series(x1 + x2).plot(kind='hist', bins=11) plt.title("$D_1$ and $D_2$ biased to 1 and 6"); In a very un-subtle way, we have of course made the values 2 and 12 far more likely. ","version":null,"tagName":"h3"},{"title":"Detection Conclusion","type":1,"pageTitle":"The unfair casino","url":"/2016/05/the-unfair-casino#detection-conclusion","content":" There are some trivial examples of cheating that are easy to detect: whenever the expected value of the sum of two fair dice deviates from the expected value for the sum of two fair dice, we can immediately conclude that there is cheating at stake. The interesting edge cases occur when the expected value of the sum of loaded dice matches the expected value of the sum of fair dice. Considering the above examples (and a couple more I ran through in developing this), we have seen that in every circumstance having two unfair dice leads to a distribution of results different from the fair results. We can thus finally state: just by looking at the distribution of results from this game, we can immediately conclude whether there is cheating. ","version":null,"tagName":"h2"},{"title":"Simulated Annealing","type":1,"pageTitle":"The unfair casino","url":"/2016/05/the-unfair-casino#simulated-annealing","content":" What we really would like to do though, is see if there is any way to determine how exactly the dice are loaded. This is significantly more complicated, but we can borrow some algorithms from Machine Learning to figure out exactly how to perform this process. I'm using the Simulated Annealing algorithm, and I discuss why this works and why I chose it over some of the alternatives in the justification. If you don't care about how I set up the model and just want to see the code, check out the actual code. Simulated Annealing is a variation of the Metropolis-Hastings Algorithm, but the important thing for us is: Simulated Annealing allows us to quickly optimize high-dimensional problems. But what exactly are we trying to optimize? Ideally, we want a function that can tell us whether one distribution for the dice better explains the results than another distribution. This is known as the likelihood function. ","version":null,"tagName":"h2"},{"title":"Deriving the Likelihood function","type":1,"pageTitle":"The unfair casino","url":"/2016/05/the-unfair-casino#deriving-the-likelihood-function","content":" To derive our likelihood function, we want to know: what is the probability of seeing a specific result given those hidden parameters? This is actually a surprisingly difficult problem. While we can do a lot of calculations by hand, we need a more general solution since we will be working with very some interesting die distributions. We first note that the sum of two dice can take on 11 different values - 2 through 12. This implies that each individual sum follows a Categorical distribution. That is: L(x)={p2x=2p3x=3…p11x=11p12x=12\\begin{align*} \\mathcal{L(x)} = \\left\\{ \\begin{array}{lr} p_2 & x = 2\\\\ p_3 & x = 3\\\\ \\ldots & \\\\ p_{11} & x = 11\\\\ p_{12} & x = 12 \\end{array} \\right. \\end{align*}L(x)=⎩⎨⎧p2p3…p11p12x=2x=3x=11x=12 Where each pip_ipi is the probability of seeing that specific result. However, we need to calculate what each probability is! I'll save you the details, but this author explains how to do it. Now, we would like to know the likelihood of our entire data-set. This is trivial: L(X)=∏i=1nL(x)\\begin{align*} \\mathcal{L(\\mathbf{X})} &= \\prod_{i=1}^n L(x) \\end{align*}L(X)=i=1∏nL(x) However, it's typically much easier to work with the log(L)\\log(\\mathcal{L})log(L) function instead. This is critically important from a computational perspective: when you multiply so many small numbers together (i.e. the product of L(x)L(x)L(x) terms) the computer suffers from rounding error; if we don't control for this, we will find that no matter the distributions we choose for each die, the "likelihood" will be close to zero because the computer is not precise enough. log(L)=∑i=1nlog(L)\\begin{align*} \\log(\\mathcal{L}) &= \\sum_{i=1}^n \\log(L) \\end{align*}log(L)=i=1∑nlog(L) ","version":null,"tagName":"h3"},{"title":"The process of Simulated Annealing","type":1,"pageTitle":"The unfair casino","url":"/2016/05/the-unfair-casino#the-process-of-simulated-annealing","content":" The means by which we optimize our likelihood function is the simulated annealing algorithm. The way it works is as follows: Start with a random guess for the parameters we are trying to optimize. In our case we are trying to guess the distribution of two dice, and so we "optimize" until we have a distribution that matches the data. For each iteration of the algorithm: Generate a new "proposed" set of parameters based on the current parameters - i.e. slightly modify the current parameters to get a new set of parameters.Calculate the value of log(L)\\log(\\mathcal{L})log(L) for each set of parameters. If the function value for the proposed parameter set is higher than for the current, automatically switch to the new parameter set and continue the next iteration.Given the new parameter set performs worse, determine a probability of switching to the new parameter set anyways: P(pcurrent,pproposed)\\mathcal{P}(p_{current}, p_{proposed})P(pcurrent,pproposed)Switch to the new parameter set with probability P\\mathcal{P}P. If you fail to switch, begin the next iteration. The algorithm is complete after we fail to make a transition nnn times in a row. If everything goes according to plan, we will have a value that is close to the true distribution of each die. ","version":null,"tagName":"h3"},{"title":"The actual code","type":1,"pageTitle":"The unfair casino","url":"/2016/05/the-unfair-casino#the-actual-code","content":" We start by defining the score function. This will tell us how well the proposed die densities actually explain the results. import numpy as np from numpy import polynomial def density_coef(d1_density, d2_density): # Calculating the probabilities of each outcome was taken # from this author: http://math.stackexchange.com/a/1710392/320784 d1_p = polynomial.Polynomial(d1_density) d2_p = polynomial.Polynomial(d2_density) coefs = (d1_p * d2_p).coef return coefs def score(x, d1_density, d2_density): # We've now got the probabilities of each event, but we need # to shift the array a bit so we can use the x values to actually # index into it. This will allow us to do all the calculations # incredibly quickly coefs = density_coef(d1_density, d2_density) coefs = np.hstack((0, 0, coefs)) return np.log(coefs[x]).sum() Afterward, we need to write something to permute the proposal densities. We make random modifications, and eventually the best one survives. def permute(d1_density, d2_density): # To ensure we have legitimate densities, we will randomly # increase one die face probability by `change`, # and decrease one by `change`. # This means there are something less than (1/`change`)^12 possibilities # we are trying to search over. change = .01 d1_index1, d1_index2 = np.random.randint(0, 6, 2) d2_index1, d2_index2 = np.random.randint(0, 6, 2) # Also make sure to copy. I've had some weird aliasing issues # in the past that made everything blow up. new_d1 = np.float64(np.copy(d1_density)) new_d2 = np.float64(np.copy(d2_density)) # While this doesn't account for the possibility that some # values go negative, in practice this never happens new_d1[d1_index1] += change new_d1[d1_index2] -= change new_d2[d2_index1] += change new_d2[d2_index2] -= change return new_d1, new_d2 Now we've got the main algorithm code to do. This is what brings all the pieces together. def optimize(data, conv_count=10, max_iter=1e4): switch_failures = 0 iter_count = 0 # Start with guessing fair dice cur_d1 = np.repeat(1/6, 6) cur_d2 = np.repeat(1/6, 6) cur_score = score(data, cur_d1, cur_d2) # Keep track of our best guesses - may not be # what we end with max_score = cur_score max_d1 = cur_d1 max_d2 = cur_d2 # Optimization stops when we have failed to switch `conv_count` # times (presumably because we have a great guess), or we reach # the maximum number of iterations. while switch_failures < conv_count and iter_count < max_iter: iter_count += 1 if iter_count % (max_iter / 10) == 0: print('Iteration: {}; Current score (higher is better): {}'.format( iter_count, cur_score)) new_d1, new_d2 = permute(cur_d1, cur_d2) new_score = score(data, new_d1, new_d2) if new_score > max_score: max_score = new_score max_d1 = new_d1 max_d2 = new_d2 if new_score > cur_score: # If the new permutation beats the old one, # automatically select it. cur_score = new_score cur_d1 = new_d1 cur_d2 = new_d2 switch_failures = 0 else: # We didn't beat the current score, but allow # for possibly switching anyways. accept_prob = np.exp(new_score - cur_score) coin_toss = np.random.rand() if coin_toss < accept_prob: # We randomly switch to the new distribution cur_score = new_score cur_d1 = new_d1 cur_d2 = new_d2 switch_failures = 0 else: switch_failures += 1 # Return both our best guess, and the ending guess return max_d1, max_d2, cur_d1, cur_d2 And now we have finished the hard work! ","version":null,"tagName":"h2"},{"title":"Catching the Casino","type":1,"pageTitle":"The unfair casino","url":"/2016/05/the-unfair-casino#catching-the-casino","content":" Let's go through a couple of scenarios and see if we can catch the casino cheating with some loaded dice. In every scenario we start with an assumption of fair dice, and then try our hand to figure out what the actual distribution was. ","version":null,"tagName":"h2"},{"title":"Attempt 1","type":1,"pageTitle":"The unfair casino","url":"/2016/05/the-unfair-casino#attempt-1","content":" The casino is using two dice that are both biased low. How well can we recover the distribution? import time def simulate_casino(d1_dist, d2_dist, n=10000): d1_vals = unfair_die(d1_dist, n) d2_vals = unfair_die(d2_dist, n) start = time.perf_counter() max_d1, max_d2, final_d1, final_d2 = optimize(d1_vals + d2_vals) end = time.perf_counter() print("Simulated Annealing time: {:.02f}s".format(end - start)) coef_range = np.arange(2, 13) - .5 plt.subplot(221) plt.bar(coef_range, density_coef(d1_dist, d2_dist), width=1) plt.title('True Distribution') plt.subplot(222) plt.hist(d1_vals + d2_vals, bins=11) plt.title('Empirical Distribution') plt.subplot(223) plt.bar(coef_range, density_coef(max_d1, max_d2), width=1) plt.title('Recovered Distribution') plt.gcf().set_size_inches(10, 10) simulate_casino([2/9, 2/9, 2/9, 1/9, 1/9, 1/9], [2/9, 2/9, 2/9, 1/9, 1/9, 1/9]) Iteration: 1000; Current score (higher is better): -22147.004400281654 Simulated Annealing time: 0.30s ","version":null,"tagName":"h3"},{"title":"Attempt 2","type":1,"pageTitle":"The unfair casino","url":"/2016/05/the-unfair-casino#attempt-2","content":" The casino now uses dice that are both biased towards 1 and 6. simulate_casino([1/3, 1/12, 1/12, 1/12, 1/12, 1/3], [1/3, 1/12, 1/12, 1/12, 1/12, 1/3]) Simulated Annealing time: 0.08s ","version":null,"tagName":"h3"},{"title":"Attempt 3","type":1,"pageTitle":"The unfair casino","url":"/2016/05/the-unfair-casino#attempt-3","content":" The casino will now use one die biased towards 1 and 6, and one die towards 3 and 4. simulate_casino([1/3, 1/12, 1/12, 1/12, 1/12, 1/3], [1/12, 1/12, 1/3, 1/3, 1/12, 1/12]) Simulated Annealing time: 0.09s ","version":null,"tagName":"h3"},{"title":"Attempt 4","type":1,"pageTitle":"The unfair casino","url":"/2016/05/the-unfair-casino#attempt-4","content":" We'll now finally go to a fair casino to make sure that we can still recognize a positive result. simulate_casino(np.repeat(1/6, 6), np.repeat(1/6, 6)) Simulated Annealing time: 0.02s ","version":null,"tagName":"h3"},{"title":"Attempt 5","type":1,"pageTitle":"The unfair casino","url":"/2016/05/the-unfair-casino#attempt-5","content":" We've so far been working with a large amount of data - 10,000 data points. Can we now scale things back to only 250 throws? We'll start with two dice biased high. simulate_casino([1/9, 1/9, 1/9, 2/9, 2/9, 2/9], [1/9, 1/9, 1/9, 2/9, 2/9, 2/9], n=250) Iteration: 1000; Current score (higher is better): -551.6995384525453 Iteration: 2000; Current score (higher is better): -547.7803673440676 Iteration: 3000; Current score (higher is better): -547.9805613193807 Iteration: 4000; Current score (higher is better): -546.7574874775273 Iteration: 5000; Current score (higher is better): -549.5798007672656 Iteration: 6000; Current score (higher is better): -545.0354060154496 Iteration: 7000; Current score (higher is better): -550.1134504086606 Iteration: 8000; Current score (higher is better): -549.9306537114975 Iteration: 9000; Current score (higher is better): -550.7075182119111 Iteration: 10000; Current score (higher is better): -549.400679551826 Simulated Annealing time: 1.94s The results are surprisingly good. While the actual optimization process took much longer to finish than in the other examples, we still have a very good guess. As a caveat though: the recovered distribution tends to overfit the data. That is, if the data doesn't fit the underlying distribution well, the model will also fail. ","version":null,"tagName":"h3"},{"title":"Conclusion","type":1,"pageTitle":"The unfair casino","url":"/2016/05/the-unfair-casino#conclusion","content":" Given the results above, we can see that we have indeed come up with a very good algorithm to determine the distribution of two dice given their results. As a benefit, we have even seen that results come back very quickly; it's not uncommon for the optimization to converge within a tenth of a second. Additionally, we have seen that the algorithm can intuit the distribution even when there is not much data. While the final example shows that we can 'overfit' on the dataset, we can still get valuable information from a relatively small dataset. We can declare at long last: the mathematicians have again triumphed over the casino. ","version":null,"tagName":"h2"},{"title":"Justification of Simulated Annealing","type":1,"pageTitle":"The unfair casino","url":"/2016/05/the-unfair-casino#justification-of-simulated-annealing","content":" ","version":null,"tagName":"h2"},{"title":"Why Simulated Annealing?","type":1,"pageTitle":"The unfair casino","url":"/2016/05/the-unfair-casino#why-simulated-annealing","content":" So why even use an algorithm with a fancy title like Simulated Annealing? First of all, because the title is sexy. Second of all, because this is a reasonably complicated problem to try and solve. We have a parameter space where each value pij∈(0,1);i,j∈{1,…,6}p_{ij} \\in (0, 1); i, j \\in \\{1, \\ldots, 6\\}pij∈(0,1);i,j∈{1,…,6}, for a total of 12 different variables we are trying to optimize over. Additionally, given a 12-dimensional function we are trying to optimize, simulated annealing makes sure that we don't fall into a local minimum. ","version":null,"tagName":"h3"},{"title":"Why not something else?","type":1,"pageTitle":"The unfair casino","url":"/2016/05/the-unfair-casino#why-not-something-else","content":" This is a fair question. There are two classes of algorithms that can also be used to solve this problem: Non-linear optimization methods, and the EM algorithm. I chose not to use non-linear optimization simply because I'm a bit concerned that it will trap me in a local maximum. Instead of running multiple different optimizations from different starting points, I can just use simulated annealing to take that into account. In addition, throughout the course of testing the simulated annealing code converged incredibly quickly - far more quickly than any non-linear solver would be able to accomplish. The EM Algorithm was originally what I intended to write this blog post with. Indeed, the post was inspired by the crooked casino example which uses the EM algorithm to solve it. However, after modeling the likelihood function I realized that the algebra would very quickly get out of hand. Trying to compute all the polynomial terms would not be fun, which would be needed to actually optimize for each parameter. So while the EM algorithm would likely be much faster in raw speed terms, the amount of time needed to program and verify it meant that I was far better off using a different method for optimization. ","version":null,"tagName":"h3"},{"title":"Event studies and earnings releases","type":0,"sectionRef":"#","url":"/2016/06/event-studies-and-earnings-releases","content":"","keywords":"","version":null},{"title":"The Market Just Knew","type":1,"pageTitle":"Event studies and earnings releases","url":"/2016/06/event-studies-and-earnings-releases#the-market-just-knew","content":" I recently saw two examples of stock charts that have kept me thinking for a while. And now that the semester is complete, I finally have enough time to really look at them and give them the treatment they deserve. The first is good old Apple: Code from secrets import QUANDL_KEY import matplotlib.pyplot as plt from matplotlib.dates import date2num from matplotlib.finance import candlestick_ohlc from matplotlib.dates import DateFormatter, WeekdayLocator,\\ DayLocator, MONDAY import quandl from datetime import datetime import pandas as pd %matplotlib inline def fetch_ticker(ticker, start, end): # Quandl is currently giving me issues with returning # the entire dataset and not slicing server-side. # So instead, we'll do it client-side! q_format = '%Y-%m-%d' ticker_data = quandl.get('YAHOO/' + ticker, start_date=start.strftime(q_format), end_date=end.strftime(q_format), authtoken=QUANDL_KEY) return ticker_data def ohlc_dataframe(data, ax=None): # Much of this code re-used from: # http://matplotlib.org/examples/pylab_examples/finance_demo.html if ax is None: f, ax = plt.subplots() vals = [(date2num(date), *(data.loc[date])) for date in data.index] candlestick_ohlc(ax, vals) mondays = WeekdayLocator(MONDAY) alldays = DayLocator() weekFormatter = DateFormatter('%b %d') ax.xaxis.set_major_locator(mondays) ax.xaxis.set_minor_locator(alldays) ax.xaxis.set_major_formatter(weekFormatter) return ax AAPL = fetch_ticker('AAPL', datetime(2016, 3, 1), datetime(2016, 5, 1)) ax = ohlc_dataframe(AAPL) plt.vlines(date2num(datetime(2016, 4, 26, 12)), ax.get_ylim()[0], ax.get_ylim()[1], color='b', label='Earnings Release') plt.legend(loc=3) plt.title("Apple Price 3/1/2016 - 5/1/2016"); The second chart is from Facebook: FB = fetch_ticker('FB', datetime(2016, 3, 1), datetime(2016, 5, 5)) ax = ohlc_dataframe(FB) plt.vlines(date2num(datetime(2016, 4, 27, 12)), ax.get_ylim()[0], ax.get_ylim()[1], color='b', label='Earnings Release') plt.title('Facebook Price 3/5/2016 - 5/5/2016') plt.legend(loc=2); These two charts demonstrate two very specific phonomena: how the market prepares for earnings releases. Let's look at those charts again, but with some extra information. As we're about the see, the market "knew" in advance that Apple was going to perform poorly. The market expected that Facebook was going to perform poorly, and instead shot the lights out. Let's see that trend in action: Code def plot_hilo(ax, start, end, data): ax.plot([date2num(start), date2num(end)], [data.loc[start]['High'], data.loc[end]['High']], color='b') ax.plot([date2num(start), date2num(end)], [data.loc[start]['Low'], data.loc[end]['Low']], color='b') f, axarr = plt.subplots(1, 2) ax_aapl = axarr[0] ax_fb = axarr[1] # Plot the AAPL trend up and down ohlc_dataframe(AAPL, ax=ax_aapl) plot_hilo(ax_aapl, datetime(2016, 3, 1), datetime(2016, 4, 15), AAPL) plot_hilo(ax_aapl, datetime(2016, 4, 18), datetime(2016, 4, 26), AAPL) ax_aapl.vlines(date2num(datetime(2016, 4, 26, 12)), ax_aapl.get_ylim()[0], ax_aapl.get_ylim()[1], color='g', label='Earnings Release') ax_aapl.legend(loc=2) ax_aapl.set_title('AAPL Price History') # Plot the FB trend down and up ohlc_dataframe(FB, ax=ax_fb) plot_hilo(ax_fb, datetime(2016, 3, 30), datetime(2016, 4, 27), FB) plot_hilo(ax_fb, datetime(2016, 4, 28), datetime(2016, 5, 5), FB) ax_fb.vlines(date2num(datetime(2016, 4, 27, 12)), ax_fb.get_ylim()[0], ax_fb.get_ylim()[1], color='g', label='Earnings Release') ax_fb.legend(loc=2) ax_fb.set_title('FB Price History') f.set_size_inches(18, 6) As we can see above, the market broke a prevailing trend on Apple in order to go down, and ultimately predict the earnings release. For Facebook, the opposite happened. While the trend was down, the earnings were fantastic and the market corrected itself much higher. ","version":null,"tagName":"h2"},{"title":"Formulating the Question","type":1,"pageTitle":"Event studies and earnings releases","url":"/2016/06/event-studies-and-earnings-releases#formulating-the-question","content":" While these are two specific examples, there are plenty of other examples you could cite one way or another. Even if the preponderance of evidence shows that the market correctly predicts earnings releases, we need not accuse people of collusion; for a company like Apple with many suppliers we can generally forecast how Apple has done based on those same suppliers. The question then, is this: how well does the market predict the earnings releases? It's an incredibly broad question that I want to disect in a couple of different ways: Given a stock that has been trending down over the past N days before an earnings release, how likely does it continue downward after the release?Given a stock trending up, how likely does it continue up?Is there a difference in accuracy between large- and small-cap stocks?How often, and for how long, do markets trend before an earnings release? I want to especially thank Alejandro Saltiel for helping me retrieve the data. He's great. And now for all of the interesting bits. ","version":null,"tagName":"h2"},{"title":"Event Studies","type":1,"pageTitle":"Event studies and earnings releases","url":"/2016/06/event-studies-and-earnings-releases#event-studies","content":" Before we go too much further, I want to introduce the actual event study. Each chart intends to capture a lot of information and present an easy-to-understand pattern: Code import numpy as np import pandas as pd from pandas.tseries.holiday import USFederalHolidayCalendar from pandas.tseries.offsets import CustomBusinessDay from datetime import datetime, timedelta # If you remove rules, it removes them from *all* calendars # To ensure we don't pop rules we don't want to, first make # sure to fully copy the object trade_calendar = USFederalHolidayCalendar() trade_calendar.rules.pop(6) # Remove Columbus day trade_calendar.rules.pop(7) # Remove Veteran's day TradeDay = lambda days: CustomBusinessDay(days, calendar=trade_calendar) def plot_study(array): # Given a 2-d array, we assume the event happens at index `lookback`, # and create all of our summary statistics from there. lookback = int((array.shape[1] - 1) / 2) norm_factor = np.repeat(array[:,lookback].reshape(-1, 1), array.shape[1], axis=1) centered_data = array / norm_factor - 1 lookforward = centered_data.shape[1] - lookback means = centered_data.mean(axis=0) lookforward_data = centered_data[:,lookforward:] std_dev = np.hstack([0, lookforward_data.std(axis=0)]) maxes = lookforward_data.max(axis=0) mins = lookforward_data.min(axis=0) f, axarr = plt.subplots(1, 2) range_begin = -lookback range_end = lookforward axarr[0].plot(range(range_begin, range_end), means) axarr[1].plot(range(range_begin, range_end), means) axarr[0].fill_between(range(0, range_end), means[-lookforward:] + std_dev, means[-lookforward:] - std_dev, alpha=.5, label="$\\pm$ 1 s.d.") axarr[1].fill_between(range(0, range_end), means[-lookforward:] + std_dev, means[-lookforward:] - std_dev, alpha=.5, label="$\\pm$ 1 s.d.") max_err = maxes - means[-lookforward+1:] min_err = means[-lookforward+1:] - mins axarr[0].errorbar(range(1, range_end), means[-lookforward+1:], yerr=[min_err, max_err], label='Max & Min') axarr[0].legend(loc=2) axarr[1].legend(loc=2) axarr[0].set_xlim((-lookback-1, lookback+1)) axarr[1].set_xlim((-lookback-1, lookback+1)) def plot_study_small(array): # Given a 2-d array, we assume the event happens at index `lookback`, # and create all of our summary statistics from there. lookback = int((array.shape[1] - 1) / 2) norm_factor = np.repeat(array[:,lookback].reshape(-1, 1), array.shape[1], axis=1) centered_data = array / norm_factor - 1 lookforward = centered_data.shape[1] - lookback means = centered_data.mean(axis=0) lookforward_data = centered_data[:,lookforward:] std_dev = np.hstack([0, lookforward_data.std(axis=0)]) maxes = lookforward_data.max(axis=0) mins = lookforward_data.min(axis=0) range_begin = -lookback range_end = lookforward plt.plot(range(range_begin, range_end), means) plt.fill_between(range(0, range_end), means[-lookforward:] + std_dev, means[-lookforward:] - std_dev, alpha=.5, label="$\\pm$ 1 s.d.") max_err = maxes - means[-lookforward+1:] min_err = means[-lookforward+1:] - mins plt.errorbar(range(1, range_end), means[-lookforward+1:], yerr=[min_err, max_err], label='Max & Min') plt.legend(loc=2) plt.xlim((-lookback-1, lookback+1)) def fetch_event_data(ticker, events, horizon=5): # Use horizon+1 to account for including the day of the event, # and half-open interval - that is, for a horizon of 5, # we should be including 11 events. Additionally, using the # CustomBusinessDay means we automatically handle issues if # for example a company reports Friday afternoon - the date # calculator will turn this into a "Saturday" release, but # we effectively shift that to Monday with the logic below. td_back = TradeDay(horizon+1) td_forward = TradeDay(horizon+1) start_date = min(events) - td_back end_date = max(events) + td_forward total_data = fetch_ticker(ticker, start_date, end_date) event_data = [total_data.ix[event-td_back:event+td_forward]\\ [0:horizon*2+1]\\ ['Adjusted Close'] for event in events] return np.array(event_data) # Generate a couple of random events event_dates = [datetime(2016, 5, 27) - timedelta(days=1) - TradeDay(x*20) for x in range(1, 40)] data = fetch_event_data('CELG', event_dates) plot_study_small(data) plt.legend(loc=3) plt.gcf().set_size_inches(12, 6); plt.annotate('Mean price for days leading up to each event', (-5, -.01), (-4.5, .025), arrowprops=dict(facecolor='black', shrink=0.05)) plt.annotate('', (-.1, .005), (-.5, .02), arrowprops={'facecolor': 'black', 'shrink': .05}) plt.annotate('$\\pm$ 1 std. dev. each day', (5, .055), (2.5, .085), arrowprops={'facecolor': 'black', 'shrink': .05}) plt.annotate('Min/Max each day', (.9, -.07), (-1, -.1), arrowprops={'facecolor': 'black', 'shrink': .05}); And as a quick textual explanation as well: The blue line represents the mean price for each day, represented as a percentage of the price on the '0-day'. For example, if we defined an 'event' as whenever the stock price dropped for three days, we would see a decreasing blue line to the left of the 0-day.The blue shaded area represents one standard deviation above and below the mean price for each day following an event. This is intended to give us an idea of what the stock price does in general following an event.The green bars are the minimum and maximum price for each day following an event. This instructs us as to how much it's possible for the stock to move. ","version":null,"tagName":"h2"},{"title":"Event Type 1: Trending down over the past N days","type":1,"pageTitle":"Event studies and earnings releases","url":"/2016/06/event-studies-and-earnings-releases#event-type-1-trending-down-over-the-past-n-days","content":" The first type of event I want to study is how stocks perform when they've been trending down over the past couple of days prior to a release. However, we need to clarify what exactly is meant by "trending down." To do so, we'll use the following metric: the midpoint between each day's opening and closing price goes down over a period of N days. It's probably helpful to have an example: Code f, axarr = plt.subplots(1, 2) f.set_size_inches(18, 6) FB_plot = axarr[0] ohlc_dataframe(FB[datetime(2016, 4, 18):], FB_plot) FB_truncated = FB[datetime(2016, 4, 18):datetime(2016, 4, 27)] midpoint = FB_truncated['Open']/2 + FB_truncated['Close']/2 FB_plot.plot(FB_truncated.index, midpoint, label='Midpoint') FB_plot.vlines(date2num(datetime(2016, 4, 27, 12)), ax_fb.get_ylim()[0], ax_fb.get_ylim()[1], color='g', label='Earnings Release') FB_plot.legend(loc=2) FB_plot.set_title('FB Midpoint Plot') AAPL_plot = axarr[1] ohlc_dataframe(AAPL[datetime(2016, 4, 10):], AAPL_plot) AAPL_truncated = AAPL[datetime(2016, 4, 10):datetime(2016, 4, 26)] midpoint = AAPL_truncated['Open']/2 + AAPL_truncated['Close']/2 AAPL_plot.plot(AAPL_truncated.index, midpoint, label='Midpoint') AAPL_plot.vlines(date2num(datetime(2016, 4, 26, 12)), ax_aapl.get_ylim()[0], ax_aapl.get_ylim()[1], color='g', label='Earnings Release') AAPL_plot.legend(loc=3) AAPL_plot.set_title('AAPL Midpoint Plot'); Given these charts, we can see that FB was trending down for the four days preceding the earnings release, and AAPL was trending down for a whopping 8 days (we don't count the peak day). This will define the methodology that we will use for the study. So what are the results? For a given horizon, how well does the market actually perform? Code # Read in the events for each stock; # The file was created using the first code block in the Appendix import yaml from dateutil.parser import parse from progressbar import ProgressBar data_str = open('earnings_dates.yaml', 'r').read() # Need to remove invalid lines filtered = filter(lambda x: '{' not in x, data_str.split('\\n')) earnings_data = yaml.load('\\n'.join(filtered)) # Convert our earnings data into a list of (ticker, date) pairs # to make it easy to work with. # This is horribly inefficient, but should get us what we need ticker_dates = [] for ticker, date_list in earnings_data.items(): for iso_str in date_list: ticker_dates.append((ticker, parse(iso_str))) def does_trend_down(ticker, event, horizon): # Figure out if the `event` has a downtrend for # the `horizon` days preceding it # As an interpretation note: it is assumed that # the closing price of day `event` is the reference # point, and we want `horizon` days before that. # The price_data.hdf was created in the second appendix code block try: ticker_data = pd.read_hdf('price_data.hdf', ticker) data = ticker_data[event-TradeDay(horizon):event] midpoints = data['Open']/2 + data['Close']/2 # Shift dates one forward into the future and subtract # Effectively: do we trend down over all days? elems = midpoints - midpoints.shift(1) return len(elems)-1 == len(elems.dropna()[elems <= 0]) except KeyError: # If the stock doesn't exist, it doesn't qualify as trending down # Mostly this is here to make sure the entire analysis doesn't # blow up if there were issues in data retrieval return False def study_trend(horizon, trend_function): five_day_events = np.zeros((1, horizon*2 + 1)) invalid_events = [] for ticker, event in ProgressBar()(ticker_dates): if trend_function(ticker, event, horizon): ticker_data = pd.read_hdf('price_data.hdf', ticker) event_data = ticker_data[event-TradeDay(horizon):event+TradeDay(horizon)]['Close'] try: five_day_events = np.vstack([five_day_events, event_data]) except ValueError: # Sometimes we don't get exactly the right number of values due to calendar # issues. I've fixed most everything I can, and the few issues that are left # I assume don't systemically bias the results (i.e. data could be missing # because it doesn't exist, etc.). After running through, ~1% of events get # discarded this way invalid_events.append((ticker, event)) # Remove our initial zero row five_day_events = five_day_events[1:,:] plot_study(five_day_events) plt.gcf().suptitle('Action over {} days: {} events' .format(horizon,five_day_events.shape[0])) plt.gcf().set_size_inches(18, 6) # Start with a 5 day study study_trend(5, does_trend_down) 100% (47578 of 47578) |###########################################################| Elapsed Time: 0:21:38 Time: 0:21:38 When a stock has been trending down for 5 days, once the earnings are announced it really doesn't move on average. However, the variability is incredible. This implies two important things: The market is just as often wrong about an earnings announcement before it happens as it is correctThe incredible width of the min/max bars and standard deviation area tell us that the market reacts violently after the earnings are released. Let's repeat the same study, but over a time horizon of 8 days and 3 days. Presumably if a stock has been going down for 8 days at a time before the earnings, the market should be more accurate. Code # 8 day study next study_trend(8, does_trend_down) 100% (47578 of 47578) |###########################################################| Elapsed Time: 0:20:29 Time: 0:20:29 However, looking only at stocks that trended down for 8 days prior to a release, the same pattern emerges: on average, the stock doesn't move, but the market reaction is often incredibly violent. Code # 3 day study after that study_trend(3, does_trend_down) 100% (47578 of 47578) |###########################################################| Elapsed Time: 0:26:26 Time: 0:26:26 Finally, when we look at a 3-day horizon, we start getting some incredible outliers. Stocks have a potential to move over ~300% up, and the standard deviation width is again, incredible. The results for a 3-day horizon follow the same pattern we've seen in the 5- and 8-day horizons. ","version":null,"tagName":"h2"},{"title":"Event Type 2: Trending up for N days","type":1,"pageTitle":"Event studies and earnings releases","url":"/2016/06/event-studies-and-earnings-releases#event-type-2-trending-up-for-n-days","content":" We're now going to repeat the analysis, but do it for uptrends instead. That is, instead of looking at stocks that have been trending down over the past number of days, we focus only on stocks that have been trending up. Code def does_trend_up(ticker, event, horizon): # Figure out if the `event` has an uptrend for # the `horizon` days preceding it # As an interpretation note: it is assumed that # the closing price of day `event` is the reference # point, and we want `horizon` days before that. # The price_data.hdf was created in the second appendix code block try: ticker_data = pd.read_hdf('price_data.hdf', ticker) data = ticker_data[event-TradeDay(horizon):event] midpoints = data['Open']/2 + data['Close']/2 # Shift dates one forward into the future and subtract # Effectively: do we trend down over all days? elems = midpoints - midpoints.shift(1) return len(elems)-1 == len(elems.dropna()[elems >= 0]) except KeyError: # If the stock doesn't exist, it doesn't qualify as trending down # Mostly this is here to make sure the entire analysis doesn't # blow up if there were issues in data retrieval return False study_trend(5, does_trend_up) 100% (47578 of 47578) |###########################################################| Elapsed Time: 0:22:51 Time: 0:22:51 The patterns here are very similar. With the exception of noting that stocks can go to nearly 400% after an earnings announcement (most likely this included a takeover announcement, etc.), we still see large min/max bars and wide standard deviation of returns. We'll repeat the pattern for stocks going up for both 8 and 3 days straight, but at this point, the results should be very predictable: Code study_trend(8, does_trend_up) 100% (47578 of 47578) |###########################################################| Elapsed Time: 0:20:51 Time: 0:20:51 Code study_trend(3, does_trend_up) 100% (47578 of 47578) |###########################################################| Elapsed Time: 0:26:56 Time: 0:26:56 ","version":null,"tagName":"h2"},{"title":"Conclusion and Summary","type":1,"pageTitle":"Event studies and earnings releases","url":"/2016/06/event-studies-and-earnings-releases#conclusion-and-summary","content":" I guess the most important thing to summarize with is this: looking at the entire market, stock performance prior to an earnings release has no bearing on the stock's performance. Honestly: given the huge variability of returns after an earnings release, even when the stock has been trending for a long time, you're best off divesting before an earnings release and letting the market sort itself out. However, there is a big caveat. These results are taken when we look at the entire market. So while we can say that the market as a whole knows nothing and just reacts violently, I want to take a closer look into this data. Does the market typically perform poorly on large-cap/high liquidity stocks? Do smaller companies have investors that know them better and can thus predict performance better? Are specific market sectors better at prediction? Presumably technology stocks are more volatile than the industrials. So there are some more interesting questions I still want to ask with this data. Knowing that the hard work of data processing is largely already done, it should be fairly simple to continue this analysis and get much more refined with it. Until next time. Appendix Export event data for Russell 3000 companies: Code import pandas as pd from html.parser import HTMLParser from datetime import datetime, timedelta import requests import re from dateutil import parser import progressbar from concurrent import futures import yaml class EarningsParser(HTMLParser): store_dates = False earnings_offset = None dates = [] def __init__(self, *args, **kwargs): super().__init__(*args, **kwargs) self.dates = [] def handle_starttag(self, tag, attrs): if tag == 'table': self.store_dates = True def handle_data(self, data): if self.store_dates: match = re.match(r'\\d+/\\d+/\\d+', data) if match: self.dates.append(match.group(0)) # If a company reports before the bell, record the earnings date # being at midnight the day before. Ex: WMT reports 5/19/2016, # but we want the reference point to be the closing price on 5/18/2016 if 'After Close' in data: self.earnings_offset = timedelta(days=0) elif 'Before Open' in data: self.earnings_offset = timedelta(days=-1) def handle_endtag(self, tag): if tag == 'table': self.store_dates = False def earnings_releases(ticker): #print("Looking up ticker {}".format(ticker)) user_agent = 'Mozilla/5.0 (Windows NT 10.0; WOW64; rv:46.0) '\\ 'Gecko/20100101 Firefox/46.0' headers = {'user-agent': user_agent} base_url = 'http://www.streetinsider.com/ec_earnings.php?q={}'\\ .format(ticker) e = EarningsParser() s = requests.Session() a = requests.adapters.HTTPAdapter(max_retries=0) s.mount('http://', a) e.feed(str(s.get(base_url, headers=headers).content)) if e.earnings_offset is not None: dates = map(lambda x: parser.parse(x) + e.earnings_offset, e.dates) past = filter(lambda x: x < datetime.now(), dates) return list(map(lambda d: d.isoformat(), past)) # Use a Russell-3000 ETF tracker (ticker IWV) to get a list of holdings r3000 = pd.read_csv('https://www.ishares.com/us/products/239714/' 'ishares-russell-3000-etf/1449138789749.ajax?' 'fileType=csv&fileName=IWV_holdings&dataType=fund', header=10) r3000_equities = r3000[(r3000['Exchange'] == 'NASDAQ') | (r3000['Exchange'] == 'New York Stock Exchange Inc.')] dates_file = open('earnings_dates.yaml', 'w') with futures.ThreadPoolExecutor(max_workers=8) as pool: fs = {pool.submit(earnings_releases, r3000_equities.ix[t]['Ticker']): t for t in r3000_equities.index} pbar = progressbar.ProgressBar(term_width=80, max_value=r3000_equities.index.max()) for future in futures.as_completed(fs): i = fs[future] pbar.update(i) dates_file.write(yaml.dump({r3000_equities.ix[i]['Ticker']: future.result()})) Downloading stock price data needed for the event studies: Code from secrets import QUANDL_KEY import pandas as pd import yaml from dateutil.parser import parse from datetime import timedelta import quandl from progressbar import ProgressBar def fetch_ticker(ticker, start, end): # Quandl is currently giving me issues with returning # the entire dataset and not slicing server-side. # So instead, we'll do it client-side! q_format = '%Y-%m-%d' ticker_data = quandl.get('YAHOO/' + ticker, start_date=start.strftime(q_format), end_date=end.strftime(q_format), authtoken=QUANDL_KEY) return ticker_data data_str = open('earnings_dates.yaml', 'r').read() # Need to remove invalid lines filtered = filter(lambda x: '{' not in x, data_str.split('\\n')) earnings_data = yaml.load('\\n'.join(filtered)) # Get the first 1500 keys - split up into two statements # because of Quandl rate limits tickers = list(earnings_data.keys()) price_dict = {} invalid_tickers = [] for ticker in ProgressBar()(tickers[0:1500]): try: # Replace '.' with '-' in name for some tickers fixed = ticker.replace('.', '-') event_strs = earnings_data[ticker] events = [parse(event) for event in event_strs] td = timedelta(days=20) price_dict[ticker] = fetch_ticker(fixed, min(events)-td, max(events)+td) except quandl.NotFoundError: invalid_tickers.append(ticker) # Execute this after 10 minutes have passed for ticker in ProgressBar()(tickers[1500:]): try: # Replace '.' with '-' in name for some tickers fixed = ticker.replace('.', '-') event_strs = earnings_data[ticker] events = [parse(event) for event in event_strs] td = timedelta(days=20) price_dict[ticker] = fetch_ticker(fixed, min(events)-td, max(events)+td) except quandl.NotFoundError: invalid_tickers.append(ticker) prices_store = pd.HDFStore('price_data.hdf') for ticker, prices in price_dict.items(): prices_store[ticker] = prices ","version":null,"tagName":"h2"},{"title":"A Rustic re-podcasting server","type":0,"sectionRef":"#","url":"/2016/10/rustic-repodcasting","content":"","keywords":"","version":null},{"title":"The Setup","type":1,"pageTitle":"A Rustic re-podcasting server","url":"/2016/10/rustic-repodcasting#the-setup","content":" We'll be using the iron library to handle the server, and hyper to fetch the data we need from elsewhere on the interwebs. HTML5Ever allows us to ingest the content that will be coming from Bassdrive, and finally, output is done with handlebars-rust. It will ultimately be interesting to see how much more work must be done to actually get this working over another language like Python. Coming from a dynamic state of mind it's super easy to just chain stuff together, ship it out, and call it a day. I think I'm going to end up getting much dirtier trying to write all of this out. ","version":null,"tagName":"h2"},{"title":"Issue 1: Strings","type":1,"pageTitle":"A Rustic re-podcasting server","url":"/2016/10/rustic-repodcasting#issue-1-strings","content":" Strings in Rust are hard. I acknowledge Python can get away with some things that make strings super easy (and Python 3 has gotten better at cracking down on some bad cases, str <-> bytes specifically), but Rust is hard. Let's take for example the 404 error handler I'm trying to write. The result should be incredibly simple: All I want is to echo backDidn't find URL: <url>. Shouldn't be that hard right? In Python I'd just do something like: def echo_handler(request): return "You're visiting: {}".format(request.uri) And we'd call it a day. Rust isn't so simple. Let's start with the trivial examples people post online: fn hello_world(req: &mut Request) -> IronResult<Response> { Ok(Response::with((status::Ok, "You found the server!"))) } Doesn't look too bad right? In fact, it's essentially the same as the Python version! All we need to do is just send back a string of some form. So, we look up the documentation for Request and see a url field that will contain what we want. Let's try the first iteration: fn hello_world(req: &mut Request) -> IronResult<Response> { Ok(Response::with((status::Ok, "You found the URL: " + req.url))) } Which yields the error: error[E0369]: binary operation `+` cannot be applied to type `&'static str` OK, what's going on here? Time to start Googling for "concatenate strings in Rust". That's what we want to do right? Concatenate a static string and the URL. After Googling, we come across a helpful concat! macro that looks really nice! Let's try that one: fn hello_world(req: &mut Request) -> IronResult<Response> { Ok(Response::with((status::Ok, concat!("You found the URL: ", req.url)))) } And the error: error: expected a literal Turns out Rust actually blows up because the concat! macro expects us to know at compile time what req.url is. Which, in my outsider opinion, is a bit strange. println! and format!, etc., all handle values they don't know at compile time. Why can't concat!? By any means, we need a new plan of attack. How about we try formatting strings? fn hello_world(req: &mut Request) -> IronResult<Response> { Ok(Response::with((status::Ok, format!("You found the URL: {}", req.url)))) } And at long last, it works. Onwards! ","version":null,"tagName":"h2"},{"title":"Issue 2: Fighting with the borrow checker","type":1,"pageTitle":"A Rustic re-podcasting server","url":"/2016/10/rustic-repodcasting#issue-2-fighting-with-the-borrow-checker","content":" Rust's single coolest feature is how the compiler can guarantee safety in your program. As long as you don't use unsafe pointers in Rust, you're guaranteed safety. And not having truly manual memory management is really cool; I'm totally OK with never having to write malloc() again. That said, even the Rust documentation makes a specific note: Many new users to Rust experience something we like to call ‘fighting with the borrow checker’, where the Rust compiler refuses to compile a program that the author thinks is valid. If you have to put it in the documentation, it's not a helpful note: it's hazing. So now that we have a handler which works with information from the request, we want to start making something that looks like an actual web application. The router provided by iron isn't terribly difficult so I won't cover it. Instead, the thing that had me stumped for a couple hours was trying to dynamically create routes. The unfortunate thing with Rust (in my limited experience at the moment) is that there is a severe lack of non-trivial examples. Using the router is easy when you want to give an example of a static function. But how do you you start working on things that are a bit more complex? We're going to cover that here. Our first try: creating a function which returns other functions. This is a principle called currying. We set up a function that allows us to keep some data in scope for another function to come later. fn build_handler(message: String) -> Fn(&mut Request) -> IronResult<Response> { move |_: &mut Request| { Ok(Response::with((status::Ok, message))) } } We've simply set up a function that returns another anonymous function with themessage parameter scoped in. If you compile this, you get not 1, not 2, but 5 new errors. 4 of them are the same though: error[E0277]: the trait bound `for<'r, 'r, 'r> std::ops::Fn(&'r mut iron::Request<'r, 'r>) -> std::result::Result<iron::Response, iron::IronError> + 'static: std::marker::Sized` is not satisfied ...oookay. I for one, am not going to spend time trying to figure out what's going on there. And it is here that I will save the audience many hours of frustrated effort. At this point, I decided to switch from iron to pure hyper since usinghyper would give me a much simpler API. All I would have to do is build a function that took two parameters as input, and we're done. That said, it ultimately posed many more issues because I started getting into a weird fight with the 'static lifetimeand being a Rust newbie I just gave up on trying to understand it. Instead, we will abandon (mostly) the curried function attempt, and instead take advantage of something Rust actually intends us to use: struct andtrait. Remember when I talked about a lack of non-trivial examples on the Internet? This is what I was talking about. I could only find one example of this available online, and it was incredibly complex and contained code we honestly don't need or care about. There was no documentation of how to build routes that didn't use static functions, etc. But, I'm assuming you don't really care about my whining, so let's get to it. The iron documentation mentions the Handler trait as being something we can implement. Does the function signature for that handle() method look familiar? It's what we've been working with so far. The principle is that we need to define a new struct to hold our data, then implement that handle() method to return the result. Something that looks like this might do: struct EchoHandler { message: String } impl Handler for EchoHandler { fn handle(&self, _: &mut Request) -> IronResult<Response> { Ok(Response::with((status::Ok, self.message))) } } // Later in the code when we set up the router... let echo = EchoHandler { message: "Is it working yet?" } router.get("/", echo.handle, "index"); We attempt to build a struct, and give its handle method off to the router so the router knows what to do. You guessed it, more errors: error: attempted to take value of method `handle` on type `EchoHandler` Now, the Rust compiler is actually a really nice fellow, and offers us help: help: maybe a `()` to call it is missing? If not, try an anonymous function We definitely don't want to call that function, so maybe try an anonymous function as it recommends? router.get("/", |req: &mut Request| echo.handle(req), "index"); Another error: error[E0373]: closure may outlive the current function, but it borrows `echo`, which is owned by the current function Another helpful message: help: to force the closure to take ownership of `echo` (and any other referenced variables), use the `move` keyword We're getting closer though! Let's implement this change: router.get("/", move |req: &mut Request| echo.handle(req), "index"); And here's where things get strange: error[E0507]: cannot move out of borrowed content --> src/main.rs:18:40 | 18 | Ok(Response::with((status::Ok, self.message))) | ^^^^ cannot move out of borrowed content Now, this took me another couple hours to figure out. I'm going to explain it, but keep this in mind: Rust only allows one reference at a time (exceptions apply of course). When we attempt to use self.message as it has been created in the earlierstruct, we essentially are trying to give it away to another piece of code. Rust's semantics then state that we may no longer access it unless it is returned to us (which iron's code does not do). There are two ways to fix this: Only give away references (i.e. &self.message instead of self.message) instead of transferring ownershipMake a copy of the underlying value which will be safe to give away I didn't know these were the two options originally, so I hope this helps the audience out. Because iron won't accept a reference, we are forced into the second option: making a copy. To do so, we just need to change the function to look like this: Ok(Response::with((status::Ok, self.message.clone()))) Not so bad, huh? My only complaint is that it took so long to figure out exactly what was going on. And now we have a small server that we can configure dynamically. At long last. Final sidenote: You can actually do this without anonymous functions. Just change the router line to:router.get("/", echo, "index"); Rust's type system seems to figure out that we want to use the handle() method. ","version":null,"tagName":"h2"},{"title":"Conclusion","type":1,"pageTitle":"A Rustic re-podcasting server","url":"/2016/10/rustic-repodcasting#conclusion","content":" After a good long days' work, we now have the routing functionality set up on our application. We should be able to scale this pretty well in the future: the RSS content we need to deliver in the future can be treated as a string, so the building blocks are in place. There are two important things I learned starting with Rust today: Rust is a new language, and while the code is high-quality, the mindshare is coming.I'm a terrible programmer. Number 1 is pretty obvious and not surprising to anyone. Number two caught me off guard. I've gotten used to having either a garbage collector (Java, Python, etc.) or playing a little fast and loose with scoping rules (C, C++). You don't have to worry about object lifetime there. With Rust, it's forcing me to fully understand and use well the memory in my applications. In the final mistake I fixed (using .clone()) I would have been fine in C++ to just give away that reference and never use it again. I wouldn't have run into a "use-after-free" error, but I would have potentially been leaking memory. Rust forced me to be incredibly precise about how I use it. All said I'm excited for using Rust more. I think it's super cool, it's just going to take me a lot longer to do this than I originally thought. ","version":null,"tagName":"h2"},{"title":"PCA audio compression","type":0,"sectionRef":"#","url":"/2016/11/pca-audio-compression","content":"","keywords":"","version":null},{"title":"Towards a new (and pretty poor) compression scheme","type":1,"pageTitle":"PCA audio compression","url":"/2016/11/pca-audio-compression#towards-a-new-and-pretty-poor-compression-scheme","content":" I'm going to be working with some audio data for a while as I get prepared for a term project this semester. I'll be working (with a partner) to design a system for separating voices from music. Given my total lack of experience with Digital Signal Processing I figured that now was as good a time as ever to work on a couple of fun projects that would get me back up to speed. The first project I want to work on: Designing a new compression scheme for audio data. ","version":null,"tagName":"h2"},{"title":"A Brief Introduction to Audio Compression","type":1,"pageTitle":"PCA audio compression","url":"/2016/11/pca-audio-compression#a-brief-introduction-to-audio-compression","content":" Audio files when uncompressed (files ending with .wav) are huge. Like, 10.5 Megabytes per minute huge. Storage is cheap these days, but that's still an incredible amount of data that we don't really need. Instead, we'd like to compress that data so that it's not taking up so much space. There are broadly two ways to accomplish this: Lossless compression - Formats like FLAC, ALAC, and Monkey's Audio (.ape) all go down this route. The idea is that when you compress and uncompress a file, you get exactly the same as what you started with. Lossy compression - Formats like MP3, Ogg, and AAC (.m4a) are far more popular, but make a crucial tradeoff: We can reduce the file size even more during compression, but the decompressed file won't be the same. There is a fundamental tradeoff at stake: Using lossy compression sacrifices some of the integrity of the resulting file to save on storage space. Most people (I personally believe it's everybody) can't hear the difference, so this is an acceptable tradeoff. You have files that take up a 10th of the space, and nobody can tell there's a difference in audio quality. ","version":null,"tagName":"h2"},{"title":"A PCA-based Compression Scheme","type":1,"pageTitle":"PCA audio compression","url":"/2016/11/pca-audio-compression#a-pca-based-compression-scheme","content":" What I want to try out is a PCA approach to encoding audio. The PCA technique comes from Machine Learning, where it is used for a process called Dimensionality Reduction. Put simply, the idea is the same as lossy compression: if we can find a way that represents the data well enough, we can save on space. There are a lot of theoretical concerns that lead me to believe this compression style will not end well, but I'm interested to try it nonetheless. PCA works as follows: Given a dataset with a number of features, I find a way to approximate those original features using some "new features" that are statistically as close as possible to the original ones. This is comparable to a scheme like MP3: Given an original signal, I want to find a way of representing it that gets approximately close to what the original was. The difference is that PCA is designed for statistical data, and not signal data. But we won't let that stop us. The idea is as follows: Given a signal, reshape it into 1024 columns by however many rows are needed (zero-padded if necessary). Run the PCA algorithm, and do dimensionality reduction with a couple different settings. The number of components I choose determines the quality: If I use 1024 components, I will essentially be using the original signal. If I use a smaller number of components, I start losing some of the data that was in the original file. This will give me an idea of whether it's possible to actually build an encoding scheme off of this, or whether I'm wasting my time. ","version":null,"tagName":"h2"},{"title":"Running the Algorithm","type":1,"pageTitle":"PCA audio compression","url":"/2016/11/pca-audio-compression#running-the-algorithm","content":" The audio I will be using comes from the song Tabulasa, by Broke for Free. I'll be loading in the audio signal to Python and using Scikit-Learn to actually run the PCA algorithm. We first need to convert the FLAC file I have to a WAV: !ffmpeg -hide_banner -loglevel panic -i "Broke For Free/XXVII/01 Tabulasa.flac" "Tabulasa.wav" -c wav Then, let's go ahead and load a small sample so you can hear what is going on. from IPython.display import Audio from scipy.io import wavfile samplerate, tabulasa = wavfile.read('Tabulasa.wav') start = samplerate * 14 # 10 seconds in end = start + samplerate * 10 # 5 second duration Audio(data=tabulasa[start:end, 0], rate=samplerate) Next, we'll define the code we will be using to do PCA. It's very short, as the PCA algorithm is very simple. from sklearn.decomposition import PCA import numpy as np def pca_reduce(signal, n_components, block_size=1024): # First, zero-pad the signal so that it is divisible by the block_size samples = len(signal) hanging = block_size - np.mod(samples, block_size) padded = np.lib.pad(signal, (0, hanging), 'constant', constant_values=0) # Reshape the signal to have 1024 dimensions reshaped = padded.reshape((len(padded) // block_size, block_size)) # Second, do the actual PCA process pca = PCA(n_components=n_components) pca.fit(reshaped) transformed = pca.transform(reshaped) reconstructed = pca.inverse_transform(transformed).reshape((len(padded))) return pca, transformed, reconstructed Now that we've got our functions set up, let's try actually running something. First, we'll use n_components == block_size, which implies that we should end up with the same signal we started with. tabulasa_left = tabulasa[:,0] _, _, reconstructed = pca_reduce(tabulasa_left, 1024, 1024) Audio(data=reconstructed[start:end], rate=samplerate) OK, that does indeed sound like what we originally had. Let's drastically cut down the number of components we're doing this with as a sanity check: the audio quality should become incredibly poor. _, _, reconstructed = pca_reduce(tabulasa_left, 32, 1024) Audio(data=reconstructed[start:end], rate=samplerate) As expected, our reconstructed audio does sound incredibly poor! But there's something else very interesting going on here under the hood. Did you notice that the bassline comes across very well, but that there's no midrange or treble? The drums are almost entirely gone. ","version":null,"tagName":"h2"},{"title":"Drop the (Treble)","type":1,"pageTitle":"PCA audio compression","url":"/2016/11/pca-audio-compression#drop-the-treble","content":" It will help to understand PCA more fully when trying to read this part, but I'll do my best to break it down. PCA tries to find a way to best represent the dataset using "components." Think of each "component" as containing some of the information you need in order to reconstruct the full audio. For example, you might have a "low frequency" component that contains all the information you need in order to hear the bassline. There might be other components that explain the high frequency things like singers, or melodies, that you also need. What makes PCA interesting is that it attempts to find the "most important" components in explaining the signal. In a signal processing world, this means that PCA is trying to find the signal amongst the noise in your data. In our case, this means that PCA, when forced to work with small numbers of components, will chuck out the noisy components first. It's doing it's best job to reconstruct the signal, but it has to make sacrifices somewhere. So I've mentioned that PCA identifies the "noisy" components in our dataset. This is equivalent to saying that PCA removes the "high frequency" components in this case: it's very easy to represent a low-frequency signal like a bassline. It's far more difficult to represent a high-frequency signal because it's changing all the time. When you force PCA to make a tradeoff by using a small number of components, the best it can hope to do is replicate the low-frequency sections and skip the high-frequency things. This is a very interesting insight, and it also has echos (pardon the pun) of how humans understand music in general. Other encoding schemes (like MP3, etc.) typically chop off a lot of the high-frequency range as well. There is typically a lot of high-frequency noise in audio that is nearly impossible to hear, so it's easy to remove it without anyone noticing. PCA ends up doing something similar, and while that certainly wasn't the intention, it is an interesting effect. ","version":null,"tagName":"h2"},{"title":"A More Realistic Example","type":1,"pageTitle":"PCA audio compression","url":"/2016/11/pca-audio-compression#a-more-realistic-example","content":" So we've seen the edge cases so far: Using a large number of components results in audio very close to the original, and using a small number of components acts as a low-pass filter. How about we develop something that sounds "good enough" in practice, that we can use as a benchmark for size? We'll use ourselves as judges of audio quality, and build another function to help us estimate how much space we need to store everything in. from bz2 import compress import pandas as pd def raw_estimate(transformed, pca): # We assume that we'll be storing things as 16-bit WAV, # meaning two bytes per sample signal_bytes = transformed.tobytes() # PCA stores the components as floating point, we'll assume # that means 32-bit floats, so 4 bytes per element component_bytes = transformed.tobytes() # Return a result in megabytes return (len(signal_bytes) + len(component_bytes)) / (2**20) # Do an estimate for lossless compression applied on top of our # PCA reduction def bz2_estimate(transformed, pca): bytestring = transformed.tobytes() + b';' + pca.components_.tobytes() compressed = compress(bytestring) return len(compressed) / (2**20) compression_attempts = [ (1, 1), (1, 2), (1, 4), (4, 32), (16, 256), (32, 256), (64, 256), (128, 1024), (256, 1024), (512, 1024), (128, 2048), (256, 2048), (512, 2048), (1024, 2048) ] def build_estimates(signal, n_components, block_size): pca, transformed, recon = pca_reduce(tabulasa_left, n_components, block_size) raw_pca_estimate = raw_estimate(transformed, pca) bz2_pca_estimate = bz2_estimate(transformed, pca) raw_size = len(recon.tobytes()) / (2**20) return raw_size, raw_pca_estimate, bz2_pca_estimate pca_compression_results = pd.DataFrame([ build_estimates(tabulasa_left, n, bs) for n, bs in compression_attempts ]) pca_compression_results.columns = ["Raw", "PCA", "PCA w/ BZ2"] pca_compression_results.index = compression_attempts pca_compression_results \tRaw\tPCA\tPCA w/ BZ2(1, 1)\t69.054298\t138.108597\t16.431797 (1, 2)\t69.054306\t69.054306\t32.981380 (1, 4)\t69.054321\t34.527161\t16.715032 (4, 32)\t69.054443\t17.263611\t8.481735 (16, 256)\t69.054688\t8.631836\t4.274846 (32, 256)\t69.054688\t17.263672\t8.542909 (64, 256)\t69.054688\t34.527344\t17.097543 (128, 1024)\t69.054688\t17.263672\t9.430644 (256, 1024)\t69.054688\t34.527344\t18.870387 (512, 1024)\t69.054688\t69.054688\t37.800940 (128, 2048)\t69.062500\t8.632812\t6.185015 (256, 2048)\t69.062500\t17.265625\t12.366942 (512, 2048)\t69.062500\t34.531250\t24.736506 (1024, 2048)\t69.062500\t69.062500\t49.517493 As we can see, there are a couple of instances where we do nearly 20 times better on storage space than the uncompressed file. Let's here what that sounds like: _, _, reconstructed = pca_reduce(tabulasa_left, 16, 256) Audio(data=reconstructed[start:end], rate=samplerate) It sounds incredibly poor though. Let's try something that's a bit more realistic: _, _, reconstructed = pca_reduce(tabulasa_left, 1, 4) Audio(data=reconstructed[start:end], rate=samplerate) And just out of curiosity, we can try something that has the same ratio of components to block size. This should be close to an apples-to-apples comparison. _, _, reconstructed = pca_reduce(tabulasa_left, 64, 256) Audio(data=reconstructed[start:end], rate=samplerate) The smaller block size definitely has better high-end response, but I personally think the larger block size sounds better overall. ","version":null,"tagName":"h2"},{"title":"Conclusions","type":1,"pageTitle":"PCA audio compression","url":"/2016/11/pca-audio-compression#conclusions","content":" So, what do I think about audio compression using PCA? Strangely enough, it actually works pretty well relative to what I expected. That said, it's a terrible idea in general. First off, you don't really save any space. The component matrix needed to actually run the PCA algorithm takes up a lot of space on its own, so it's very difficult to save space without sacrificing a huge amount of audio quality. And even then, codecs like AAC sound very nice even at bitrates that this PCA method could only dream of. Second, there's the issue of audio streaming. PCA relies on two components: the datastream, and a matrix used to reconstruct the original signal. While it is easy to stream the data, you can't stream that matrix. And even if you divided the stream up into small blocks to give you a small matrix, you must guarantee that the matrix arrives; if you don't have that matrix, the data stream will make no sense whatsoever. All said, this was an interesting experiment. It's really cool seeing PCA used for signal analysis where I haven't seen it applied before, but I don't think it will lead to any practical results. Look forward to more signal processing stuff in the future! ","version":null,"tagName":"h2"},{"title":"Captain's Cookbook: Project setup","type":0,"sectionRef":"#","url":"/2018/01/captains-cookbook-part-1","content":"","keywords":"","version":null},{"title":"Step 1: Installing capnp","type":1,"pageTitle":"Captain's Cookbook: Project setup","url":"/2018/01/captains-cookbook-part-1#step-1-installing-capnp","content":" The capnp binary itself is needed for taking the schema files you write and turning them into a format that can be used by the code generation libraries. Don't ask me what that actually means, I just know that you need to make sure this is installed. I'll refer you to Cap'N Proto's installation instructions here. As a quick TLDR though: Linux users will likely have a binary shipped by their package manager - On Ubuntu, apt install capnproto is enoughOS X users can use Homebrew as an easy install path. Just brew install capnpWindows users are a bit more complicated. If you're using Chocolatey, there's a package available. If that doesn't work however, you need to download a release zip and make sure that the capnp.exe binary is in your %PATH% environment variable The way you know you're done with this step is if the following command works in your shell: capnp id ","version":null,"tagName":"h2"},{"title":"Step 2: Starting a Cap'N Proto Rust project","type":1,"pageTitle":"Captain's Cookbook: Project setup","url":"/2018/01/captains-cookbook-part-1#step-2-starting-a-capn-proto-rust-project","content":" After the capnp binary is set up, it's time to actually create our Rust project. Nothing terribly complex here, just a simple mkdir capnp_cookbook_1 cd capnp_cookbook_1 cargo init --bin We'll put the following content into Cargo.toml: [package] name = "capnp_cookbook_1" version = "0.1.0" authors = ["Bradlee Speice <bspeice@kcg.com>"] [build-dependencies] capnpc = "0.8" # 1 [dependencies] capnp = "0.8" # 2 This sets up: The Rust code generator (CAPNProto Compiler)The Cap'N Proto runtime library (CAPNProto runtime) We've now got everything prepared that we need for writing a Cap'N Proto project. ","version":null,"tagName":"h2"},{"title":"Step 3: Writing a basic schema","type":1,"pageTitle":"Captain's Cookbook: Project setup","url":"/2018/01/captains-cookbook-part-1#step-3-writing-a-basic-schema","content":" We're going to start with writing a pretty trivial data schema that we can extend later. This is just intended to make sure you get familiar with how to start from a basic project. First, we're going to create a top-level directory for storing the schema files in: # Assuming we're starting from the `capnp_cookbook_1` directory created earlier mkdir schema cd schema Now, we're going to put the following content in point.capnp: @0xab555145c708dad2; struct Point { x @0 :Int32; y @1 :Int32; } Pretty easy, we've now got structure for an object we'll be able to quickly encode in a binary format. ","version":null,"tagName":"h2"},{"title":"Step 4: Setting up the build process","type":1,"pageTitle":"Captain's Cookbook: Project setup","url":"/2018/01/captains-cookbook-part-1#step-4-setting-up-the-build-process","content":" Now it's time to actually set up the build process to make sure that Cap'N Proto generates the Rust code we'll eventually be using. This is typically done through a build.rs file to invoke the schema compiler. In the same folder as your Cargo.toml file, please put the following content in build.rs: extern crate capnpc; fn main() { ::capnpc::CompilerCommand::new() .src_prefix("schema") // 1 .file("schema/point.capnp") // 2 .run().expect("compiling schema"); } This sets up the protocol compiler (capnpc from earlier) to compile the schema we've built so far. Because Cap'N Proto schema files can re-use types specified in other files, the src_prefix() tells the compiler where to look for those extra files at.We specify the schema file we're including by hand. In a much larger project, you could presumably build the CompilerCommanddynamically, but we won't worry too much about that one for now. ","version":null,"tagName":"h2"},{"title":"Step 5: Running the build","type":1,"pageTitle":"Captain's Cookbook: Project setup","url":"/2018/01/captains-cookbook-part-1#step-5-running-the-build","content":" If you've done everything correctly so far, you should be able to actually build the project and see the auto-generated code. Run a cargo build command, and if you don't see cargo complaining, you're doing just fine! So where exactly does the generated code go to? I think it's critically important for people to be able to see what the generated code looks like, because you need to understand what you're actually programming against. The short answer is: the generated code lives somewhere in the target/ directory. The long answer is that you're best off running a find command to get the actual file path: # Assuming we're running from the capnp_cookbook_1 project folder find . -name point_capnp.rs Alternately, if the find command isn't available, the path will look something like: ./target/debug/build/capnp_cookbook_1-c6e2990393c32fe6/out/point_capnp.rs See if there are any paths in your target directory that look similar. Now, the file content looks pretty nasty. I've included an example hereif you aren't following along at home. There are a couple things I'll try and point out though so you can get an idea of how the schema we wrote for the "Point" message is tied to the generated code. First, the Cap'N Proto library splits things up into Builder and Reader structs. These are best thought of the same way Rust separates mut from non-mut code. Builders are mut versions of your message, and Readers are immutable versions. For example, the Builder impl for point defines get_x(), set_x(), get_y(), and set_y() methods. In comparison, the Reader impl only defines get_x() and get_y() methods. So now we know that there are some get and set methods available for our x and y coordinates; but what do we actually do with those? ","version":null,"tagName":"h2"},{"title":"Step 6: Making a point","type":1,"pageTitle":"Captain's Cookbook: Project setup","url":"/2018/01/captains-cookbook-part-1#step-6-making-a-point","content":" So we've install Cap'N Proto, gotten a project set up, and can generate schema code now. It's time to actually start building Cap'N Proto messages! I'm going to put the code you need here because it's small, and put some extra long comments inline. This code should go in src/main.rs: // Note that we use `capnp` here, NOT `capnpc` extern crate capnp; // We create a module here to define how we are to access the code // being included. pub mod point_capnp { // The environment variable OUT_DIR is set by Cargo, and // is the location of all the code that was built as part // of the codegen step. // point_capnp.rs is the actual file to include include!(concat!(env!("OUT_DIR"), "/point_capnp.rs")); } fn main() { // The process of building a Cap'N Proto message is a bit tedious. // We start by creating a generic Builder; it acts as the message // container that we'll later be filling with content of our `Point` let mut builder = capnp::message::Builder::new_default(); // Because we need a mutable reference to the `builder` later, // we fence off this part of the code to allow sequential mutable // borrows. As I understand it, non-lexical lifetimes: // https://github.com/rust-lang/rust-roadmap/issues/16 // will make this no longer necessary { // And now we can set up the actual message we're trying to create let mut point_msg = builder.init_root::<point_capnp::point::Builder>(); // Stuff our message with some content point_msg.set_x(12); point_msg.set_y(14); } // It's now time to serialize our message to binary. Let's set up a buffer for that: let mut buffer = Vec::new(); // And actually fill that buffer with our data capnp::serialize::write_message(&mut buffer, &builder).unwrap(); // Finally, let's deserialize the data let deserialized = capnp::serialize::read_message( &mut buffer.as_slice(), capnp::message::ReaderOptions::new() ).unwrap(); // `deserialized` is currently a generic reader; it understands // the content of the message we gave it (i.e. that there are two // int32 values) but doesn't really know what they represent (the Point). // This is where we map the generic data back into our schema. let point_reader = deserialized.get_root::<point_capnp::point::Reader>().unwrap(); // We can now get our x and y values back, and make sure they match assert_eq!(point_reader.get_x(), 12); assert_eq!(point_reader.get_y(), 14); } And with that, we've now got a functioning project. Here's the content I'm planning to go over next as we build up some practical examples of Cap'N Proto in action: ","version":null,"tagName":"h2"},{"title":"Captain's Cookbook: Practical usage","type":0,"sectionRef":"#","url":"/2018/01/captains-cookbook-part-2","content":"","keywords":"","version":null},{"title":"Attempt 1: Move the reference","type":1,"pageTitle":"Captain's Cookbook: Practical usage","url":"/2018/01/captains-cookbook-part-2#attempt-1-move-the-reference","content":" As a first attempt, we're going to try and let Rust move the reference. Our code will look something like: fn main() { // ...assume that we own a `buffer: Vec<u8>` containing the binary message content from // somewhere else let deserialized = capnp::serialize::read_message( &mut buffer.as_slice(), capnp::message::ReaderOptions::new() ).unwrap(); let point_reader = deserialized.get_root::<point_capnp::point::Reader>().unwrap(); // By using `point_reader` inside the new thread, we're hoping that Rust can // safely move the reference and invalidate the original thread's usage. // Since the original thread doesn't use `point_reader` again, this should // be safe, right? let handle = std::thread:spawn(move || { assert_eq!(point_reader.get_x(), 12); assert_eq!(point_reader.get_y(), 14); }); handle.join().unwrap() } Well, the Rust compiler doesn't really like this. We get four distinct errors back: error[E0277]: the trait bound `*const u8: std::marker::Send` is not satisfied in `[closure@src/main.rs:31:37: 36:6 point_reader:point_capnp::point::Reader<'_>]` --> src/main.rs:31:18 | 31 | let handle = std::thread::spawn(move || { | ^^^^^^^^^^^^^^^^^^ `*const u8` cannot be sent between threads safely | error[E0277]: the trait bound `*const capnp::private::layout::WirePointer: std::marker::Send` is not satisfied in `[closure@src/main.rs:31:37: 36:6 point_reader:point_capnp::point::Reader<'_>]` --> src/main.rs:31:18 | 31 | let handle = std::thread::spawn(move || { | ^^^^^^^^^^^^^^^^^^ `*const capnp::private::layout::WirePointer` cannot be sent between threads safely | error[E0277]: the trait bound `capnp::private::arena::ReaderArena: std::marker::Sync` is not satisfied --> src/main.rs:31:18 | 31 | let handle = std::thread::spawn(move || { | ^^^^^^^^^^^^^^^^^^ `capnp::private::arena::ReaderArena` cannot be shared between threads safely | error[E0277]: the trait bound `*const std::vec::Vec<std::option::Option<std::boxed::Box<capnp::private::capability::ClientHook + 'static>>>: std::marker::Send` is not satisfied in `[closure@src/main.rs:31:37: 36:6 point_reader:point_capnp::point::Reader<'_>]` --> src/main.rs:31:18 | 31 | let handle = std::thread::spawn(move || { | ^^^^^^^^^^^^^^^^^^ `*const std::vec::Vec<std::option::Option<std::boxed::Box<capnp::private::capability::ClientHook + 'static>>>` cannot be sent between threads safely | error: aborting due to 4 previous errors Note, I've removed the help text for brevity, but suffice to say that these errors are intimidating. Pay attention to the text that keeps on getting repeated though: XYZ cannot be sent between threads safely. This is a bit frustrating: we own the buffer from which all the content was derived, and we don't have any unsafe accesses in our code. We guarantee that we wait for the child thread to stop first, so there's no possibility of the pointer becoming invalid because the original thread exits before the child thread does. So why is Rust preventing us from doing something that really should be legal? This is what is known as fighting the borrow checker. Let our crusade begin. ","version":null,"tagName":"h2"},{"title":"Attempt 2: Put the Reader in a Box","type":1,"pageTitle":"Captain's Cookbook: Practical usage","url":"/2018/01/captains-cookbook-part-2#attempt-2-put-the-reader-in-a-box","content":" The Box type allows us to convert a pointer we have (in our case the point_reader) into an "owned" value, which should be easier to send across threads. Our next attempt looks something like this: fn main() { // ...assume that we own a `buffer: Vec<u8>` containing the binary message content // from somewhere else let deserialized = capnp::serialize::read_message( &mut buffer.as_slice(), capnp::message::ReaderOptions::new() ).unwrap(); let point_reader = deserialized.get_root::<point_capnp::point::Reader>().unwrap(); let boxed_reader = Box::new(point_reader); // Now that the reader is `Box`ed, we've proven ownership, and Rust can // move the ownership to the new thread, right? let handle = std::thread::spawn(move || { assert_eq!(boxed_reader.get_x(), 12); assert_eq!(boxed_reader.get_y(), 14); }); handle.join().unwrap(); } Spoiler alert: still doesn't work. Same errors still show up. error[E0277]: the trait bound `*const u8: std::marker::Send` is not satisfied in `point_capnp::point::Reader<'_>` --> src/main.rs:33:18 | 33 | let handle = std::thread::spawn(move || { | ^^^^^^^^^^^^^^^^^^ `*const u8` cannot be sent between threads safely | error[E0277]: the trait bound `*const capnp::private::layout::WirePointer: std::marker::Send` is not satisfied in `point_capnp::point::Reader<'_>` --> src/main.rs:33:18 | 33 | let handle = std::thread::spawn(move || { | ^^^^^^^^^^^^^^^^^^ `*const capnp::private::layout::WirePointer` cannot be sent between threads safely | error[E0277]: the trait bound `capnp::private::arena::ReaderArena: std::marker::Sync` is not satisfied --> src/main.rs:33:18 | 33 | let handle = std::thread::spawn(move || { | ^^^^^^^^^^^^^^^^^^ `capnp::private::arena::ReaderArena` cannot be shared between threads safely | error[E0277]: the trait bound `*const std::vec::Vec<std::option::Option<std::boxed::Box<capnp::private::capability::ClientHook + 'static>>>: std::marker::Send` is not satisfied in `point_capnp::point::Reader<'_>` --> src/main.rs:33:18 | 33 | let handle = std::thread::spawn(move || { | ^^^^^^^^^^^^^^^^^^ `*const std::vec::Vec<std::option::Option<std::boxed::Box<capnp::private::capability::ClientHook + 'static>>>` cannot be sent between threads safely | error: aborting due to 4 previous errors Let's be a little bit smarter about the exceptions this time though. What is thatstd::marker::Send thing the compiler keeps telling us about? The documentation is pretty clear; Send is used to denote: Types that can be transferred across thread boundaries. In our case, we are seeing the error messages for two reasons: Pointers (*const u8) are not safe to send across thread boundaries. While we're nice in our code making sure that we wait on the child thread to finish before closing down, the Rust compiler can't make that assumption, and so complains that we're not using this in a safe manner. The point_capnp::point::Reader type is itself not safe to send across threads because it doesn't implement the Send trait. Which is to say, the things that make up a Reader are themselves not thread-safe, so the Reader is also not thread-safe. So, how are we to actually transfer a parsed Cap'N Proto message between threads? ","version":null,"tagName":"h2"},{"title":"Attempt 3: The TypedReader","type":1,"pageTitle":"Captain's Cookbook: Practical usage","url":"/2018/01/captains-cookbook-part-2#attempt-3-the-typedreader","content":" The TypedReader is a new API implemented in the Cap'N Proto Rust code. We're interested in it here for two reasons: It allows us to define an object where the object owns the underlying data. In previous attempts, the current context owned the data, but the Reader itself had no such control. We can compose the TypedReader using objects that are safe to Send across threads, guaranteeing that we can transfer parsed messages across threads. The actual type info for the TypedReaderis a bit complex. And to be honest, I'm still really not sure what the whole point of thePhantomData thing is either. My impression is that it lets us enforce type safety when we know what the underlying Cap'N Proto message represents. That is, technically the only thing we're storing is the untyped binary message;PhantomData just enforces the principle that the binary represents some specific object that has been parsed. Either way, we can carefully construct something which is safe to move between threads: fn main() { // ...assume that we own a `buffer: Vec<u8>` containing the binary message content from somewhere else let deserialized = capnp::serialize::read_message( &mut buffer.as_slice(), capnp::message::ReaderOptions::new() ).unwrap(); let point_reader: capnp::message::TypedReader<capnp::serialize::OwnedSegments, point_capnp::point::Owned> = capnp::message::TypedReader::new(deserialized); // Because the point_reader is now working with OwnedSegments (which are owned vectors) and an Owned message // (which is 'static lifetime), this is now safe let handle = std::thread::spawn(move || { // The point_reader owns its data, and we use .get() to retrieve the actual point_capnp::point::Reader // object from it let point_root = point_reader.get().unwrap(); assert_eq!(point_root.get_x(), 12); assert_eq!(point_root.get_y(), 14); }); handle.join().unwrap(); } And while we've left Rust to do the dirty work of actually moving the point_reader into the new thread, we could also use things like mpsc channels to achieve a similar effect. So now we're able to define basic Cap'N Proto messages, and send them all around our programs. ","version":null,"tagName":"h2"},{"title":"Hello!","type":0,"sectionRef":"#","url":"/2018/05/hello","content":"I'll do what I can to keep this short, there's plenty of other things we both should be doing right now. If you're here for the bread pics, and to marvel in some other culinary side projects, I've got you covered: And no, I'm not posting pictures of earlier attempts that ended up turning into rocks in the oven. Okay, just one: Thanks, and keep it amazing.","keywords":"","version":null},{"title":"What I learned porting dateutil to Rust","type":0,"sectionRef":"#","url":"/2018/06/dateutil-parser-to-rust","content":"","keywords":"","version":null},{"title":"Slow down, what?","type":1,"pageTitle":"What I learned porting dateutil to Rust","url":"/2018/06/dateutil-parser-to-rust#slow-down-what","content":" OK, fine, I guess I should start with why someone would do this. Dateutil is a Python library for handling dates. The standard library support for time in Python is kinda dope, but there are a lot of extras that go into making it useful beyond just the datetimemodule. dateutil.parser specifically is code to take all the super-weird time formats people come up with and turn them into something actually useful. Date/time parsing, it turns out, is just like everything else involvingcomputers andtime: it feels like it shouldn't be that difficult to do, until you try to do it, and you realize that people suck and this is whywe can't we have nice things. But alas, we'll try and make contemporary art out of the rubble and give it a pretentious name likeTime. Time What makes dateutil.parser great is that there's single function with a single argument that drives what programmers interact with:parse(timestr). It takes in the time as a string, and gives you back a reasonable "look, this is the best anyone can possibly do to make sense of your input" value. It doesn't expect much of you. And now it's in Rust. ","version":null,"tagName":"h2"},{"title":"Lost in Translation","type":1,"pageTitle":"What I learned porting dateutil to Rust","url":"/2018/06/dateutil-parser-to-rust#lost-in-translation","content":" Having worked at a bulge-bracket bank watching Java programmers try to be Python programmers, I'm admittedly hesitant to publish Python code that's trying to be Rust. Interestingly, Rust code can actually do a great job of mimicking Python. It's certainly not idiomatic Rust, but I've had better experiences thanthis guywho attempted the same thing for D. These are the actual take-aways: When transcribing code, stay as close to the original library as possible. I'm talking about using the same variable names, same access patterns, the whole shebang. It's way too easy to make a couple of typos, and all of a sudden your code blows up in new and exciting ways. Having a reference manual for verbatim what your code should be means that you don't spend that long debugging complicated logic, you're more looking for typos. Also, don't use nice Rust things like enums. Whileone time it worked out OK for me, I also managed to shoot myself in the foot a couple times because dateutil stores AM/PM as a boolean and I mixed up which was true, and which was false (side note: AM is false, PM is true). In general, writing nice code should not be a first-pass priority when you're just trying to recreate the same functionality. Exceptions are a pain. Make peace with it. Python code is just allowed to skip stack frames. So when a co-worker told me "Rust is getting try-catch syntax" I properly freaked out. Turns outhe's not quite right, and I'm OK with that. And whiledateutil is pretty well-behaved about not skipping multiple stack frames,130-line try-catch blockstake a while to verify. As another Python quirk, be very careful aboutlong nested if-elif-else blocks. I used to think that Python's whitespace was just there to get you to format your code correctly. I think that no longer. It's way too easy to close a block too early and have incredibly weird issues in the logic. Make sure you use an editor that displays indentation levels so you can keep things straight. Rust macros are not free. I originally had themain test bodywrapped up in a macro using pyo3. It took two minutes to compile. Aftermoving things to a functioncompile times dropped down to ~5 seconds. Turns out 150 lines * 100 tests = a lot of redundant code to be compiled. My new rule of thumb is that any macros longer than 10-15 lines are actually functions that need to be liberated, man. Finally, I really miss list comprehensions and dictionary comprehensions. As a quick comparison, seethis dateutil codeandthe implementation in Rust. I probably wrote it wrong, and I'm sorry. Ultimately though, I hope that these comprehensions can be added through macros or syntax extensions. Either way, they're expressive, save typing, and are super-readable. Let's get more of that. ","version":null,"tagName":"h2"},{"title":"Using a young language","type":1,"pageTitle":"What I learned porting dateutil to Rust","url":"/2018/06/dateutil-parser-to-rust#using-a-young-language","content":" Now, Rust is exciting and new, which means that there's opportunity to make a substantive impact. On more than one occasion though, I've had issues navigating the Rust ecosystem. What I'll call the "canonical library" is still being built. In Python, if you need datetime parsing, you use dateutil. If you want decimal types, it's already in thestandard library. While I might've gotten away with f64, dateutil uses decimals, and I wanted to follow the principle of staying as close to the original library as possible. Thus began my quest to find a decimal library in Rust. What I quickly found was summarized in a comment: Writing a BigDecimal is easy. Writing a good BigDecimal is hard. -cmr In practice, this means that there are at least 4differentimplementations available. And that's a lot of decisions to worry about when all I'm thinking is "why can'tcalendar reform be a thing" and I'm forced to dig through a coupledifferentthreads to figure out if the library I'm look at is dead or just stable. And even when the "canonical library" exists, there's no guarantees that it will be well-maintained.Chrono is the de facto date/time library in Rust, and just released version 0.4.4 like two days ago. Meanwhile,chrono-tz appears to be dead in the water even thoughthere are people happy to help maintain it. I know relatively little about it, but it appears that most of the release process is automated; keeping that up to date should be a no-brainer. ","version":null,"tagName":"h2"},{"title":"Trial Maintenance Policy","type":1,"pageTitle":"What I learned porting dateutil to Rust","url":"/2018/06/dateutil-parser-to-rust#trial-maintenance-policy","content":" Specifically given "maintenance" being anoft-discussedissue, I'm going to try out the following policy to keep things moving on dtparse: Issues/PRs needing maintainer feedback will be updated at least weekly. I want to make sure nobody's blocking on me. To keep issues/PRs needing contributor feedback moving, I'm going to (kindly) ask the contributor to check in after two weeks, and close the issue without resolution if I hear nothing back after a month. The second point I think has the potential to be a bit controversial, so I'm happy to receive feedback on that. And if a contributor responds with "hey, still working on it, had a kid and I'm running on 30 seconds of sleep a night," then first: congratulations on sustaining human life. And second: I don't mind keeping those requests going indefinitely. I just want to try and balance keeping things moving with giving people the necessary time they need. I should also note that I'm still getting some best practices in place - CONTRIBUTING and CONTRIBUTORS files need to be added, as well as issue/PR templates. In progress. None of us are perfect. ","version":null,"tagName":"h2"},{"title":"Roadmap and Conclusion","type":1,"pageTitle":"What I learned porting dateutil to Rust","url":"/2018/06/dateutil-parser-to-rust#roadmap-and-conclusion","content":" So if I've now built a dateutil-compatible parser, we're done, right? Of course not! That's not nearly ambitious enough. Ultimately, I'd love to have a library that's capable of parsing everything the Linux date command can do (and not date on OSX, because seriously, BSD coreutils are the worst). I know Rust has a coreutils rewrite going on, and dtparse would potentially be an interesting candidate since it doesn't bring in a lot of extra dependencies. humantimecould help pick up some of the (current) slack in dtparse, so maybe we can share and care with each other? All in all, I'm mostly hoping that nobody's already done this and I haven't spent a bit over a month on redundant code. So if it exists, tell me. I need to know, but be nice about it, because I'm going to take it hard. And in the mean time, I'm looking forward to building more. Onwards. ","version":null,"tagName":"h2"},{"title":"Isomorphic desktop apps with Rust","type":0,"sectionRef":"#","url":"/2018/09/isomorphic-apps","content":"I both despise Javascript and am stunned by its success doing some really cool things. It'sthis duality that's led me to a couple of (very) late nights over the past weeks trying to reconcile myself as I bootstrap a simple desktop application. See, as much asWebassembly isn't trying to replace Javascript,I want Javascript gone. There are plenty of people who don't share my views, and they are probably nicer and more fun at parties. But I cringe every time "Webpack" is mentioned, and I think it's hilarious that thelanguage specificationdramatically outpaces anyone'sactual implementation. The answer to this conundrum is of course to recompile code from newer versions of the language to older versions of the same language before running. At least Babel is a nice tongue-in-cheek reference. Yet for as much hate as Electron receives, it does a stunningly good job at solving a really hard problem: how the hell do I put a button on the screen and react when the user clicks it? GUI programming is hard, straight up. But if browsers are already able to run everywhere, why don't we take advantage of someone else solving the hard problems for us? I don't like that I have to use Javascript for it, but I really don't feel inclined to whip out good ol' wxWidgets. Now there are other native solutions (libui-rs, conrod, oh hey wxWdidgets again!), but those also have their own issues with distribution, styling, etc. With Electron, I canyarn create electron-app my-app and just get going, knowing that packaging/upgrades/etc. are built in. My question is: given recent innovations with WASM, are we Electron yet? No, not really. Instead, what would it take to get to a point where we can skip Javascript in Electron apps? Truth is, WASM/Webassembly is a pretty new technology and I'm a total beginner in this area. There may already be solutions to the issues I discuss, but I'm totally unaware of them, so I'm going to try and organize what I did manage to discover. I should also mention that the content and things I'm talking about here are not intended to be prescriptive, but more "if someone else is interested, what do we already know doesn't work?" I expect everything in this post to be obsolete within two months. Even over the course of writing this, a separate blog post had to be modified because upstream changes broke aRust tool the post tried to use. The post ultimatelygot updated, but all this happened within the span of a week. Things are moving quickly. I'll also note that we're going to skip asm.js and emscripten. Truth be told, I couldn't get either of these to output anything, and so I'm just going to sayhere be dragons. Everything I'm discussing here uses the wasm32-unknown-unknown target. The code that I did get running is availableover here. Feel free to use it as a starting point, but I'm mostly including the link as a reference for the things that were attempted. An Example Running Application So, I did technically get a running application: ...which you can also try out if you want: git clone https://github.com/speice-io/isomorphic-rust.git cd isomorphic_rust/percy yarn install && yarn start ...but I wouldn't really call it a "high quality" starting point to base future work on. It's mostly there to prove this is possible in the first place. And that's something to be proud of! There's a huge amount of engineering that went into showing a window with the text "It's alive!". There's also a lot of usability issues that prevent me from recommending anyone try Electron and WASM apps at the moment, and I think that's the more important thing to discuss. Issue the First: Complicated Toolchains I quickly established that wasm-bindgen was necessary to "link" my Rust code to Javascript. At that point you've got an Electron app that starts an HTML page which ultimately fetches your WASM blob. To keep things simple, the goal was to package everything using webpack so that I could just load a bundle.js file on the page. That decision was to be the last thing that kinda worked in this process. The first issueI ran intowhile attempting to bundle everything via webpack is a detail in the WASM spec: This function accepts a Response object, or a promise for one, and ... [if > it] does not match the application/wasm MIME type, the returned promise will be rejected with a TypeError; WebAssembly - Additional Web Embedding API Specifically, if you try and load a WASM blob without the MIME type set, you'll get an error. On the web this isn't a huge issue, as the server can set MIME types when delivering the blob. With Electron, you're resolving things with a file:// URL and thus can't control the MIME type: There are a couple of solutions depending on how far into the deep end you care to venture: Embed a static file server in your Electron applicationUse a custom protocol and custom protocol handlerHost your WASM blob on a website that you resolve at runtime But all these are pretty bad solutions and defeat the purpose of using WASM in the first place. Instead, my workaround was toopen a PR with webpack and use regex to remove calls to instantiateStreaming in thebuild script: cargo +nightly build --target=wasm32-unknown-unknown && \\ wasm-bindgen "$WASM_DIR/debug/$WASM_NAME.wasm" --out-dir "$APP_DIR" --no-typescript && \\ # Have to use --mode=development so we can patch out the call to instantiateStreaming "$DIR/node_modules/webpack-cli/bin/cli.js" --mode=development "$APP_DIR/app_loader.js" -o "$APP_DIR/bundle.js" && \\ sed -i 's/.*instantiateStreaming.*//g' "$APP_DIR/bundle.js" Once that lands, thebuild processbecomes much simpler: cargo +nightly build --target=wasm32-unknown-unknown && \\ wasm-bindgen "$WASM_DIR/debug/$WASM_NAME.wasm" --out-dir "$APP_DIR" --no-typescript && \\ "$DIR/node_modules/webpack-cli/bin/cli.js" --mode=production "$APP_DIR/app_loader.js" -o "$APP_DIR/bundle.js" But we're not done yet! After we compile Rust into WASM and link WASM to Javascript (viawasm-bindgen and webpack), we still have to make an Electron app. For this purpose I used a starter app from Electron Forge, and then aprestart scriptto actually handle starting the application. Thefinal toolchainlooks something like this: yarn start triggers the prestart scriptprestart checks for missing tools (wasm-bindgen-cli, etc.) and then: Uses cargo to compile the Rust code into WASMUses wasm-bindgen to link the WASM blob into a Javascript file with exported symbolsUses webpack to bundle the page start script with the Javascript we just generated Uses babel under the hood to compile the wasm-bindgen code down from ES6 into something browser-compatible The start script runs an Electron Forge handler to do some sanity checksElectron actually starts ...which is complicated. I think more work needs to be done to either build a high-quality starter app that can manage these steps, or another tool that "just handles" the complexity of linking a compiled WASM file into something the Electron browser can run. Issue the Second: WASM tools in Rust For as much as I didn't enjoy the Javascript tooling needed to interface with Rust, the Rust-only bits aren't any better at the moment. I get it, a lot of projects are just starting off, and that leads to a fragmented ecosystem. Here's what I can recommend as a starting point: Don't check in your Cargo.lock files to version control. If there's a disagreement between the version of wasm-bindgen-cli you have installed and the wasm-bindgen you're compiling with inCargo.lock, you get a nasty error: it looks like the Rust project used to create this wasm file was linked against a different version of wasm-bindgen than this binary: rust wasm file: 0.2.21 this binary: 0.2.17 Currently the bindgen format is unstable enough that these two version must exactly match, so it's required that these two version are kept in sync by either updating the wasm-bindgen dependency or this binary. Not that I ever managed to run into this myself (coughs nervously). There are two projects attempting to be "application frameworks": percy and yew. Between those, I managed to get twoexamples running using percy, but was unable to get anexample running with yew because of issues with "missing modules" during the webpack step: ERROR in ./dist/electron_yew_wasm_bg.wasm Module not found: Error: Can't resolve 'env' in '/home/bspeice/Development/isomorphic_rust/yew/dist' @ ./dist/electron_yew_wasm_bg.wasm @ ./dist/electron_yew_wasm.js @ ./dist/app.js @ ./dist/app_loader.js If you want to work with the browser APIs directly, your choices are percy-webapis or stdweb (or eventually web-sys). See above for my percy examples, but when I triedan example with stdweb, I was unable to get it running: ERROR in ./dist/stdweb_electron_bg.wasm Module not found: Error: Can't resolve 'env' in '/home/bspeice/Development/isomorphic_rust/stdweb/dist' @ ./dist/stdweb_electron_bg.wasm @ ./dist/stdweb_electron.js @ ./dist/app_loader.js At this point I'm pretty convinced that stdweb is causing issues for yew as well, but can't prove it. I did also get a minimal examplerunning that doesn't depend on any tools besides wasm-bindgen. However, it requires manually writing "extern C" blocks for everything you need from the browser. Es no bueno. Finally, from a tools and platform view, there are two up-and-coming packages that should be mentioned: js-sys and web-sys. Their purpose is to be fundamental building blocks that exposes the browser's APIs to Rust. If you're interested in building an app framework from scratch, these should give you the most flexibility. I didn't touch either in my research, though I expect them to be essential long-term. So there's a lot in play from the Rust side of things, and it's just going to take some time to figure out what works and what doesn't. Issue the Third: Known Unknowns Alright, so after I managed to get an application started, I stopped there. It was a good deal of effort to chain together even a proof of concept, and at this point I'd rather learn Typescriptthan keep trying to maintain an incredibly brittle pipeline. Blasphemy, I know... The important point I want to make is that there's a lot unknown about how any of this holds up outside proofs of concept. Things I didn't attempt: TestingPackagingUpdatesLiterally anything related to why I wanted to use Electron in the first place What it Would Take Much as I don't like Javascript, the tools are too shaky for me to recommend mixing Electron and WASM at the moment. There's a lot of innovation happening, so who knows? Someone might have an application in production a couple months from now. But at the moment, I'm personally going to stay away. Let's finish with a wishlist then - here are the things that I think need to happen before Electron/WASM/Rust can become a thing: Webpack still needs some updates. The necessary work is in progress, but hasn't landed yet (#7983)Browser API libraries (web-sys and stdweb) need to make sure they can support running in Electron (see module error above)Projects need to stabilize. There's talk of stdweb being turned into a Rust APIon top of web-sys, and percymoving to web-sys, both of which are big changeswasm-bindgen is great, but still in the "move fast and break things" phaseA good "boilerplate" app would dramatically simplify the start-up costs;electron-react-boilerplate comes to mind as a good project to imitateMore blog posts/contributors! I think Electron + Rust could be cool, but I have no idea what I'm doing","keywords":"","version":null},{"title":"Primitives in Rust are weird (and cool)","type":0,"sectionRef":"#","url":"/2018/09/primitives-in-rust-are-weird","content":"","keywords":"","version":null},{"title":"Defining primitives (Java)","type":1,"pageTitle":"Primitives in Rust are weird (and cool)","url":"/2018/09/primitives-in-rust-are-weird#defining-primitives-java","content":" The reason I'm using the name primitive comes from how much of my life is Java right now. For the most part I like Java, but I digress. In Java, there's a special name for some specific types of values: bool char byte short int long float double They are referred to as primitives. And relative to the other bits of Java, they have two unique features. First, they don't have to worry about thebillion-dollar mistake; primitives in Java can never be null. Second: they can't have instance methods. Remember that Rust program from earlier? Java has no idea what to do with it: class Main { public static void main(String[] args) { int x = 8; System.out.println(x.toString()); // Triggers a compiler error } } The error is: Main.java:5: error: int cannot be dereferenced System.out.println(x.toString()); ^ 1 error Specifically, Java's Objectand things that inherit from it are pointers under the hood, and we have to dereference them before the fields and methods they define can be used. In contrast, primitive types are just values - there's nothing to be dereferenced. In memory, they're just a sequence of bits. If we really want, we can turn the int into anInteger and then dereference it, but it's a bit wasteful: class Main { public static void main(String[] args) { int x = 8; Integer y = Integer.valueOf(x); System.out.println(y.toString()); } } This creates the variable y of type Integer (which inherits Object), and at run time we dereference y to locate the toString() function and call it. Rust obviously handles things a bit differently, but we have to dig into the low-level details to see it in action. ","version":null,"tagName":"h2"},{"title":"Low Level Handling of Primitives (C)","type":1,"pageTitle":"Primitives in Rust are weird (and cool)","url":"/2018/09/primitives-in-rust-are-weird#low-level-handling-of-primitives-c","content":" We first need to build a foundation for reading and understanding the assembly code the final answer requires. Let's begin with showing how the C language (and your computer) thinks about "primitive" values in memory: void my_function(int num) {} int main() { int x = 8; my_function(x); } The compiler explorer gives us an easy way of showing off the assembly-level code that's generated: whose output has been lightly edited main: push rbp mov rbp, rsp sub rsp, 16 ; We assign the value `8` to `x` here mov DWORD PTR [rbp-4], 8 ; And copy the bits making up `x` to a location ; `my_function` can access (`edi`) mov eax, DWORD PTR [rbp-4] mov edi, eax ; Call `my_function` and give it control call my_function mov eax, 0 leave ret my_function: push rbp mov rbp, rsp ; Copy the bits out of the pre-determined location (`edi`) ; to somewhere we can use mov DWORD PTR [rbp-4], edi nop pop rbp ret At a really low level of memory, we're copying bits around using the mov instruction; nothing crazy. But to show how similar Rust is, let's take a look at our program translated from C to Rust: fn my_function(x: i32) {} fn main() { let x = 8; my_function(x) } And the assembly generated when we stick it in thecompiler explorer: again, lightly edited example::main: push rax ; Look familiar? We're copying bits to a location for `my_function` ; The compiler just optimizes out holding `x` in memory mov edi, 8 ; Call `my_function` and give it control call example::my_function pop rax ret example::my_function: sub rsp, 4 ; And copying those bits again, just like in C mov dword ptr [rsp], edi add rsp, 4 ret The generated Rust assembly is functionally pretty close to the C assembly: When working with primitives, we're just dealing with bits in memory. In Java we have to dereference a pointer to call its functions; in Rust, there's no pointer to dereference. So what exactly is going on with this .to_string() function call? ","version":null,"tagName":"h2"},{"title":"impl primitive (and Python)","type":1,"pageTitle":"Primitives in Rust are weird (and cool)","url":"/2018/09/primitives-in-rust-are-weird#impl-primitive-and-python","content":" Now it's time to reveal my trap card show the revelation that tied all this together: Rust has implementations for its primitive types. That's right, impl blocks aren't only for structs and traits, primitives get them too. Don't believe me? Check outu32,f64 andchar as examples. But the really interesting bit is how Rust turns those impl blocks into assembly. Let's break out the compiler explorer once again: pub fn main() { 8.to_string() } And the interesting bits in the assembly: heavily trimmed down example::main: sub rsp, 24 mov rdi, rsp lea rax, [rip + .Lbyte_str.u] mov rsi, rax ; Cool stuff right here call <T as alloc::string::ToString>::to_string@PLT mov rdi, rsp call core::ptr::drop_in_place add rsp, 24 ret Now, this assembly is a bit more complicated, but here's the big revelation: we're callingto_string() as a function that exists all on its own, and giving it the instance of 8. Instead of thinking of the value 8 as an instance of u32 and then peeking in to find the location of the function we want to call (like Java), we have a function that exists outside of the instance and just give that function the value 8. This is an incredibly technical detail, but the interesting idea I had was this: if to_string()is a static function, can I refer to the unbound function and give it an instance? Better explained in code (and a compiler explorer link because I seriously love this thing): struct MyVal { x: u32 } impl MyVal { fn to_string(&self) -> String { self.x.to_string() } } pub fn main() { let my_val = MyVal { x: 8 }; // THESE ARE THE SAME my_val.to_string(); MyVal::to_string(&my_val); } Rust is totally fine "binding" the function call to the instance, and also as a static. MIND == BLOWN. Python does the same thing where I can both call functions bound to their instances and also call as an unbound function where I give it the instance: class MyClass(): x = 24 def my_function(self): print(self.x) m = MyClass() m.my_function() MyClass.my_function(m) And Python tries to make you think that primitives can have instance methods... >>> dir(8) ['__abs__', '__add__', '__and__', '__class__', '__cmp__', '__coerce__', '__delattr__', '__div__', '__divmod__', '__doc__', '__float__', '__floordiv__', ... '__setattr__', '__sizeof__', '__str__', '__sub__', '__subclasshook__', '__truediv__', ...] >>> # Theoretically `8.__str__()` should exist, but: >>> 8.__str__() File "<stdin>", line 1 8.__str__() ^ SyntaxError: invalid syntax >>> # It will run if we assign it first though: >>> x = 8 >>> x.__str__() '8' ...but in practice it's a bit complicated. So while Python handles binding instance methods in a way similar to Rust, it's still not able to run the example we started with. ","version":null,"tagName":"h2"},{"title":"Conclusion","type":1,"pageTitle":"Primitives in Rust are weird (and cool)","url":"/2018/09/primitives-in-rust-are-weird#conclusion","content":" This was a super-roundabout way of demonstrating it, but the way Rust handles incredibly minor details like primitives leads to really cool effects. Primitives are optimized like C in how they have a space-efficient memory layout, yet the language still has a lot of features I enjoy in Python (like both instance and late binding). And when you put it together, there are areas where Rust does cool things nobody else can; as a quirky feature of Rust's type system, 8.to_string() is actually valid code. Now go forth and fool your friends into thinking you know assembly. This is all I've got. ","version":null,"tagName":"h2"},{"title":"A case study in heaptrack","type":0,"sectionRef":"#","url":"/2018/10/case-study-optimization","content":"","keywords":"","version":null},{"title":"Curiosity","type":1,"pageTitle":"A case study in heaptrack","url":"/2018/10/case-study-optimization#curiosity","content":" When I first started building the dtparse crate, my intention was to mirror as closely as possible the equivalent Python library. Python, as you may know, is garbage collected. Very rarely is memory usage considered in Python, and I likewise wasn't paying too much attention whendtparse was first being built. This lackadaisical approach to memory works well enough, and I'm not planning on making dtparsehyper-efficient. But every so often, I've wondered: "what exactly is going on in memory?" With the advent of Rust 1.28 and theGlobal Allocator trait, I had a really great idea: build a custom allocator that allows you to track your own allocations. That way, you can do things like writing tests for both correct results and correct memory usage. I gave it ashot, but learned very quickly: never write your own allocator. It went from "fun weekend project" to "I have literally no idea what my computer is doing" at breakneck speed. Instead, I'll highlight a separate path I took to make sense of my memory usage: heaptrack. ","version":null,"tagName":"h2"},{"title":"Turning on the System Allocator","type":1,"pageTitle":"A case study in heaptrack","url":"/2018/10/case-study-optimization#turning-on-the-system-allocator","content":" This is the hardest part of the post. Because Rust usesits own allocator by default,heaptrack is unable to properly record unmodified Rust code. To remedy this, we'll make use of the#[global_allocator] attribute. Specifically, in lib.rs or main.rs, add this: use std::alloc::System; #[global_allocator] static GLOBAL: System = System; ...and that's it. Everything else comes essentially for free. ","version":null,"tagName":"h2"},{"title":"Running heaptrack","type":1,"pageTitle":"A case study in heaptrack","url":"/2018/10/case-study-optimization#running-heaptrack","content":" Assuming you've installed heaptrack (Homebrew in Mac, package manager in Linux, ??? in Windows), all that's left is to fire up your application: heaptrack my_application It's that easy. After the program finishes, you'll see a file in your local directory with a name like heaptrack.my_appplication.XXXX.gz. If you load that up in heaptrack_gui, you'll see something like this: And even these pretty colors: ","version":null,"tagName":"h2"},{"title":"Reading Flamegraphs","type":1,"pageTitle":"A case study in heaptrack","url":"/2018/10/case-study-optimization#reading-flamegraphs","content":" To make sense of our memory usage, we're going to focus on that last picture - it's called a"flamegraph". These charts are typically used to show how much time your program spends executing each function, but they're used here to show how much memory was allocated during those functions instead. For example, we can see that all executions happened during the main function: ...and within that, all allocations happened during dtparse::parse: ...and within that, allocations happened in two different places: Now I apologize that it's hard to see, but there's one area specifically that stuck out as an issue:what the heck is the Default thing doing? ","version":null,"tagName":"h2"},{"title":"Optimizing dtparse","type":1,"pageTitle":"A case study in heaptrack","url":"/2018/10/case-study-optimization#optimizing-dtparse","content":" See, I knew that there were some allocations during calls to dtparse::parse, but I was totally wrong about where the bulk of allocations occurred in my program. Let me post the code and see if you can spot the mistake: /// Main entry point for using `dtparse`. pub fn parse(timestr: &str) -> ParseResult<(NaiveDateTime, Option<FixedOffset>)> { let res = Parser::default().parse( timestr, None, None, false, false, None, false, &HashMap::new(), )?; Ok((res.0, res.1)) } dtparse Because Parser::parse requires a mutable reference to itself, I have to create a newParser::default every time it receives a string. This is excessive! We'd rather have an immutable parser that can be re-used, and avoid allocating memory in the first place. Armed with that information, I put some time in tomake the parser immutable. Now that I can re-use the same parser over and over, the allocations disappear: In total, we went from requiring 2 MB of memory inversion 1.0.2: All the way down to 300KB in version 1.0.3: ","version":null,"tagName":"h2"},{"title":"Conclusion","type":1,"pageTitle":"A case study in heaptrack","url":"/2018/10/case-study-optimization#conclusion","content":" In the end, you don't need to write a custom allocator to be efficient with memory, great tools already exist to help you understand what your program is doing. Use them. Given that Moore's Law isdead, we've all got to do our part to take back what Microsoft stole. ","version":null,"tagName":"h2"},{"title":"QADAPT - debug_assert! for allocations","type":0,"sectionRef":"#","url":"/2018/12/allocation-safety","content":"","keywords":"","version":null},{"title":"Why an Allocator?","type":1,"pageTitle":"QADAPT - debug_assert! for allocations","url":"/2018/12/allocation-safety#why-an-allocator","content":" So why, after complaining about allocators, would I still want to write one? There are three reasons for that: Allocation/dropping is slowIt's difficult to know exactly when Rust will allocate or drop, especially when using code that you did not writeI want automated tools to verify behavior, instead of inspecting by hand When I say "slow," it's important to define the terms. If you're writing web applications, you'll spend orders of magnitude more time waiting for the database than you will the allocator. However, there's still plenty of code where micro- or nano-seconds matter; thinkfinance,real-time audio,self-driving cars, andnetworking. In these situations it's simply unacceptable for you to spend time doing things that are not your program, and waiting on the allocator is not cool. As I continue to learn Rust, it's difficult for me to predict where exactly allocations will happen. So, I propose we play a quick trivia game: Does this code invoke the allocator? ","version":null,"tagName":"h2"},{"title":"Example 1","type":1,"pageTitle":"QADAPT - debug_assert! for allocations","url":"/2018/12/allocation-safety#example-1","content":" fn my_function() { let v: Vec<u8> = Vec::new(); } No: Rust knows how big the Vec type is, and reserves a fixed amount of memory on the stack for the v vector. However, if we wanted to reserve extra space (using Vec::with_capacity) the allocator would get invoked. ","version":null,"tagName":"h3"},{"title":"Example 2","type":1,"pageTitle":"QADAPT - debug_assert! for allocations","url":"/2018/12/allocation-safety#example-2","content":" fn my_function() { let v: Box<Vec<u8>> = Box::new(Vec::new()); } Yes: Because Boxes allow us to work with things that are of unknown size, it has to allocate on the heap. While the Box is unnecessary in this snippet (release builds will optimize out the allocation), reserving heap space more generally is needed to pass a dynamically sized type to another function. ","version":null,"tagName":"h3"},{"title":"Example 3","type":1,"pageTitle":"QADAPT - debug_assert! for allocations","url":"/2018/12/allocation-safety#example-3","content":" fn my_function(v: Vec<u8>) { v.push(5); } Maybe: Depending on whether the Vector we were given has space available, we may or may not allocate. Especially when dealing with code that you did not author, it's difficult to verify that things behave as you expect them to. ","version":null,"tagName":"h3"},{"title":"Blowing Things Up","type":1,"pageTitle":"QADAPT - debug_assert! for allocations","url":"/2018/12/allocation-safety#blowing-things-up","content":" So, how exactly does QADAPT solve these problems? Whenever an allocation or drop occurs in code marked allocation-safe, QADAPT triggers a thread panic. We don't want to let the program continue as if nothing strange happened, we want things to explode. However, you don't want code to panic in production because of circumstances you didn't predict. Just like debug_assert!, QADAPT will strip out its own code when building in release mode to guarantee no panics and no performance impact. Finally, there are three ways to have QADAPT check that your code will not invoke the allocator: ","version":null,"tagName":"h2"},{"title":"Using a procedural macro","type":1,"pageTitle":"QADAPT - debug_assert! for allocations","url":"/2018/12/allocation-safety#using-a-procedural-macro","content":" The easiest method, watch an entire function for allocator invocation: use qadapt::no_alloc; use qadapt::QADAPT; #[global_allocator] static Q: QADAPT = QADAPT; #[no_alloc] fn push_vec(v: &mut Vec<u8>) { // This triggers a panic if v.len() == v.capacity() v.push(5); } fn main() { let v = Vec::with_capacity(1); // This will *not* trigger a panic push_vec(&v); // This *will* trigger a panic push_vec(&v); } ","version":null,"tagName":"h3"},{"title":"Using a regular macro","type":1,"pageTitle":"QADAPT - debug_assert! for allocations","url":"/2018/12/allocation-safety#using-a-regular-macro","content":" For times when you need more precision: use qadapt::assert_no_alloc; use qadapt::QADAPT; #[global_allocator] static Q: QADAPT = QADAPT; fn main() { let v = Vec::with_capacity(1); // No allocations here, we already have space reserved assert_no_alloc!(v.push(5)); // Even though we remove an item, it doesn't trigger a drop // because it's a scalar. If it were a `Box<_>` type, // a drop would trigger. assert_no_alloc!({ v.pop().unwrap(); }); } ","version":null,"tagName":"h3"},{"title":"Using function calls","type":1,"pageTitle":"QADAPT - debug_assert! for allocations","url":"/2018/12/allocation-safety#using-function-calls","content":" Both the most precise and most tedious: use qadapt::enter_protected; use qadapt::exit_protected; use qadapt::QADAPT; #[global_allocator] static Q: QADAPT = QADAPT; fn main() { // This triggers an allocation (on non-release builds) let v = Vec::with_capacity(1); enter_protected(); // This does not trigger an allocation because we've reserved size v.push(0); exit_protected(); // This triggers an allocation because we ran out of size, // but doesn't panic because we're no longer protected. v.push(1); } ","version":null,"tagName":"h3"},{"title":"Caveats","type":1,"pageTitle":"QADAPT - debug_assert! for allocations","url":"/2018/12/allocation-safety#caveats","content":" It's important to point out that QADAPT code is synchronous, so please be careful when mixing in asynchronous functions: use futures::future::Future; use futures::future::ok; #[no_alloc] fn async_capacity() -> impl Future<Item=Vec<u8>, Error=()> { ok(12).and_then(|e| Ok(Vec::with_capacity(e))) } fn main() { // This doesn't trigger a panic because the `and_then` closure // wasn't run during the function call. async_capacity(); // Still no panic assert_no_alloc!(async_capacity()); // This will panic because the allocation happens during `unwrap` // in the `assert_no_alloc!` macro assert_no_alloc!(async_capacity().poll().unwrap()); } ","version":null,"tagName":"h3"},{"title":"Conclusion","type":1,"pageTitle":"QADAPT - debug_assert! for allocations","url":"/2018/12/allocation-safety#conclusion","content":" While there's a lot more to writing high-performance code than managing your usage of the allocator, it's critical that you do use the allocator correctly. QADAPT will verify that your code is doing what you expect. It's usable even on stable Rust from version 1.31 onward, which isn't the case for most allocators. Version 1.0 was released today, and you can check it out over atcrates.io or on github. I'm hoping to write more about high-performance Rust in the future, and I expect that QADAPT will help guide that. If there are topics you're interested in, let me know in the comments below! ","version":null,"tagName":"h2"},{"title":"More \"what companies really mean\"","type":0,"sectionRef":"#","url":"/2018/12/what-small-business-really-means","content":"","keywords":"","version":null},{"title":"How do you feel about production support?","type":1,"pageTitle":"More \"what companies really mean\"","url":"/2018/12/what-small-business-really-means#how-do-you-feel-about-production-support","content":" Translation: We're a fairly small team, and when things break on an evening/weekend/Christmas Day, can we call on you to be there? I've met decidedly few people in my life who truly enjoy the "ops" side of "devops". They're incredibly good at taking an impossible problem, pre-existing knowledge of arcane arts, and turning that into a functioning system at the end. And if they all left for lunch, we probably wouldn't make it out the door before the zombie apocalypse. Larger organizations (in my experience, 500+ person organizations) have the luxury of hiring people who either enjoy that, or play along nicely enough that our systems keep working. Small teams have no such luck. If you're interviewing at a small company, especially as a "data scientist" or other somesuch position, be aware that systems can and do spontaneously combust at the most inopportune moments. Terrible-but-popular answers include: It's a part of the job, and I'm happy to contribute. ","version":null,"tagName":"h2"},{"title":"Allocations in Rust: Compiler optimizations","type":0,"sectionRef":"#","url":"/2019/02/08/compiler-optimizations","content":"","keywords":"","version":null},{"title":"The Case of the Disappearing Box","type":1,"pageTitle":"Allocations in Rust: Compiler optimizations","url":"/2019/02/08/compiler-optimizations#the-case-of-the-disappearing-box","content":" Our first optimization comes when LLVM can reason that the lifetime of an object is sufficiently short that heap allocations aren't necessary. In these cases, LLVM will move the allocation to the stack instead! The way this interacts with #[inline] attributes is a bit opaque, but the important part is that LLVM can sometimes do better than the baseline Rust language: use std::alloc::{GlobalAlloc, Layout, System}; use std::sync::atomic::{AtomicBool, Ordering}; pub fn cmp(x: u32) { // Turn on panicking if we allocate on the heap DO_PANIC.store(true, Ordering::SeqCst); // The compiler is able to see through the constant `Box` // and directly compare `x` to 24 - assembly line 73 let y = Box::new(24); let equals = x == *y; // This call to drop is eliminated drop(y); // Need to mark the comparison result as volatile so that // LLVM doesn't strip out all the code. If `y` is marked // volatile instead, allocation will be forced. unsafe { std::ptr::read_volatile(&equals) }; // Turn off panicking, as there are some deallocations // when we exit main. DO_PANIC.store(false, Ordering::SeqCst); } fn main() { cmp(12) } #[global_allocator] static A: PanicAllocator = PanicAllocator; static DO_PANIC: AtomicBool = AtomicBool::new(false); struct PanicAllocator; unsafe impl GlobalAlloc for PanicAllocator { unsafe fn alloc(&self, layout: Layout) -> *mut u8 { if DO_PANIC.load(Ordering::SeqCst) { panic!("Unexpected allocation."); } System.alloc(layout) } unsafe fn dealloc(&self, ptr: *mut u8, layout: Layout) { if DO_PANIC.load(Ordering::SeqCst) { panic!("Unexpected deallocation."); } System.dealloc(ptr, layout); } } -- Compiler Explorer -- Rust Playground ","version":null,"tagName":"h2"},{"title":"Dr. Array or: how I learned to love the optimizer","type":1,"pageTitle":"Allocations in Rust: Compiler optimizations","url":"/2019/02/08/compiler-optimizations#dr-array-or-how-i-learned-to-love-the-optimizer","content":" Finally, this isn't so much about LLVM figuring out different memory behavior, but LLVM stripping out code that doesn't do anything. Optimizations of this type have a lot of nuance to them; if you're not careful, they can make your benchmarks lookimpossibly good. In Rust, theblack_box function (implemented in bothlibtest andcriterion) will tell the compiler to disable this kind of optimization. But if you let LLVM remove unnecessary code, you can end up running programs that previously caused errors: #[derive(Default)] struct TwoFiftySix { _a: [u64; 32] } #[derive(Default)] struct EightK { _a: [TwoFiftySix; 32] } #[derive(Default)] struct TwoFiftySixK { _a: [EightK; 32] } #[derive(Default)] struct EightM { _a: [TwoFiftySixK; 32] } pub fn main() { // Normally this blows up because we can't reserve size on stack // for the `EightM` struct. But because the compiler notices we // never do anything with `_x`, it optimizes out the stack storage // and the program completes successfully. let _x = EightM::default(); } -- Compiler Explorer -- Rust Playground ","version":null,"tagName":"h2"},{"title":"Allocations in Rust: Dynamic memory","type":0,"sectionRef":"#","url":"/2019/02/a-heaping-helping","content":"","keywords":"","version":null},{"title":"Smart pointers","type":1,"pageTitle":"Allocations in Rust: Dynamic memory","url":"/2019/02/a-heaping-helping#smart-pointers","content":" The first thing to note are the "smart pointer" types. When you have data that must outlive the scope in which it is declared, or your data is of unknown or dynamic size, you'll make use of these types. The term smart pointer comes from C++, and while it's closely linked to a general design pattern of"Resource Acquisition Is Initialization", we'll use it here specifically to describe objects that are responsible for managing ownership of data allocated on the heap. The smart pointers available in the alloc crate should look mostly familiar: BoxRcArcCow The standard library also defines some smart pointers to manage heap objects, though more than can be covered here. Some examples are: RwLockMutex Finally, there is one "gotcha": cell types(like RefCell) look and behave similarly, but don't involve heap allocation. Thecore::cell docs have more information. When a smart pointer is created, the data it is given is placed in heap memory and the location of that data is recorded in the smart pointer. Once the smart pointer has determined it's safe to deallocate that memory (when a Box hasgone out of scope or a reference countgoes to zero), the heap space is reclaimed. We can prove these types use heap memory by looking at code: use std::rc::Rc; use std::sync::Arc; use std::borrow::Cow; pub fn my_box() { // Drop at assembly line 1640 Box::new(0); } pub fn my_rc() { // Drop at assembly line 1650 Rc::new(0); } pub fn my_arc() { // Drop at assembly line 1660 Arc::new(0); } pub fn my_cow() { // Drop at assembly line 1672 Cow::from("drop"); } -- Compiler Explorer ","version":null,"tagName":"h2"},{"title":"Collections","type":1,"pageTitle":"Allocations in Rust: Dynamic memory","url":"/2019/02/a-heaping-helping#collections","content":" Collection types use heap memory because their contents have dynamic size; they will request more memory when needed, and canrelease memory when it's no longer necessary. This dynamic property forces Rust to heap allocate everything they contain. In a way, collections are smart pointers for many objects at a time. Common types that fall under this umbrella are Vec,HashMap, andString (notstr). While collections store the objects they own in heap memory, creating new collections will not allocate on the heap. This is a bit weird; if we call Vec::new(), the assembly shows a corresponding call to real_drop_in_place: pub fn my_vec() { // Drop in place at line 481 Vec::<u8>::new(); } -- Compiler Explorer But because the vector has no elements to manage, no calls to the allocator will ever be dispatched: use std::alloc::{GlobalAlloc, Layout, System}; use std::sync::atomic::{AtomicBool, Ordering}; fn main() { // Turn on panicking if we allocate on the heap DO_PANIC.store(true, Ordering::SeqCst); // Interesting bit happens here let x: Vec<u8> = Vec::new(); drop(x); // Turn panicking back off, some deallocations occur // after main as well. DO_PANIC.store(false, Ordering::SeqCst); } #[global_allocator] static A: PanicAllocator = PanicAllocator; static DO_PANIC: AtomicBool = AtomicBool::new(false); struct PanicAllocator; unsafe impl GlobalAlloc for PanicAllocator { unsafe fn alloc(&self, layout: Layout) -> *mut u8 { if DO_PANIC.load(Ordering::SeqCst) { panic!("Unexpected allocation."); } System.alloc(layout) } unsafe fn dealloc(&self, ptr: *mut u8, layout: Layout) { if DO_PANIC.load(Ordering::SeqCst) { panic!("Unexpected deallocation."); } System.dealloc(ptr, layout); } } --Rust Playground Other standard library types follow the same behavior; make sure to check outHashMap::new(), and String::new(). ","version":null,"tagName":"h2"},{"title":"Heap Alternatives","type":1,"pageTitle":"Allocations in Rust: Dynamic memory","url":"/2019/02/a-heaping-helping#heap-alternatives","content":" While it is a bit strange to speak of the stack after spending time with the heap, it's worth pointing out that some heap-allocated objects in Rust have stack-based counterparts provided by other crates. If you have need of the functionality, but want to avoid allocating, there are typically alternatives available. When it comes to some standard library smart pointers (RwLock andMutex), stack-based alternatives are provided in crates like parking_lot andspin. You can check outlock_api::RwLock,lock_api::Mutex, andspin::Once if you're in need of synchronization primitives. thread_id may be necessary if you're implementing an allocator because thread::current().id() uses athread_local! structurethat needs heap allocation. ","version":null,"tagName":"h2"},{"title":"Tracing Allocators","type":1,"pageTitle":"Allocations in Rust: Dynamic memory","url":"/2019/02/a-heaping-helping#tracing-allocators","content":" When writing performance-sensitive code, there's no alternative to measuring your code. If you didn't write a benchmark,you don't care about it's performanceYou should never rely on your instincts whena microsecond is an eternity. Similarly, there's great work going on in Rust with allocators that keep track of what they're doing (like alloc_counter). When it comes to tracking heap behavior, it's easy to make mistakes; please write tests and make sure you have tools to guard against future issues. ","version":null,"tagName":"h2"},{"title":"Allocations in Rust: Summary","type":0,"sectionRef":"#","url":"/2019/02/summary","content":"While there's a lot of interesting detail captured in this series, it's often helpful to have a document that answers some "yes/no" questions. You may not care about what an Iterator looks like in assembly, you just need to know whether it allocates an object on the heap or not. And while Rust will prioritize the fastest behavior it can, here are the rules for each memory type: Global Allocation: const is a fixed value; the compiler is allowed to copy it wherever useful.static is a fixed reference; the compiler will guarantee it is unique. Stack Allocation: Everything not using a smart pointer will be allocated on the stack.Structs, enums, iterators, arrays, and closures are all stack allocated.Cell types (RefCell) behave like smart pointers, but are stack-allocated.Inlining (#[inline]) will not affect allocation behavior for better or worse.Types that are marked Copy are guaranteed to have their contents stack-allocated. Heap Allocation: Smart pointers (Box, Rc, Mutex, etc.) allocate their contents in heap memory.Collections (HashMap, Vec, String, etc.) allocate their contents in heap memory.Some smart pointers in the standard library have counterparts in other crates that don't need heap memory. If possible, use those. -- Raph Levien","keywords":"","version":null},{"title":"Allocations in Rust: Foreword","type":0,"sectionRef":"#","url":"/2019/02/understanding-allocations-in-rust","content":"There's an alchemy of distilling complex technical topics into articles and videos that change the way programmers see the tools they interact with on a regular basis. I knew what a linker was, but there's a staggering amount of complexity in betweenthe OS and main(). Rust programmers use theBox type all the time, but there's a rich history of the Rust language itself wrapped up inhow special it is. In a similar vein, this series attempts to look at code and understand how memory is used; the complex choreography of operating system, compiler, and program that frees you to focus on functionality far-flung from frivolous book-keeping. The Rust compiler relieves a great deal of the cognitive burden associated with memory management, but we're going to step into its world for a while. Let's learn a bit about memory in Rust. Rust's three defining features ofPerformance, Reliability, and Productivity are all driven to a great degree by the how the Rust compiler understands memory usage. Unlike managed memory languages (Java, Python), Rustdoesn't reallygarbage collect; instead, it uses anownership system to reason about how long objects will last in your program. In some cases, if the life of an object is fairly transient, Rust can make use of a very fast region called the "stack." When that's not possible, Rust usesdynamic (heap) memoryand the ownership system to ensure you can't accidentally corrupt memory. It's not as fast, but it is important to have available. That said, there are specific situations in Rust where you'd never need to worry about the stack/heap distinction! If you: Never use unsafeNever use #![feature(alloc)] or the alloc crate ...then it's not possible for you to use dynamic memory! For some uses of Rust, typically embedded devices, these constraints are OK. They have very limited memory, and the program binary size itself may significantly affect what's available! There's no operating system able to manage this"virtual memory" thing, but that's not an issue because there's only one running application. Theembedonomicon is ever in mind, and interacting with the "real world" through extra peripherals is accomplished by reading and writing to specific memory addresses. Most Rust programs find these requirements overly burdensome though. C++ developers would struggle without access to std::vector (except those hardcore no-STL people), and Rust developers would struggle withoutstd::vec. But with the constraints above,std::vec is actually a part of thealloc crate, and thus off-limits. Box,Rc, etc., are also unusable for the same reason. Whether writing code for embedded devices or not, the important thing in both situations is how much you know before your application starts about what its memory usage will look like. In embedded devices, there's a small, fixed amount of memory to use. In a browser, you have no idea how largegoogle.com's home page is until you start trying to download it. The compiler uses this knowledge (or lack thereof) to optimize how memory is used; put simply, your code runs faster when the compiler can guarantee exactly how much memory your program needs while it's running. This series is all about understanding how the compiler reasons about your program, with an emphasis on the implications for performance. Now let's address some conditions and caveats before going much further: We'll focus on "safe" Rust only; unsafe lets you use platform-specific allocation API's (malloc) that we'll ignore.We'll assume a "debug" build of Rust code (what you get with cargo run and cargo test) and address (pun intended) release mode at the end (cargo run --release and cargo test --release).All content will be run using Rust 1.32, as that's the highest currently supported in theCompiler Exporer. As such, we'll avoid upcoming innovations likecompile-time evaluation of staticthat are available in nightly.Because of the nature of the content, being able to read assembly is helpful. We'll keep it simple, but I found arefresher on the push and popinstructions was helpful while writing this.I've tried to be precise in saying only what I can prove using the tools (ASM, docs) that are available, but if there's something said in error it will be corrected expeditiously. Please let me know at bradlee@speice.io Finally, I'll do what I can to flag potential future changes but the Rust docs have a notice worth repeating: Rust does not currently have a rigorously and formally defined memory model. -- the docs","keywords":"","version":null},{"title":"Making bread","type":0,"sectionRef":"#","url":"/2019/05/making-bread","content":"Having recently started my "gardening leave" between positions, I have some more personal time available. I'm planning to stay productive, contributing to some open-source projects, but it also occurred to me that despite talking about bread pics, this blog has been purely technical. Maybe I'll change the site title from "The Old Speice Guy" to "Bites and Bytes"? Either way, I'm baking a little bit again, and figured it was worth taking a quick break to focus on some lighter material. I recently learned two critically important lessons: first, the temperature of the dough when you put the yeast in makes a huge difference. Previously, when I wasn't paying attention to dough temperature: Compared with what happens when I put the dough in the microwave for a defrost cycle because the water I used wasn't warm enough: I mean, just look at the bubbles! After shaping the dough, I've got two loaves ready: Now, the recipe normally calls for a Dutch Oven to bake the bread because it keeps the dough from drying out in the oven. Because I don't own a Dutch Oven, I typically put a casserole dish on the bottom rack and fill it with water so there's still some moisture in the oven. This time, I forgot to add the water and learned my second lesson: never add room-temperature water to a glass dish that's currently at 500 degrees. Needless to say, trying to pull out sharp glass from an incredibly hot oven is not what I expected to be doing during my garden leave. In the end, the bread crust wasn't great, but the bread itself turned out pretty alright: I've been writing a lot more during this break, so I'm looking forward to sharing that in the future. In the mean-time, I'm planning on making a sandwich.","keywords":"","version":null},{"title":"Binary format shootout","type":0,"sectionRef":"#","url":"/2019/09/binary-format-shootout","content":"","keywords":"","version":null},{"title":"Prologue: Binary Parsing with Nom","type":1,"pageTitle":"Binary format shootout","url":"/2019/09/binary-format-shootout#prologue-binary-parsing-with-nom","content":" Our benchmark system will be a simple data processor; given depth-of-book market data fromIEX, serialize each message into the schema format, read it back, and calculate total size of stock traded and the lowest/highest quoted prices. This test isn't complex, but is representative of the project I need a binary format for. But before we make it to that point, we have to actually read in the market data. To do so, I'm using a library called nom. Version 5.0 was recently released and brought some big changes, so this was an opportunity to build a non-trivial program and get familiar. If you don't already know about nom, it's a "parser generator". By combining different smaller parsers, you can assemble a parser to handle complex structures without writing tedious code by hand. For example, when parsingPCAP files: 0 1 2 3 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 +---------------------------------------------------------------+ 0 | Block Type = 0x00000006 | +---------------------------------------------------------------+ 4 | Block Total Length | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 8 | Interface ID | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 12 | Timestamp (High) | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 16 | Timestamp (Low) | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 20 | Captured Len | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 24 | Packet Len | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Packet Data | | ... | ...you can build a parser in nom that looks likethis: const ENHANCED_PACKET: [u8; 4] = [0x06, 0x00, 0x00, 0x00]; pub fn enhanced_packet_block(input: &[u8]) -> IResult<&[u8], &[u8]> { let ( remaining, ( block_type, block_len, interface_id, timestamp_high, timestamp_low, captured_len, packet_len, ), ) = tuple(( tag(ENHANCED_PACKET), le_u32, le_u32, le_u32, le_u32, le_u32, le_u32, ))(input)?; let (remaining, packet_data) = take(captured_len)(remaining)?; Ok((remaining, packet_data)) } While this example isn't too interesting, more complex formats (like IEX market data) are wherenom really shines. Ultimately, because the nom code in this shootout was the same for all formats, we're not too interested in its performance. Still, it's worth mentioning that building the market data parser was actually fun; I didn't have to write tons of boring code by hand. ","version":null,"tagName":"h2"},{"title":"Cap'n Proto","type":1,"pageTitle":"Binary format shootout","url":"/2019/09/binary-format-shootout#capn-proto","content":" Now it's time to get into the meaty part of the story. Cap'n Proto was the first format I tried because of how long it has supported Rust (thanks to dwrensha for maintaining the Rust port since2014!). However, I had a ton of performance concerns once I started using it. To serialize new messages, Cap'n Proto uses a "builder" object. This builder allocates memory on the heap to hold the message content, but because builderscan't be re-used, we have to allocate a new buffer for every single message. I was able to work around this with aspecial builderthat could re-use the buffer, but it required reading through Cap'n Proto'sbenchmarksto find an example, and usedstd::mem::transmute to bypass Rust's borrow checker. The process of reading messages was better, but still had issues. Cap'n Proto has two message encodings: a "packed" representation, and an "unpacked" version. When reading "packed" messages, we need a buffer to unpack the message into before we can use it; Cap'n Proto allocates a new buffer for each message we unpack, and I wasn't able to figure out a way around that. In contrast, the unpacked message format should be where Cap'n Proto shines; its main selling point is that there's no decoding step. However, accomplishing zero-copy deserialization required code in the private API (since fixed), and we allocate a vector on every read for the segment table. In the end, I put in significant work to make Cap'n Proto as fast as possible, but there were too many issues for me to feel comfortable using it long-term. ","version":null,"tagName":"h2"},{"title":"Flatbuffers","type":1,"pageTitle":"Binary format shootout","url":"/2019/09/binary-format-shootout#flatbuffers","content":" This is the new kid on the block. After afirst attempt didn't pan out, official support was recently launched. Flatbuffers intends to address the same problems as Cap'n Proto: high-performance, polyglot, binary messaging. The difference is that Flatbuffers claims to have a simpler wire format andmore flexibility. On the whole, I enjoyed using Flatbuffers; the tooling is nice, and unlike Cap'n Proto, parsing messages was actually zero-copy and zero-allocation. However, there were still some issues. First, Flatbuffers (at least in Rust) can't handle nested vectors. This is a problem for formats like the following: table Message { symbol: string; } table MultiMessage { messages:[Message]; } We want to create a MultiMessage which contains a vector of Message, and each Message itself contains a vector (the string type). I was able to work around this bycaching Message elementsin a SmallVec before building the final MultiMessage, but it was a painful process that I believe contributed to poor serialization performance. Second, streaming support in Flatbuffers seems to be something of anafterthought. Where Cap'n Proto in Rust handles reading messages from a stream as part of the API, Flatbuffers just sticks a u32 at the front of each message to indicate the size. Not specifically a problem, but calculating message size without that tag is nigh on impossible. Ultimately, I enjoyed using Flatbuffers, and had to do significantly less work to make it perform well. ","version":null,"tagName":"h2"},{"title":"Simple Binary Encoding","type":1,"pageTitle":"Binary format shootout","url":"/2019/09/binary-format-shootout#simple-binary-encoding","content":" Support for SBE was added by the author of one of my favoriteRust blog posts. I've talked previously about how important variance is in high-performance systems, so it was encouraging to read about a format thatdirectly addressed my concerns. SBE has by far the simplest binary format, but it does make some tradeoffs. Both Cap'n Proto and Flatbuffers use message offsetsto handle variable-length data, unions, and various other features. In contrast, messages in SBE are essentiallyjust structs; variable-length data is supported, but there's no union type. As mentioned in the beginning, the Rust port of SBE works well, but isessentially unmaintained. However, if you don't need union types, and can accept that schemas are XML documents, it's still worth using. SBE's implementation had the best streaming support of all formats I tested, and doesn't trigger allocation during de/serialization. ","version":null,"tagName":"h2"},{"title":"Results","type":1,"pageTitle":"Binary format shootout","url":"/2019/09/binary-format-shootout#results","content":" After building a test harnessforeachformat, it was time to actually take them for a spin. I usedthis script to run the benchmarks, and the raw results arehere. All data reported below is the average of 10 runs on a single day of IEX data. Results were validated to make sure that each format parsed the data correctly. ","version":null,"tagName":"h2"},{"title":"Serialization","type":1,"pageTitle":"Binary format shootout","url":"/2019/09/binary-format-shootout#serialization","content":" This test measures, on aper-message basis, how long it takes to serialize the IEX message into the desired format and write to a pre-allocated buffer. Schema\tMedian\t99th Pctl\t99.9th Pctl\tTotalCap'n Proto Packed\t413ns\t1751ns\t2943ns\t14.80s Cap'n Proto Unpacked\t273ns\t1828ns\t2836ns\t10.65s Flatbuffers\t355ns\t2185ns\t3497ns\t14.31s SBE\t91ns\t1535ns\t2423ns\t3.91s ","version":null,"tagName":"h3"},{"title":"Deserialization","type":1,"pageTitle":"Binary format shootout","url":"/2019/09/binary-format-shootout#deserialization","content":" This test measures, on aper-message basis, how long it takes to read the previously-serialized message and perform some basic aggregation. The aggregation code is the same for each format, so any performance differences are due solely to the format implementation. Schema\tMedian\t99th Pctl\t99.9th Pctl\tTotalCap'n Proto Packed\t539ns\t1216ns\t2599ns\t18.92s Cap'n Proto Unpacked\t366ns\t737ns\t1583ns\t12.32s Flatbuffers\t173ns\t421ns\t1007ns\t6.00s SBE\t116ns\t286ns\t659ns\t4.05s ","version":null,"tagName":"h3"},{"title":"Conclusion","type":1,"pageTitle":"Binary format shootout","url":"/2019/09/binary-format-shootout#conclusion","content":" Building a benchmark turned out to be incredibly helpful in making a decision; because a "union" type isn't important to me, I can be confident that SBE best addresses my needs. While SBE was the fastest in terms of both median and worst-case performance, its worst case performance was proportionately far higher than any other format. It seems to be that de/serialization time scales with message size, but I'll need to do some more research to understand what exactly is going on. ","version":null,"tagName":"h2"},{"title":"On building high performance systems","type":0,"sectionRef":"#","url":"/2019/06/high-performance-systems","content":"","keywords":"","version":null},{"title":"Language-specific","type":1,"pageTitle":"On building high performance systems","url":"/2019/06/high-performance-systems#language-specific","content":" Garbage Collection: How often does garbage collection happen? When is it triggered? What are the impacts? In Python, individual objects are collected if the reference count reaches 0, and each generation is collected ifnum_alloc - num_dealloc > gc_threshold whenever an allocation happens. The GIL is acquired for the duration of generational collection.Java hasmanydifferentcollectionalgorithmsto choose from, each with different characteristics. The default algorithms (Parallel GC in Java 8, G1 in Java 9) freeze the JVM while collecting, while more recent algorithms (ZGC andShenandoah) are designed to keep "stop the world" to a minimum by doing collection work in parallel. Allocation: Every language has a different way of interacting with "heap" memory, but the principle is the same: running the allocator to allocate/deallocate memory takes time that can often be put to better use. Understanding when your language interacts with the allocator is crucial, and not always obvious. For example: C++ and Rust don't allocate heap memory for iterators, but Java does (meaning potential GC pauses). Take time to understand heap behavior (I made aa guide for Rust), and look into alternative allocators (jemalloc,tcmalloc) that might run faster than the operating system default. Data Layout: How your data is arranged in memory matters;data-oriented design andcache locality can have huge impacts on performance. The C family of languages (C, value types in C#, C++) and Rust all have guarantees about the shape every object takes in memory that others (e.g. Java and Python) can't make. Cachegrind and kernelperf counters are both great for understanding how performance relates to memory layout. Just-In-Time Compilation: Languages that are compiled on the fly (LuaJIT, C#, Java, PyPy) are great because they optimize your program for how it's actually being used, rather than how a compiler expects it to be used. However, there's a variance problem if the program stops executing while waiting for translation from VM bytecode to native code. As a remedy, many languages support ahead-of-time compilation in addition to the JIT versions (CoreRT in C# and GraalVM in Java). On the other hand, LLVM supportsProfile Guided Optimization, which theoretically brings JIT benefits to non-JIT languages. Finally, be careful to avoid comparing apples and oranges during benchmarks; you don't want your code to suddenly speed up because the JIT compiler kicked in. Programming Tricks: These won't make or break performance, but can be useful in specific circumstances. For example, C++ can usetemplates instead of branchesin critical sections. ","version":null,"tagName":"h2"},{"title":"Kernel","type":1,"pageTitle":"On building high performance systems","url":"/2019/06/high-performance-systems#kernel","content":" Code you wrote is almost certainly not the only code running on your hardware. There are many ways the operating system interacts with your program, from interrupts to system calls, that are important to watch for. These are written from a Linux perspective, but Windows does typically have equivalent functionality. Scheduling: The kernel is normally free to schedule any process on any core, so it's important to reserve CPU cores exclusively for the important programs. There are a few parts to this: first, limit the CPU cores that non-critical processes are allowed to run on by excluding cores from scheduling (isolcpuskernel command-line option), or by setting the init process CPU affinity (systemd example). Second, set critical processes to run on the isolated cores by setting theprocessor affinity usingtaskset. Finally, useNO_HZ orchrt to disable scheduling interrupts. Turning off hyper-threading is also likely beneficial. System calls: Reading from a UNIX socket? Writing to a file? In addition to not knowing how long the I/O operation takes, these all trigger expensivesystem calls (syscalls). To handle these, the CPU mustcontext switch to the kernel, let the kernel operation complete, then context switch back to your program. We'd rather keep theseto a minimum (see timestamp 18:20). Strace is your friend for understanding when and where syscalls happen. Signal Handling: Far less likely to be an issue, but signals do trigger a context switch if your code has a handler registered. This will be highly dependent on the application, but you canblock signalsif it's an issue. Interrupts: System interrupts are how devices connected to your computer notify the CPU that something has happened. The CPU will then choose a processor core to pause and context switch to the OS to handle the interrupt. Make sure thatSMP affinity is set so that interrupts are handled on a CPU core not running the program you care about. NUMA: While NUMA is good at making multi-cell systems transparent, there are variance implications; if the kernel moves a process across nodes, future memory accesses must wait for the controller on the original node. Usenumactl to handle memory-/cpu-cell pinning so this doesn't happen. ","version":null,"tagName":"h2"},{"title":"Hardware","type":1,"pageTitle":"On building high performance systems","url":"/2019/06/high-performance-systems#hardware","content":" CPU Pipelining/Speculation: Speculative execution in modern processors gave us vulnerabilities like Spectre, but it also gave us performance improvements likebranch prediction. And if the CPU mis-speculates your code, there's variance associated with rewind and replay. While the compiler knows a lot about how your CPU pipelines instructions, code can bestructured to help the branch predictor. Paging: For most systems, virtual memory is incredible. Applications live in their own worlds, and the CPU/MMU figures out the details. However, there's a variance penalty associated with memory paging and caching; if you access more memory pages than the TLB can store, you'll have to wait for the page walk. Kernel perf tools are necessary to figure out if this is an issue, but using huge pages can reduce TLB burdens. Alternately, running applications in a hypervisor likeJailhouse allows one to skip virtual memory entirely, but this is probably more work than the benefits are worth. Network Interfaces: When more than one computer is involved, variance can go up dramatically. Tuning kernelnetwork parameters may be helpful, but modern systems more frequently opt to skip the kernel altogether with a technique called kernel bypass. This typically requires specialized hardware and drivers, but even industries liketelecom are finding the benefits. ","version":null,"tagName":"h2"},{"title":"Networks","type":1,"pageTitle":"On building high performance systems","url":"/2019/06/high-performance-systems#networks","content":" Routing: There's a reason financial firms are willing to paymillions of eurosfor rights to a small plot of land - having a straight-line connection from point A to point B means the path their data takes is the shortest possible. In contrast, there are currently 6 computers in between me and Google, but that may change at any moment if my ISP realizes amore efficient route is available. Whether it's usingresearch-quality equipmentfor shortwave radio, or just making sure there's no data inadvertently going between data centers, routing matters. Protocol: TCP as a network protocol is awesome: guaranteed and in-order delivery, flow control, and congestion control all built in. But these attributes make the most sense when networking infrastructure is lossy; for systems that expect nearly all packets to be delivered correctly, the setup handshaking and packet acknowledgment are just overhead. Using UDP (unicast or multicast) may make sense in these contexts as it avoids the chatter needed to track connection state, andgap-fillstrategiescan handle the rest. Switching: Many routers/switches handle packets using "store-and-forward" behavior: wait for the whole packet, validate checksums, and then send to the next device. In variance terms, the time needed to move data between two nodes is proportional to the size of that data; the switch must "store" all data before it can calculate checksums and "forward" to the next node. With"cut-through"designs, switches will begin forwarding data as soon as they know where the destination is, checksums be damned. This means there's a fixed cost (at the switch) for network traffic, no matter the size. ","version":null,"tagName":"h2"},{"title":"Final Thoughts","type":1,"pageTitle":"On building high performance systems","url":"/2019/06/high-performance-systems#final-thoughts","content":" High-performance systems, regardless of industry, are not magical. They do require extreme precision and attention to detail, but they're designed, built, and operated by regular people, using a lot of tools that are publicly available. Interested in seeing how context switching affects performance of your benchmarks? taskset should be installed in all modern Linux distributions, and can be used to make sure the OS never migrates your process. Curious how often garbage collection triggers during a crucial operation? Your language of choice will typically expose details of its operations (Python,Java). Want to know how hard your program is stressing the TLB? Use perf record and look fordtlb_load_misses.miss_causes_a_walk. Two final guiding questions, then: first, before attempting to apply some of the technology above to your own systems, can you first identifywhere/when you care about "high-performance"? As an example, if parts of a system rely on humans pushing buttons, CPU pinning won't have any measurable effect. Humans are already far too slow to react in time. Second, if you're using benchmarks, are they being designed in a way that's actually helpful? Tools likeCriterion (also inRust) and Google'sBenchmark output not only average run time, but variance as well; your benchmarking environment is subject to the same concerns your production environment is. Finally, I believe high-performance systems are a matter of philosophy, not necessarily technique. Rigorous focus on variance is the first step, and there are plenty of ways to measure and mitigate it; once that's at an acceptable level, then optimize for speed. ","version":null,"tagName":"h2"},{"title":"Allocations in Rust: Global memory","type":0,"sectionRef":"#","url":"/2019/02/the-whole-world","content":"","keywords":"","version":null},{"title":"const values","type":1,"pageTitle":"Allocations in Rust: Global memory","url":"/2019/02/the-whole-world#const-values","content":" When a value is guaranteed to be unchanging in your program (where "value" may be scalars,structs, etc.), you can declare it const. This tells the compiler that it's safe to treat the value as never changing, and enables some interesting optimizations; not only is there no initialization cost to creating the value (it is loaded at the same time as the executable parts of your program), but the compiler can also copy the value around if it speeds up the code. The points we need to address when talking about const are: Const values are stored in read-only memory - it's impossible to modify.Values resulting from calling a const fn are materialized at compile-time.The compiler may (or may not) copy const values wherever it chooses. ","version":null,"tagName":"h2"},{"title":"Read-Only","type":1,"pageTitle":"Allocations in Rust: Global memory","url":"/2019/02/the-whole-world#read-only","content":" The first point is a bit strange - "read-only memory."The Rust bookmentions in a couple places that using mut with constants is illegal, but it's also important to demonstrate just how immutable they are. Typically in Rust you can useinterior mutability to modify things that aren't declared mut.RefCell provides an example of this pattern in action: use std::cell::RefCell; fn my_mutator(cell: &RefCell<u8>) { // Even though we're given an immutable reference, // the `replace` method allows us to modify the inner value. cell.replace(14); } fn main() { let cell = RefCell::new(25); // Prints out 25 println!("Cell: {:?}", cell); my_mutator(&cell); // Prints out 14 println!("Cell: {:?}", cell); } --Rust Playground When const is involved though, interior mutability is impossible: use std::cell::RefCell; const CELL: RefCell<u8> = RefCell::new(25); fn my_mutator(cell: &RefCell<u8>) { cell.replace(14); } fn main() { // First line prints 25 as expected println!("Cell: {:?}", &CELL); my_mutator(&CELL); // Second line *still* prints 25 println!("Cell: {:?}", &CELL); } --Rust Playground And a second example using Once: use std::sync::Once; const SURPRISE: Once = Once::new(); fn main() { // This is how `Once` is supposed to be used SURPRISE.call_once(|| println!("Initializing...")); // Because `Once` is a `const` value, we never record it // having been initialized the first time, and this closure // will also execute. SURPRISE.call_once(|| println!("Initializing again???")); } --Rust Playground When theconst specificationrefers to "rvalues", this behavior is what they refer to. Clippy will treat this as an error, but it's still something to be aware of. ","version":null,"tagName":"h3"},{"title":"Initialization","type":1,"pageTitle":"Allocations in Rust: Global memory","url":"/2019/02/the-whole-world#initialization","content":" The next thing to mention is that const values are loaded into memory as part of your program binary. Because of this, any const values declared in your program will be "realized" at compile-time; accessing them may trigger a main-memory lookup (with a fixed address, so your CPU may be able to prefetch the value), but that's it. use std::cell::RefCell; const CELL: RefCell<u32> = RefCell::new(24); pub fn multiply(value: u32) -> u32 { // CELL is stored at `.L__unnamed_1` value * (*CELL.get_mut()) } -- Compiler Explorer The compiler creates one RefCell, uses it everywhere, and never needs to call the RefCell::newfunction. ","version":null,"tagName":"h3"},{"title":"Copying","type":1,"pageTitle":"Allocations in Rust: Global memory","url":"/2019/02/the-whole-world#copying","content":" If it's helpful though, the compiler can choose to copy const values. const FACTOR: u32 = 1000; pub fn multiply(value: u32) -> u32 { // See assembly line 4 for the `mov edi, 1000` instruction value * FACTOR } pub fn multiply_twice(value: u32) -> u32 { // See assembly lines 22 and 29 for `mov edi, 1000` instructions value * FACTOR * FACTOR } -- Compiler Explorer In this example, the FACTOR value is turned into the mov edi, 1000 instruction in both themultiply and multiply_twice functions; the "1000" value is never "stored" anywhere, as it's small enough to inline into the assembly instructions. Finally, getting the address of a const value is possible, but not guaranteed to be unique (because the compiler can choose to copy values). I was unable to get non-unique pointers in my testing (even using different crates), but the specifications are clear enough: don't rely on pointers to const values being consistent. To be frank, caring about locations for const values is almost certainly a code smell. ","version":null,"tagName":"h3"},{"title":"static values","type":1,"pageTitle":"Allocations in Rust: Global memory","url":"/2019/02/the-whole-world#static-values","content":" Static variables are related to const variables, but take a slightly different approach. When we declare that a reference is unique for the life of a program, you have a static variable (unrelated to the 'static lifetime). Because of the reference/value distinction withconst/static, static variables behave much more like typical "global" variables. But to understand static, here's what we'll look at: static variables are globally unique locations in memory.Like const, static variables are loaded at the same time as your program being read into memory.All static variables must implement theSync marker trait.Interior mutability is safe and acceptable when using static variables. ","version":null,"tagName":"h2"},{"title":"Memory Uniqueness","type":1,"pageTitle":"Allocations in Rust: Global memory","url":"/2019/02/the-whole-world#memory-uniqueness","content":" The single biggest difference between const and static is the guarantees provided about uniqueness. Where const variables may or may not be copied in code, static variables are guarantee to be unique. If we take a previous const example and change it to static, the difference should be clear: static FACTOR: u32 = 1000; pub fn multiply(value: u32) -> u32 { // The assembly to `mul dword ptr [rip + example::FACTOR]` is how FACTOR gets used value * FACTOR } pub fn multiply_twice(value: u32) -> u32 { // The assembly to `mul dword ptr [rip + example::FACTOR]` is how FACTOR gets used value * FACTOR * FACTOR } -- Compiler Explorer Where previously there were plenty of references to multiplying by 1000, the new assembly refers to FACTOR as a named memory location instead. No initialization work needs to be done, but the compiler can no longer prove the value never changes during execution. ","version":null,"tagName":"h3"},{"title":"Initialization","type":1,"pageTitle":"Allocations in Rust: Global memory","url":"/2019/02/the-whole-world#initialization-1","content":" Next, let's talk about initialization. The simplest case is initializing static variables with either scalar or struct notation: #[derive(Debug)] struct MyStruct { x: u32 } static MY_STRUCT: MyStruct = MyStruct { // You can even reference other statics // declared later x: MY_VAL }; static MY_VAL: u32 = 24; fn main() { println!("Static MyStruct: {:?}", MY_STRUCT); } --Rust Playground Things can get a bit weirder when using const fn though. In most cases, it just works: #[derive(Debug)] struct MyStruct { x: u32 } impl MyStruct { const fn new() -> MyStruct { MyStruct { x: 24 } } } static MY_STRUCT: MyStruct = MyStruct::new(); fn main() { println!("const fn Static MyStruct: {:?}", MY_STRUCT); } --Rust Playground However, there's a caveat: you're currently not allowed to use const fn to initialize static variables of types that aren't marked Sync. For example,RefCell::new() is aconst fn, but becauseRefCell isn't Sync, you'll get an error at compile time: use std::cell::RefCell; // error[E0277]: `std::cell::RefCell<u8>` cannot be shared between threads safely static MY_LOCK: RefCell<u8> = RefCell::new(0); --Rust Playground It's likely that this willchange in the future though. ","version":null,"tagName":"h3"},{"title":"The Sync marker","type":1,"pageTitle":"Allocations in Rust: Global memory","url":"/2019/02/the-whole-world#the-sync-marker","content":" Which leads well to the next point: static variable types must implement theSync marker. Because they're globally unique, it must be safe for you to access static variables from any thread at any time. Moststruct definitions automatically implement the Sync trait because they contain only elements which themselves implement Sync (read more in theNomicon). This is why earlier examples could get away with initializing statics, even though we never included an impl Sync for MyStruct in the code. To demonstrate this property, Rust refuses to compile our earlier example if we add a non-Sync element to the struct definition: use std::cell::RefCell; struct MyStruct { x: u32, y: RefCell<u8>, } // error[E0277]: `std::cell::RefCell<u8>` cannot be shared between threads safely static MY_STRUCT: MyStruct = MyStruct { x: 8, y: RefCell::new(8) }; --Rust Playground ","version":null,"tagName":"h3"},{"title":"Interior mutability","type":1,"pageTitle":"Allocations in Rust: Global memory","url":"/2019/02/the-whole-world#interior-mutability","content":" Finally, while static mut variables are allowed, mutating them is an unsafe operation. If we want to stay in safe Rust, we can use interior mutability to accomplish similar goals: use std::sync::Once; // This example adapted from https://doc.rust-lang.org/std/sync/struct.Once.html#method.call_once static INIT: Once = Once::new(); fn main() { // Note that while `INIT` is declared immutable, we're still allowed // to mutate its interior INIT.call_once(|| println!("Initializing...")); // This code won't panic, as the interior of INIT was modified // as part of the previous `call_once` INIT.call_once(|| panic!("INIT was called twice!")); } --Rust Playground ","version":null,"tagName":"h3"},{"title":"Release the GIL","type":0,"sectionRef":"#","url":"/2019/12/release-the-gil","content":"","keywords":"","version":null},{"title":"Cython","type":1,"pageTitle":"Release the GIL","url":"/2019/12/release-the-gil#cython","content":" Put simply, Cython is a programming language that looks a lot like Python, gets transpiled to C/C++, and integrates well with the CPython API. It's great for building Python wrappers to C and C++ libraries, writing optimized code for numerical processing, and tons more. And when it comes to managing the GIL, there are two special features: The nogilfunction annotationasserts that a Cython function is safe to use without the GIL, and compilation will fail if it interacts with Python in an unsafe mannerThe with nogilcontext managerexplicitly unlocks the CPython GIL while active Whenever Cython code runs inside a with nogil block on a separate thread, the Python interpreter is unblocked and allowed to continue work elsewhere. We'll define a "busy work" function that demonstrates this principle in action: %%cython # Annotating a function with `nogil` indicates only that it is safe # to call in a `with nogil` block. It *does not* release the GIL. cdef unsigned long fibonacci(unsigned long n) nogil: if n <= 1: return n cdef unsigned long a = 0, b = 1, c = 0 c = a + b for _i in range(2, n): a = b b = c c = a + b return c def cython_nogil(unsigned long n): # Explicitly release the GIL while running `fibonacci` with nogil: value = fibonacci(n) return value def cython_gil(unsigned long n): # Because the GIL is not explicitly released, it implicitly # remains acquired when running the `fibonacci` function return fibonacci(n) First, let's time how long it takes Cython to calculate the billionth Fibonacci number: %%time _ = cython_gil(N); CPU times: user 365 ms, sys: 0 ns, total: 365 ms Wall time: 372 ms %%time _ = cython_nogil(N); CPU times: user 381 ms, sys: 0 ns, total: 381 ms Wall time: 388 ms Both versions (with and without GIL) take effectively the same amount of time to run. Even when running this calculation in parallel on separate threads, it is expected that the run time will double because only one thread can be active at a time: %%time from threading import Thread # Create the two threads to run on t1 = Thread(target=cython_gil, args=[N]) t2 = Thread(target=cython_gil, args=[N]) # Start the threads t1.start(); t2.start() # Wait for the threads to finish t1.join(); t2.join() CPU times: user 641 ms, sys: 5.62 ms, total: 647 ms Wall time: 645 ms However, if the first thread releases the GIL, the second thread is free to acquire it and run in parallel: %%time t1 = Thread(target=cython_nogil, args=[N]) t2 = Thread(target=cython_gil, args=[N]) t1.start(); t2.start() t1.join(); t2.join() CPU times: user 717 ms, sys: 372 µs, total: 718 ms Wall time: 358 ms Because user time represents the sum of processing time on all threads, it doesn't change much. The "wall time" has been cut roughly in half because each function is running simultaneously. Keep in mind that the order in which threads are started makes a difference! %%time # Note that the GIL-locked version is started first t1 = Thread(target=cython_gil, args=[N]) t2 = Thread(target=cython_nogil, args=[N]) t1.start(); t2.start() t1.join(); t2.join() CPU times: user 667 ms, sys: 0 ns, total: 667 ms Wall time: 672 ms Even though the second thread releases the GIL while running, it can't start until the first has completed. Thus, the overall runtime is effectively the same as running two GIL-locked threads. Finally, be aware that attempting to unlock the GIL from a thread that doesn't own it will crash theinterpreter, not just the thread attempting the unlock: %%cython cdef int cython_recurse(int n) nogil: if n <= 0: return 0 with nogil: return cython_recurse(n - 1) cython_recurse(2) Fatal Python error: PyEval_SaveThread: NULL tstate Thread 0x00007f499effd700 (most recent call first): File "/home/bspeice/.virtualenvs/release-the-gil/lib/python3.7/site-packages/ipykernel/parentpoller.py", line 39 in run File "/usr/lib/python3.7/threading.py", line 926 in _bootstrap_inner File "/usr/lib/python3.7/threading.py", line 890 in _bootstrap In practice, avoiding this issue is simple. First, nogil functions probably shouldn't containwith nogil blocks. Second, Cython canconditionally acquire/releasethe GIL, so these conditions can be used to synchronize access. Finally, Cython's documentation forexternal C codecontains more detail on how to safely manage the GIL. To conclude: use Cython's nogil annotation to assert that functions are safe for calling when the GIL is unlocked, and with nogil to actually unlock the GIL and run those functions. ","version":null,"tagName":"h2"},{"title":"Numba","type":1,"pageTitle":"Release the GIL","url":"/2019/12/release-the-gil#numba","content":" Like Cython, Numba is a "compiled Python." Where Cython works by compiling a Python-like language to C/C++, Numba compiles Python bytecode directly to machine codeat runtime. Behavior is controlled with a special @jit decorator; calling a decorated function first compiles it to machine code before running. Calling the function a second time re-uses that machine code unless the argument types have changed. Numba works best when a nopython=True argument is added to the @jit decorator; functions compiled in nopython mode avoid the CPython API and have performance comparable to C. Further, adding nogil=True to the@jit decorator unlocks the GIL while that function is running. Note that nogil and nopythonare separate arguments; while it is necessary for code to be compiled in nopython mode in order to release the lock, the GIL will remain locked if nogil=False (the default). Let's repeat the same experiment, this time using Numba instead of Cython: # The `int` type annotation is only for humans and is ignored # by Numba. @jit(nopython=True, nogil=True) def numba_nogil(n: int) -> int: if n <= 1: return n a = 0 b = 1 c = a + b for _i in range(2, n): a = b b = c c = a + b return c # Run using `nopython` mode to receive a performance boost, # but GIL remains locked due to `nogil=False` by default. @jit(nopython=True) def numba_gil(n: int) -> int: if n <= 1: return n a = 0 b = 1 c = a + b for _i in range(2, n): a = b b = c c = a + b return c # Call each function once to force compilation; we don't want # the timing statistics to include how long it takes to compile. numba_nogil(N) numba_gil(N); We'll perform the same tests as above; first, figure out how long it takes the function to run: %%time _ = numba_gil(N) CPU times: user 253 ms, sys: 258 µs, total: 253 ms Wall time: 251 ms Aside: it's not immediately clear why Numba takes ~20% less time to run than Cython for code that should be effectively identical after compilation. When running two GIL-locked threads, the result (as expected) takes around twice as long to compute: %%time t1 = Thread(target=numba_gil, args=[N]) t2 = Thread(target=numba_gil, args=[N]) t1.start(); t2.start() t1.join(); t2.join() CPU times: user 541 ms, sys: 3.96 ms, total: 545 ms Wall time: 541 ms But if the GIL-unlocking thread starts first, both threads run in parallel: %%time t1 = Thread(target=numba_nogil, args=[N]) t2 = Thread(target=numba_gil, args=[N]) t1.start(); t2.start() t1.join(); t2.join() CPU times: user 551 ms, sys: 7.77 ms, total: 559 ms Wall time: 279 ms Just like Cython, starting the GIL-locked thread first leads to poor performance: %%time t1 = Thread(target=numba_gil, args=[N]) t2 = Thread(target=numba_nogil, args=[N]) t1.start(); t2.start() t1.join(); t2.join() CPU times: user 524 ms, sys: 0 ns, total: 524 ms Wall time: 522 ms Finally, unlike Cython, Numba will unlock the GIL if and only if it is currently acquired; recursively calling @jit(nogil=True) functions is perfectly safe: from numba import jit @jit(nopython=True, nogil=True) def numba_recurse(n: int) -> int: if n <= 0: return 0 return numba_recurse(n - 1) numba_recurse(2); ","version":null,"tagName":"h2"},{"title":"Conclusion","type":1,"pageTitle":"Release the GIL","url":"/2019/12/release-the-gil#conclusion","content":" Before finishing, it's important to address pain points that will show up if these techniques are used in a more realistic project: First, code running in a GIL-free context will likely also need non-trivial data structures; GIL-free functions aren't useful if they're constantly interacting with Python objects whose access requires the GIL. Cython providesextension types and Numba provides a @jitclass decorator to address this need. Second, building and distributing applications that make use of Cython/Numba can be complicated. Cython packages require running the compiler, (potentially) linking/packaging external dependencies, and distributing a binary wheel. Numba is generally simpler because the code being distributed is pure Python, but can be tricky since errors aren't detected until runtime. Finally, while unlocking the GIL is often a solution in search of a problem, both Cython and Numba provide tools to directly manage the GIL when appropriate. This enables true parallelism (not justconcurrency) that is impossible in vanilla Python. ","version":null,"tagName":"h2"},{"title":"Allocations in Rust: Fixed memory","type":0,"sectionRef":"#","url":"/2019/02/stacking-up","content":"","keywords":"","version":null},{"title":"Structs","type":1,"pageTitle":"Allocations in Rust: Fixed memory","url":"/2019/02/stacking-up#structs","content":" The simplest case comes first. When creating vanilla struct objects, we use stack memory to hold their contents: struct Point { x: u64, y: u64, } struct Line { a: Point, b: Point, } pub fn make_line() { // `origin` is stored in the first 16 bytes of memory // starting at location `rsp` let origin = Point { x: 0, y: 0 }; // `point` makes up the next 16 bytes of memory let point = Point { x: 1, y: 2 }; // When creating `ray`, we just move the content out of // `origin` and `point` into the next 32 bytes of memory let ray = Line { a: origin, b: point }; } -- Compiler Explorer Note that while some extra-fancy instructions are used for memory manipulation in the assembly, thesub rsp, 64 instruction indicates we're still working with the stack. ","version":null,"tagName":"h2"},{"title":"Function arguments","type":1,"pageTitle":"Allocations in Rust: Fixed memory","url":"/2019/02/stacking-up#function-arguments","content":" Have you ever wondered how functions communicate with each other? Like, once the variables are given to you, everything's fine. But how do you "give" those variables to another function? How do you get the results back afterward? The answer: the compiler arranges memory and assembly instructions using a pre-determined calling convention. This convention governs the rules around where arguments needed by a function will be located (either in memory offsets relative to the stack pointer rsp, or in other registers), and where the results can be found once the function has finished. And when multiple languages agree on what the calling conventions are, you can do things like having Go call Rust code! Put simply: it's the compiler's job to figure out how to call other functions, and you can assume that the compiler is good at its job. We can see this in action using a simple example: struct Point { x: i64, y: i64, } // We use integer division operations to keep // the assembly clean, understanding the result // isn't accurate. fn distance(a: &Point, b: &Point) -> i64 { // Immediately subtract from `rsp` the bytes needed // to hold all the intermediate results - this is // the stack allocation step // The compiler used the `rdi` and `rsi` registers // to pass our arguments, so read them in let x1 = a.x; let x2 = b.x; let y1 = a.y; let y2 = b.y; // Do the actual math work let x_pow = (x1 - x2) * (x1 - x2); let y_pow = (y1 - y2) * (y1 - y2); let squared = x_pow + y_pow; squared / squared // Our final result will be stored in the `rax` register // so that our caller knows where to retrieve it. // Finally, add back to `rsp` the stack memory that is // now ready to be used by other functions. } pub fn total_distance() { let start = Point { x: 1, y: 2 }; let middle = Point { x: 3, y: 4 }; let end = Point { x: 5, y: 6 }; let _dist_1 = distance(&start, &middle); let _dist_2 = distance(&middle, &end); } -- Compiler Explorer As a consequence of function arguments never using heap memory, we can also infer that functions using the #[inline] attributes also do not heap allocate. But better than inferring, we can look at the assembly to prove it: struct Point { x: i64, y: i64, } // Note that there is no `distance` function in the assembly output, // and the total line count goes from 229 with inlining off // to 306 with inline on. Even still, no heap allocations occur. #[inline(always)] fn distance(a: &Point, b: &Point) -> i64 { let x1 = a.x; let x2 = b.x; let y1 = a.y; let y2 = b.y; let x_pow = (a.x - b.x) * (a.x - b.x); let y_pow = (a.y - b.y) * (a.y - b.y); let squared = x_pow + y_pow; squared / squared } pub fn total_distance() { let start = Point { x: 1, y: 2 }; let middle = Point { x: 3, y: 4 }; let end = Point { x: 5, y: 6 }; let _dist_1 = distance(&start, &middle); let _dist_2 = distance(&middle, &end); } -- Compiler Explorer Finally, passing by value (arguments with typeCopy) and passing by reference (either moving ownership or passing a pointer) may have slightly different layouts in assembly, but will still use either stack memory or CPU registers: pub struct Point { x: i64, y: i64, } // Moving values pub fn distance_moved(a: Point, b: Point) -> i64 { let x1 = a.x; let x2 = b.x; let y1 = a.y; let y2 = b.y; let x_pow = (x1 - x2) * (x1 - x2); let y_pow = (y1 - y2) * (y1 - y2); let squared = x_pow + y_pow; squared / squared } // Borrowing values has two extra `mov` instructions on lines 21 and 22 pub fn distance_borrowed(a: &Point, b: &Point) -> i64 { let x1 = a.x; let x2 = b.x; let y1 = a.y; let y2 = b.y; let x_pow = (x1 - x2) * (x1 - x2); let y_pow = (y1 - y2) * (y1 - y2); let squared = x_pow + y_pow; squared / squared } -- Compiler Explorer ","version":null,"tagName":"h2"},{"title":"Enums","type":1,"pageTitle":"Allocations in Rust: Fixed memory","url":"/2019/02/stacking-up#enums","content":" If you've ever worried that wrapping your types inOption orResult would finally make them large enough that Rust decides to use heap allocation instead, fear no longer: enum and union types don't use heap allocation: enum MyEnum { Small(u8), Large(u64) } struct MyStruct { x: MyEnum, y: MyEnum, } pub fn enum_compare() { let x = MyEnum::Small(0); let y = MyEnum::Large(0); let z = MyStruct { x, y }; let opt = Option::Some(z); } -- Compiler Explorer Because the size of an enum is the size of its largest element plus a flag, the compiler can predict how much memory is used no matter which variant of an enum is currently stored in a variable. Thus, enums and unions have no need of heap allocation. There's unfortunately not a great way to show this in assembly, so I'll instead point you to thecore::mem::size_ofdocumentation. ","version":null,"tagName":"h2"},{"title":"Arrays","type":1,"pageTitle":"Allocations in Rust: Fixed memory","url":"/2019/02/stacking-up#arrays","content":" The array type is guaranteed to be stack allocated, which is why the array size must be declared. Interestingly enough, this can be used to cause safe Rust programs to crash: // 256 bytes #[derive(Default)] struct TwoFiftySix { _a: [u64; 32] } // 8 kilobytes #[derive(Default)] struct EightK { _a: [TwoFiftySix; 32] } // 256 kilobytes #[derive(Default)] struct TwoFiftySixK { _a: [EightK; 32] } // 8 megabytes - exceeds space typically provided for the stack, // though the kernel can be instructed to allocate more. // On Linux, you can check stack size using `ulimit -s` #[derive(Default)] struct EightM { _a: [TwoFiftySixK; 32] } fn main() { // Because we already have things in stack memory // (like the current function call stack), allocating another // eight megabytes of stack memory crashes the program let _x = EightM::default(); } --Rust Playground There aren't any security implications of this (no memory corruption occurs), but it's good to note that the Rust compiler won't move arrays into heap memory even if they can be reasonably expected to overflow the stack. ","version":null,"tagName":"h2"},{"title":"Closures","type":1,"pageTitle":"Allocations in Rust: Fixed memory","url":"/2019/02/stacking-up#closures","content":" Rules for how anonymous functions capture their arguments are typically language-specific. In Java,Lambda Expressions are actually objects created on the heap that capture local primitives by copying, and capture local non-primitives as (final) references.Python andJavaScriptboth bind everything by reference normally, but Python can alsocapture values and JavaScript hasArrow functions. In Rust, arguments to closures are the same as arguments to other functions; closures are simply functions that don't have a declared name. Some weird ordering of the stack may be required to handle them, but it's the compiler's responsiblity to figure that out. Each example below has the same effect, but a different assembly implementation. In the simplest case, we immediately run a closure returned by another function. Because we don't store a reference to the closure, the stack memory needed to store the captured values is contiguous: fn my_func() -> impl FnOnce() { let x = 24; // Note that this closure in assembly looks exactly like // any other function; you even use the `call` instruction // to start running it. move || { x; } } pub fn immediate() { my_func()(); my_func()(); } -- Compiler Explorer, 25 total assembly instructions If we store a reference to the closure, the Rust compiler keeps values it needs in the stack memory of the original function. Getting the details right is a bit harder, so the instruction count goes up even though this code is functionally equivalent to our original example: pub fn simple_reference() { let x = my_func(); let y = my_func(); y(); x(); } -- Compiler Explorer, 55 total assembly instructions Even things like variable order can make a difference in instruction count: pub fn complex() { let x = my_func(); let y = my_func(); x(); y(); } -- Compiler Explorer, 70 total assembly instructions In every circumstance though, the compiler ensured that no heap allocations were necessary. ","version":null,"tagName":"h2"},{"title":"Generics","type":1,"pageTitle":"Allocations in Rust: Fixed memory","url":"/2019/02/stacking-up#generics","content":" Traits in Rust come in two broad forms: static dispatch (monomorphization, impl Trait) and dynamic dispatch (trait objects, dyn Trait). While dynamic dispatch is often associated with trait objects being stored in the heap, dynamic dispatch can be used with stack allocated objects as well: trait GetInt { fn get_int(&self) -> u64; } // vtable stored at section L__unnamed_1 struct WhyNotU8 { x: u8 } impl GetInt for WhyNotU8 { fn get_int(&self) -> u64 { self.x as u64 } } // vtable stored at section L__unnamed_2 struct ActualU64 { x: u64 } impl GetInt for ActualU64 { fn get_int(&self) -> u64 { self.x } } // `&dyn` declares that we want to use dynamic dispatch // rather than monomorphization, so there is only one // `retrieve_int` function that shows up in the final assembly. // If we used generics, there would be one implementation of // `retrieve_int` for each type that implements `GetInt`. pub fn retrieve_int(u: &dyn GetInt) { // In the assembly, we just call an address given to us // in the `rsi` register and hope that it was set up // correctly when this function was invoked. let x = u.get_int(); } pub fn do_call() { // Note that even though the vtable for `WhyNotU8` and // `ActualU64` includes a pointer to // `core::ptr::real_drop_in_place`, it is never invoked. let a = WhyNotU8 { x: 0 }; let b = ActualU64 { x: 0 }; retrieve_int(&a); retrieve_int(&b); } -- Compiler Explorer It's hard to imagine practical situations where dynamic dispatch would be used for objects that aren't heap allocated, but it technically can be done. ","version":null,"tagName":"h2"},{"title":"Copy types","type":1,"pageTitle":"Allocations in Rust: Fixed memory","url":"/2019/02/stacking-up#copy-types","content":" Understanding move semantics and copy semantics in Rust is weird at first. The Rust docsgo into detail far better than can be addressed here, so I'll leave them to do the job. From a memory perspective though, their guideline is reasonable:if your type can implemement Copy, it should. While there are potential speed tradeoffs to benchmark when discussing Copy (move semantics for stack objects vs. copying stack pointers vs. copying stack structs), it's impossible for Copyto introduce a heap allocation. But why is this the case? Fundamentally, it's because the language controls what Copy means -"the behavior of Copy is not overloadable"because it's a marker trait. From there we'll note that a typecan implement Copyif (and only if) its components implement Copy, and thatno heap-allocated types implement Copy. Thus, assignments involving heap types are always move semantics, and new heap allocations won't occur because of implicit operator behavior. #[derive(Clone)] struct Cloneable { x: Box<u64> } // error[E0204]: the trait `Copy` may not be implemented for this type #[derive(Copy, Clone)] struct NotCopyable { x: Box<u64> } -- Compiler Explorer ","version":null,"tagName":"h2"},{"title":"Iterators","type":1,"pageTitle":"Allocations in Rust: Fixed memory","url":"/2019/02/stacking-up#iterators","content":" In managed memory languages (likeJava), there's a subtle difference between these two code samples: public static int sum_for(List<Long> vals) { long sum = 0; // Regular for loop for (int i = 0; i < vals.length; i++) { sum += vals[i]; } return sum; } public static int sum_foreach(List<Long> vals) { long sum = 0; // "Foreach" loop - uses iteration for (Long l : vals) { sum += l; } return sum; } In the sum_for function, nothing terribly interesting happens. In sum_foreach, an object of typeIteratoris allocated on the heap, and will eventually be garbage-collected. This isn't a great design; iterators are often transient objects that you need during a function and can discard once the function ends. Sounds exactly like the issue stack-allocated objects address, no? In Rust, iterators are allocated on the stack. The objects to iterate over are almost certainly in heap memory, but the iterator itself (Iter) doesn't need to use the heap. In each of the examples below we iterate over a collection, but never use heap allocation: use std::collections::HashMap; // There's a lot of assembly generated, but if you search in the text, // there are no references to `real_drop_in_place` anywhere. pub fn sum_vec(x: &Vec<u32>) { let mut s = 0; // Basic iteration over vectors doesn't need allocation for y in x { s += y; } } pub fn sum_enumerate(x: &Vec<u32>) { let mut s = 0; // More complex iterators are just fine too for (_i, y) in x.iter().enumerate() { s += y; } } pub fn sum_hm(x: &HashMap<u32, u32>) { let mut s = 0; // And it's not just Vec, all types will allocate the iterator // on stack memory for y in x.values() { s += y; } } -- Compiler Explorer ","version":null,"tagName":"h2"}],"options":{"id":"default"}} |