plot()
function has been fairly simple: map a fractal flame coordinate to a specific pixel,
and color in that pixel. This works well for simple function systems (like Sierpinski's Gasket),
but more complex systems (like the reference parameters) produce grainy images.
In this post, we'll refine the image quality and add color to really make things shine.
This post covers sections 4 and 5 of the Fractal Flame Algorithm paper
One problem with the current chaos game algorithm is that we waste work because pixels are either "on" (opaque) or "off" (transparent). If the chaos game encounters the same pixel twice, nothing changes.
To demonstrate how much work is wasted, we'll count each time the chaos game visits a pixel while iterating. This gives us a kind of image "histogram":
import { randomBiUnit } from "../src/randomBiUnit";
import { randomChoice } from "../src/randomChoice";
import { Props as ChaosGameFinalProps } from "../2-transforms/chaosGameFinal";
import { camera, histIndex } from "../src/camera";
const quality = 10;
const step = 100_000;
type Props = ChaosGameFinalProps & {
paint: (
width: number,
height: number,
histogram: number[]
) => ImageData;
}
export function* chaosGameHistogram(
{
width,
height,
transforms,
final,
paint
}: Props
) {
const pixels = width * height;
const iterations = quality * pixels;
const hist = Array<number>(pixels)
.fill(0);
const plotHist = (
x: number,
y: number
) => {
const [pixelX, pixelY] =
camera(x, y, width);
if (
pixelX < 0 ||
pixelX >= width ||
pixelY < 0 ||
pixelY >= height
)
return;
const hIndex =
histIndex(pixelX, pixelY, width, 1);
hist[hIndex] += 1;
};
let [x, y] = [
randomBiUnit(),
randomBiUnit()
];
for (let i = 0; i < iterations; i++) {
const [_, transform] =
randomChoice(transforms);
[x, y] = transform(x, y);
const [finalX, finalY] = final(x, y);
if (i > 20) {
plotHist(finalX, finalY);
}
if (i % step === 0)
yield paint(width, height, hist);
}
yield paint(width, height, hist);
}
When the chaos game finishes, we find the pixel encountered most often. Finally, we "paint" the image by setting each pixel's alpha (transparency) value to the ratio of times visited divided by the maximum:
export function paintLinear(
width: number,
height: number,
hist: number[]
) {
const img =
new ImageData(width, height);
let hMax = 0;
for (let value of hist) {
hMax = Math.max(hMax, value);
}
for (let i = 0; i < hist.length; i++) {
const pixelIndex = i * 4;
img.data[pixelIndex] = 0;
img.data[pixelIndex + 1] = 0;
img.data[pixelIndex + 2] = 0;
const alpha = hist[i] / hMax * 0xff;
img.data[pixelIndex + 3] = alpha;
}
return img;
}
While using a histogram reduces the "graining," it also leads to some parts vanishing entirely. In the reference parameters, the outer circle is still there, but the interior is gone!
To fix this, we'll introduce the second major innovation of the fractal flame algorithm: tone mapping. This is a technique used in computer graphics to compensate for differences in how computers represent brightness, and how people actually see brightness.
As a concrete example, high-dynamic-range (HDR) photography uses this technique to capture scenes with a wide range of brightnesses. To take a picture of something dark, you need a long exposure time. However, long exposures lead to "hot spots" (sections that are pure white). By taking multiple pictures with different exposure times, we can combine them to create a final image where everything is visible.
In fractal flames, this "tone map" is accomplished by scaling brightness according to the logarithm of how many times we encounter a pixel. This way, "cold spots" (pixels the chaos game visits infrequently) are still visible, and "hot spots" (pixels the chaos game visits frequently) won't wash out.
As mentioned in the paper:
Where one branch of the fractal crosses another, one may appear to occlude the other if their densities are different enough because the lesser density is inconsequential in sum. For example, branches of densities 1000 and 100 might have brightnesses of 30 and 20. Where they cross the density is 1100, whose brightness is 30.4, which is hardly distinguishable from 30.
export function paintLogarithmic(
width: number,
height: number,
hist: number[]
) {
const img =
new ImageData(width, height);
const histLog = hist.map(Math.log);
let hLogMax = -Infinity;
for (let value of histLog) {
hLogMax = Math.max(hLogMax, value);
}
for (let i = 0; i < hist.length; i++) {
const pixelIndex = i * 4;
img.data[pixelIndex] = 0; // red
img.data[pixelIndex + 1] = 0; // green
img.data[pixelIndex + 2] = 0; // blue
const alpha =
histLog[i] / hLogMax * 0xff;
img.data[pixelIndex + 3] = alpha;
}
return img;
}
Now we'll introduce the last innovation of the fractal flame algorithm: color. By including a third coordinate () in the chaos game, we can illustrate the transforms responsible for the image.
Color in a fractal flame is continuous on the range . This is important for two reasons:
We'll give each transform a color value () in the range. The final transform gets a value too (). Then, at each step in the chaos game, we'll set the current color by blending it with the previous color:
Color speed isn't introduced in the Fractal Flame Algorithm paper.
It is included here because flam3
implements it,
and because it's fun to play with.
Next, we'll add a parameter to each transform that controls how much it changes the current color. This is known as the "color speed" ():
export function mixColor(
color1: number,
color2: number,
colorSpeed: number
) {
return color1 * (1 - colorSpeed) +
color2 * colorSpeed;
}
Color speed values work just like transform weights. A value of 1 means we take the transform color and ignore the previous color state. A value of 0 means we keep the current color state and ignore the transform color.
Now, we need to map the color coordinate to a pixel color. Fractal flames typically use 256 colors (each color has 3 values - red, green, blue) to define a palette. The color coordinate then becomes an index into the palette.
There's one small complication: the color coordinate is continuous, but the palette uses discrete colors. How do we handle situations where the color coordinate is "in between" the colors of our palette?
One way to handle this is a step function. In the code below, we multiply the color coordinate by the number of colors in the palette, then truncate that value. This gives us a discrete index:
export function colorFromPalette(
palette: number[],
colorIndex: number
): [number, number, number] {
const numColors = palette.length / 3;
const paletteIndex = Math.floor(
colorIndex * (numColors)
) * 3;
return [
palette[paletteIndex], // red
palette[paletteIndex + 1], // green
palette[paletteIndex + 2] // blue
];
}
...you could interpolate between colors in the palette.
For example, flam3
uses linear interpolation
In the diagram below, each color in the palette is plotted on a small vertical strip. Putting the strips side by side shows the full palette used by the reference parameters:
We're now ready to plot our coordinates. This time, we'll use a histogram for each color channel (red, green, blue, alpha). After translating from color coordinate () to RGB value, add that to the histogram:
import { Props as ChaosGameFinalProps } from "../2-transforms/chaosGameFinal";
import { randomBiUnit } from "../src/randomBiUnit";
import { randomChoice } from "../src/randomChoice";
import { camera, histIndex } from "../src/camera";
import { colorFromPalette } from "./colorFromPalette";
import { mixColor } from "./mixColor";
import { paintColor } from "./paintColor";
const quality = 15;
const step = 100_000;
export type TransformColor = {
color: number;
colorSpeed: number;
}
export type Props = ChaosGameFinalProps & {
palette: number[];
colors: TransformColor[];
finalColor: TransformColor;
}
export function* chaosGameColor(
{
width,
height,
transforms,
final,
palette,
colors,
finalColor
}: Props
) {
const pixels = width * height;
const imgRed = Array<number>(pixels)
.fill(0);
const imgGreen = Array<number>(pixels)
.fill(0);
const imgBlue = Array<number>(pixels)
.fill(0);
const imgAlpha = Array<number>(pixels)
.fill(0);
const plotColor = (
x: number,
y: number,
c: number
) => {
const [pixelX, pixelY] =
camera(x, y, width);
if (
pixelX < 0 ||
pixelX >= width ||
pixelY < 0 ||
pixelY >= width
)
return;
const hIndex =
histIndex(pixelX, pixelY, width, 1);
const [r, g, b] =
colorFromPalette(palette, c);
imgRed[hIndex] += r;
imgGreen[hIndex] += g;
imgBlue[hIndex] += b;
imgAlpha[hIndex] += 1;
}
let [x, y] = [
randomBiUnit(),
randomBiUnit()
];
let c = Math.random();
const iterations = quality * pixels;
for (let i = 0; i < iterations; i++) {
const [transformIndex, transform] =
randomChoice(transforms);
[x, y] = transform(x, y);
const transformColor =
colors[transformIndex];
c = mixColor(
c,
transformColor.color,
transformColor.colorSpeed
);
const [finalX, finalY] = final(x, y);
const finalC = mixColor(
c,
finalColor.color,
finalColor.colorSpeed
);
if (i > 20)
plotColor(
finalX,
finalY,
finalC
)
if (i % step === 0)
yield paintColor(
width,
height,
imgRed,
imgGreen,
imgBlue,
imgAlpha
);
}
yield paintColor(
width,
height,
imgRed,
imgGreen,
imgBlue,
imgAlpha
);
}
Finally, painting the image. With tone mapping, logarithms scale the image brightness to match how it is perceived. With color, we use a similar method, but scale each color channel by the alpha channel:
export function paintColor(
width: number,
height: number,
red: number[],
green: number[],
blue: number[],
alpha: number[]
): ImageData {
const pixels = width * height;
const img =
new ImageData(width, height);
for (let i = 0; i < pixels; i++) {
const scale =
Math.log10(alpha[i]) /
(alpha[i] * 1.5);
const pixelIndex = i * 4;
const rVal = red[i] * scale * 0xff;
img.data[pixelIndex] = rVal;
const gVal = green[i] * scale * 0xff;
img.data[pixelIndex + 1] = gVal;
const bVal = blue[i] * scale * 0xff;
img.data[pixelIndex + 2] = bVal;
const aVal = alpha[i] * scale * 0xff;
img.data[pixelIndex + 3] = aVal;
}
return img;
}
And now, at long last, a full-color fractal flame:
Tone mapping is the second major innovation of the fractal flame algorithm. By tracking how often the chaos game encounters each pixel, we can adjust brightness/transparency to reduce the visual "graining" of previous images.
Next, introducing a third coordinate to the chaos game makes color images possible, the third major innovation of the fractal flame algorithm. Using a continuous color scale and color palette adds a splash of excitement to the image.
The Fractal Flame Algorithm paper goes on to describe more techniques not covered here. For example, image quality can be improved with density estimation and filtering. New parameters can be generated by "mutating" existing fractal flames. And fractal flames can even be animated to produce videos!
That said, I think this is a good place to wrap up. We went from an introduction to the mathematics of fractal systems all the way to generating full-color images. Fractal flames are a challenging topic, but it's extremely rewarding to learn about how they work.
]]>This post uses reference parameters to demonstrate the fractal flame algorithm. If you're interested in tweaking the parameters, or creating your own, Apophysis can load that file.
This post covers section 3 of the Fractal Flame Algorithm paper
We previously introduced transforms as the "functions" of an "iterated function system," and showed how playing the chaos game gives us an image of Sierpinski's Gasket. Even though we used simple functions, the image it generates is intriguing. But what would happen if we used something more complex?
This leads us to the first big innovation of the fractal flame algorithm: adding non-linear functions after the affine transform. These functions are called "variations":
export type Variation = (
x: number,
y: number
) => [number, number];
Just like transforms, variations () are functions that take in coordinates
and give back new coordinates.
However, the sky is the limit for what happens between input and output.
The Fractal Flame paper lists 49 variation functions,
and the official flam3
implementation supports 98 different variations.
To draw our reference image, we'll focus on just four:
This variation is dead simple: return the and coordinates as-is.
import {Variation} from "./variation"
export const linear: Variation =
(x, y) => [x, y];
In a way, we've already been using this variation! The transforms that define Sierpinski's Gasket apply the affine coefficients to the input point and use that as the output.
This variation is a good example of a non-linear function. It uses both trigonometry and probability to produce interesting shapes:
import { Variation } from "./variation";
const omega =
() => Math.random() > 0.5 ? 0 : Math.PI;
export const julia: Variation =
(x, y) => {
const x2 = Math.pow(x, 2);
const y2 = Math.pow(y, 2);
const r = Math.sqrt(x2 + y2);
const theta = Math.atan2(x, y);
const sqrtR = Math.sqrt(r);
const thetaVal = theta / 2 + omega();
return [
sqrtR * Math.cos(thetaVal),
sqrtR * Math.sin(thetaVal)
];
};
Some variations rely on knowing the transform's affine coefficients; they're called "dependent variations." For this variation, we use and :
import { Coefs } from "./transform";
import { Variation } from "./variation";
export const popcorn =
({ c, f }: Coefs): Variation =>
(x, y) => [
x + c * Math.sin(Math.tan(3 * y)),
y + f * Math.sin(Math.tan(3 * x))
];
Some variations have extra parameters we can choose; they're called "parametric variations." For the PDJ variation, there are four extra parameters:
import { Variation } from './variation'
export type PdjParams = {
a: number,
b: number,
c: number,
d: number
};
export const pdj =
({a, b, c, d}: PdjParams): Variation =>
(x, y) => [
Math.sin(a * y) - Math.cos(b * x),
Math.sin(c * x) - Math.cos(d * y)
]
Now, one variation is fun, but we can also combine variations in a process called "blending." Each variation receives the same and inputs, and we add together each variation's and outputs. We'll also give each variation a weight () that changes how much it contributes to the result:
The formula looks intimidating, but it's not hard to implement:
import { Variation } from "./variation";
export type Blend = [number, Variation][];
export function blend(
x: number,
y: number,
varFns: Blend
): [number, number] {
let [outX, outY] = [0, 0];
for (const [weight, varFn] of varFns) {
const [varX, varY] = varFn(x, y);
outX += weight * varX;
outY += weight * varY;
}
return [outX, outY];
}
With that in place, we have enough to render a fractal flame. We'll use the same chaos game as before, but the new transforms and variations produce a dramatically different image:
Try using the variation weights to figure out which parts of the image each transform controls.
Next, we'll introduce a second affine transform applied after variation blending. This is called a "post transform."
We'll use some new variables, but the post transform should look familiar:
import { applyCoefs, Coefs, Transform } from "../src/transform";
export const transformPost = (
transform: Transform,
coefs: Coefs
): Transform =>
(x, y) => {
[x, y] = transform(x, y);
return applyCoefs(x, y, coefs);
}
The image below uses the same transforms/variations as the previous fractal flame, but allows changing the post-transform coefficients:
The last step is to introduce a "final transform" () that is applied regardless of which regular transform () the chaos game selects. It's just like a normal transform (composition of affine transform, variation blend, and post transform), but it doesn't affect the chaos game state.
After adding the final transform, our chaos game algorithm looks like this:
import { randomBiUnit } from "../src/randomBiUnit";
import { randomChoice } from "../src/randomChoice";
import { plotBinary as plot } from "../src/plotBinary";
import { Transform } from "../src/transform";
import { Props as WeightedProps } from "../1-introduction/chaosGameWeighted";
const quality = 0.5;
const step = 1000;
export type Props = WeightedProps & {
final: Transform,
}
export function* chaosGameFinal(
{
width,
height,
transforms,
final
}: Props
) {
let img =
new ImageData(width, height);
let [x, y] = [
randomBiUnit(),
randomBiUnit()
];
const pixels = width * height;
const iterations = quality * pixels;
for (let i = 0; i < iterations; i++) {
const [_, transform] =
randomChoice(transforms);
[x, y] = transform(x, y);
const [finalX, finalY] = final(x, y);
if (i > 20)
plot(finalX, finalY, img);
if (i % step === 0)
yield img;
}
yield img;
}
This image uses the same normal/post transforms as above, but allows modifying the coefficients and variations of the final transform:
Variations are the fractal flame algorithm's first major innovation. By blending variation functions and post/final transforms, we generate unique images.
However, these images are grainy and unappealing. In the next post, we'll clean up the image quality and add some color.
]]>a member of the iterated function system class of fractals
It's tedious, but technically correct. I choose to think of them a different way: beauty in mathematics.
I don't remember when exactly I first learned about fractal flames, but I do remember being entranced by the images they created. I also remember their unique appeal to my young engineering mind; this was an art form I could participate in.
The Fractal Flame Algorithm paper describing their structure was too much for me to handle at the time (I was ~12 years old), so I was content to play around and enjoy the pictures. But the desire to understand it stuck around. Now, with a graduate degree under my belt, I wanted to revisit it.
This guide is my attempt to explain how fractal flames work so that younger me — and others interested in the art — can understand without too much prior knowledge.
This post covers section 2 of the Fractal Flame Algorithm paper
As mentioned, fractal flames are a type of "iterated function system," or IFS. The formula for an IFS is short, but takes some time to work through:
First, . is the set of points in two dimensions (in math terms, ) that represent a "solution" of some kind to our equation. Our goal is to find all the points in , plot them, and display that image.
For example, if we say , there are three points to plot:
With fractal flames, rather than listing individual points, we use functions to describe the solution. This means there are an infinite number of points, but if we find enough points to plot, we get a nice picture. And if the functions change, the solution also changes, and we get something new.
Second, the functions, also known as "transforms." Each transform takes in a 2-dimensional point and gives a new point back (in math terms, ). While you could theoretically use any function, we'll focus on a specific kind of function called an "affine transformation." Every transform uses the same formula:
export type Transform =
(x: number, y: number) =>
[number, number];
export interface Coefs {
a: number,
b: number,
c: number,
d: number,
e: number,
f: number
}
export function applyCoefs(
x: number,
y: number,
coefs: Coefs
): [number, number] {
return [
(x * coefs.a + y * coefs.b + coefs.c),
(x * coefs.d + y * coefs.e + coefs.f)
];
}
The parameters (, , etc.) are values we choose. For example, we can define a "shift" function like this:
Applying this transform to the original points gives us a new set of points:
Fractal flames use more complex functions, but they all start with this structure.
With those definitions in place, let's revisit the initial problem:
Or, in English, we might say:
Our solution, , is the union of all sets produced by applying each function, , to points in the solution.
There's just one small problem: to find the solution, we must already know which points are in the solution. What?
John E. Hutchinson provides an explanation in the original paper defining the mathematics of iterated function systems:
Furthermore, is compact and is the closure of the set of fixed points of finite compositions of members of .
Before your eyes glaze over, let's unpack this:
Thus, by applying the functions to fixed points of our system, we will find the other points we care about.
...then there are some extra details I've glossed over so far.
First, the Hutchinson paper requires that the functions be contractive for the solution set to exist. That is, applying the function to a point must bring it closer to other points. However, as the fractal flame algorithm demonstrates, we only need functions to be contractive on average. At worst, the system will degenerate and produce a bad image.
Second, we're focused on because we're generating images, but the math allows for arbitrary dimensions; you could also have 3-dimensional fractal flames.
Finally, there's a close relationship between fractal flames and attractors. Specifically, the fixed points of act as attractors for the chaos game (explained below).
This is still a bit vague, so let's work through an example.
The Fractal Flame paper gives three functions to use for a first IFS:
Now, how do we find the "fixed points" mentioned earlier? The paper lays out an algorithm called the "chaos game" that gives us points in the solution:
The chaos game algorithm is effectively the "finite compositions of " mentioned earlier.
Let's turn this into code, one piece at a time.
To start, we need to generate some random numbers. The "bi-unit square" is the range , and we can do this using an existing API:
export function randomBiUnit() {
return Math.random() * 2 - 1;
}
Next, we need to choose a random integer from to :
export function randomInteger(
min: number,
max: number
) {
let v = Math.random() * (max - min);
return Math.floor(v) + min;
}
Finally, implementing the plot
function. This blog series is interactive,
so everything displays directly in the browser. As an alternative,
software like flam3
and Apophysis can "plot" by saving an image to disk.
To see the results, we'll use the Canvas API. This allows us to manipulate individual pixels in an image and show it on screen.
First, we need to convert from fractal flame coordinates to pixel coordinates. To simplify things, we'll assume that we're plotting a square image with range for both and :
export function camera(
x: number,
y: number,
size: number
): [number, number] {
return [
Math.floor(x * size),
Math.floor(y * size)
];
}
Next, we'll store the pixel data in an ImageData
object.
Each pixel on screen has a corresponding index in the data
array.
To plot a point, we set that pixel to be black:
import { camera } from "./cameraGasket";
function imageIndex(
x: number,
y: number,
width: number
) {
return y * (width * 4) + x * 4;
}
export function plot(
x: number,
y: number,
img: ImageData
) {
let [pixelX, pixelY] =
camera(x, y, img.width);
// Skip coordinates outside the display
if (
pixelX < 0 ||
pixelX >= img.width ||
pixelY < 0 ||
pixelY >= img.height
)
return;
const i = imageIndex(
pixelX,
pixelY,
img.width
);
// Set the pixel to black by setting
// the first three elements to 0
// (red, green, and blue, respectively),
// and 255 to the last element (alpha)
img.data[i] = 0;
img.data[i + 1] = 0;
img.data[i + 2] = 0;
img.data[i + 3] = 0xff;
}
Putting it all together, we have our first image:
// Hint: try changing the iteration count const iterations = 100000; // Hint: negating `x` and `y` creates some cool images const xforms = [ (x, y) => [x / 2, y / 2], (x, y) => [(x + 1) / 2, y / 2], (x, y) => [x / 2, (y + 1) / 2] ]; function* chaosGame({ width, height }) { let img = new ImageData(width, height); let [x, y] = [ randomBiUnit(), randomBiUnit() ]; for (let i = 0; i < iterations; i++) { const index = randomInteger(0, xforms.length); [x, y] = xforms[index](x, y); if (i > 20) plot(x, y, img); if (i % 1000 === 0) yield img; } yield img; } render(<Gasket f={chaosGame} />);
The image here is slightly different than in the paper. I think the paper has an error, so I'm plotting the image like the reference implementation.
There's one last step before we finish the introduction. So far, each transform has the same chance of being picked in the chaos game. We can change that by giving them a "weight" () instead:
export function randomChoice<T>(
choices: [number, T][]
): [number, T] {
const weightSum = choices.reduce(
(sum, [weight, _]) => sum + weight,
0
);
let choice = Math.random() * weightSum;
for (const entry of choices.entries()) {
const [idx, elem] = entry;
const [weight, t] = elem;
if (choice < weight) {
return [idx, t];
}
choice -= weight;
}
const index = choices.length - 1;
return [index, choices[index][1]];
}
If we let the chaos game run forever, these weights wouldn't matter. But because the iteration count is limited, changing the weights means we don't plot some parts of the image:
import { randomBiUnit } from "../src/randomBiUnit";
import { randomChoice } from "../src/randomChoice";
import { plot } from "./plot";
import { Transform } from "../src/transform";
const quality = 0.5;
const step = 1000;
export type Props = {
width: number,
height: number,
transforms: [number, Transform][]
}
export function* chaosGameWeighted(
{ width, height, transforms }: Props
) {
let img =
new ImageData(width, height);
let [x, y] = [
randomBiUnit(),
randomBiUnit()
];
const pixels = width * height;
const iterations = quality * pixels;
for (let i = 0; i < iterations; i++) {
const [_, xform] =
randomChoice(transforms);
[x, y] = xform(x, y);
if (i > 20)
plot(x, y, img);
if (i % step === 0)
yield img;
}
yield img;
}
Double-click the image if you want to save a copy!
Studying the foundations of fractal flames is challenging, but we now have an understanding of the mathematics and the implementation of iterated function systems.
In the next post, we'll look at the first innovation of fractal flame algorithm: variations.
]]>The project was soon derailed trying to sort out technical issues unrelated to the original purpose. Finding a resolution was a frustrating journey, and it's still not clear whether those problems were my fault. As a result, I'm writing this to try making sense of it, as a case study/reference material, and to salvage something from the process.
The sole starting requirement was to write everything in TypeScript. Not because of project scale, but because guardrails help with unfamiliar territory. Keeping that in mind, the first question was: how does one start a new project? All I actually need is "compile TypeScript, show it in a browser."
Create React App (CRA) came to the rescue and the rest of that evening was a joy. My TypeScript/JavaScript skills were rusty, but the online documentation was helpful. I had never understood the appeal of JSX (why put a DOM in JavaScript?) until it made connecting an onEvent
handler and a function easy.
Some quick dimensional analysis later and there was a sine wave oscillator playing A=440 through the speakers. I specifically remember thinking "modern browsers are magical."
Now comes the first mistake: I began to worry about "scale" before encountering an actual problem. Rather than rendering audio in the main thread, why not use audio worklets and render in a background thread instead?
The first sign something was amiss came from the TypeScript compiler errors showing the audio worklet API was missing. After searching out Github issues and (unsuccessfully) tweaking the .tsconfig
settings, I settled on installing a package and moving on.
The next problem came from actually using the API. Worklets must load from separate "modules," but it wasn't clear how to guarantee the worklet code stayed separate from the application. I saw recommendations to use new URL(<local path>, import.meta.url)
and it worked! Well, kind of:
That file has the audio processor code, so why does it get served with Content-Type: video/mp2t
?
Now comes the second mistake: even though I didn't understand the error, I ignored recommendations to just use JavaScript and stuck by the original TypeScript requirement.
I tried different project structures. Moving the worklet code to a new folder didn't help, nor did setting up a monorepo and placing it in a new package.
I tried three different CRA tools - react-app-rewired
, craco
, customize-react-app
- but got the same problem. Each has varying levels of compatibility with recent CRA versions, so it wasn't clear if I had the right solution but implemented it incorrectly. After attempting to eject the application and panicking after seeing the configuration, I abandoned that as well.
I tried changing the webpack configuration: using new loaders, setting asset rules, even changing how webpack detects worker resources. In hindsight, entry points may have been the answer. But because CRA actively resists attempts to change its webpack configuration, and I couldn't find audio worklet examples in any other framework, I gave up.
I tried so many application frameworks. Next.js looked like a good candidate, but added its own bespoke webpack complexity to the existing confusion. Astro had the best "getting started" experience, but I refuse to install an IDE-specific plugin. I first used Deno while exploring Lume, but it couldn't import the audio worklet types (maybe because of module compatibility?). Each framework was unique in its own way (shout-out to SvelteKit) but I couldn't figure out how to make them work.
I ended up using Vite and vite-plugin-react-pages to handle both "build the app" and "bundle worklets," but the specific tool choice isn't important. Instead, the focus should be on lessons learned.
For myself:
For the tools:
In the end, learning new systems is fun, but a focus on tools that "just work" can leave users out in the cold if they break down.
]]>Still, wouldn't it be nice to have more than a single active interpreter thread? In an age of asynchronicity and M:N threading, Python seems lacking. The ideal scenario is to take advantage of both Python's productivity and the modern CPU's parallel capabilities.
Presented below are two strategies for releasing the GIL's icy grip without giving up on what makes Python a nice language to start with. Bear in mind: these are just the tools, no claim is made about whether it's a good idea to use them. Very often, unlocking the GIL is an XY problem; you want application performance, and the GIL seems like an obvious bottleneck. Remember that any gains from running code in parallel come at the expense of project complexity; messing with the GIL is ultimately messing with Python's memory model.
%load_ext Cython
from numba import jit
N = 1_000_000_000
Put simply, Cython is a programming language that looks a lot like Python, gets transpiled to C/C++, and integrates well with the CPython API. It's great for building Python wrappers to C and C++ libraries, writing optimized code for numerical processing, and tons more. And when it comes to managing the GIL, there are two special features:
nogil
function annotation
asserts that a Cython function is safe to use without the GIL, and compilation will fail if it
interacts with Python in an unsafe mannerwith nogil
context manager
explicitly unlocks the CPython GIL while activeWhenever Cython code runs inside a with nogil
block on a separate thread, the Python interpreter
is unblocked and allowed to continue work elsewhere. We'll define a "busy work" function that
demonstrates this principle in action:
%%cython
# Annotating a function with `nogil` indicates only that it is safe
# to call in a `with nogil` block. It *does not* release the GIL.
cdef unsigned long fibonacci(unsigned long n) nogil:
if n <= 1:
return n
cdef unsigned long a = 0, b = 1, c = 0
c = a + b
for _i in range(2, n):
a = b
b = c
c = a + b
return c
def cython_nogil(unsigned long n):
# Explicitly release the GIL while running `fibonacci`
with nogil:
value = fibonacci(n)
return value
def cython_gil(unsigned long n):
# Because the GIL is not explicitly released, it implicitly
# remains acquired when running the `fibonacci` function
return fibonacci(n)
First, let's time how long it takes Cython to calculate the billionth Fibonacci number:
%%time
_ = cython_gil(N);
CPU times: user 365 ms, sys: 0 ns, total: 365 ms Wall time: 372 ms
%%time
_ = cython_nogil(N);
CPU times: user 381 ms, sys: 0 ns, total: 381 ms Wall time: 388 ms
Both versions (with and without GIL) take effectively the same amount of time to run. Even when running this calculation in parallel on separate threads, it is expected that the run time will double because only one thread can be active at a time:
%%time
from threading import Thread
# Create the two threads to run on
t1 = Thread(target=cython_gil, args=[N])
t2 = Thread(target=cython_gil, args=[N])
# Start the threads
t1.start(); t2.start()
# Wait for the threads to finish
t1.join(); t2.join()
CPU times: user 641 ms, sys: 5.62 ms, total: 647 ms Wall time: 645 ms
However, if the first thread releases the GIL, the second thread is free to acquire it and run in parallel:
%%time
t1 = Thread(target=cython_nogil, args=[N])
t2 = Thread(target=cython_gil, args=[N])
t1.start(); t2.start()
t1.join(); t2.join()
CPU times: user 717 ms, sys: 372 µs, total: 718 ms Wall time: 358 ms
Because user
time represents the sum of processing time on all threads, it doesn't change much.
The "wall time" has been cut roughly in half
because each function is running simultaneously.
Keep in mind that the order in which threads are started makes a difference!
%%time
# Note that the GIL-locked version is started first
t1 = Thread(target=cython_gil, args=[N])
t2 = Thread(target=cython_nogil, args=[N])
t1.start(); t2.start()
t1.join(); t2.join()
CPU times: user 667 ms, sys: 0 ns, total: 667 ms Wall time: 672 ms
Even though the second thread releases the GIL while running, it can't start until the first has completed. Thus, the overall runtime is effectively the same as running two GIL-locked threads.
Finally, be aware that attempting to unlock the GIL from a thread that doesn't own it will crash the interpreter, not just the thread attempting the unlock:
%%cython
cdef int cython_recurse(int n) nogil:
if n <= 0:
return 0
with nogil:
return cython_recurse(n - 1)
cython_recurse(2)
Fatal Python error: PyEval_SaveThread: NULL tstate
Thread 0x00007f499effd700 (most recent call first): File "/home/bspeice/.virtualenvs/release-the-gil/lib/python3.7/site-packages/ipykernel/parentpoller.py", line 39 in run File "/usr/lib/python3.7/threading.py", line 926 in _bootstrap_inner File "/usr/lib/python3.7/threading.py", line 890 in _bootstrap
In practice, avoiding this issue is simple. First, nogil
functions probably shouldn't contain
with nogil
blocks. Second, Cython can
conditionally acquire/release
the GIL, so these conditions can be used to synchronize access. Finally, Cython's documentation for
external C code
contains more detail on how to safely manage the GIL.
To conclude: use Cython's nogil
annotation to assert that functions are safe for calling when the
GIL is unlocked, and with nogil
to actually unlock the GIL and run those functions.
Like Cython, Numba is a "compiled Python." Where Cython works by
compiling a Python-like language to C/C++, Numba compiles Python bytecode directly to machine code
at runtime. Behavior is controlled with a special @jit
decorator; calling a decorated function
first compiles it to machine code before running. Calling the function a second time re-uses that
machine code unless the argument types have changed.
Numba works best when a nopython=True
argument is added to the @jit
decorator; functions
compiled in nopython
mode
avoid the CPython API and have performance comparable to C. Further, adding nogil=True
to the
@jit
decorator unlocks the GIL while that function is running. Note that nogil
and nopython
are separate arguments; while it is necessary for code to be compiled in nopython
mode in order to
release the lock, the GIL will remain locked if nogil=False
(the default).
Let's repeat the same experiment, this time using Numba instead of Cython:
# The `int` type annotation is only for humans and is ignored
# by Numba.
@jit(nopython=True, nogil=True)
def numba_nogil(n: int) -> int:
if n <= 1:
return n
a = 0
b = 1
c = a + b
for _i in range(2, n):
a = b
b = c
c = a + b
return c
# Run using `nopython` mode to receive a performance boost,
# but GIL remains locked due to `nogil=False` by default.
@jit(nopython=True)
def numba_gil(n: int) -> int:
if n <= 1:
return n
a = 0
b = 1
c = a + b
for _i in range(2, n):
a = b
b = c
c = a + b
return c
# Call each function once to force compilation; we don't want
# the timing statistics to include how long it takes to compile.
numba_nogil(N)
numba_gil(N);
We'll perform the same tests as above; first, figure out how long it takes the function to run:
%%time
_ = numba_gil(N)
CPU times: user 253 ms, sys: 258 µs, total: 253 ms Wall time: 251 ms
Aside: it's not immediately clear why Numba takes ~20% less time to run than Cython for code that should be effectively identical after compilation.
When running two GIL-locked threads, the result (as expected) takes around twice as long to compute:
%%time
t1 = Thread(target=numba_gil, args=[N])
t2 = Thread(target=numba_gil, args=[N])
t1.start(); t2.start()
t1.join(); t2.join()
CPU times: user 541 ms, sys: 3.96 ms, total: 545 ms Wall time: 541 ms
But if the GIL-unlocking thread starts first, both threads run in parallel:
%%time
t1 = Thread(target=numba_nogil, args=[N])
t2 = Thread(target=numba_gil, args=[N])
t1.start(); t2.start()
t1.join(); t2.join()
CPU times: user 551 ms, sys: 7.77 ms, total: 559 ms Wall time: 279 ms
Just like Cython, starting the GIL-locked thread first leads to poor performance:
%%time
t1 = Thread(target=numba_gil, args=[N])
t2 = Thread(target=numba_nogil, args=[N])
t1.start(); t2.start()
t1.join(); t2.join()
CPU times: user 524 ms, sys: 0 ns, total: 524 ms Wall time: 522 ms
Finally, unlike Cython, Numba will unlock the GIL if and only if it is currently acquired;
recursively calling @jit(nogil=True)
functions is perfectly safe:
from numba import jit
@jit(nopython=True, nogil=True)
def numba_recurse(n: int) -> int:
if n <= 0:
return 0
return numba_recurse(n - 1)
numba_recurse(2);
Before finishing, it's important to address pain points that will show up if these techniques are used in a more realistic project:
First, code running in a GIL-free context will likely also need non-trivial data structures;
GIL-free functions aren't useful if they're constantly interacting with Python objects whose access
requires the GIL. Cython provides
extension types and Numba
provides a @jitclass
decorator to
address this need.
Second, building and distributing applications that make use of Cython/Numba can be complicated. Cython packages require running the compiler, (potentially) linking/packaging external dependencies, and distributing a binary wheel. Numba is generally simpler because the code being distributed is pure Python, but can be tricky since errors aren't detected until runtime.
Finally, while unlocking the GIL is often a solution in search of a problem, both Cython and Numba provide tools to directly manage the GIL when appropriate. This enables true parallelism (not just concurrency) that is impossible in vanilla Python.
]]>So let's say you're in need of a binary serialization format. Data will be going over the network, not just in memory, so having a schema document and code generation is a must. Performance is crucial, so formats that support zero-copy de/serialization are given priority. And the more languages supported, the better; I use Rust, but can't predict what other languages this could interact with.
Given these requirements, the candidates I could find were:
Any one of these will satisfy the project requirements: easy to transmit over a network, reasonably fast, and polyglot support. But how do you actually pick one? It's impossible to know what issues will follow that choice, so I tend to avoid commitment until the last possible moment.
Still, a choice must be made. Instead of worrying about which is "the best," I decided to build a small proof-of-concept system in each format and pit them against each other. All code can be found in the repository for this post.
We'll discuss more in detail, but a quick preview of the results:
Our benchmark system will be a simple data processor; given depth-of-book market data from IEX, serialize each message into the schema format, read it back, and calculate total size of stock traded and the lowest/highest quoted prices. This test isn't complex, but is representative of the project I need a binary format for.
But before we make it to that point, we have to actually read in the market data. To do so, I'm
using a library called nom
. Version 5.0 was recently released and
brought some big changes, so this was an opportunity to build a non-trivial program and get
familiar.
If you don't already know about nom
, it's a "parser generator". By combining different smaller
parsers, you can assemble a parser to handle complex structures without writing tedious code by
hand. For example, when parsing
PCAP files:
0 1 2 3
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
+---------------------------------------------------------------+
0 | Block Type = 0x00000006 |
+---------------------------------------------------------------+
4 | Block Total Length |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
8 | Interface ID |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
12 | Timestamp (High) |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
16 | Timestamp (Low) |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
20 | Captured Len |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
24 | Packet Len |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| Packet Data |
| ... |
...you can build a parser in nom
that looks like
this:
const ENHANCED_PACKET: [u8; 4] = [0x06, 0x00, 0x00, 0x00];
pub fn enhanced_packet_block(input: &[u8]) -> IResult<&[u8], &[u8]> {
let (
remaining,
(
block_type,
block_len,
interface_id,
timestamp_high,
timestamp_low,
captured_len,
packet_len,
),
) = tuple((
tag(ENHANCED_PACKET),
le_u32,
le_u32,
le_u32,
le_u32,
le_u32,
le_u32,
))(input)?;
let (remaining, packet_data) = take(captured_len)(remaining)?;
Ok((remaining, packet_data))
}
While this example isn't too interesting, more complex formats (like IEX market data) are where
nom
really shines.
Ultimately, because the nom
code in this shootout was the same for all formats, we're not too
interested in its performance. Still, it's worth mentioning that building the market data parser was
actually fun; I didn't have to write tons of boring code by hand.
Now it's time to get into the meaty part of the story. Cap'n Proto was the first format I tried because of how long it has supported Rust (thanks to dwrensha for maintaining the Rust port since 2014!). However, I had a ton of performance concerns once I started using it.
To serialize new messages, Cap'n Proto uses a "builder" object. This builder allocates memory on the
heap to hold the message content, but because builders
can't be re-used, we have to allocate a
new buffer for every single message. I was able to work around this with a
special builder
that could re-use the buffer, but it required reading through Cap'n Proto's
benchmarks
to find an example, and used
std::mem::transmute
to bypass Rust's borrow
checker.
The process of reading messages was better, but still had issues. Cap'n Proto has two message encodings: a "packed" representation, and an "unpacked" version. When reading "packed" messages, we need a buffer to unpack the message into before we can use it; Cap'n Proto allocates a new buffer for each message we unpack, and I wasn't able to figure out a way around that. In contrast, the unpacked message format should be where Cap'n Proto shines; its main selling point is that there's no decoding step. However, accomplishing zero-copy deserialization required code in the private API (since fixed), and we allocate a vector on every read for the segment table.
In the end, I put in significant work to make Cap'n Proto as fast as possible, but there were too many issues for me to feel comfortable using it long-term.
This is the new kid on the block. After a first attempt didn't pan out, official support was recently launched. Flatbuffers intends to address the same problems as Cap'n Proto: high-performance, polyglot, binary messaging. The difference is that Flatbuffers claims to have a simpler wire format and more flexibility.
On the whole, I enjoyed using Flatbuffers; the tooling is nice, and unlike Cap'n Proto, parsing messages was actually zero-copy and zero-allocation. However, there were still some issues.
First, Flatbuffers (at least in Rust) can't handle nested vectors. This is a problem for formats like the following:
table Message {
symbol: string;
}
table MultiMessage {
messages:[Message];
}
We want to create a MultiMessage
which contains a vector of Message
, and each Message
itself
contains a vector (the string
type). I was able to work around this by
caching Message
elements
in a SmallVec
before building the final MultiMessage
, but it was a painful process that I
believe contributed to poor serialization performance.
Second, streaming support in Flatbuffers seems to be something of an
afterthought. Where Cap'n Proto in Rust handles
reading messages from a stream as part of the API, Flatbuffers just sticks a u32
at the front of
each message to indicate the size. Not specifically a problem, but calculating message size without
that tag is nigh on impossible.
Ultimately, I enjoyed using Flatbuffers, and had to do significantly less work to make it perform well.
Support for SBE was added by the author of one of my favorite Rust blog posts. I've talked previously about how important variance is in high-performance systems, so it was encouraging to read about a format that directly addressed my concerns. SBE has by far the simplest binary format, but it does make some tradeoffs.
Both Cap'n Proto and Flatbuffers use message offsets to handle variable-length data, unions, and various other features. In contrast, messages in SBE are essentially just structs; variable-length data is supported, but there's no union type.
As mentioned in the beginning, the Rust port of SBE works well, but is essentially unmaintained. However, if you don't need union types, and can accept that schemas are XML documents, it's still worth using. SBE's implementation had the best streaming support of all formats I tested, and doesn't trigger allocation during de/serialization.
After building a test harness for each format, it was time to actually take them for a spin. I used this script to run the benchmarks, and the raw results are here. All data reported below is the average of 10 runs on a single day of IEX data. Results were validated to make sure that each format parsed the data correctly.
This test measures, on a per-message basis, how long it takes to serialize the IEX message into the desired format and write to a pre-allocated buffer.
Schema | Median | 99th Pctl | 99.9th Pctl | Total |
---|---|---|---|---|
Cap'n Proto Packed | 413ns | 1751ns | 2943ns | 14.80s |
Cap'n Proto Unpacked | 273ns | 1828ns | 2836ns | 10.65s |
Flatbuffers | 355ns | 2185ns | 3497ns | 14.31s |
SBE | 91ns | 1535ns | 2423ns | 3.91s |
This test measures, on a per-message basis, how long it takes to read the previously-serialized message and perform some basic aggregation. The aggregation code is the same for each format, so any performance differences are due solely to the format implementation.
Schema | Median | 99th Pctl | 99.9th Pctl | Total |
---|---|---|---|---|
Cap'n Proto Packed | 539ns | 1216ns | 2599ns | 18.92s |
Cap'n Proto Unpacked | 366ns | 737ns | 1583ns | 12.32s |
Flatbuffers | 173ns | 421ns | 1007ns | 6.00s |
SBE | 116ns | 286ns | 659ns | 4.05s |
Building a benchmark turned out to be incredibly helpful in making a decision; because a "union" type isn't important to me, I can be confident that SBE best addresses my needs.
While SBE was the fastest in terms of both median and worst-case performance, its worst case performance was proportionately far higher than any other format. It seems to be that de/serialization time scales with message size, but I'll need to do some more research to understand what exactly is going on.
]]>How I assumed HFT people learn their secret techniques
How else do you explain people working on systems that complete the round trip of market data in to orders out (a.k.a. tick-to-trade) consistently within 750-800 nanoseconds? In roughly the time it takes a computer to access main memory 8 times, trading systems are capable of reading the market data packets, deciding what orders to send, doing risk checks, creating new packets for exchange-specific protocols, and putting those packets on the wire.
Having now worked in the trading industry, I can confirm the developers aren't super-human; I've made some simple mistakes at the very least. Instead, what shows up in public discussions is that philosophy, not technique, separates high-performance systems from everything else. Performance-critical systems don't rely on "this one cool C++ optimization trick" to make code fast (though micro-optimizations have their place); there's a lot more to worry about than just the code written for the project.
The framework I'd propose is this: If you want to build high-performance systems, focus first on reducing performance variance (reducing the gap between the fastest and slowest runs of the same code), and only look at average latency once variance is at an acceptable level.
Don't get me wrong, I'm a much happier person when things are fast. Computer goes from booting in 20 seconds down to 10 because I installed a solid-state drive? Awesome. But if every fifth day it takes a full minute to boot because of corrupted sectors? Not so great. Average speed over the course of a week is the same in each situation, but you're painfully aware of that minute when it happens. When it comes to code, the principal is the same: speeding up a function by an average of 10 milliseconds doesn't mean much if there's a 100ms difference between your fastest and slowest runs. When performance matters, you need to respond quickly every time, not just in aggregate. High-performance systems should first optimize for time variance. Once you're consistent at the time scale you care about, then focus on improving average time.
This focus on variance shows up all the time in industry too (emphasis added in all quotes below):
In marketing materials for NASDAQ's matching engine, the most performance-sensitive component of the exchange, dependability is highlighted in addition to instantaneous metrics:
Able to consistently sustain an order rate of over 100,000 orders per second at sub-40 microsecond average latency
The Aeron message bus has this to say about performance:
Performance is the key focus. Aeron is designed to be the highest throughput with the lowest and most predictable latency possible of any messaging system
The company PolySync, which is working on autonomous vehicles, mentions why they picked their specific messaging format:
In general, high performance is almost always desirable for serialization. But in the world of autonomous vehicles, steady timing performance is even more important than peak throughput. This is because safe operation is sensitive to timing outliers. Nobody wants the system that decides when to slam on the brakes to occasionally take 100 times longer than usual to encode its commands.
Solarflare, which makes highly-specialized network hardware, points out variance (jitter) as a big concern for electronic trading:
The high stakes world of electronic trading, investment banks, market makers, hedge funds and exchanges demand the lowest possible latency and jitter while utilizing the highest bandwidth and return on their investment.
And to further clarify: we're not discussing total run-time, but variance of total run-time. There are situations where it's not reasonably possible to make things faster, and you'd much rather be consistent. For example, trading firms use wireless networks because the speed of light through air is faster than through fiber-optic cables. There's still at absolute minimum a ~33.76 millisecond delay required to send data between, say, Chicago and Tokyo. If a trading system in Chicago calls the function for "send order to Tokyo" and waits to see if a trade occurs, there's a physical limit to how long that will take. In this situation, the focus is on keeping variance of additional processing to a minimum, since speed of light is the limiting factor.
So how does one go about looking for and eliminating performance variance? To tell the truth, I don't think a systematic answer or flow-chart exists. There's no substitute for (A) building a deep understanding of the entire technology stack, and (B) actually measuring system performance (though (C) watching a lot of CppCon videos for inspiration never hurt). Even then, every project cares about performance to a different degree; you may need to build an entire replica production system to accurately benchmark at nanosecond precision, or you may be content to simply avoid garbage collection in your Java code.
Even though everyone has different needs, there are still common things to look for when trying to isolate and eliminate variance. In no particular order, these are my focus areas when thinking about high-performance systems:
Update 2019-09-21: Added notes on isolcpus
and systemd
affinity.
Garbage Collection: How often does garbage collection happen? When is it triggered? What are the impacts?
num_alloc - num_dealloc > gc_threshold
whenever an allocation happens. The GIL is acquired for
the duration of generational collection.Allocation: Every language has a different way of interacting with "heap" memory, but the principle is the same: running the allocator to allocate/deallocate memory takes time that can often be put to better use. Understanding when your language interacts with the allocator is crucial, and not always obvious. For example: C++ and Rust don't allocate heap memory for iterators, but Java does (meaning potential GC pauses). Take time to understand heap behavior (I made a a guide for Rust), and look into alternative allocators (jemalloc, tcmalloc) that might run faster than the operating system default.
Data Layout: How your data is arranged in memory matters; data-oriented design and cache locality can have huge impacts on performance. The C family of languages (C, value types in C#, C++) and Rust all have guarantees about the shape every object takes in memory that others (e.g. Java and Python) can't make. Cachegrind and kernel perf counters are both great for understanding how performance relates to memory layout.
Just-In-Time Compilation: Languages that are compiled on the fly (LuaJIT, C#, Java, PyPy) are great because they optimize your program for how it's actually being used, rather than how a compiler expects it to be used. However, there's a variance problem if the program stops executing while waiting for translation from VM bytecode to native code. As a remedy, many languages support ahead-of-time compilation in addition to the JIT versions (CoreRT in C# and GraalVM in Java). On the other hand, LLVM supports Profile Guided Optimization, which theoretically brings JIT benefits to non-JIT languages. Finally, be careful to avoid comparing apples and oranges during benchmarks; you don't want your code to suddenly speed up because the JIT compiler kicked in.
Programming Tricks: These won't make or break performance, but can be useful in specific circumstances. For example, C++ can use templates instead of branches in critical sections.
Code you wrote is almost certainly not the only code running on your hardware. There are many ways the operating system interacts with your program, from interrupts to system calls, that are important to watch for. These are written from a Linux perspective, but Windows does typically have equivalent functionality.
Scheduling: The kernel is normally free to schedule any process on any core, so it's important
to reserve CPU cores exclusively for the important programs. There are a few parts to this: first,
limit the CPU cores that non-critical processes are allowed to run on by excluding cores from
scheduling
(isolcpus
kernel command-line option), or by setting the init
process CPU affinity
(systemd
example). Second, set critical processes
to run on the isolated cores by setting the
processor affinity using
taskset. Finally, use
NO_HZ
or
chrt
to disable scheduling interrupts. Turning off
hyper-threading is also likely beneficial.
System calls: Reading from a UNIX socket? Writing to a file? In addition to not knowing how long the I/O operation takes, these all trigger expensive system calls (syscalls). To handle these, the CPU must context switch to the kernel, let the kernel operation complete, then context switch back to your program. We'd rather keep these to a minimum (see timestamp 18:20). Strace is your friend for understanding when and where syscalls happen.
Signal Handling: Far less likely to be an issue, but signals do trigger a context switch if your code has a handler registered. This will be highly dependent on the application, but you can block signals if it's an issue.
Interrupts: System interrupts are how devices connected to your computer notify the CPU that something has happened. The CPU will then choose a processor core to pause and context switch to the OS to handle the interrupt. Make sure that SMP affinity is set so that interrupts are handled on a CPU core not running the program you care about.
NUMA: While NUMA is good at making multi-cell systems transparent, there are variance implications; if the kernel moves a process across nodes, future memory accesses must wait for the controller on the original node. Use numactl to handle memory-/cpu-cell pinning so this doesn't happen.
CPU Pipelining/Speculation: Speculative execution in modern processors gave us vulnerabilities like Spectre, but it also gave us performance improvements like branch prediction. And if the CPU mis-speculates your code, there's variance associated with rewind and replay. While the compiler knows a lot about how your CPU pipelines instructions, code can be structured to help the branch predictor.
Paging: For most systems, virtual memory is incredible. Applications live in their own worlds, and the CPU/MMU figures out the details. However, there's a variance penalty associated with memory paging and caching; if you access more memory pages than the TLB can store, you'll have to wait for the page walk. Kernel perf tools are necessary to figure out if this is an issue, but using huge pages can reduce TLB burdens. Alternately, running applications in a hypervisor like Jailhouse allows one to skip virtual memory entirely, but this is probably more work than the benefits are worth.
Network Interfaces: When more than one computer is involved, variance can go up dramatically. Tuning kernel network parameters may be helpful, but modern systems more frequently opt to skip the kernel altogether with a technique called kernel bypass. This typically requires specialized hardware and drivers, but even industries like telecom are finding the benefits.
Routing: There's a reason financial firms are willing to pay millions of euros for rights to a small plot of land - having a straight-line connection from point A to point B means the path their data takes is the shortest possible. In contrast, there are currently 6 computers in between me and Google, but that may change at any moment if my ISP realizes a more efficient route is available. Whether it's using research-quality equipment for shortwave radio, or just making sure there's no data inadvertently going between data centers, routing matters.
Protocol: TCP as a network protocol is awesome: guaranteed and in-order delivery, flow control, and congestion control all built in. But these attributes make the most sense when networking infrastructure is lossy; for systems that expect nearly all packets to be delivered correctly, the setup handshaking and packet acknowledgment are just overhead. Using UDP (unicast or multicast) may make sense in these contexts as it avoids the chatter needed to track connection state, and gap-fill strategies can handle the rest.
Switching: Many routers/switches handle packets using "store-and-forward" behavior: wait for the whole packet, validate checksums, and then send to the next device. In variance terms, the time needed to move data between two nodes is proportional to the size of that data; the switch must "store" all data before it can calculate checksums and "forward" to the next node. With "cut-through" designs, switches will begin forwarding data as soon as they know where the destination is, checksums be damned. This means there's a fixed cost (at the switch) for network traffic, no matter the size.
High-performance systems, regardless of industry, are not magical. They do require extreme precision
and attention to detail, but they're designed, built, and operated by regular people, using a lot of
tools that are publicly available. Interested in seeing how context switching affects performance of
your benchmarks? taskset
should be installed in all modern Linux distributions, and can be used to
make sure the OS never migrates your process. Curious how often garbage collection triggers during a
crucial operation? Your language of choice will typically expose details of its operations
(Python,
Java).
Want to know how hard your program is stressing the TLB? Use perf record
and look for
dtlb_load_misses.miss_causes_a_walk
.
Two final guiding questions, then: first, before attempting to apply some of the technology above to your own systems, can you first identify where/when you care about "high-performance"? As an example, if parts of a system rely on humans pushing buttons, CPU pinning won't have any measurable effect. Humans are already far too slow to react in time. Second, if you're using benchmarks, are they being designed in a way that's actually helpful? Tools like Criterion (also in Rust) and Google's Benchmark output not only average run time, but variance as well; your benchmarking environment is subject to the same concerns your production environment is.
Finally, I believe high-performance systems are a matter of philosophy, not necessarily technique. Rigorous focus on variance is the first step, and there are plenty of ways to measure and mitigate it; once that's at an acceptable level, then optimize for speed.
]]>Either way, I'm baking a little bit again, and figured it was worth taking a quick break to focus on some lighter material. I recently learned two critically important lessons: first, the temperature of the dough when you put the yeast in makes a huge difference.
Previously, when I wasn't paying attention to dough temperature:
Compared with what happens when I put the dough in the microwave for a defrost cycle because the water I used wasn't warm enough:
I mean, just look at the bubbles!
After shaping the dough, I've got two loaves ready:
Now, the recipe normally calls for a Dutch Oven to bake the bread because it keeps the dough from drying out in the oven. Because I don't own a Dutch Oven, I typically put a casserole dish on the bottom rack and fill it with water so there's still some moisture in the oven. This time, I forgot to add the water and learned my second lesson: never add room-temperature water to a glass dish that's currently at 500 degrees.
Needless to say, trying to pull out sharp glass from an incredibly hot oven is not what I expected to be doing during my garden leave.
In the end, the bread crust wasn't great, but the bread itself turned out pretty alright:
I've been writing a lot more during this break, so I'm looking forward to sharing that in the future. In the mean-time, I'm planning on making a sandwich.
]]>Iterator
looks like
in assembly, you just need to know whether it allocates an object on the heap or not. And while Rust
will prioritize the fastest behavior it can, here are the rules for each memory type:
Global Allocation:
const
is a fixed value; the compiler is allowed to copy it wherever useful.static
is a fixed reference; the compiler will guarantee it is unique.Stack Allocation:
RefCell
) behave like smart pointers, but are stack-allocated.#[inline]
) will not affect allocation behavior for better or worse.Copy
are guaranteed to have their contents stack-allocated.Heap Allocation:
Box
, Rc
, Mutex
, etc.) allocate their contents in heap memory.HashMap
, Vec
, String
, etc.) allocate their contents in heap memory.-- Raph Levien
]]>Throughout the series so far, we've put a handicap on the code. In the name of consistent and understandable results, we've asked the compiler to pretty please leave the training wheels on. Now is the time where we throw out all the rules and take off the kid gloves. As it turns out, both the Rust compiler and the LLVM optimizers are incredibly sophisticated, and we'll step back and let them do their job.
Similar to "What Has My Compiler Done For Me Lately?", we're focusing on interesting things the Rust language (and LLVM!) can do with memory management. We'll still be looking at assembly code to understand what's going on, but it's important to mention again: please use automated tools like alloc-counter to double-check memory behavior if it's something you care about. It's far too easy to mis-read assembly in large code sections, you should always verify behavior if you care about memory usage.
The guiding principal as we move forward is this: optimizing compilers won't produce worse programs than we started with. There won't be any situations where stack allocations get moved to heap allocations. There will, however, be an opera of optimization.
Update 2019-02-10: When debugging a
related issue, it was discovered that the
original code worked because LLVM optimized out the entire function, rather than just the allocation
segments. The code has been updated with proper use of
read_volatile
, and a previous section
on vector capacity has been removed.
Our first optimization comes when LLVM can reason that the lifetime of an object is sufficiently
short that heap allocations aren't necessary. In these cases, LLVM will move the allocation to the
stack instead! The way this interacts with #[inline]
attributes is a bit opaque, but the important
part is that LLVM can sometimes do better than the baseline Rust language:
use std::alloc::{GlobalAlloc, Layout, System};
use std::sync::atomic::{AtomicBool, Ordering};
pub fn cmp(x: u32) {
// Turn on panicking if we allocate on the heap
DO_PANIC.store(true, Ordering::SeqCst);
// The compiler is able to see through the constant `Box`
// and directly compare `x` to 24 - assembly line 73
let y = Box::new(24);
let equals = x == *y;
// This call to drop is eliminated
drop(y);
// Need to mark the comparison result as volatile so that
// LLVM doesn't strip out all the code. If `y` is marked
// volatile instead, allocation will be forced.
unsafe { std::ptr::read_volatile(&equals) };
// Turn off panicking, as there are some deallocations
// when we exit main.
DO_PANIC.store(false, Ordering::SeqCst);
}
fn main() {
cmp(12)
}
#[global_allocator]
static A: PanicAllocator = PanicAllocator;
static DO_PANIC: AtomicBool = AtomicBool::new(false);
struct PanicAllocator;
unsafe impl GlobalAlloc for PanicAllocator {
unsafe fn alloc(&self, layout: Layout) -> *mut u8 {
if DO_PANIC.load(Ordering::SeqCst) {
panic!("Unexpected allocation.");
}
System.alloc(layout)
}
unsafe fn dealloc(&self, ptr: *mut u8, layout: Layout) {
if DO_PANIC.load(Ordering::SeqCst) {
panic!("Unexpected deallocation.");
}
System.dealloc(ptr, layout);
}
}
Finally, this isn't so much about LLVM figuring out different memory behavior, but LLVM stripping
out code that doesn't do anything. Optimizations of this type have a lot of nuance to them; if
you're not careful, they can make your benchmarks look
impossibly good. In Rust, the
black_box
function (implemented in both
libtest
and
criterion
) will tell the compiler
to disable this kind of optimization. But if you let LLVM remove unnecessary code, you can end up
running programs that previously caused errors:
#[derive(Default)]
struct TwoFiftySix {
_a: [u64; 32]
}
#[derive(Default)]
struct EightK {
_a: [TwoFiftySix; 32]
}
#[derive(Default)]
struct TwoFiftySixK {
_a: [EightK; 32]
}
#[derive(Default)]
struct EightM {
_a: [TwoFiftySixK; 32]
}
pub fn main() {
// Normally this blows up because we can't reserve size on stack
// for the `EightM` struct. But because the compiler notices we
// never do anything with `_x`, it optimizes out the stack storage
// and the program completes successfully.
let _x = EightM::default();
}
The heap is used in two situations; when the compiler is unable to predict either the total size of memory needed, or how long the memory is needed for, it allocates space in the heap.
This happens pretty frequently; if you want to download the Google home page, you won't know how large it is until your program runs. And when you're finished with Google, we deallocate the memory so it can be used to store other webpages. If you're interested in a slightly longer explanation of the heap, check out The Stack and the Heap in Rust's documentation.
We won't go into detail on how the heap is managed; the ownership documentation does a phenomenal job explaining both the "why" and "how" of memory management. Instead, we're going to focus on understanding "when" heap allocations occur in Rust.
To start off, take a guess for how many allocations happen in the program below:
fn main() {}
It's obviously a trick question; while no heap allocations occur as a result of that code, the setup
needed to call main
does allocate on the heap. Here's a way to show it:
#![feature(integer_atomics)]
use std::alloc::{GlobalAlloc, Layout, System};
use std::sync::atomic::{AtomicU64, Ordering};
static ALLOCATION_COUNT: AtomicU64 = AtomicU64::new(0);
struct CountingAllocator;
unsafe impl GlobalAlloc for CountingAllocator {
unsafe fn alloc(&self, layout: Layout) -> *mut u8 {
ALLOCATION_COUNT.fetch_add(1, Ordering::SeqCst);
System.alloc(layout)
}
unsafe fn dealloc(&self, ptr: *mut u8, layout: Layout) {
System.dealloc(ptr, layout);
}
}
#[global_allocator]
static A: CountingAllocator = CountingAllocator;
fn main() {
let x = ALLOCATION_COUNT.fetch_add(0, Ordering::SeqCst);
println!("There were {} allocations before calling main!", x);
}
As of the time of writing, there are five allocations that happen before main
is ever called.
But when we want to understand more practically where heap allocation happens, we'll follow this guide:
Finally, there are two "addendum" issues that are important to address when discussing Rust and the heap:
The first thing to note are the "smart pointer" types. When you have data that must outlive the scope in which it is declared, or your data is of unknown or dynamic size, you'll make use of these types.
The term smart pointer comes from C++, and while it's
closely linked to a general design pattern of
"Resource Acquisition Is Initialization", we'll
use it here specifically to describe objects that are responsible for managing ownership of data
allocated on the heap. The smart pointers available in the alloc
crate should look mostly
familiar:
The standard library also defines some smart pointers to manage heap objects, though more than can be covered here. Some examples are:
Finally, there is one "gotcha": cell types
(like RefCell
) look and behave
similarly, but don't involve heap allocation. The
core::cell
docs have more information.
When a smart pointer is created, the data it is given is placed in heap memory and the location of
that data is recorded in the smart pointer. Once the smart pointer has determined it's safe to
deallocate that memory (when a Box
has
gone out of scope or a reference count
goes to zero), the heap space is reclaimed. We can
prove these types use heap memory by looking at code:
use std::rc::Rc;
use std::sync::Arc;
use std::borrow::Cow;
pub fn my_box() {
// Drop at assembly line 1640
Box::new(0);
}
pub fn my_rc() {
// Drop at assembly line 1650
Rc::new(0);
}
pub fn my_arc() {
// Drop at assembly line 1660
Arc::new(0);
}
pub fn my_cow() {
// Drop at assembly line 1672
Cow::from("drop");
}
Collection types use heap memory because their contents have dynamic size; they will request more
memory when needed, and can
release memory when it's
no longer necessary. This dynamic property forces Rust to heap allocate everything they contain. In
a way, collections are smart pointers for many objects at a time. Common types that fall under
this umbrella are Vec
,
HashMap
, and
String
(not
str
).
While collections store the objects they own in heap memory, creating new collections will not
allocate on the heap. This is a bit weird; if we call Vec::new()
, the assembly shows a
corresponding call to real_drop_in_place
:
pub fn my_vec() {
// Drop in place at line 481
Vec::<u8>::new();
}
But because the vector has no elements to manage, no calls to the allocator will ever be dispatched:
use std::alloc::{GlobalAlloc, Layout, System};
use std::sync::atomic::{AtomicBool, Ordering};
fn main() {
// Turn on panicking if we allocate on the heap
DO_PANIC.store(true, Ordering::SeqCst);
// Interesting bit happens here
let x: Vec<u8> = Vec::new();
drop(x);
// Turn panicking back off, some deallocations occur
// after main as well.
DO_PANIC.store(false, Ordering::SeqCst);
}
#[global_allocator]
static A: PanicAllocator = PanicAllocator;
static DO_PANIC: AtomicBool = AtomicBool::new(false);
struct PanicAllocator;
unsafe impl GlobalAlloc for PanicAllocator {
unsafe fn alloc(&self, layout: Layout) -> *mut u8 {
if DO_PANIC.load(Ordering::SeqCst) {
panic!("Unexpected allocation.");
}
System.alloc(layout)
}
unsafe fn dealloc(&self, ptr: *mut u8, layout: Layout) {
if DO_PANIC.load(Ordering::SeqCst) {
panic!("Unexpected deallocation.");
}
System.dealloc(ptr, layout);
}
}
Other standard library types follow the same behavior; make sure to check out
HashMap::new()
,
and String::new()
.
While it is a bit strange to speak of the stack after spending time with the heap, it's worth pointing out that some heap-allocated objects in Rust have stack-based counterparts provided by other crates. If you have need of the functionality, but want to avoid allocating, there are typically alternatives available.
When it comes to some standard library smart pointers
(RwLock
and
Mutex
), stack-based alternatives are
provided in crates like parking_lot and
spin. You can check out
lock_api::RwLock
,
lock_api::Mutex
, and
spin::Once
if you're in need
of synchronization primitives.
thread_id may be necessary if you're implementing an allocator
because thread::current().id()
uses a
thread_local!
structure
that needs heap allocation.
When writing performance-sensitive code, there's no alternative to measuring your code. If you didn't write a benchmark, you don't care about it's performance You should never rely on your instincts when a microsecond is an eternity.
Similarly, there's great work going on in Rust with allocators that keep track of what they're doing
(like alloc_counter
). When it comes to tracking heap
behavior, it's easy to make mistakes; please write tests and make sure you have tools to guard
against future issues.
const
and static
are perfectly fine, but it's relatively rare that we know at compile-time about
either values or references that will be the same for the duration of our program. Put another way,
it's not often the case that either you or your compiler knows how much memory your entire program
will ever need.
However, there are still some optimizations the compiler can do if it knows how much memory individual functions will need. Specifically, the compiler can make use of "stack" memory (as opposed to "heap" memory) which can be managed far faster in both the short- and long-term.
When requesting memory, the push
instruction
can typically complete in 1 or 2 cycles (<1ns
on modern CPUs). Contrast that to heap memory which requires an allocator (specialized
software to track what memory is in use) to reserve space. When you're finished with stack memory,
the pop
instruction runs in 1-3 cycles, as opposed to an allocator needing to worry about memory
fragmentation and other issues with the heap. All sorts of incredibly sophisticated techniques have
been used to design allocators:
But no matter how fast your allocator is, the principle remains: the fastest allocator is the one
you never use. As such, we're not going to discuss how exactly the
push
and pop
instructions work, but
we'll focus instead on the conditions that enable the Rust compiler to use faster stack-based
allocation for variables.
So, how do we know when Rust will or will not use stack allocation for objects we create?
Looking at other languages, it's often easy to delineate between stack and heap. Managed memory
languages (Python, Java,
C#) place
everything on the heap. JIT compilers (PyPy,
HotSpot) may optimize
some heap allocations away, but you should never assume it will happen. C makes things clear with
calls to special functions (like malloc(3)) needed to access
heap memory. Old C++ has the new
keyword, though
modern C++/C++11 is more complicated with RAII.
For Rust, we can summarize as follows: stack allocation will be used for everything that doesn't involve "smart pointers" and collections. We'll skip over a precise definition of the term "smart pointer" for now, and instead discuss what we should watch for to understand when stack and heap memory regions are used:
Stack manipulation instructions (push
, pop
, and add
/sub
of the rsp
register) indicate
allocation of stack memory:
pub fn stack_alloc(x: u32) -> u32 {
// Space for `y` is allocated by subtracting from `rsp`,
// and then populated
let y = [1u8, 2, 3, 4];
// Space for `y` is deallocated by adding back to `rsp`
x
}
Tracking when exactly heap allocation calls occur is difficult. It's typically easier to watch
for call core::ptr::real_drop_in_place
, and infer that a heap allocation happened in the recent
past:
pub fn heap_alloc(x: usize) -> usize {
// Space for elements in a vector has to be allocated
// on the heap, and is then de-allocated once the
// vector goes out of scope
let y: Vec<u8> = Vec::with_capacity(x);
x
}
-- Compiler Explorer (real_drop_in_place
happens on line 1317)
Note: While the
Drop
trait is
called for stack-allocated objects,
the Rust standard library only defines Drop
implementations for types that involve heap
allocation.
If you don't want to inspect the assembly, use a custom allocator that's able to track and alert
when heap allocations occur. Crates like
alloc_counter
are designed for exactly this purpose.
With all that in mind, let's talk about situations in which we're guaranteed to use stack memory:
#[inline]
attribute will
not change the memory region used.Copy
types are guaranteed to be
stack-allocated, and copying them will be done in stack memory.Iterator
s in the standard library are
stack-allocated even when iterating over heap-based collections.The simplest case comes first. When creating vanilla struct
objects, we use stack memory to hold
their contents:
struct Point {
x: u64,
y: u64,
}
struct Line {
a: Point,
b: Point,
}
pub fn make_line() {
// `origin` is stored in the first 16 bytes of memory
// starting at location `rsp`
let origin = Point { x: 0, y: 0 };
// `point` makes up the next 16 bytes of memory
let point = Point { x: 1, y: 2 };
// When creating `ray`, we just move the content out of
// `origin` and `point` into the next 32 bytes of memory
let ray = Line { a: origin, b: point };
}
Note that while some extra-fancy instructions are used for memory manipulation in the assembly, the
sub rsp, 64
instruction indicates we're still working with the stack.
Have you ever wondered how functions communicate with each other? Like, once the variables are given
to you, everything's fine. But how do you "give" those variables to another function? How do you get
the results back afterward? The answer: the compiler arranges memory and assembly instructions using
a pre-determined calling convention. This
convention governs the rules around where arguments needed by a function will be located (either in
memory offsets relative to the stack pointer rsp
, or in other registers), and where the results
can be found once the function has finished. And when multiple languages agree on what the calling
conventions are, you can do things like having Go call Rust code!
Put simply: it's the compiler's job to figure out how to call other functions, and you can assume that the compiler is good at its job.
We can see this in action using a simple example:
struct Point {
x: i64,
y: i64,
}
// We use integer division operations to keep
// the assembly clean, understanding the result
// isn't accurate.
fn distance(a: &Point, b: &Point) -> i64 {
// Immediately subtract from `rsp` the bytes needed
// to hold all the intermediate results - this is
// the stack allocation step
// The compiler used the `rdi` and `rsi` registers
// to pass our arguments, so read them in
let x1 = a.x;
let x2 = b.x;
let y1 = a.y;
let y2 = b.y;
// Do the actual math work
let x_pow = (x1 - x2) * (x1 - x2);
let y_pow = (y1 - y2) * (y1 - y2);
let squared = x_pow + y_pow;
squared / squared
// Our final result will be stored in the `rax` register
// so that our caller knows where to retrieve it.
// Finally, add back to `rsp` the stack memory that is
// now ready to be used by other functions.
}
pub fn total_distance() {
let start = Point { x: 1, y: 2 };
let middle = Point { x: 3, y: 4 };
let end = Point { x: 5, y: 6 };
let _dist_1 = distance(&start, &middle);
let _dist_2 = distance(&middle, &end);
}
As a consequence of function arguments never using heap memory, we can also infer that functions
using the #[inline]
attributes also do not heap allocate. But better than inferring, we can look
at the assembly to prove it:
struct Point {
x: i64,
y: i64,
}
// Note that there is no `distance` function in the assembly output,
// and the total line count goes from 229 with inlining off
// to 306 with inline on. Even still, no heap allocations occur.
#[inline(always)]
fn distance(a: &Point, b: &Point) -> i64 {
let x1 = a.x;
let x2 = b.x;
let y1 = a.y;
let y2 = b.y;
let x_pow = (a.x - b.x) * (a.x - b.x);
let y_pow = (a.y - b.y) * (a.y - b.y);
let squared = x_pow + y_pow;
squared / squared
}
pub fn total_distance() {
let start = Point { x: 1, y: 2 };
let middle = Point { x: 3, y: 4 };
let end = Point { x: 5, y: 6 };
let _dist_1 = distance(&start, &middle);
let _dist_2 = distance(&middle, &end);
}
Finally, passing by value (arguments with type
Copy
) and passing by reference (either
moving ownership or passing a pointer) may have slightly different layouts in assembly, but will
still use either stack memory or CPU registers:
pub struct Point {
x: i64,
y: i64,
}
// Moving values
pub fn distance_moved(a: Point, b: Point) -> i64 {
let x1 = a.x;
let x2 = b.x;
let y1 = a.y;
let y2 = b.y;
let x_pow = (x1 - x2) * (x1 - x2);
let y_pow = (y1 - y2) * (y1 - y2);
let squared = x_pow + y_pow;
squared / squared
}
// Borrowing values has two extra `mov` instructions on lines 21 and 22
pub fn distance_borrowed(a: &Point, b: &Point) -> i64 {
let x1 = a.x;
let x2 = b.x;
let y1 = a.y;
let y2 = b.y;
let x_pow = (x1 - x2) * (x1 - x2);
let y_pow = (y1 - y2) * (y1 - y2);
let squared = x_pow + y_pow;
squared / squared
}
If you've ever worried that wrapping your types in
Option
or
Result
would finally make them
large enough that Rust decides to use heap allocation instead, fear no longer: enum
and union
types don't use heap allocation:
enum MyEnum {
Small(u8),
Large(u64)
}
struct MyStruct {
x: MyEnum,
y: MyEnum,
}
pub fn enum_compare() {
let x = MyEnum::Small(0);
let y = MyEnum::Large(0);
let z = MyStruct { x, y };
let opt = Option::Some(z);
}
Because the size of an enum
is the size of its largest element plus a flag, the compiler can
predict how much memory is used no matter which variant of an enum is currently stored in a
variable. Thus, enums and unions have no need of heap allocation. There's unfortunately not a great
way to show this in assembly, so I'll instead point you to the
core::mem::size_of
documentation.
The array type is guaranteed to be stack allocated, which is why the array size must be declared. Interestingly enough, this can be used to cause safe Rust programs to crash:
// 256 bytes
#[derive(Default)]
struct TwoFiftySix {
_a: [u64; 32]
}
// 8 kilobytes
#[derive(Default)]
struct EightK {
_a: [TwoFiftySix; 32]
}
// 256 kilobytes
#[derive(Default)]
struct TwoFiftySixK {
_a: [EightK; 32]
}
// 8 megabytes - exceeds space typically provided for the stack,
// though the kernel can be instructed to allocate more.
// On Linux, you can check stack size using `ulimit -s`
#[derive(Default)]
struct EightM {
_a: [TwoFiftySixK; 32]
}
fn main() {
// Because we already have things in stack memory
// (like the current function call stack), allocating another
// eight megabytes of stack memory crashes the program
let _x = EightM::default();
}
There aren't any security implications of this (no memory corruption occurs), but it's good to note that the Rust compiler won't move arrays into heap memory even if they can be reasonably expected to overflow the stack.
Rules for how anonymous functions capture their arguments are typically language-specific. In Java,
Lambda Expressions are
actually objects created on the heap that capture local primitives by copying, and capture local
non-primitives as (final
) references.
Python and
JavaScript
both bind everything by reference normally, but Python can also
capture values and JavaScript has
Arrow functions.
In Rust, arguments to closures are the same as arguments to other functions; closures are simply functions that don't have a declared name. Some weird ordering of the stack may be required to handle them, but it's the compiler's responsiblity to figure that out.
Each example below has the same effect, but a different assembly implementation. In the simplest case, we immediately run a closure returned by another function. Because we don't store a reference to the closure, the stack memory needed to store the captured values is contiguous:
fn my_func() -> impl FnOnce() {
let x = 24;
// Note that this closure in assembly looks exactly like
// any other function; you even use the `call` instruction
// to start running it.
move || { x; }
}
pub fn immediate() {
my_func()();
my_func()();
}
-- Compiler Explorer, 25 total assembly instructions
If we store a reference to the closure, the Rust compiler keeps values it needs in the stack memory of the original function. Getting the details right is a bit harder, so the instruction count goes up even though this code is functionally equivalent to our original example:
pub fn simple_reference() {
let x = my_func();
let y = my_func();
y();
x();
}
-- Compiler Explorer, 55 total assembly instructions
Even things like variable order can make a difference in instruction count:
pub fn complex() {
let x = my_func();
let y = my_func();
x();
y();
}
-- Compiler Explorer, 70 total assembly instructions
In every circumstance though, the compiler ensured that no heap allocations were necessary.
Traits in Rust come in two broad forms: static dispatch (monomorphization, impl Trait
) and dynamic
dispatch (trait objects, dyn Trait
). While dynamic dispatch is often associated with trait
objects being stored in the heap, dynamic dispatch can be used with stack allocated objects as well:
trait GetInt {
fn get_int(&self) -> u64;
}
// vtable stored at section L__unnamed_1
struct WhyNotU8 {
x: u8
}
impl GetInt for WhyNotU8 {
fn get_int(&self) -> u64 {
self.x as u64
}
}
// vtable stored at section L__unnamed_2
struct ActualU64 {
x: u64
}
impl GetInt for ActualU64 {
fn get_int(&self) -> u64 {
self.x
}
}
// `&dyn` declares that we want to use dynamic dispatch
// rather than monomorphization, so there is only one
// `retrieve_int` function that shows up in the final assembly.
// If we used generics, there would be one implementation of
// `retrieve_int` for each type that implements `GetInt`.
pub fn retrieve_int(u: &dyn GetInt) {
// In the assembly, we just call an address given to us
// in the `rsi` register and hope that it was set up
// correctly when this function was invoked.
let x = u.get_int();
}
pub fn do_call() {
// Note that even though the vtable for `WhyNotU8` and
// `ActualU64` includes a pointer to
// `core::ptr::real_drop_in_place`, it is never invoked.
let a = WhyNotU8 { x: 0 };
let b = ActualU64 { x: 0 };
retrieve_int(&a);
retrieve_int(&b);
}
It's hard to imagine practical situations where dynamic dispatch would be used for objects that aren't heap allocated, but it technically can be done.
Understanding move semantics and copy semantics in Rust is weird at first. The Rust docs
go into detail far better than can
be addressed here, so I'll leave them to do the job. From a memory perspective though, their
guideline is reasonable:
if your type can implemement Copy
, it should.
While there are potential speed tradeoffs to benchmark when discussing Copy
(move semantics for
stack objects vs. copying stack pointers vs. copying stack struct
s), it's impossible for Copy
to introduce a heap allocation.
But why is this the case? Fundamentally, it's because the language controls what Copy
means -
"the behavior of Copy
is not overloadable"
because it's a marker trait. From there we'll note that a type
can implement Copy
if (and only if) its components implement Copy
, and that
no heap-allocated types implement Copy
.
Thus, assignments involving heap types are always move semantics, and new heap allocations won't
occur because of implicit operator behavior.
#[derive(Clone)]
struct Cloneable {
x: Box<u64>
}
// error[E0204]: the trait `Copy` may not be implemented for this type
#[derive(Copy, Clone)]
struct NotCopyable {
x: Box<u64>
}
In managed memory languages (like Java), there's a subtle difference between these two code samples:
public static int sum_for(List<Long> vals) {
long sum = 0;
// Regular for loop
for (int i = 0; i < vals.length; i++) {
sum += vals[i];
}
return sum;
}
public static int sum_foreach(List<Long> vals) {
long sum = 0;
// "Foreach" loop - uses iteration
for (Long l : vals) {
sum += l;
}
return sum;
}
In the sum_for
function, nothing terribly interesting happens. In sum_foreach
, an object of type
Iterator
is allocated on the heap, and will eventually be garbage-collected. This isn't a great design;
iterators are often transient objects that you need during a function and can discard once the
function ends. Sounds exactly like the issue stack-allocated objects address, no?
In Rust, iterators are allocated on the stack. The objects to iterate over are almost certainly in
heap memory, but the iterator itself
(Iter
) doesn't need to use the heap. In
each of the examples below we iterate over a collection, but never use heap allocation:
use std::collections::HashMap;
// There's a lot of assembly generated, but if you search in the text,
// there are no references to `real_drop_in_place` anywhere.
pub fn sum_vec(x: &Vec<u32>) {
let mut s = 0;
// Basic iteration over vectors doesn't need allocation
for y in x {
s += y;
}
}
pub fn sum_enumerate(x: &Vec<u32>) {
let mut s = 0;
// More complex iterators are just fine too
for (_i, y) in x.iter().enumerate() {
s += y;
}
}
pub fn sum_hm(x: &HashMap<u32, u32>) {
let mut s = 0;
// And it's not just Vec, all types will allocate the iterator
// on stack memory
for y in x.values() {
s += y;
}
}
const
), and when a reference is unique for the life of a program
(static
as a declaration, not
'static
as a
lifetime), we can make use of global memory. This special section of data is embedded directly in
the program binary so that variables are ready to go once the program loads; no additional
computation is necessary.
Understanding the value/reference distinction is important for reasons we'll go into below, and while the full specification for these two keywords is available, we'll take a hands-on approach to the topic.
const
valuesWhen a value is guaranteed to be unchanging in your program (where "value" may be scalars,
struct
s, etc.), you can declare it const
. This tells the compiler that it's safe to treat the
value as never changing, and enables some interesting optimizations; not only is there no
initialization cost to creating the value (it is loaded at the same time as the executable parts of
your program), but the compiler can also copy the value around if it speeds up the code.
The points we need to address when talking about const
are:
Const
values are stored in read-only memory - it's impossible to modify.const fn
are materialized at compile-time.const
values wherever it chooses.The first point is a bit strange - "read-only memory."
The Rust book
mentions in a couple places that using mut
with constants is illegal, but it's also important to
demonstrate just how immutable they are. Typically in Rust you can use
interior mutability to modify
things that aren't declared mut
.
RefCell
provides an example of this
pattern in action:
use std::cell::RefCell;
fn my_mutator(cell: &RefCell<u8>) {
// Even though we're given an immutable reference,
// the `replace` method allows us to modify the inner value.
cell.replace(14);
}
fn main() {
let cell = RefCell::new(25);
// Prints out 25
println!("Cell: {:?}", cell);
my_mutator(&cell);
// Prints out 14
println!("Cell: {:?}", cell);
}
When const
is involved though, interior mutability is impossible:
use std::cell::RefCell;
const CELL: RefCell<u8> = RefCell::new(25);
fn my_mutator(cell: &RefCell<u8>) {
cell.replace(14);
}
fn main() {
// First line prints 25 as expected
println!("Cell: {:?}", &CELL);
my_mutator(&CELL);
// Second line *still* prints 25
println!("Cell: {:?}", &CELL);
}
And a second example using Once
:
use std::sync::Once;
const SURPRISE: Once = Once::new();
fn main() {
// This is how `Once` is supposed to be used
SURPRISE.call_once(|| println!("Initializing..."));
// Because `Once` is a `const` value, we never record it
// having been initialized the first time, and this closure
// will also execute.
SURPRISE.call_once(|| println!("Initializing again???"));
}
When the
const
specification
refers to "rvalues", this
behavior is what they refer to. Clippy will treat this
as an error, but it's still something to be aware of.
The next thing to mention is that const
values are loaded into memory as part of your program
binary. Because of this, any const
values declared in your program will be "realized" at
compile-time; accessing them may trigger a main-memory lookup (with a fixed address, so your CPU may
be able to prefetch the value), but that's it.
use std::cell::RefCell;
const CELL: RefCell<u32> = RefCell::new(24);
pub fn multiply(value: u32) -> u32 {
// CELL is stored at `.L__unnamed_1`
value * (*CELL.get_mut())
}
The compiler creates one RefCell
, uses it everywhere, and never needs to call the RefCell::new
function.
If it's helpful though, the compiler can choose to copy const
values.
const FACTOR: u32 = 1000;
pub fn multiply(value: u32) -> u32 {
// See assembly line 4 for the `mov edi, 1000` instruction
value * FACTOR
}
pub fn multiply_twice(value: u32) -> u32 {
// See assembly lines 22 and 29 for `mov edi, 1000` instructions
value * FACTOR * FACTOR
}
In this example, the FACTOR
value is turned into the mov edi, 1000
instruction in both the
multiply
and multiply_twice
functions; the "1000" value is never "stored" anywhere, as it's
small enough to inline into the assembly instructions.
Finally, getting the address of a const
value is possible, but not guaranteed to be unique
(because the compiler can choose to copy values). I was unable to get non-unique pointers in my
testing (even using different crates), but the specifications are clear enough: don't rely on
pointers to const
values being consistent. To be frank, caring about locations for const
values
is almost certainly a code smell.
static
valuesStatic variables are related to const
variables, but take a slightly different approach. When we
declare that a reference is unique for the life of a program, you have a static
variable
(unrelated to the 'static
lifetime). Because of the reference/value distinction with
const
/static
, static variables behave much more like typical "global" variables.
But to understand static
, here's what we'll look at:
static
variables are globally unique locations in memory.const
, static
variables are loaded at the same time as your program being read into
memory.static
variables must implement the
Sync
marker trait.static
variables.The single biggest difference between const
and static
is the guarantees provided about
uniqueness. Where const
variables may or may not be copied in code, static
variables are
guarantee to be unique. If we take a previous const
example and change it to static
, the
difference should be clear:
static FACTOR: u32 = 1000;
pub fn multiply(value: u32) -> u32 {
// The assembly to `mul dword ptr [rip + example::FACTOR]` is how FACTOR gets used
value * FACTOR
}
pub fn multiply_twice(value: u32) -> u32 {
// The assembly to `mul dword ptr [rip + example::FACTOR]` is how FACTOR gets used
value * FACTOR * FACTOR
}
Where previously there were plenty of references to multiplying by 1000, the new
assembly refers to FACTOR
as a named memory location instead. No initialization work needs to be
done, but the compiler can no longer prove the value never changes during execution.
Next, let's talk about initialization. The simplest case is initializing static variables with either scalar or struct notation:
#[derive(Debug)]
struct MyStruct {
x: u32
}
static MY_STRUCT: MyStruct = MyStruct {
// You can even reference other statics
// declared later
x: MY_VAL
};
static MY_VAL: u32 = 24;
fn main() {
println!("Static MyStruct: {:?}", MY_STRUCT);
}
Things can get a bit weirder when using const fn
though. In most cases, it just works:
#[derive(Debug)]
struct MyStruct {
x: u32
}
impl MyStruct {
const fn new() -> MyStruct {
MyStruct { x: 24 }
}
}
static MY_STRUCT: MyStruct = MyStruct::new();
fn main() {
println!("const fn Static MyStruct: {:?}", MY_STRUCT);
}
However, there's a caveat: you're currently not allowed to use const fn
to initialize static
variables of types that aren't marked Sync
. For example,
RefCell::new()
is a
const fn
, but because
RefCell
isn't Sync
, you'll
get an error at compile time:
use std::cell::RefCell;
// error[E0277]: `std::cell::RefCell<u8>` cannot be shared between threads safely
static MY_LOCK: RefCell<u8> = RefCell::new(0);
It's likely that this will change in the future though.
Sync
markerWhich leads well to the next point: static variable types must implement the
Sync
marker. Because they're globally
unique, it must be safe for you to access static variables from any thread at any time. Most
struct
definitions automatically implement the Sync
trait because they contain only elements
which themselves implement Sync
(read more in the
Nomicon). This is why earlier examples could
get away with initializing statics, even though we never included an impl Sync for MyStruct
in the
code. To demonstrate this property, Rust refuses to compile our earlier example if we add a
non-Sync
element to the struct
definition:
use std::cell::RefCell;
struct MyStruct {
x: u32,
y: RefCell<u8>,
}
// error[E0277]: `std::cell::RefCell<u8>` cannot be shared between threads safely
static MY_STRUCT: MyStruct = MyStruct {
x: 8,
y: RefCell::new(8)
};
Finally, while static mut
variables are allowed, mutating them is an unsafe
operation. If we
want to stay in safe
Rust, we can use interior mutability to accomplish similar goals:
use std::sync::Once;
// This example adapted from https://doc.rust-lang.org/std/sync/struct.Once.html#method.call_once
static INIT: Once = Once::new();
fn main() {
// Note that while `INIT` is declared immutable, we're still allowed
// to mutate its interior
INIT.call_once(|| println!("Initializing..."));
// This code won't panic, as the interior of INIT was modified
// as part of the previous `call_once`
INIT.call_once(|| panic!("INIT was called twice!"));
}
main()
. Rust programmers use the
Box
type all the time, but there's a
rich history of the Rust language itself wrapped up in
how special it is.
In a similar vein, this series attempts to look at code and understand how memory is used; the complex choreography of operating system, compiler, and program that frees you to focus on functionality far-flung from frivolous book-keeping. The Rust compiler relieves a great deal of the cognitive burden associated with memory management, but we're going to step into its world for a while.
Let's learn a bit about memory in Rust.
Rust's three defining features of Performance, Reliability, and Productivity are all driven to a great degree by the how the Rust compiler understands memory usage. Unlike managed memory languages (Java, Python), Rust doesn't really garbage collect; instead, it uses an ownership system to reason about how long objects will last in your program. In some cases, if the life of an object is fairly transient, Rust can make use of a very fast region called the "stack." When that's not possible, Rust uses dynamic (heap) memory and the ownership system to ensure you can't accidentally corrupt memory. It's not as fast, but it is important to have available.
That said, there are specific situations in Rust where you'd never need to worry about the stack/heap distinction! If you:
unsafe
#![feature(alloc)]
or the alloc
crate...then it's not possible for you to use dynamic memory!
For some uses of Rust, typically embedded devices, these constraints are OK. They have very limited memory, and the program binary size itself may significantly affect what's available! There's no operating system able to manage this "virtual memory" thing, but that's not an issue because there's only one running application. The embedonomicon is ever in mind, and interacting with the "real world" through extra peripherals is accomplished by reading and writing to specific memory addresses.
Most Rust programs find these requirements overly burdensome though. C++ developers would struggle
without access to std::vector
(except those
hardcore no-STL people), and Rust developers would struggle without
std::vec
. But with the constraints above,
std::vec
is actually a part of the
alloc
crate, and thus off-limits. Box
,
Rc
, etc., are also unusable for the same reason.
Whether writing code for embedded devices or not, the important thing in both situations is how much you know before your application starts about what its memory usage will look like. In embedded devices, there's a small, fixed amount of memory to use. In a browser, you have no idea how large google.com's home page is until you start trying to download it. The compiler uses this knowledge (or lack thereof) to optimize how memory is used; put simply, your code runs faster when the compiler can guarantee exactly how much memory your program needs while it's running. This series is all about understanding how the compiler reasons about your program, with an emphasis on the implications for performance.
Now let's address some conditions and caveats before going much further:
unsafe
lets you use platform-specific allocation API's
(malloc
) that we'll
ignore.cargo run
and cargo test
) and
address (pun intended) release mode at the end (cargo run --release
and cargo test --release
).static
that are available in nightly.push
and pop
instructions was helpful while writing
this.Finally, I'll do what I can to flag potential future changes but the Rust docs have a notice worth repeating:
]]>Rust does not currently have a rigorously and formally defined memory model.
-- the docs
I had a really great idea: build a custom allocator that allows you to track your own allocations. I gave it a shot, but learned very quickly: never write your own allocator.
-- me
I proceeded to ignore it, because we never really learn from our mistakes.
There's another part of the human condition that derives joy from seeing things explode.
And that's the part I'm going to focus on.
So why, after complaining about allocators, would I still want to write one? There are three reasons for that:
When I say "slow," it's important to define the terms. If you're writing web applications, you'll spend orders of magnitude more time waiting for the database than you will the allocator. However, there's still plenty of code where micro- or nano-seconds matter; think finance, real-time audio, self-driving cars, and networking. In these situations it's simply unacceptable for you to spend time doing things that are not your program, and waiting on the allocator is not cool.
As I continue to learn Rust, it's difficult for me to predict where exactly allocations will happen. So, I propose we play a quick trivia game: Does this code invoke the allocator?
fn my_function() {
let v: Vec<u8> = Vec::new();
}
No: Rust knows how big the Vec
type is,
and reserves a fixed amount of memory on the stack for the v
vector. However, if we wanted to
reserve extra space (using Vec::with_capacity
) the allocator would get invoked.
fn my_function() {
let v: Box<Vec<u8>> = Box::new(Vec::new());
}
Yes: Because Boxes allow us to work with things that are of unknown size, it has to allocate on
the heap. While the Box
is unnecessary in this snippet (release builds will optimize out the
allocation), reserving heap space more generally is needed to pass a dynamically sized type to
another function.
fn my_function(v: Vec<u8>) {
v.push(5);
}
Maybe: Depending on whether the Vector we were given has space available, we may or may not allocate. Especially when dealing with code that you did not author, it's difficult to verify that things behave as you expect them to.
So, how exactly does QADAPT solve these problems? Whenever an allocation or drop occurs in code marked allocation-safe, QADAPT triggers a thread panic. We don't want to let the program continue as if nothing strange happened, we want things to explode.
However, you don't want code to panic in production because of circumstances you didn't predict.
Just like debug_assert!
, QADAPT will
strip out its own code when building in release mode to guarantee no panics and no performance
impact.
Finally, there are three ways to have QADAPT check that your code will not invoke the allocator:
The easiest method, watch an entire function for allocator invocation:
use qadapt::no_alloc;
use qadapt::QADAPT;
#[global_allocator]
static Q: QADAPT = QADAPT;
#[no_alloc]
fn push_vec(v: &mut Vec<u8>) {
// This triggers a panic if v.len() == v.capacity()
v.push(5);
}
fn main() {
let v = Vec::with_capacity(1);
// This will *not* trigger a panic
push_vec(&v);
// This *will* trigger a panic
push_vec(&v);
}
For times when you need more precision:
use qadapt::assert_no_alloc;
use qadapt::QADAPT;
#[global_allocator]
static Q: QADAPT = QADAPT;
fn main() {
let v = Vec::with_capacity(1);
// No allocations here, we already have space reserved
assert_no_alloc!(v.push(5));
// Even though we remove an item, it doesn't trigger a drop
// because it's a scalar. If it were a `Box<_>` type,
// a drop would trigger.
assert_no_alloc!({
v.pop().unwrap();
});
}
Both the most precise and most tedious:
use qadapt::enter_protected;
use qadapt::exit_protected;
use qadapt::QADAPT;
#[global_allocator]
static Q: QADAPT = QADAPT;
fn main() {
// This triggers an allocation (on non-release builds)
let v = Vec::with_capacity(1);
enter_protected();
// This does not trigger an allocation because we've reserved size
v.push(0);
exit_protected();
// This triggers an allocation because we ran out of size,
// but doesn't panic because we're no longer protected.
v.push(1);
}
It's important to point out that QADAPT code is synchronous, so please be careful when mixing in asynchronous functions:
use futures::future::Future;
use futures::future::ok;
#[no_alloc]
fn async_capacity() -> impl Future<Item=Vec<u8>, Error=()> {
ok(12).and_then(|e| Ok(Vec::with_capacity(e)))
}
fn main() {
// This doesn't trigger a panic because the `and_then` closure
// wasn't run during the function call.
async_capacity();
// Still no panic
assert_no_alloc!(async_capacity());
// This will panic because the allocation happens during `unwrap`
// in the `assert_no_alloc!` macro
assert_no_alloc!(async_capacity().poll().unwrap());
}
While there's a lot more to writing high-performance code than managing your usage of the allocator, it's critical that you do use the allocator correctly. QADAPT will verify that your code is doing what you expect. It's usable even on stable Rust from version 1.31 onward, which isn't the case for most allocators. Version 1.0 was released today, and you can check it out over at crates.io or on github.
I'm hoping to write more about high-performance Rust in the future, and I expect that QADAPT will help guide that. If there are topics you're interested in, let me know in the comments below!
]]>Let me also make note of one more question/euphemism I've come across:
Translation: We're a fairly small team, and when things break on an evening/weekend/Christmas Day, can we call on you to be there?
I've met decidedly few people in my life who truly enjoy the "ops" side of "devops". They're incredibly good at taking an impossible problem, pre-existing knowledge of arcane arts, and turning that into a functioning system at the end. And if they all left for lunch, we probably wouldn't make it out the door before the zombie apocalypse.
Larger organizations (in my experience, 500+ person organizations) have the luxury of hiring people who either enjoy that, or play along nicely enough that our systems keep working.
Small teams have no such luck. If you're interviewing at a small company, especially as a "data scientist" or other somesuch position, be aware that systems can and do spontaneously combust at the most inopportune moments.
Terrible-but-popular answers include: It's a part of the job, and I'm happy to contribute.
]]>Programmers have it too easy these days. They should learn to develop in low memory environments and be more efficient.
...though it's not like the first code I wrote was for a graphing calculator packing a whole 24KB of RAM.
But the principle remains: be efficient with the resources you have, because what Intel giveth, Microsoft taketh away.
My professional work is focused on this kind of efficiency; low-latency financial markets demand that you understand at a deep level exactly what your code is doing. As I continue experimenting with Rust for personal projects, it's exciting to bring a utilitarian mindset with me: there's flexibility for the times I pretend to have a garbage collector, and flexibility for the times that I really care about how memory is used.
This post is a (small) case study in how I went from the former to the latter. And ultimately, it's intended to be a starting toolkit to empower analysis of your own code.
When I first started building the dtparse crate, my intention was to mirror as closely as possible
the equivalent Python library. Python, as you may know, is garbage collected. Very
rarely is memory usage considered in Python, and I likewise wasn't paying too much attention when
dtparse
was first being built.
This lackadaisical approach to memory works well enough, and I'm not planning on making dtparse
hyper-efficient. But every so often, I've wondered: "what exactly is going on in memory?" With the
advent of Rust 1.28 and the
Global Allocator trait, I had a really
great idea: build a custom allocator that allows you to track your own allocations. That way, you
can do things like writing tests for both correct results and correct memory usage. I gave it a
shot, but learned very quickly: never write your own allocator. It went from "fun
weekend project" to "I have literally no idea what my computer is doing" at breakneck speed.
Instead, I'll highlight a separate path I took to make sense of my memory usage: heaptrack.
This is the hardest part of the post. Because Rust uses
its own allocator by default,
heaptrack
is unable to properly record unmodified Rust code. To remedy this, we'll make use of the
#[global_allocator]
attribute.
Specifically, in lib.rs
or main.rs
, add this:
use std::alloc::System;
#[global_allocator]
static GLOBAL: System = System;
...and that's it. Everything else comes essentially for free.
Assuming you've installed heaptrack (Homebrew in Mac, package manager in Linux, ??? in Windows), all that's left is to fire up your application:
heaptrack my_application
It's that easy. After the program finishes, you'll see a file in your local directory with a name
like heaptrack.my_appplication.XXXX.gz
. If you load that up in heaptrack_gui
, you'll see
something like this:
And even these pretty colors:
To make sense of our memory usage, we're going to focus on that last picture - it's called a "flamegraph". These charts are typically used to show how much time your program spends executing each function, but they're used here to show how much memory was allocated during those functions instead.
For example, we can see that all executions happened during the main
function:
...and within that, all allocations happened during dtparse::parse
:
...and within that, allocations happened in two different places:
Now I apologize that it's hard to see, but there's one area specifically that stuck out as an issue:
what the heck is the Default
thing doing?
See, I knew that there were some allocations during calls to dtparse::parse
, but I was totally
wrong about where the bulk of allocations occurred in my program. Let me post the code and see if
you can spot the mistake:
/// Main entry point for using `dtparse`.
pub fn parse(timestr: &str) -> ParseResult<(NaiveDateTime, Option<FixedOffset>)> {
let res = Parser::default().parse(
timestr, None, None, false, false,
None, false,
&HashMap::new(),
)?;
Ok((res.0, res.1))
}
Because Parser::parse
requires a mutable reference to itself, I have to create a new
Parser::default
every time it receives a string. This is excessive! We'd rather have an immutable
parser that can be re-used, and avoid allocating memory in the first place.
Armed with that information, I put some time in to make the parser immutable. Now that I can re-use the same parser over and over, the allocations disappear:
In total, we went from requiring 2 MB of memory in version 1.0.2:
All the way down to 300KB in version 1.0.3:
In the end, you don't need to write a custom allocator to be efficient with memory, great tools already exist to help you understand what your program is doing.
Use them.
Given that Moore's Law is dead, we've all got to do our part to take back what Microsoft stole.
]]>See, as much as Webassembly isn't trying to replace Javascript, I want Javascript gone. There are plenty of people who don't share my views, and they are probably nicer and more fun at parties. But I cringe every time "Webpack" is mentioned, and I think it's hilarious that the language specification dramatically outpaces anyone's actual implementation. The answer to this conundrum is of course to recompile code from newer versions of the language to older versions of the same language before running. At least Babel is a nice tongue-in-cheek reference.
Yet for as much hate as Electron receives, it does a stunningly good job at solving a really hard problem: how the hell do I put a button on the screen and react when the user clicks it? GUI programming is hard, straight up. But if browsers are already able to run everywhere, why don't we take advantage of someone else solving the hard problems for us? I don't like that I have to use Javascript for it, but I really don't feel inclined to whip out good ol' wxWidgets.
Now there are other native solutions (libui-rs, conrod, oh hey wxWdidgets again!), but
those also have their own issues with distribution, styling, etc. With Electron, I can
yarn create electron-app my-app
and just get going, knowing that packaging/upgrades/etc. are built
in.
My question is: given recent innovations with WASM, are we Electron yet?
No, not really.
Instead, what would it take to get to a point where we can skip Javascript in Electron apps?
Truth is, WASM/Webassembly is a pretty new technology and I'm a total beginner in this area. There may already be solutions to the issues I discuss, but I'm totally unaware of them, so I'm going to try and organize what I did manage to discover.
I should also mention that the content and things I'm talking about here are not intended to be prescriptive, but more "if someone else is interested, what do we already know doesn't work?" I expect everything in this post to be obsolete within two months. Even over the course of writing this, a separate blog post had to be modified because upstream changes broke a Rust tool the post tried to use. The post ultimately got updated, but all this happened within the span of a week. Things are moving quickly.
I'll also note that we're going to skip asm.js and emscripten. Truth be told, I couldn't get
either of these to output anything, and so I'm just going to say
here be dragons. Everything I'm discussing here
uses the wasm32-unknown-unknown
target.
The code that I did get running is available over here. Feel free to use it as a starting point, but I'm mostly including the link as a reference for the things that were attempted.
So, I did technically get a running application:
...which you can also try out if you want:
git clone https://github.com/speice-io/isomorphic-rust.git
cd isomorphic_rust/percy
yarn install && yarn start
...but I wouldn't really call it a "high quality" starting point to base future work on. It's mostly there to prove this is possible in the first place. And that's something to be proud of! There's a huge amount of engineering that went into showing a window with the text "It's alive!".
There's also a lot of usability issues that prevent me from recommending anyone try Electron and WASM apps at the moment, and I think that's the more important thing to discuss.
I quickly established that wasm-bindgen was necessary to "link" my Rust code to Javascript. At
that point you've got an Electron app that starts an HTML page which ultimately fetches your WASM
blob. To keep things simple, the goal was to package everything using webpack so that I could just
load a bundle.js
file on the page. That decision was to be the last thing that kinda worked in
this process.
The first issue
I ran into
while attempting to bundle everything via webpack
is a detail in the WASM spec:
This function accepts a Response object, or a promise for one, and ... [if > it] does not match the
application/wasm
MIME type, the returned promise will be rejected with a TypeError;
Specifically, if you try and load a WASM blob without the MIME type set, you'll get an error. On the
web this isn't a huge issue, as the server can set MIME types when delivering the blob. With
Electron, you're resolving things with a file://
URL and thus can't control the MIME type:
There are a couple of solutions depending on how far into the deep end you care to venture:
But all these are pretty bad solutions and defeat the purpose of using WASM in the first place.
Instead, my workaround was to
open a PR with webpack
and use regex to remove
calls to instantiateStreaming
in the
build script:
cargo +nightly build --target=wasm32-unknown-unknown && \
wasm-bindgen "$WASM_DIR/debug/$WASM_NAME.wasm" --out-dir "$APP_DIR" --no-typescript && \
# Have to use --mode=development so we can patch out the call to instantiateStreaming
"$DIR/node_modules/webpack-cli/bin/cli.js" --mode=development "$APP_DIR/app_loader.js" -o "$APP_DIR/bundle.js" && \
sed -i 's/.*instantiateStreaming.*//g' "$APP_DIR/bundle.js"
Once that lands, the build process becomes much simpler:
cargo +nightly build --target=wasm32-unknown-unknown && \
wasm-bindgen "$WASM_DIR/debug/$WASM_NAME.wasm" --out-dir "$APP_DIR" --no-typescript && \
"$DIR/node_modules/webpack-cli/bin/cli.js" --mode=production "$APP_DIR/app_loader.js" -o "$APP_DIR/bundle.js"
But we're not done yet! After we compile Rust into WASM and link WASM to Javascript (via
wasm-bindgen
and webpack
), we still have to make an Electron app. For this purpose I used a
starter app from Electron Forge, and then a
prestart
script
to actually handle starting the application.
The final toolchain looks something like this:
yarn start
triggers the prestart
scriptprestart
checks for missing tools (wasm-bindgen-cli
, etc.) and then:
cargo
to compile the Rust code into WASMwasm-bindgen
to link the WASM blob into a Javascript file with exported symbolswebpack
to bundle the page start script with the Javascript we just generated
babel
under the hood to compile the wasm-bindgen
code down from ES6 into something
browser-compatiblestart
script runs an Electron Forge handler to do some sanity checks...which is complicated. I think more work needs to be done to either build a high-quality starter app that can manage these steps, or another tool that "just handles" the complexity of linking a compiled WASM file into something the Electron browser can run.
For as much as I didn't enjoy the Javascript tooling needed to interface with Rust, the Rust-only bits aren't any better at the moment. I get it, a lot of projects are just starting off, and that leads to a fragmented ecosystem. Here's what I can recommend as a starting point:
Don't check in your Cargo.lock
files to version control. If there's a disagreement between the
version of wasm-bindgen-cli
you have installed and the wasm-bindgen
you're compiling with in
Cargo.lock
, you get a nasty error:
it looks like the Rust project used to create this wasm file was linked against
a different version of wasm-bindgen than this binary:
rust wasm file: 0.2.21
this binary: 0.2.17
Currently the bindgen format is unstable enough that these two version must
exactly match, so it's required that these two version are kept in sync by
either updating the wasm-bindgen dependency or this binary.
Not that I ever managed to run into this myself (coughs nervously).
There are two projects attempting to be "application frameworks": percy and yew. Between those,
I managed to get two
examples running
using percy
, but was unable to get an
example running with yew
because
of issues with "missing modules" during the webpack
step:
ERROR in ./dist/electron_yew_wasm_bg.wasm
Module not found: Error: Can't resolve 'env' in '/home/bspeice/Development/isomorphic_rust/yew/dist'
@ ./dist/electron_yew_wasm_bg.wasm
@ ./dist/electron_yew_wasm.js
@ ./dist/app.js
@ ./dist/app_loader.js
If you want to work with the browser APIs directly, your choices are percy-webapis or stdweb (or
eventually web-sys). See above for my percy
examples, but when I tried
an example with stdweb
, I was
unable to get it running:
ERROR in ./dist/stdweb_electron_bg.wasm
Module not found: Error: Can't resolve 'env' in '/home/bspeice/Development/isomorphic_rust/stdweb/dist'
@ ./dist/stdweb_electron_bg.wasm
@ ./dist/stdweb_electron.js
@ ./dist/app_loader.js
At this point I'm pretty convinced that stdweb
is causing issues for yew
as well, but can't
prove it.
I did also get a minimal example
running that doesn't depend on any tools besides wasm-bindgen
. However, it requires manually
writing "extern C
" blocks for everything you need from the browser. Es no bueno.
Finally, from a tools and platform view, there are two up-and-coming packages that should be mentioned: js-sys and web-sys. Their purpose is to be fundamental building blocks that exposes the browser's APIs to Rust. If you're interested in building an app framework from scratch, these should give you the most flexibility. I didn't touch either in my research, though I expect them to be essential long-term.
So there's a lot in play from the Rust side of things, and it's just going to take some time to figure out what works and what doesn't.
Alright, so after I managed to get an application started, I stopped there. It was a good deal of effort to chain together even a proof of concept, and at this point I'd rather learn Typescript than keep trying to maintain an incredibly brittle pipeline. Blasphemy, I know...
The important point I want to make is that there's a lot unknown about how any of this holds up outside proofs of concept. Things I didn't attempt:
Much as I don't like Javascript, the tools are too shaky for me to recommend mixing Electron and WASM at the moment. There's a lot of innovation happening, so who knows? Someone might have an application in production a couple months from now. But at the moment, I'm personally going to stay away.
Let's finish with a wishlist then - here are the things that I think need to happen before Electron/WASM/Rust can become a thing:
web-sys
and stdweb
) need to make sure they can support running in
Electron (see module error above)stdweb
being turned into a Rust API
on top of web-sys, and percy
moving to web-sys, both of which are big changeswasm-bindgen
is great, but still in the "move fast and break things" phasefn main() {
println!("{}", 8.to_string())
}
And to my complete befuddlement, it compiled, ran, and produced a completely sensible output.
The reason I was so surprised has to do with how Rust treats a special category of things I'm going to call primitives. In the current version of the Rust book, you'll see them referred to as scalars, and in older versions they'll be called primitives, but we're going to stick with the name primitive for the time being. Explaining why this program is so cool requires talking about a number of other programming languages, and keeping a consistent terminology makes things easier.
You've been warned: this is going to be a tedious post about a relatively minor issue that involves Java, Python, C, and x86 Assembly. And also me pretending like I know what I'm talking about with assembly.
The reason I'm using the name primitive comes from how much of my life is Java right now. For the most part I like Java, but I digress. In Java, there's a special name for some specific types of values:
bool char byte
short int long
float double
They are referred to as primitives. And relative to the other bits of Java,
they have two unique features. First, they don't have to worry about the
billion-dollar mistake;
primitives in Java can never be null
. Second: they can't have instance methods.
Remember that Rust program from earlier? Java has no idea what to do with it:
class Main {
public static void main(String[] args) {
int x = 8;
System.out.println(x.toString()); // Triggers a compiler error
}
}
The error is:
Main.java:5: error: int cannot be dereferenced
System.out.println(x.toString());
^
1 error
Specifically, Java's Object
and things that inherit from it are pointers under the hood, and we have to dereference them before
the fields and methods they define can be used. In contrast, primitive types are just values -
there's nothing to be dereferenced. In memory, they're just a sequence of bits.
If we really want, we can turn the int
into an
Integer
and then dereference
it, but it's a bit wasteful:
class Main {
public static void main(String[] args) {
int x = 8;
Integer y = Integer.valueOf(x);
System.out.println(y.toString());
}
}
This creates the variable y
of type Integer
(which inherits Object
), and at run time we
dereference y
to locate the toString()
function and call it. Rust obviously handles things a bit
differently, but we have to dig into the low-level details to see it in action.
We first need to build a foundation for reading and understanding the assembly code the final answer
requires. Let's begin with showing how the C
language (and your computer) thinks about "primitive"
values in memory:
void my_function(int num) {}
int main() {
int x = 8;
my_function(x);
}
The compiler explorer gives us an easy way of showing off the assembly-level code that's generated: whose output has been lightly edited
main:
push rbp
mov rbp, rsp
sub rsp, 16
; We assign the value `8` to `x` here
mov DWORD PTR [rbp-4], 8
; And copy the bits making up `x` to a location
; `my_function` can access (`edi`)
mov eax, DWORD PTR [rbp-4]
mov edi, eax
; Call `my_function` and give it control
call my_function
mov eax, 0
leave
ret
my_function:
push rbp
mov rbp, rsp
; Copy the bits out of the pre-determined location (`edi`)
; to somewhere we can use
mov DWORD PTR [rbp-4], edi
nop
pop rbp
ret
At a really low level of memory, we're copying bits around using the mov
instruction;
nothing crazy. But to show how similar Rust is, let's take a look at our program translated from C
to Rust:
fn my_function(x: i32) {}
fn main() {
let x = 8;
my_function(x)
}
And the assembly generated when we stick it in the compiler explorer: again, lightly edited
example::main:
push rax
; Look familiar? We're copying bits to a location for `my_function`
; The compiler just optimizes out holding `x` in memory
mov edi, 8
; Call `my_function` and give it control
call example::my_function
pop rax
ret
example::my_function:
sub rsp, 4
; And copying those bits again, just like in C
mov dword ptr [rsp], edi
add rsp, 4
ret
The generated Rust assembly is functionally pretty close to the C assembly: When working with primitives, we're just dealing with bits in memory.
In Java we have to dereference a pointer to call its functions; in Rust, there's no pointer to
dereference. So what exactly is going on with this .to_string()
function call?
Now it's time to reveal my trap card show the revelation that tied all this
together: Rust has implementations for its primitive types. That's right, impl
blocks aren't
only for structs
and traits
, primitives get them too. Don't believe me? Check out
u32,
f64 and
char as examples.
But the really interesting bit is how Rust turns those impl
blocks into assembly. Let's break out
the compiler explorer once again:
pub fn main() {
8.to_string()
}
And the interesting bits in the assembly: heavily trimmed down
example::main:
sub rsp, 24
mov rdi, rsp
lea rax, [rip + .Lbyte_str.u]
mov rsi, rax
; Cool stuff right here
call <T as alloc::string::ToString>::to_string@PLT
mov rdi, rsp
call core::ptr::drop_in_place
add rsp, 24
ret
Now, this assembly is a bit more complicated, but here's the big revelation: we're calling
to_string()
as a function that exists all on its own, and giving it the instance of 8
. Instead
of thinking of the value 8 as an instance of u32
and then peeking in to find the location of the
function we want to call (like Java), we have a function that exists outside of the instance and
just give that function the value 8
.
This is an incredibly technical detail, but the interesting idea I had was this: if to_string()
is a static function, can I refer to the unbound function and give it an instance?
Better explained in code (and a compiler explorer link because I seriously love this thing):
struct MyVal {
x: u32
}
impl MyVal {
fn to_string(&self) -> String {
self.x.to_string()
}
}
pub fn main() {
let my_val = MyVal { x: 8 };
// THESE ARE THE SAME
my_val.to_string();
MyVal::to_string(&my_val);
}
Rust is totally fine "binding" the function call to the instance, and also as a static.
MIND == BLOWN.
Python does the same thing where I can both call functions bound to their instances and also call as an unbound function where I give it the instance:
class MyClass():
x = 24
def my_function(self):
print(self.x)
m = MyClass()
m.my_function()
MyClass.my_function(m)
And Python tries to make you think that primitives can have instance methods...
>>> dir(8)
['__abs__', '__add__', '__and__', '__class__', '__cmp__', '__coerce__',
'__delattr__', '__div__', '__divmod__', '__doc__', '__float__', '__floordiv__',
...
'__setattr__', '__sizeof__', '__str__', '__sub__', '__subclasshook__', '__truediv__',
...]
>>> # Theoretically `8.__str__()` should exist, but:
>>> 8.__str__()
File "<stdin>", line 1
8.__str__()
^
SyntaxError: invalid syntax
>>> # It will run if we assign it first though:
>>> x = 8
>>> x.__str__()
'8'
...but in practice it's a bit complicated.
So while Python handles binding instance methods in a way similar to Rust, it's still not able to run the example we started with.
This was a super-roundabout way of demonstrating it, but the way Rust handles incredibly minor details like primitives leads to really cool effects. Primitives are optimized like C in how they have a space-efficient memory layout, yet the language still has a lot of features I enjoy in Python (like both instance and late binding).
And when you put it together, there are areas where Rust does cool things nobody else can; as a
quirky feature of Rust's type system, 8.to_string()
is actually valid code.
Now go forth and fool your friends into thinking you know assembly. This is all I've got.
]]>But I built dtparse, and you can read about my thoughts on the process. Or don't. I won't tell you what to do with your life (but you should totally keep reading).
OK, fine, I guess I should start with why someone would do this.
Dateutil is a Python library for handling dates. The
standard library support for time in Python is kinda dope, but there are a lot of extras that go
into making it useful beyond just the datetime
module. dateutil.parser
specifically is code to take all the super-weird time formats people come
up with and turn them into something actually useful.
Date/time parsing, it turns out, is just like everything else involving computers and time: it feels like it shouldn't be that difficult to do, until you try to do it, and you realize that people suck and this is why we can't we have nice things. But alas, we'll try and make contemporary art out of the rubble and give it a pretentious name like Time.
What makes dateutil.parser
great is that there's single function with a single argument that
drives what programmers interact with:
parse(timestr)
.
It takes in the time as a string, and gives you back a reasonable "look, this is the best anyone can
possibly do to make sense of your input" value. It doesn't expect much of you.
Having worked at a bulge-bracket bank watching Java programmers try to be Python programmers, I'm admittedly hesitant to publish Python code that's trying to be Rust. Interestingly, Rust code can actually do a great job of mimicking Python. It's certainly not idiomatic Rust, but I've had better experiences than this guy who attempted the same thing for D. These are the actual take-aways:
When transcribing code, stay as close to the original library as possible. I'm talking about using the same variable names, same access patterns, the whole shebang. It's way too easy to make a couple of typos, and all of a sudden your code blows up in new and exciting ways. Having a reference manual for verbatim what your code should be means that you don't spend that long debugging complicated logic, you're more looking for typos.
Also, don't use nice Rust things like enums. While
one time it worked out OK for me,
I also managed to shoot myself in the foot a couple times because dateutil
stores AM/PM as a
boolean and I mixed up which was true, and which was false (side note: AM is false, PM is true). In
general, writing nice code should not be a first-pass priority when you're just trying to recreate
the same functionality.
Exceptions are a pain. Make peace with it. Python code is just allowed to skip stack frames. So
when a co-worker told me "Rust is getting try-catch syntax" I properly freaked out. Turns out
he's not quite right, and I'm OK with that. And while
dateutil
is pretty well-behaved about not skipping multiple stack frames,
130-line try-catch blocks
take a while to verify.
As another Python quirk, be very careful about long nested if-elif-else blocks. I used to think that Python's whitespace was just there to get you to format your code correctly. I think that no longer. It's way too easy to close a block too early and have incredibly weird issues in the logic. Make sure you use an editor that displays indentation levels so you can keep things straight.
Rust macros are not free. I originally had the main test body wrapped up in a macro using pyo3. It took two minutes to compile. After moving things to a function compile times dropped down to ~5 seconds. Turns out 150 lines * 100 tests = a lot of redundant code to be compiled. My new rule of thumb is that any macros longer than 10-15 lines are actually functions that need to be liberated, man.
Finally, I really miss list comprehensions and dictionary comprehensions. As a quick comparison, see this dateutil code and the implementation in Rust. I probably wrote it wrong, and I'm sorry. Ultimately though, I hope that these comprehensions can be added through macros or syntax extensions. Either way, they're expressive, save typing, and are super-readable. Let's get more of that.
Now, Rust is exciting and new, which means that there's opportunity to make a substantive impact. On more than one occasion though, I've had issues navigating the Rust ecosystem.
What I'll call the "canonical library" is still being built. In Python, if you need datetime
parsing, you use dateutil
. If you want decimal
types, it's already in the
standard library. While I might've gotten away
with f64
, dateutil
uses decimals, and I wanted to follow the principle of staying as close to
the original library as possible. Thus began my quest to find a decimal library in Rust. What I
quickly found was summarized in a comment:
Writing a BigDecimal is easy. Writing a good BigDecimal is hard.
In practice, this means that there are at least 4 different implementations available. And that's a lot of decisions to worry about when all I'm thinking is "why can't calendar reform be a thing" and I'm forced to dig through a couple different threads to figure out if the library I'm look at is dead or just stable.
And even when the "canonical library" exists, there's no guarantees that it will be well-maintained. Chrono is the de facto date/time library in Rust, and just released version 0.4.4 like two days ago. Meanwhile, chrono-tz appears to be dead in the water even though there are people happy to help maintain it. I know relatively little about it, but it appears that most of the release process is automated; keeping that up to date should be a no-brainer.
Specifically given "maintenance" being an
oft-discussed
issue, I'm going to try out the following policy to keep things moving on dtparse
:
Issues/PRs needing maintainer feedback will be updated at least weekly. I want to make sure nobody's blocking on me.
To keep issues/PRs needing contributor feedback moving, I'm going to (kindly) ask the contributor to check in after two weeks, and close the issue without resolution if I hear nothing back after a month.
The second point I think has the potential to be a bit controversial, so I'm happy to receive feedback on that. And if a contributor responds with "hey, still working on it, had a kid and I'm running on 30 seconds of sleep a night," then first: congratulations on sustaining human life. And second: I don't mind keeping those requests going indefinitely. I just want to try and balance keeping things moving with giving people the necessary time they need.
I should also note that I'm still getting some best practices in place - CONTRIBUTING and CONTRIBUTORS files need to be added, as well as issue/PR templates. In progress. None of us are perfect.
So if I've now built a dateutil
-compatible parser, we're done, right? Of course not! That's not
nearly ambitious enough.
Ultimately, I'd love to have a library that's capable of parsing everything the Linux date
command
can do (and not date
on OSX, because seriously, BSD coreutils are the worst). I know Rust has a
coreutils rewrite going on, and dtparse
would potentially be an interesting candidate since it
doesn't bring in a lot of extra dependencies. humantime
could help pick up some of the (current) slack in dtparse, so maybe we can share and care with each
other?
All in all, I'm mostly hoping that nobody's already done this and I haven't spent a bit over a month on redundant code. So if it exists, tell me. I need to know, but be nice about it, because I'm going to take it hard.
And in the mean time, I'm looking forward to building more. Onwards.
]]>