speice.io/_posts/2018-09-01-primitives-in-rust-are-weird.md

9.7 KiB

layout title description category tags
post Primitives in Rust are Weird (and Cool) but mostly weird.
rust
c
java
python
x86

I wrote a really small Rust program a while back because I was curious. I was 100% convinced it couldn't possibly run:

fn main() {
    println!("{}", 8.to_string())
}

And to my complete befuddlement, it compiled, ran, and produced a completely sensible output. The reason I was so surprised has to do with how Rust treats a special category of things I'm going to call primitives. In the current version of the Rust book, you'll see them referred to as scalars, and in older versions they'll be called primitives, but we're going to stick with the name primitive for the time being. Explaining why this program is so cool requires talking about a number of other programming languages, and keeping a consistent terminology makes things easier.

You've been warned: this is going to be a tedious post about a relatively minor issue that involves Java, Python, C, and x86 Assembly. And also me pretending like I know what I'm talking about with assembly.

Defining primitives (Java)

The reason I'm using the name primitive comes from how much of my life is Java right now. Spoiler alert: a lot of it. And for the most part I like Java, but I digress. In Java, there's a special name for some specific types of values:

bool    char    byte
short   int     long
float   double

They are referred to as [primitives][java_primitive]. And relative to the other bits of Java,
they have two unique features. First, they don't have to worry about the
[billion-dollar mistake](https://en.wikipedia.org/wiki/Tony_Hoare#Apologies_and_retractions);
primitives in Java can never be `null`. Second: *they can't have instance methods*.
Remember that Rust program from earlier? Java has no idea what to do with it:

```java
class Main {
    public static void main(String[] args) {
        int x = 8;
        System.out.println(x.toString()); // Triggers a compiler error
    }
}

The error is:

Main.java:5: error: int cannot be dereferenced
        System.out.println(x.toString());
                            ^
1 error

Specifically, Java's Object and things that inherit from it are pointers under the hood, and we have to dereference them before the fields and methods they define can be used. In contrast, primitive types are just values - there's nothing to be dereferenced. In memory, they're just a sequence of bits.

If we really want, we can turn the int into an Integer and then dereference it, but it's a bit wasteful:

class Main {
    public static void main(String[] args) {
        int x = 8;
        Integer y = Integer.valueOf(x);
        System.out.println(y.toString());
    }
}

This creates the variable y of type Integer (which inherits Object), and at run time we dereference y to locate the toString() function and call it. Rust obviously handles things a bit differently, but we have to dig into the low-level details to see it in action.

Low Level Handling of Primitives (C)

We first need to build a foundation for reading and understanding the assembly code the final answer requires. Let's begin with showing how the C language (and your computer) thinks about "primitive" values in memory:

void my_function(int num) {}

int main() {
    int x = 8;
    my_function(x);
}

The compiler explorer gives us an easy way of showing off the assembly-level code that's generated: whose output has been lightly edited

main:
        push    rbp
        mov     rbp, rsp
        sub     rsp, 16

        ; We assign the value `8` to `x` here
        mov     DWORD PTR [rbp-4], 8

        ; And copy the bits making up `x` to a location
        ; `my_function` can access (`edi`)
        mov     eax, DWORD PTR [rbp-4]
        mov     edi, eax

        ; Call `my_function` and give it control
        call    my_function

        mov     eax, 0
        leave
        ret

my_function:
        push    rbp
        mov     rbp, rsp

        ; Copy the bits out of the pre-determined location (`edi`)
        ; to somewhere we can use
        mov     DWORD PTR [rbp-4], edi
        nop

        pop     rbp
        ret

At a really low level of memory, we're copying bits around using the mov instruction; nothing crazy. But to show how similar Rust is, let's take a look at our program translated from C to Rust:

fn my_function(x: i32) {}

fn main() {
    let x = 8;
    my_function(x)
}

And the assembly generated when we stick it in the compiler explorer: again, lightly edited

example::main:
  push rax

  ; Look familiar? We're copying bits to a location for `my_function`
  ; The compiler just optimizes out holding `x` in memory
  mov edi, 8

  ; Call `my_function` and give it control
  call example::my_function

  pop rax
  ret

example::my_function:
  sub rsp, 4

  ; And copying those bits again, just like in C
  mov dword ptr [rsp], edi

  add rsp, 4
  ret

The generated Rust assembly is functionally pretty close to the C assembly: When working with primitives, we're just dealing with bits in memory.

In Java we have to dereference a pointer to call its functions; in Rust, there's no pointer to dereference. So what exactly is going on with this .to_string() function call?

impl primitive (and Python)

Now it's time to reveal my trap card show the revelation that tied all this together: Rust has implementations for its primitive types. That's right, impl blocks aren't only for structs and traits, primitives get them too. Don't believe me? Check out u32, f64 and char as examples.

But the really interesting bit is how Rust turns those impl blocks into assembly. Let's break out the compiler explorer once again:

pub fn main() {
    8.to_string()
}

And the interesting bits in the assembly: heavily trimmed down

example::main:
  sub rsp, 24
  mov rdi, rsp
  lea rax, [rip + .Lbyte_str.u]
  mov rsi, rax

  ; Cool stuff right here
  call <T as alloc::string::ToString>::to_string@PLT

  mov rdi, rsp
  call core::ptr::drop_in_place
  add rsp, 24
  ret

Now, this assembly is a bit more complicated, but here's the big revelation: we're calling to_string() as a function that exists all on its own, and giving it the instance of 8. Instead of thinking of the value 8 as an instance of u32 and then peeking in to find the location of the function we want to call (like Java), we have a function that exists outside of the instance and just give that function the value 8.

This is an incredibly technical detail, but the interesting idea I had was this: if to_string() is a static function, can I refer to the unbound function and give it an instance?

Better explained in code (and a compiler explorer link because I seriously love this thing):

struct MyVal {
    x: u32
}

impl MyVal {
    fn to_string(&self) -> String {
        self.x.to_string()
    }
}

pub fn main() {
    let my_val = MyVal { x: 8 };

    // THESE ARE THE SAME
    my_val.to_string();
    MyVal::to_string(&my_val);
}

Rust is totally fine "binding" the function call to the instance, and also as a static.

MIND == BLOWN.

Python does the same thing where I can both call functions bound to their instances and also call as an unbound function where I give it the instance:

class MyClass():
    x = 24

    def my_function(self):
        print(self.x)

m = MyClass()

m.my_function()
MyClass.my_function(m)

And Python tries to make you think that primitives can have instance methods...

>>> dir(8)
['__abs__', '__add__', '__and__', '__class__', '__cmp__', '__coerce__',
'__delattr__', '__div__', '__divmod__', '__doc__', '__float__', '__floordiv__',
...
'__setattr__', '__sizeof__', '__str__', '__sub__', '__subclasshook__', '__truediv__',
...]

>>> # Theoretically `8.__str__()` should exist, but:

>>> 8.__str__()
  File "<stdin>", line 1
    8.__str__()
             ^
SyntaxError: invalid syntax

>>> # It will run if we assign it first though:
>>> x = 8
>>> x.__str__()
'8'

...but in practice it's a bit complicated.

So while Python handles binding instance methods in a way similar to Rust, it's still not able to run the example we started with.

Conclusion

This was a super-roundabout way of demonstrating it, but the way Rust handles incredibly minor details like primitives leads to really cool effects. Primitives are optimized like C in how they have a space-efficient memory layout, yet the language still has a lot of features I enjoy in Python (like both instance and late binding).

And when you put it together, there are areas where Rust does cool things nobody else can; as a quirky feature of Rust's type system, 8.to_string() is actually valid code.

Now go forth and fool your friends into thinking you know assembly. This is all I've got.