From a69ed9f690257784ce67feb6aa433b2eeaccd89d Mon Sep 17 00:00:00 2001 From: Bradlee Speice Date: Tue, 19 Jun 2018 01:18:09 -0400 Subject: [PATCH] First draft of dtparse post --- _posts/2018-06-19-dateutil-parser-to-rust.md | 154 +++++++++++++++++++ 1 file changed, 154 insertions(+) create mode 100644 _posts/2018-06-19-dateutil-parser-to-rust.md diff --git a/_posts/2018-06-19-dateutil-parser-to-rust.md b/_posts/2018-06-19-dateutil-parser-to-rust.md new file mode 100644 index 0000000..c7b07fd --- /dev/null +++ b/_posts/2018-06-19-dateutil-parser-to-rust.md @@ -0,0 +1,154 @@ +--- +layout: post +title: "What I Learned: Porting Dateutil Parser to Rust" +description: "" +category: +tags: [dtparse, rust] +--- + +Hi. I'm Bradlee. + +I've mostly been a lurker in Rust for a while, making a couple small contributions here and there. +So launching [dtparse](https://github.com/bspeice/dtparse) feels like nice step towards becoming a +functioning member of society. But not too much, because then you know people start asking you to +pay bills, and ain't nobody got time for that. + +But I built dtparse, and you can read about my thoughts on the process. Or don't. I won't tell you +what to do with your life (but you should totally keep reading). + +# Slow down, what? + +OK, fine, I guess I should start with *why* someone would do this. + +[Dateutil](https://github.com/dateutil/dateutil) is a Python library for handling dates. +While the standard library support for time in Python is kinda dope, there's a lot of pieces +that go into making it useful beyond just the [datetime](https://docs.python.org/3.6/library/datetime.html) +module. + +Specifically, `dateutil.parser` is code to take all the super-weird time formats people +come up with and turn them into something actually useful. Just like [everything else](https://zachholman.com/talk/utc-is-enough-for-everyone-right) +[involving](https://i.redd.it/syw7q6gc77f01.jpg) [computers](https://infiniteundo.com/post/25326999628/falsehoods-programmers-believe-about-time) +and [time](https://infiniteundo.com/post/25509354022/more-falsehoods-programmers-believe-about-time), +it feels like it shouldn't be that difficult to do, until you try to do it, +and you realize that people just suck and this is why can't we have nice things. +But alas, we can still try and make contemporary art out of the rubble. + +What makes `dateutil.parser` great is that there's single super-important function: `parse(time_string)`. +It takes in the time as a string, and gives you back a reasonable "look, this is the best +anyone can possibly do to make sense of your input" value. It doesn't expect much of you. +Which is great. And now it's in Rust. + +# Lost in Translation + +Having worked at Bank of America and seeing Java programmers try to be Python programmers, +I'm admittedly hesitant to publish Python code that's pretending to be Rust. +Interestingly, Rust code can actually do a great job of mimicking Python. +It's certainly not idiomatic Rust, but [the Iterator pattern is the same](https://webcache.googleusercontent.com/search?q=cache:wkYMpktJtnUJ:https://jackstouffer.com/blog/porting_dateutil.html+&cd=3&hl=en&ct=clnk&gl=us). + +When transcribing code, **stay as close to the original library as possible**. I'm talking +about using the same variable names, same access patterns, the whole shebang. +It's way too easy to make a couple of typos, and all of a sudden +your code blows up in new and exciting ways. Having a reference manual for verbatim +what your code should be means that you don't spend that long debugging complicated logic, +you're more looking for typos. + +Also, **don't use nice Rust things like enums**. While +[one time it worked out OK for me](https://github.com/bspeice/dtparse/blob/7d565d3a78876dbebd9711c9720364fe9eba7915/src/lib.rs#L88-L94), +I also managed to shoot myself in the foot a couple times because `dateutil` stores AM/PM as a boolean +and I got mixed up on the enum trying to figure out which AM and PM were (side note: AM is false, PM is true). +In general, writing nice code *should not be a first-pass priority* when you're just trying to recreate +the same functionality. + +**Exceptions are a pain.** Make peace with it. Python code is just allowed to skip stack frames. +So when a co-worker told me "Rust is getting try-catch syntax" I properly freaked out. +Turns out [he's not quite right](https://github.com/rust-lang/rfcs/pull/243), and I'm OK with that. +And while `dateutil` is pretty well-behaved about not skipping multiple stack frames, +[130-line try-catch blocks](https://github.com/dateutil/dateutil/blob/16561fc99361979e88cccbd135393b06b1af7e90/dateutil/parser/_parser.py#L730-L865) +take a while to verify. + +As another Python quirk, **be very careful about [long nested if-elif-else blocks](https://github.com/dateutil/dateutil/blob/16561fc99361979e88cccbd135393b06b1af7e90/dateutil/parser/_parser.py#L494-L568)**. +I used to think that [Python's whitespace](https://www.xkcd.com/353/) was just there +to get you to format your code correctly. I think that no longer. It's way too easy +to close an extra block and have incredibly weird issues in the logic. + +**Rust macros are not free.** I originally had the +[main test body](https://github.com/bspeice/dtparse/blob/b0e737f088eca8e83ab4244c6621a2797d247697/tests/compat.rs#L63-L217) +wrapped up in a macro using [pyo3](https://github.com/PyO3/PyO3). It took two minutes to compile. After +[moving things to a function](https://github.com/bspeice/dtparse/blob/e017018295c670e4b6c6ee1cfff00dbb233db47d/tests/compat.rs#L76-L205) +compile times dropped down to ~5 seconds. Turns out 150 lines * 100 tests = a lot of redundant code. +My new rule of thumb is that any macros longer than 10-15 lines are actually functions that need to be liberated, man. + +Finally, **I really miss list comprehensions and dictionary comprehensions.** +As a quick comparison, see +[this dateutil code](https://github.com/dateutil/dateutil/blob/16561fc99361979e88cccbd135393b06b1af7e90/dateutil/parser/_parser.py#L476) +and [the implementation in Rust](https://github.com/bspeice/dtparse/blob/7d565d3a78876dbebd9711c9720364fe9eba7915/src/lib.rs#L619-L629). +Ultimately, I hope that these can be added through macros, but I have a feeling that they'd actually +need to be syntax extensions. Either way, they're expressive, save typing, and super-readable. Let's get more of that. + +# Using a young language + +Now, Rust is exciting and new, which means that there's opportunity to make a substantive impact. +On more than one occasion I've had issues navigating the Rust ecosystem though. + +What I'll call the "canonical library" is still being built. In Python, if you need datetime parsing, +you use `dateutil`. If you want [Decimal](https://docs.python.org/3.6/library/decimal.html) types, +it's already in the standard library. It's probably +[not strictly necessary in `dateutil`](https://github.com/dateutil/dateutil/blob/16561fc99361979e88cccbd135393b06b1af7e90/dateutil/parser/_parser.py#L1242), +but I wanted to follow the principle of **stay as close to the original library as possible** +and thus began my quest to find a decimal library in Rust. What I quickly found was summarized +in a comment: + +> Writing a BigDecimal is easy. Writing a *good* BigDecimal is hard. +> +> [-cmr](https://github.com/rust-lang/rust/issues/8937#issuecomment-34582794) + +In practice, this means that there are at least [4](https://crates.io/crates/bigdecimal) +[different](https://crates.io/crates/rust_decimal) [implementations](https://crates.io/crates/decimal) +[available](https://crates.io/crates/decimate). And that's a lot of decisions to worry about +when all I'm thinking about is "I just want a reasonable Decimal library" and I'm forced to dig through a +[couple](https://github.com/rust-lang/rust/issues/8937#issuecomment-31661916) +[different](https://github.com/rust-lang/rfcs/issues/334) +[threads](https://github.com/rust-num/num/issues/8) to figure out if the library I'm look at is DOA stable. + +And even when the "canonical library" exists for something like timezones ([`pytz`](https://pythonhosted.org/pytz/) and +more recently [`dateutil.tz`](https://dateutil.readthedocs.io/en/stable/tz.html) in Python), there's no guarantees +that it will be well-maintained. [Chrono](https://github.com/chronotope/chrono) is currently the canonical datetime +library in Rust, and just released version 0.4.3 like a week ago. Meanwhile, [chrono-tz](https://github.com/chronotope/chrono-tz) +appears to be dead in the water even though [there are people happy to help maintain it](https://github.com/chronotope/chrono-tz/issues/19). +I know relatively little about it, but it appears that most of the release process is automated; keeping +that up to date should be a no-brainer. + +## Trial Maintenance Policy + +Specifically given "maintenance" being an [oft-discussed](https://www.reddit.com/r/rust/comments/48540g/thoughts_on_initiators_vs_maintainers/) +issue, I'm going to try out the following policy to keep things moving on [dtparse]: + +1. Issues/PRs needing *maintainer* feedback will be updated at least weekly. I want to make sure nobody's blocking on me. + +2. To keep issues/PRs needing *contributor* feedback, I'm going to (kindly) ask the contributor to check in after two weeks, +and close the issue without resolution if I hear nothing back after a month. + +The second point I think has the potential to be a bit controversial, so I'm happy to receive feedback on that. +And if a contributor responds with "hey, still working on it, had a kid and I'm running on 30 seconds of sleep a night," +then first congratulations on sustaining human life, and second I don't mind keeping those going indefinitely. +I just want to try and balance keeping things moving with giving people the necessary time. + +I should also note that I'm still getting some best practices in place - CONTRIBUTING and CONTRIBUTORS files +need to be added, as well as issue/PR templates. In progress. + +# Roadmap and Conclusion + +So if I've now built a `dateutil`-compatible parser, we're done, right? Of course not! That's not +nearly ambitious enough. + +Ultimately, I'd love to have a library that's capable of essentially everything the Linux `date` +command can do (and not `date` on OSX, because seriously, it's the worst). I know Rust has a +coreutils rewrite going on, and this would be potentially an interesting candidate since it +doesn't bring in a lot of extra dependencies for the functionality it provides. +[`humantime`](https://crates.io/crates/humantime) also is able to parse durations, +so maybe we negotiate something to integrate it all together? + +All in all, I'm really hoping that nobody's already done this and I've spent a bit over a month +on redundant code. So if it exists, tell me because I need to know, but be nice about it. + +And in the mean time, I'm looking forward to building more. Onwards.