The Generalization Trade-off

Published: 2023-10-19

Exploring

Recently, I have been working on my first indie app related to books and data. While building it, I have also taken the opportunity to learn new tools and different approaches to solving data problems. Through this journey, I revisited a recurring theme on engineering trade-offs and figured I’d share.

One of the new tools I have been utilizing is a database called ClickHouse. I know what you’re thinking right now: ‘Who cares about another database?’ Just use Postgres. While I generally would agree with you, just hang on because this isn’t a Medium article touting another new database (except for when it is).

So, why is ClickHouse interesting? Well, it’s an OLAP database, making it great for analytical workloads, but most importantly, it’s blazingly fast. Before I dive into it more I should back up and explain why I am using it in the first place.

Typically, I begin working with data by starting up ol’ reliable Pandas. Pandas is great because of its flexibility and speed of setup. Do you have a CSV you want to get some quick summary stats on? No problem. Do you have a CSV of some data you want to transform? Easy. Do you have an Excel file that operations want you to combine with some other data? First, turn it into a CSV, and then do the rest with Pandas. See how flexible Pandas is? It can handle anything thrown at it including CSVs.

But Pandas starts to get tripped up around the same time Excel does, with Big Data™. So where to turn next? If Pandas is struggling my next go-to is Polars. Polars is a quicker, more powerful library similar to Pandas. It has a similar approach as Pandas, utilizing column-based DataFrames, but instead of relying on Numpy for much of the underlying computation, Polars is written from the ground up in Rust. It outperforms due to several underlying philosophies such as high core utilization, query optimization, strict data schema, a consistent API, and capability to handle data larger than RAM, among others. Because of these principles, Polars can accomplish higher performance than Pandas. One of the best features it enables is lazy execution which Pandas doesn’t support. Lazy execution, as opposed to eager execution, allows Polars to look at the entire query plan first and optimize it for you automatically. It also allows for the processing of data out of memory via streaming.

For these reasons typically when I start to hit performance limits on Pandas due to DataFrame size I’ll swap over to Polars. But what happens when Polars can’t handle the data?

As it turns out there are a lot of books, tens of millions of books. When trying to do data-intensive operations I did start to hit some limits in Polars such as aggregations on parsed JSON records. These probably could have been avoided by processing the data in smaller chunks, but instead, I decided to try ClickHouse.

To give a rough idea of the performance between the three one of the files I have been using is a roughly 19GB tab-separated file with 33 million rows. Pandas couldn’t load the file as I ran out of memory on my Macbook with 16GB RAM. Polars could load the file in about 30 seconds but it would use a lot of swap and struggled more especially when modifying the structure of the DataFrame.

ClickHouse on the other hand was able to load and compress the file to a table in 35 seconds. It didn’t use nearly as much memory only peaking at 0.6 GB, likely due to ClickHouse writing the data to a file rather than loading it entirely into memory. And once it was loaded into the table because it was compressed it only took up 4GB, nearly a 5x reduction in size. Queries on the table are snappy typically taking less than a few hundred milliseconds.

Trade-offs

So, what did I learn from all of this? Should I just use ClickHouse from here on out? Well yes, but actually no.

Engineering often comes down to trade-offs, one of which is the relationship between generalization and specialization. Often more general solutions don’t perform as well as narrow approaches, and I think this relationship follows the trend detailed below.

The relationship is non-linear because the gains compound on each other. Take Pandas versus ClickHouse, Pandas is very flexible and integrates seamlessly into Python while not being nearly as rigid as ClickHouse which is more strict about types, operations, and table structure. But it does that to leverage specific algorithms and structures to increase performance. Robert Hodges, in his talk Introducing ClickHouse, touches upon performance gains using specialized algorithms. ClickHouse has implemented 14 different group-by algorithms to utilize the fastest for any given query.

It’s a non-linear relationship because the gains compound on each other. A more generalized solution is probably architected in a way that allows the support of additional components like Legos. Correspondingly, a system designed for performance often minimizes the overhead associated with generality, allowing for each component in the system to be tailored to one specific goal, speed. This relationship can also create a ‘jack-of-all-trades, master of none’ limbo space between these two extremes, which in this case Polars filled.

I’ll wrap this up by saying keep an eye out for this relationship. While I covered it for this specific experience, the more I look around, the more I notice this relationship in my everyday life.

Spider Man Meme - Everywhere I go I see his face - CSV

Prologue

As a side note, this is partly why I find electric vehicles so exciting. Because electric motors are compact and power-dense, I believe they actually flatten, if not invert, the curve compared to internal combustion vehicles. A Rivian is quicker than most ICE sports cars and yet it has a bed, a frunk, a gear tunnel, air suspension, and can off-road like no other. It has both performance and can be your daily driver. Can’t wait to see where we go from here.

I'm on Twitter if you want to follow for more updates. Cheers!