Tale of two: Lazy and the Eager

Nothing much to add. Lazy and Eager DataFrame APIs and what it actually means.

May 07, 2023

People tend to make comparisons of different DataFrame APIs, and we have tons of those.

Pandas - https://pandas.pydata.org/
Dask - https://www.dask.org/
Polars - https://www.pola.rs/
Spark - https://spark.apache.org/
Vaex - https://vaex.io/
Let’s even add DuckDB to the mix - https://duckdb.org/

How not to get lost and understand what’s measured and how some libraries work. For simplicity's sake, I will talk only about parquet files, just because they’re quite a standard in the data processing area.

The example file, of course, is legendary Yellow taxi trip data, but only one 2018 January one. Its size is around 124MB - compressed, but if loaded to memory, it’s way bigger.

Eager execution

Executes all of your commands eagerly. Duh!

Transform by Example: Your Data Cleaning Wish is Our Command

By eagerly, I mean it actually will do it step by step. Read, aggregate, filter, and write will be executed in the same order as you execute that command.

DataFrame libraries that are eagerly executing:

Pandas - even though it’s using PyArrow to read parquet, it’s still loading it to memory immediately

Polars - read_parquet does eager read

Lazy execution

It works exactly how it sounds. Only when you call an action, only then it will read it to memory and apply some optimizations. Now optimizations are always dependent on the implementation inside the library, but all of the lazy ones have Predicate push-down.

Predicate push-down works by evaluating filtering predicates in the query against metadata stored in the Parquet files.

If we’re talking about other optimization capabilities that are not only Predicate Push-down:

Spark has Catalyst Optimizer
Polars has optimizations
Vaex some are written here
Dask has a method to optimize
DuckDB - it’s a DB engine, of course it has optimizations, why do you think it’s that fast 😉

How it looks with different libraries. Look at the increment part. Peak memory depended on some other things, i.e. objects from the library loaded (yeah, I’m lazy too, no surprise that I’m writing this post 😅) and do things like:

import dask.dataframe as dd

import vaex

Nonetheless the idea is quite easily graspable:

DuckDB

Spark - JVM ftw, longest run compared to all other lazy libs 😅

Dask

Vaex

Polars - scan_parquet

Summary

When people say, “Oh, this library read this humongous file in seconds”, question what they have in mind. If it’s simple df.read() it might be because the underlying data frame is lazily executed, and it’s not.

Always compare speeds when the final result is presented; you can only achieve the relevant comparison. Remember that implementation also differs; some chained methods can cause performance degradation (Nice catch by Bilal Bobat here). Comparing libraries should be done by choosing the best practices for each of them.

Uncle Data

Discussion about this post