Distributed processing as Candy counting

Or a very simple explanation of a quite difficult topic

Oct 20, 2023

Photo by Iwona Castiello d'Antonio on Unsplash

For me, understanding distributed processing and how it works took some time, it’s not that of a simple topic. Once, I’ve heard a straightforward example when everything just clicked. This post will use the same theme but to a bit larger extent.

Let’s start with a hypothetical example. You sit at the table and get a big box of candies. You can imagine any colourful candies that come in a packet. Your task is to produce a simple count-by-color task for all candies inside the box. Now, to make it relatable to data processing - in this scenario:

you are a CPU Core
The pack of candies is a partition
Box of Candies - whole dataset
Table - RAM available

This example is a perfect explanation of how pandas work. If you work with chunks, you operate on a single packet (chunk); if you’re processing whole data it will be a whole bunch of candies on your desk. As much as can fit there. This process gets the job done, but it’s taking its time.

Spark Single Node Processing

Let’s assume you invite a couple of your friends to help you. You’re now a driver, and your job is to distribute work for efficiency and track who got how many packets and if all responses came back. Each of your friends gets a table, and you bring packets of candies to be processed. When your friend finishes counting, he/she writes down the numbers on paper and delivers it to your table. When everyone is done counting the packets, you aggregate the papers one last time to know the final results. This is how Spark Single Node processing works. So, in a nutshell:

You’re the driver
Your friends are the executor cores
Tables = RAM available. This part is a bit trickier since one process can take more RAM, and it will leave less for others, so technically, table sizes will fluctuate across time.

Spark Cluster

Now imagine that instead of a single box, you have an entire storage area of boxes. Now you bring more friends and your friends friends. To make it all more efficient, you split into different rooms (less noise and movement so everyone can work peacefully). You, as before, are responsible for orchestrating who’s doing what and what’s left. People are getting boxes to the rooms and delivering you notes of paper with information that they have processed with some metadata (packet ID, who processed it, etc.).

Since you will get a lot of papers with aggregated information, you can also split that across the rooms to get information aggregated faster. Each room now represents an Executor machine with multiple cores. Since data got bigger, there will be more papers for you to aggregate upon, so you can push that information to workers again. They then will return a new note with some metadata and aggreagation.

Why I’m suddenly talking about metadata - it’s a safe mechanism in case some information needs to be backtracked or if some data in the initial partition was faulty and we need to recalculate it. Backtracking all relevant steps, we can minimise it only to the ones that need to be processed.

Networking

What are the moving parts here - boxes, packets and aggregation notes. The larger the distance, the longer it will take to process them. The closer you get to the box with all the packets, the shorter amount of time they will spend bringing the data, same with your table distance from the rooms where calculations happen.

Spill to disk

Another interesting idea that can be explained with candies is spill to disk. Imagine that you have way too many candies on your table. What can you do? Shove some of them into some boxes to later aggregate them. So you handle what you can, take notes, clean the table pour the box contents and repeat. Now you can have different kinds of boxes - very wide or very narrow, depending on parameters, you can have faster or slower filling-the-box times and getting contents out. This could be a good analogy for disk types (HDD vs SSD, etc.)

Data Skew

One of the biggest obstacles in data processing is data skew. What it is and how we can handle it in this example. Imagine that, on average, 40-50 candies are in a packet, but for some reason, they created some XXL packs containing 200-300 candies. The person getting this packet will take way more time to process it. How can you deal with this situation? Repackage it, take all candies, and pack them to 40-50 by hand. Repartitioning will take some time, but the time spent by each worker later will be more or less similar.

Now, what about AQE in spark and Coalesce operation? Imagine it’s not your first time going through warehouses and counting candies; you already know by weighing in your hands what packets have more and fewer candies, so you can easily give them out together. Doing this will give people approximately the same amount of work, so processing time will not be affected much.

Shuffling service

You have standard people who can do their work at average speed (it’s not what they do for a living). Now, suppose you have a lot of inconsistencies in the amounts of candies per packet, and you want to optimise on that. In that case, you can ask a specialist/service provider who’s professionally trained and does this for a living. He can very effectively sort and pack candies and make them easily accessible. So before handing the boxes, you can pass them through this shuffler, depending on the situation. This will be faster but involve moving boxes in and out of this shuffler.

Row and Columnar formats

Another example is how you work with CSV or other Row-oriented formats and how it differs from Parquet and columnar ones. Work with a CSV would be - you get a packet and process candy by candy since your file provides them individually. If it would be a parquet, you could throw them all on the table and tackle colour by colour, etc. Also, if your data is partitioned, then your packet would have packets of coloured candies grouped, but that’s not all; you’d get metadata about the counts of candies in the packets, so you’d have to scan that information on packets, and you’re good to go!

Summary

Before starting to write this post, I thought the example would be trivial, and I wouldn’t go that wide into so many areas of spark, but here I am, covering most of the concepts.

Uncle Data

Discussion about this post