Data Engineers/Plumbers - Superheroes in disguise
Talking about different archetypes of data engineering and covering skills and seniority levels based on my experience.
Introduction
“Data is the new oil”, or “we’re data-driven”. There are many different sayings, and people focus on what’s more visible. In this post, I’ll talk about the folks whose work is visible, usually only when something breaks. This post will be the longest in the series of the Data Roles since this is where I have the most experience and knowledge picked up over the years. Let’s talk about data engineers.
Archetypes
Now to systematise what I wrote on Linkedin about different Data Engineer Archetypes.
Streamers
No, not Twitch or OnlyFans, but Data Engineers. Those who have a particular interest in working mainly with streaming data. These individuals specialize in real-time data ingestion and processing. They are proficient in technologies like Apache Kafka, Apache Flink, and/or Apache Spark Streaming, which enable high-velocity data processing. Streaming data enthusiasts are highly sought after in industries that require real-time analytics, such as finance, telecommunications, and IoT. Sometimes the downside is that they prefer all data as a stream, not only where it’s needed. This sometimes inflicts unwanted costs. On the other hand, if Kappa architecture is used, switching back to batching is not a big deal.
Hype Chasers
Hype chasers are data engineers who stay on the cutting edge of emerging technologies and trends in the field. They are early adopters of new tools and frameworks and constantly seek innovative solutions to improve data engineering processes. Hype chasers are passionate about exploring and experimenting with the latest technological advancements, such as big data processing, real-time analytics, or machine learning. Their curiosity and drive to stay ahead of the curve make them valuable contributors to data engineering teams. While it’s fun and games chasing new shiny tools, in some situations, it creates a burden if one tool is deprecated or not mature. I like, in a way, hype chasers because they are continuously checking what’s new and how it is working and could push the RnD departments for some useful PoCs.
Problem Solver
Problem Solvers are a distinctive archetype in data engineering who possess an unwavering drive and passion for solving complex problems and optimizing data engineering processes. Their relentless pursuit of finding solutions and making things work sets them apart. Specific technologies do not bind them and are adaptable in tackling any problem. With their specialized skills and deep domain knowledge, Problem Solvers excel in overcoming challenges and continuously strive to optimize data engineering workflows. This archetype often complements and collaborates with the other archetypes, leveraging their expertise to achieve efficient and effective solutions. Their dedication and problem-solving mindset make them invaluable assets in the data engineering field. They are addicted to solving problems and get bored if things don't challenge them.
Patcher
The Patcher archetype is similar to the Problem Solver archetype, which aims to address existing solutions by applying patches and making minimal code changes. They excel in swiftly resolving issues and ensuring the smooth operation of data engineering systems with minimal disruption. One can identify a Patcher by extensive or over-use custom handling instead of standardising.
Software Engineers
These individuals deeply understand programming languages, algorithms, and software development principles. They work on creating and maintaining software tools and frameworks that enable efficient data processing, workflow automation, and data pipeline management. They most likely prefer strict typing languages like Java/Scala (since most of the Big Data ecosystem is JVM-based).
Software engineers in the data engineering field often work closely with other archetypes, such as the jacks of all trades or data warehouse specialists. They collaborate to build custom solutions, optimize data pipelines, and ensure the scalability and reliability of the data infrastructure. Their expertise in software development allows them to design and implement robust systems that can handle large volumes of data and meet the organisation's specific needs. It is interesting that depending on the organization title these folks might range from Software Engineer to Data Infrastructure Engineer.
Data Warehouse Expert
These folks are oriented towards data warehouse creation. These individuals have a deep understanding of data modelling, ETL (Extract, Transform, Load) processes, and the design of data warehouses. They excel in building robust data architectures and optimizing data storage for efficient querying and reporting. Data warehouse specialists are crucial in organizations where data consolidation and analysis are essential for decision-making. These specialists are rare since data modelling is a forgotten topic (Let's query data lake with no standards 😐). Business prioritises faster results, insights, etc., and people need more time to think and design before creating multiple data marts with different calculations on the same-named KPI.
Jacks of all trades
They are also known as generalists. These individuals have broad knowledge across multiple areas, although they have yet to gain deep expertise in any one place. They are familiar with various tools and can quickly get you started using drag-drop or no-code approaches. They may also have some knowledge of data visualization techniques and be capable of performing data analysis. This makes them valuable assets for startups or small companies that prefer to invest in something other than specialized professionals. Consultancy is another field where these individuals are commonly found. They can create and implement everything needed and then leave it functioning. The only drawback is their limited expertise, which may require seeking a more specialized professional for complex tasks. However, the benefit is that they can operate as a one-person team and quickly get things up and running.
Skillsets
All folks in data engineering don’t belong to one archetype. It’s usually a mix of them. I could go and dissect them one by one, but many things will still overlap. Like at work, I’m not paid for the code lines I produce; similarly, my value is giving you insights with less text to read.
As always, the skillset will differ from company to company and their long-term data strategy, but what I think is needed generally (as well as going over multiple job ads to ensure I’m not hallucinating some of the requirements while living in my bubble). I’ll not mention specific tools, but what fundamentals you need to have to succeed as a GREAT Data Engineer.
Hard Skills
Data Modelling - the better you are at it, the easier it will be to tackle issues. Depending on the use case, you might choose one approach or a combination of several, giving you more flexibility. It’s never a one-shoe-fits-all situation, and I think it’s the most needed skill, at least when I’m writing this post.
SQL - “SQL is dead”, and similar slogans were visible occasionally, but I think it’s the one constant and never changing (dialect - yes, but it’s only about small details). I have a feeling SQL will still be king in 10-15 years. Keep in mind, by SQL, I also have in mind query tuning. Everyone can write SQL, but can you do it efficiently?
Programming language - No matter which one you choose. If you can write quality code, it will be super easy to pick up the next language. Yes, there are benefits of learning a more strict language first, so you won’t go crazy in Python with the same variable being a string, integer, or dictionary in different places of your code. You’ll learn better practices, and it will in turn, make you a better programmer.
Distributed computing - There are tons of talks about Apache Spark, and there are Impala, Trino and others available. The thing is that if you understand how distributed computing works, you’ll be able to adapt and understand how things are translated under the hood. In case you’d need to run some distributed calculations you could quickly achieve that with threads in your Python code. Why I’m not attaching to some framework/engine? As I’ve said - the data world is fast, and there are a bunch of tools available now, looking forward, it just doesn’t make sense to know what’s there now and what’s going to be released in the future.
Cloud/On-Prem - Know one cloud - you’ll know all. All services are similar with slight differences. Understand how can you leverage them and how they differ from solutions you would do on-prem. By doing this you could be a driving force in your company when those cloud costs suddenly start bothering your CFO.
Soft Skills
Business understanding - say what you want, but to be efficient and proactive, you must know the business to some extent. I personally consider myself a tinkerer, give me a problem, and I’ll solve it, but knowing the business will allow you to maximise your efficiency and career in no time.
Communication - I think it’s a must-have for anyone. It also has to be verbal and written. If you can articulate clearly, it will be easier to prove a point or counter the argument. This will help you in designing things, mentoring people, articulating shifts in priorities and many other places.
Patience - might be pretty obvious, but dealing with long-running pipelines, hours of debugging, and so on will make you resilient. Though not as resilient as kids at home make you… The thing is that sometimes you will spend hours shaving some query run time or looking for a bug; without patience, you’d just move on to something that is not in a way that time-consuming or you would be not that good of a data engineer.
Attention to detail - Your ability to spot deviations will go a long way. When you know data by heart, you can easily pinpoint brewing storms, push for prioritising technical debt over new features etc. This is, in a way, an abstract skill that will help in the accuracy and thinking not only about casual scenarios but also edge cases that eventually break your pipes.
Problem-solving - To be honest, I think it’s the primary soft skill of a data engineer. We are data plumbers. Our job is to move data from one place to another using pipelines. Badumtss. Jokes aside, our work is to think of how to do it as efficiently and robustly as possible. Later, when some edge cases appear, our job is to incorporate them into existing solutions for the longer term.
Last but not least - proactivity. You have no idea how far some extra steps will get you. It might be a simple new framework pushed to the company, an initiative you pushed to unify some data models, add data quality checks, or something similar. This will not give you an insane amount of experience but will provide you with a lot of exposure. It’s a win-win for you and the company you’re working for.
Seniority
The fun thing is that I personally skipped or didn’t have some of the titles formally, but hopefully, I can describe them adequately. Again, not touching on the above-mentioned skills and how they change since it’s very tightly coupled with your seniority. The higher you go, the more you know.
Junior
Uncle Data is teaming up with Turing College! They say they’re the #1 place to learn online for working professionals. Check out their Data Engineering program that they launched not so long ago!
While Turing College will provide you with the necessary skills, Uncle Data will help you navigate the caveats of the industry and things that are not visible straight from the trenches.
You have little to no experience. This might be your first gig. You’re going to ask many questions and “bother” people. ”Bother” was chosen deliberately because you don’t want to ask people questions and disturb them with small things; that’s at least how I was thinking when I was starting. But by asking questions, you’re learning better ways and absorbing the knowledge and experience of people answering them. You will be given simple tasks that you can do separately and not break production (or at least your code will be reviewed adequately with a KPI of many WTFs per minute).
Mid
Now you’re more on your own; you don’t feel that you’re bothering people; you’re learning and delivering. If your team is planning their work, you can adequately plan a week and take some tasks that are more abstract or have less detailed descriptions. You are now discussing potential implementations with your lead, suggesting solutions of design and given more freedom. Your KPI of WTFs per minute decreased significantly. At the same time, you should have broken production at least once. If not - you’re doing something wrong. Your tasks now are more technical and might even be performance-tuning oriented - if you’re not creating value, at least you can lower the costs for the company. That’s how I imagine most companies look at data engineers, who are visible only through Cost reports and broken pipelines that cause dashboards to show empty tables.
Senior
Now you own some end-to-end pipelines that run smoothly; you’re firefighting with ever-changing source tables and data quality issues. But in general, what changes is that the complexity of tasks naturally becomes greater. With great power comes great responsibility. Your job now involves alignments and communicating with other teams on cross-team projects; you’re creating designs that later either you or someone from your team has to implement. You’re the role model for the juniors in the group, helping them grow and guiding them to the dark side of data. You might have some libraries or frameworks built by you running in production and used by multiple teams.
Team Lead
Here, I’m just going to copy things from my previous post about analysts. This part is the same:
Welcome to the level where you will try to juggle a people manager role and deliver some IC work. So, you have the same responsibilities and expectations as your team's Senior + people management. Usually, it’s best to have a small start of 1-3 people. Otherwise, you’ll burn out, or you won’t be able to handle the trade-off between time spent on people vs. delivering actual work. People management is real work, usually in the form of meetings. It’s way different than you thought of and what you were used to doing as a Senior.
Later, you’ll get used to it and realise that the team is the most significant asset. Helping them grow and letting them achieve not only company but their personal goals will make you a successful lead/manager. At least for me, a manager was more associated with negativity and being bossy, but their meaning and role are the same! You can read my post on Team Lead or Manager from some time ago here.
Staff
When you were a senior, you probably had one or a couple of areas that are reasonably related to each other. When you reach Staff or higher IC roles, you work on better practices across multiple departments or all data engineering teams. You’re the technical role model and initiative pusher and the expert in the technicalities. You can quickly identify and solve almost any issue by being the data superhero you want to be. You focus now on high-value projects, so most of the time, you’re discussing in meetings about system designs, writing documentation and mentoring any data engineer in the company (ok, almost any if there are some policies in your company). Keep in mind that in some places, they have a separate title, “Architect”, but for me, it’s the same. You’re spending time designing solutions, sometimes high-level and sometimes detailed ones. Your job is to ensure that all the best practices are upheld, and you’re looking for ways to help others spend less time fixing crappy solutions. You’re there to suggest or lead people to a better one.
Summary
Hopefully, my rambling is helpful to you, and you can see the challenges in each role and what will change in the next one. Keep in mind no one is perfect, and an 80% requirement match is more than enough for the company and the imposter syndrome to be kept away.