How to be a great Data Engineer?
Tips and tricks from my experience that, in my opinion, make you stand out among other data engineers.
Introduction
I’ve been in the data area for around ten years. During that time, I’ve been in a bunch of interviews and conducted some as well. I already see some patterns which could be avoided easily or improved. Some people say that data is new gold/oil, but I like bacon a lot, so let’s leave it as Data is the new bacon. For Companies to go forward and leverage data, they have grown a need for good data engineers. Finding a great one is almost like a quest since a great data engineer is nearly a unicorn. I decided to give my two cents on how you can become a great data engineer or on what you can focus on improving to get there or at least to stand out of the crowd.
Soft Skills and your CV/portfolio
You might think that you wouldn’t need to work on your soft skills if you're an engineer. At the beginning of my career, I noticed that sometimes your soft skills are way more valuable than technical ones. On technical ones, you could improve faster, but on soft ones, sometimes it requires a mentality shift, which is usually not that straightforward.
Buzz words in CV
Problem: Sometimes, it’s quite cool to throw some buzzwords inside your CV. I enjoyed a post I saw somewhere (or maybe it was a talk somewhere), where the guy listed a bunch of pokemon inside of his CV and asked recruiters to point them out.
Solution: Don’t bullshit people. It’s super easy to spot those kinds of situations where you merely scratched the surface of possibilities of some technology, and you promote yourself as a specialist. Yes, an interview is your sales pitch for the company, but you lose trust if you’re caught lying or misleading. If you want to put a skill on your CV, it shouldn’t be only basics. Do a PoC, learn it a bit better on some projects. Need content? Bunch of free content on LinkedIn, Youtube, and multiple other websites. If you want to go deeper, usually some books help you along the way.
Don’t be afraid to ask questions.
Problem: What I like the most in the interviews is to see how people think. What are the questions they ask to understand the task. It seems trivial, but the majority of people try to force their way through. Either making it over-complicated (like trying to kill a fly with a shotgun or grenade), or they don’t understand properly what the question was and don’t ask to clear the air (i.e., you were talking about shoes, and they give you an answer on penguins).
Solution: It’s ok not to understand (I mean, if it’s a super fundamental question, then maybe there is a problem), but in the majority of cases, you can ask more questions until it’s clear, or if you think you got it — summarize it in words and ask if you understood it correctly. If it’s something like a live coding interview (especially with zoom times) and you’re sharing your screen — do what you’d generally do regularly. I liked a guy who said — I don't know by heart, but I can do it. He opened google, found the relevant answer, and incorporated it into his code. It’s honest, and it shows that you know how to do the job even if you don’t know the answer right away.
Show off your projects.
Problem: Sometimes, it’s tough to understand people's skill level when you’re doing an initial CV screen. Like I mentioned before, the biggest issue is buzzwords, putting something as a skill even you have just grasped the very initial idea of it.
Another case is when, I.e., I’m looking for potential candidates on LinkedIn — I type my search in a specific way and open up many profiles. If it stands out in any way, I keep it open for later, where I check it more thoroughly (open GitHub, blog posts, etc.).
Solution: Highlight your posts on your profile, add your GitHub repository so that people can check you out. Sometimes making a website to substitute your CV gives quite an impression. Keep in mind that you will have to back your words with your actions or knowledge. People smell lies a mile away! All in all — work on your online presence/personal brand. Even if you’re not looking for new gigs, opportunities will come knocking on your door sooner or later.
Technical skills
SQL
Problem: So, this one is quite straightforward. It doesn’t matter you’re a Data Analyst, Data Scientist, Data Engineer — SQL skill is most required. The majority (like 80% +) of people in the interview pipeline fail on Codility SQL tasks. I mean, none of the test cases are super complicated. It’s just to check fundamentals and maybe one or two more advanced topics.
Solution: You have to know the most mandatory things you will use in your day-to-day job— Window functions, group by, CTE. Know the ins and outs of SQL — you’ll be closer to getting your dream job. Use online platforms with SQL tasks to improve there, i.e. HackerRank, W3Schools.
Data modeling
Problem: To build a robust DWH, you have to know different data models to rely on the one that suits the situation most. Sure, star schema can always work, but I doubt that it’s the case where one shoe fits all. In my experience, it depends case by case and also on the technologies you’re using.
Solution: Read about Kimball modeling techniques, Data Vault, Inmon approach, and many others out there. Don’t be in the dark ages where you know one, and don’t even try or bother reading up on other ones. If you want, you can read a high-level intro into data modeling that I wrote some time ago. It has more sources if you decide to dive into some particular one more.
Programming languages
Problem: Data engineering is usually perceived as a python-heavy area, and people sometimes look down on data engineers because of it. Most DE is done in python, i.e., orchestrators like Airflow and Luigi are built with python. You probably heard of pandas and other nice libraries which are getting traction in the DE community. But let’s take Spark, which is super important in the Big Data ecosystem. It’s built on JVM but has a python interface for better and faster adoption. Using it in Scala/Java has its own use cases, and sometimes performance is way better there as it’s a JVM language.
Solution: Know the fundamentals of any language and approaches of OOP, and everything else will follow. Please don’t get stuck to one language; sometimes, it doesn’t make sense to have all stacks tied to python only. Maybe you want to have advantages of strongly typed languages, get errors on compile, not only on runtime — discuss with the team and try to introduce it, see if it works. Sometimes you will need a different tool to do the job right — you won’t put a nail to a wall with a screwdriver, right?
Docker/Virtual environments
Problem: Docker, Kubernetes are the future. Spinning local isolated environments will help you develop faster and more efficiently, no need to push to master and run on prod. Super easy to spin up a small instance of any DB locally, have some test data inserted, and develop pipelines faster on a smaller scale. Same with python virtual environments. It’s a pain in the a$$ to manage dependencies, keep them installed locally. Use virtual environments (or even docker images, at least VSCode and PyCharm work really nice with them, too!) to clean the mess up and have more structure.
Solution: Learn the basics of it, use it, master it. I recommend a fantastic hands-on tutorial on Docker and a similar one on Kubernetes. It will give an excellent perspective on those tools and some experience with them too. But I strongly advise after you go over them, don’t put them as your skill. It’s just basics.
Data Area in general
Problem: I’d say that tools and frameworks to work with data are being born every day. You have to quickly adapt to different tools and what’s happening in the landscape in general. I see many people saying they work with Apache Spark 1.6, Sqoop, etc. Some have newer and better versions released, or they have been retired in preference of other more performant and more recent tools.
Solution: Follow people on Twitter, Linkedin who share a bunch of news, have a basic understanding of the landscape and tools which can do the job. Try to keep up with the market and try out new things.
Summary
By working on these topics/areas, you’d at least catch my eye on social media. That’s what I’m usually looking for in people as a great data engineer. If you have any questions or want some advice — feel free to connect or DM me on LinkedIn. It’s the way I’m planning to contribute to the community.