Data Scientist AKA OG Jack of all trades
Talk about the Data Scientist role and how it's perceived in today's world and what they actually do.

Introduction
Let’s start with a definition. Data Scientist - “a person employed to analyse and interpret complex digital data, such as the usage statistics of a website, especially to assist a business in its decision-making." Also, when I asked Erikas to check what I’ve written to see if I’m not talking complete nonsense, he mentioned that he has seen that there is another one - a Data Scientist solves problems by creatively applying mathematics to data. This one is more down to earth and proves that, like always in data, you need creativity.
This part is less familiar to me, so I will rely on some things I found online. I haven’t had any production experience or building some ML platform. I am a mere user of available libraries and frameworks.
If we’re looking at Google Trends for “Data Scientist”, you can see that it just lifted off around 2013.
My wild guess would be that it’s more popular just because people with the Hadoop going on started to pile up tons of data. After the data had been collected, they had to be put to some use, so until something easy to use appeared (as well as reading data from HDFS with Spark making an entrance in 2013), it wasn’t that popular. With Hadoop, what also changed is that computing was beginning to separate from storage; you can scale horizontally more easily. This also made computing machines more available to the public.
I didn’t even know the term data science. It might be that this term came to Lithuania quite late because I’ve heard about TensorFlow, PyTorch and other libraries and written a master thesis on neural nets in Java.
The average Data Scientist
Have you wondered why there was a term “Recovering Data Scientist” or why people with this title switched to Data Engineering? I also noticed this when talking to some Data Scientists in my podcast or during some live events. The thing is that companies are usually looking for unicorns. They think they have clean data, and they can
select * from my_very_clean_and_structured_tableand then
trained_model = some_model.train(my_very_clean_and_structured_dataframe)
predict_data = trained_model.predict(my_new_values)In reality, if your first data hire is a data scientist and you live in some Barbie world expecting him to deliver some fancy ML model and increase your business metrics - you’re wrong. Guess why, in data, there is a saying: Garbage In, Garbage Out.
First, you must have a good data foundation - clean and well-structured data which is refreshed on time. Then, you must have good, business-oriented people who could help build a case and think of actual implementations of ML/AI, not just chasing hype. What’s the plan for releasing some ML models in prod? Do you have infrastructure and people who could do deployments? Is there a way to do A/B testing of models or at least do gradual rollouts with your ML models?
Doing ML in production is quite a heavy and important task. You might have some mentioned things in software engineering, but do you have the ability to allow your data folks to do it? Or do you have the capacity to support them do it?
Let’s talk about skillsets. I’m generalising them quite a bit, and some are squished together.
Statistics - to be good at your job and not blindly do fit() and predict(), you must understand the math behind the most popular models. You can adjust, fork, and build your model. Having statistics as a strong base will let you see the impact metric's significance and be sure it’s better or not.
Programming - you must produce code no matter what you do. You either have to do some feature engineering or create pipelines. Also, don’t forget basic version control and SQL (you can’t avoid it, sorry)
Data visualisation - it’s hard to notice some patterns when looking at the table; you’re not a robot - it’s easier to do it with data visualisation. Usually, knowing some basic data viz techniques is enough for data exploration.
Domain knowledge and business acumen - Knowing the domain will help you better understand potential suggestions for ML applications; it might even help you with some feature extraction/generation tasks. Knowing the business will allow you to shape and tackle the task from the right angle, allowing you to spend less time bashing your head into the wall.
Communication and project management. You will treat every task as a project here. You’ll have control of the planning and how you tackle the problem. Still, you’ll also have to communicate the plan to stakeholders, inform them about findings and present the final solution and impact on the business.
Seniority
Again, here, juniors don’t exist. You can’t be successful if you’re not battle-tested in other areas. Theoretical knowledge counts a bit, but if you don’t know how to apply it in practice, you’ll have a bad time.
Both paths are beneficial - moving from a Data Analyst or Data Engineer to a Data Scientist role. The analyst role will help you with exploratory analysis, and the engineer role will help you with the engineering part. But you won’t be good at coding or analytics if you haven’t been in those roles. I’ve tried to write an article about data science project i was trying to do and it was hard, I bashed my head so many times to a wall or a blocker and this was because I was lacking some experience. You can read about my attempt here.
Data Science 101
Background I have a statistical background, though never in my life, I’ve worked with statistics. I got rusty; I mean, short fragments exist about what is MSE, AUC, etc., but that is it. I have to start almost from 0 if I’d want to progress further (statistical courses I took in University were 10 years ago).
Mid-Senior
In a perfect world, you’re going to do a lot of research on the task at hand, communicate with different business stakeholders to understand the impact, and how to frame the task as an ML task (I’ve been reading some books to improve my knowledge at least in theory, so I’ll use some book specific terminology). When you have that figured out, you’ll create datasets, version them and explore the data, trying to identify what has some effect and what feature has 0 importance. After this is done, you’ll create some pipelines on data wrangling to get everything in the desired input for your model, and you’re going to do model fine-tuning by changing some hyperparameters and checking which model gives the best result or moving the primary business metric in the right way. Then you might be delivering the presentation or trying to productionalize the model if there are no MLOps folks.
Why I’m mashing Mid and Senior together you might ask - it’s because, depending on the company, you might get a subset of the whole flow or even it all, even if you’re technically mid-level. The only difference is that Senior is already a seasoned veteran in this field, and he can perform some deployment-related matters (or even the whole thing if it’s the usual practice in the company).
Team Lead
There is an option to centralise the Data Science team, but it makes them a bottleneck in deliveries, and all the knowledge is bottled inside the team. Still, on the upside, it allows them to tackle everything as a project and cover each other if needed. Most likely, as always, a team lead is trying to deliver something, but here, he’s more involved in priority setting, push-backing some new things from management, etc. The person should be trying to make sane decisions on what is next, what is important from DS view, and what is a vanity project. While doing so, it might also be expected that this person has to deliver some things too.
Staff
This person should look from a broader perspective and push for better practices, looking from a more holistic view, trying new things and spreading standards across the organisation. He/She should agree strategically with management and other teams on the importance of the potential tasks at the organisational level, which later boils down to data science teams for implementation. As usual, mentoring and education are expected from this person as well. Don’t be surprised if you write tons of documentation and sit in some cross-functional/strategic meetings. You’re delivering more value by showcasing what is what from DS point of view. You’re the person who’s guiding them. You must lead by example (technical leadership is leadership, too).
However, I think often people tend to switch to an ML engineer role that is more technical and in a way, gives more satisfaction with fewer failures in your projects.
Summary
Be ready to deal with many tasks and responsibilities of other data functions, and be aware that some initiatives might lead to failures and frustrations. If you see that it’s not a fit for you, look for a role in a bigger org where it’s more divided by specialisation or move to a startup if you’re quite experienced and resilient, so you won’t burn out doing a job that would usually be done of at least 3 different people.




