Airflow, dbt, and Postgres
Tired of recreating docker images for PoC in data engineering? I created one for you.
Introduction
Not sure how about you, but I like to do side projects quite a lot. Usually, they are to test different technologies/tools, how they interact, and if I could use them to some extent in my work.
The flow for me is to create some docker image or docker-compose, if it consists of interacting apps, to have an isolated and controlled environment. While I'm not an expert in docker, sometimes creating environments is more than 50% of the work of the whole PoC. I run a PoC, get some results, and rarely commit my code to GitHub. A couple of months pass and a new idea pops into my head; suddenly, I can't find where I've put my code. Almost every time I do this, I have the same issue.
To stop this from happening to me, I'll be creating multiple repositories for people to fork, and they can play around with different tools and their interactions as they will.
Airflow
My de-facto go-to scheduler. Using it since ~2019, starting as a simple dag creator to migrating with a team of people from Airflow 1.10.X to 2.1.X and applying best practices. You can read about it in my blog post while I was at HomeToGo:
Apache Airflow at HomeToGo
HomeToGo journey in having data flows orchestrated with Airflow. Pains and issues we encountered and how we tackled…
I chose it because I'm the most familiar, but I will not limit myself later only to this, so there will be repositories with other orchestrators too!
The airflow image I create will be more straightforward without Redis and celery. I am doing two different machines, one as a web server, one as a scheduler, and of course, a back-end DB.
As you can see from Dockerfile, I'm using Airflow 2.4.2 with Python 3.9. I'm not going to lie; I chose this Airflow version to test their dataset scheduling later and familiarize myself with their new UI. I stopped checking other versions after 2.2.2.
The code itself is pretty straightforward. I am installing multiple packages, i.e., libpq-dev for interactions with Postgres DB, git for dbt, and requirements for python packages.
I've created two bash scripts to control the scheduler and webserver starting-up behavior.
running dbt clean and dependencies and creating data_warehouse DB if it doesn't exist in the database. Initializing airflow database and starting scheduler.
Now before you go and judge me:
I never knew well bash — all comments are more than welcome; glad to improve and learn better practices
Hard-coding username, password, and IP — at the moment not sure how to do it another way, I guess I could have used some environment variables and added them all over the place, but I already was bashing my head here for more time than I wished
Webservers are way more straightforward:
create admin user
start web server
Since I've mounted the dags folder from the repository— put your dags there, and the scheduler will parse them.
dbt
At the moment my go-to tool for transformations. Installation is relatively easy — dbt-core package + dbt-DATABASE package.
Only adjustments to make sustainability I've written down about using separate profiles.YAML file for credentials, for PoC not to interfere with any prod things. I added as well a reminder on how it should look for Postgres.
Create your models in the transform/data_warehouse folder as you would for dbt projects.
If you're interested in dbt, you can also check out my post about getting started.
Oh, my dbt (data build tool)
My experience and a couple of notes of using this superb tool for a month
Postgres
Creating docker-compose for local airflow development is not a first; quite quickly get the hang of what's where. Why now am I talking about Airflow? I named the paragraph Postgres. I always use Postgres as the back-end DB for Airflow. I hate MySQL with all my heart, and SQLite doesn't have parallelism (it would work for PoC, but why wait for multiple tasks to finish?). Since I've decided to make it also my Data Warehouse DB, not only as Airflow back-end, I had to make some adjustments. Especially using dbt. I had to create a YAML file with connection information (credentials too). Here comes the issue — no clue how to do that, so I stitched up this beauty from stack overflow and some other posts I've found in google.
It's pretty simple — create a networks object with relevant information and then use it in your services.
So each time I spin up my docker-compose, I'd always get my database on the same IP and port.
Running it
Running is also relatively easy if you're familiar with docker-compose.
docker-compose build
To build your images
docker-compose up
for running it.
Summary
In general fun and a pleasant experience making this all run. I learned a bit about networks and how to make at least some apps use static IP to mimic real life.
You can find this repository here.