The Neo-Monolith in Data Engineering

Engineering is cyclical.

May 21, 2025

When I started in data engineering, we used monolithic solutions to solve our problems. There was the ETL tool, the data warehouse (probably an SQL DB), and some reporting tool on top (or just plain MSSQL with SSIS, SSAS, SSRS 😂). Fast-forward to today, and we've gone from three components to fifteen or more, each more specialised than the last.

I won't lie; I've been part of this expansion. In one of my previous places, I advocated for specialised tools for specific problems. Need orchestration? Let's use Airflow! For transformations? dbt! Data quality assurance? Maybe Great Expectations would fit?! While each tool solves a specific problem exceptionally well, I've noticed a pattern recently—sometimes, too many tools create more problems than they solve.

The Problem

With the rise of the modern data stack, we've been building increasingly distributed architectures. Here's what I've observed in the field:

Onboarding time for new team members has increased dramatically
Debugging issues has become more complex as data passes through multiple systems
Integration points have become single points of failure
Teams spend more time maintaining infrastructure than delivering insights

In one of my previous places, we had a scenario where a simple pipeline modification required changes in five different systems. What should have taken hours took days because it involved so many specialised tools. This is not sustainable.

The Market Is Responding

The good news is that I'm not alone in recognising this problem. Recent market trends suggest a shift toward consolidation in data engineering tooling. In the 2024 State of Data Engineering report by lakeFS , one of the primary trends identified is the consolidation between categories happening in the data engineering space.

For example, Dagster now offers Dagster+, which includes a built-in catalogue, lineage, and data observability capabilities – functions that previously required separate tools (some of them exist in the open-source version too).

Similarly, Orchestra, a newer entrant to the market, emphasises this shift by consolidating "all the good bits of the data stack into a platform a data team of any size can use," specifically mentioning "No more Airflow, platform team kubernetes, monte carlo $100k p.a. nonsense". This suggests a growing recognition that the fragmented approach isn't working for many teams.

Airflow 3: Addressing Fragmentation

One of the most interesting developments in this space is the upcoming Airflow 3 release. According to recent information, Apache Airflow 3 represents one of the most significant architectural shifts in the project's history, focusing on consolidation and improved integration. A key component of this is the Task Execution Interface (AIP-72), which enables the evolution of Airflow into a client-server architecture.

This architectural change is specifically designed to address several core issues with fragmentation:

Multi-language support: By incorporating a Task SDK approach, Airflow 3 will allow developers to write tasks in languages other than Python, reducing the need for separate tools for different language environments. This foundational feature enables multi-cloud deployments and multi-language support.
Unified security model: The new Task Execution Interface provides better isolation and security, addressing a key concern in fragmented systems where security models often differ across tools.
DAG Versioning: One of the most requested features in Airflow has been proper versioning support, which helps manage the complexity of distributed systems. DAG versioning ensures that a DAG will run through to completion based on the version at start, even if a new version has been uploaded while this DAG was being run.

Dagster: Rebundling the Data Platform

Dagster has taken a slightly different approach to addressing the fragmentation problem, but with a similar goal of consolidation. Their team explicitly articulated this vision in a blog post titled "Rebundling the Data Platform", arguing that "Having this many tools without a coherent, centralised control plane is lunacy, and a terrible endstate for data practitioners and their stakeholders".

Dagster approaches consolidation through "software-defined assets" - a fundamentally new orchestration approach focusing on assets rather than tasks. This enables orchestration and a fully integrated understanding of data assets declared in other tools, effectively rebundling the platform by ingesting these assets into a single surface that provides a base layer of lineage, observability, and data quality monitoring.

Their fully managed Dagster+ platform further exemplifies this consolidated approach by offering "a consolidated view" of asset information, including "freshness, status, schema, metadata, and dependencies." This brings together multiple aspects of data management that previously required separate specialised tools.

Orchestra: A Unified Control Plane

Orchestra represents another approach to tackling the fragmentation problem, positioning itself as a "unified control plane" for data teams that helps them "connect any Data Tool to reliably and efficiently release data." What's particularly interesting about Orchestra's approach is that it explicitly targets the pain points of the modern data stack's fragmentation.

Orchestra's founders understand that in a real-world scenario, data teams are likely already using multiple specialised tools. So, rather than replacing everything, they've built what they call a "Data Orchestration" platform that goes beyond simple "workflow orchestration" to provide a holistic view of the entire data pipeline.

Their platform focuses on solving the real operational problems that arise from fragmentation, like the inability to see the status of your end-to-end pipeline without jumping between multiple platforms, or the challenges in debugging when something goes wrong across system boundaries.

dbt vs SQLMesh: The Transformation Layer Consolidation

The trend toward consolidation isn't limited to orchestration platforms. Even in the transformation layer, we see signs of competition and consolidation. dbt has been the dominant player in SQL-based transformations, but newer tools like SQLMesh are emerging with a more comprehensive approach.

SQLMesh bills itself as a "robust DataOps framework" that was "designed from the ground up" to address limitations in dbt, particularly around state management, incremental loads, and other features that are "patched in" to dbt but are "fundamental building blocks" in SQLMesh.

Interestingly, some observers in the community see dbt "copying" features from SQLMesh as it works to retain its dominance. This kind of feature convergence is common in software - remember the saying that "imitation is the sincerest form of flattery." We see dbt gradually expanding its capabilities beyond its core transformation focus, while newer entrants like SQLMesh start with a more comprehensive feature set from day one.

This dynamic is further evidenced by dbt Labs' recent acquisition of SDF Labs in early 2025. The acquisition brings "SQL comprehension technology" into dbt with the promise of "faster dbt project compilation (~2 orders of magnitude)" and significantly improved developer experience with features like type-ahead suggestions in IDEs. This move acknowledges some of the key pain points in the dbt ecosystem, particularly around compile times and developer experience, that competitors like SQLMesh had previously addressed.

Before the acquisition, dbt treated SQL as strings, but with SDF's technology, dbt can now understand "the SQL code a user is writing, immediately as it's being written" which "allows developers to embrace modern development accelerants like code completion and content assist as well as pinpoint errors and ensure data quality far earlier in the development process". This is a clear example of a tool that began with a focused approach, now expanding to address adjacent concerns as the market matures.

However, dbt's approach to state management, with features like micro batch strategies, still requires the user to handle much of the complexity, unlike SQLMesh's more built-in approach to these challenges. It's another area where the competition between tools drives improvements across the ecosystem.

The Open Table Format Consolidation

A similar pattern of competition and consolidation is playing out in the open table format space, where multiple formats (Apache Iceberg, Delta Lake, Apache Hudi, and newer entrants like Apache Paimon) compete for adoption while simultaneously developing interoperability layers.

The challenge is clear: organisations need standards for data storage, but different table formats excel at different use cases. As Kai Waehner's analysis points out, "Apache Iceberg seems to become the de facto standard across vendors and cloud providers", but still competes with other formats like Hudi, Paimon, and Delta Lake.

Rather than forcing a winner-takes-all approach, the community has responded with interoperability solutions like Apache XTable (formerly OneTable), which "provides abstraction interfaces that allow omnidirectional interoperability across Delta, Hudi, Iceberg, and any other future lakehouse table formats".

This approach elegantly solves the fragmentation problem without requiring complete standardisation. Organisations can choose the best format for each workload while maintaining interoperability across their data ecosystem.

As Jack Vanlightly notes in his analysis of table format interoperability, this cross-publishing approach allows you to "write in one format, read as another," which means you "write the Parquet files only once (the most costly bit) and reuse them for the secondary formats".

The parallels to the broader data engineering ecosystem are striking: just as Apache XTable provides a translation layer between different table formats, tools like Airflow 3, Dagster, and Orchestra are working to provide a unified layer across different data engineering tools and processes.

Imitation: The Sincerest Form of Flattery

As the saying goes, "Imitation is the sincerest form of flattery." It's fascinating to observe how these different platforms and frameworks - from orchestration tools like Airflow, Dagster, and Orchestra to transformation tools like dbt and SQLMesh, to storage formats like Iceberg, Delta Lake, and Hudi - are all implementing similar solutions to the fragmentation problem, albeit with different approaches and emphases.

This convergence on the need for consolidation suggests that the market has recognised a genuine pain point. The fact that different teams, working independently, have arrived at similar conclusions about integrating and unifying data tooling further validates that the neo-monolith trend is real and addresses a legitimate industry need.

We're seeing not direct copying but multiple engineering teams experiencing the same problems and independently arriving at similar solutions. This parallel evolution indicates that these approaches are likely on the right track.

Engineering Is Cyclical, Not Just Data Engineering

The pendulum swing between monolithic and distributed architectures isn't unique to data engineering - it's a pattern we see across software engineering. Even Amazon has demonstrated this cycle in action. In 2023, the Amazon Prime Video team found that switching from microservices to a monolithic architecture for their video monitoring service reduced costs by 90% and improved scalability.

Their initial distributed system using AWS serverless components hit scaling limits at only 5% of expected load, with high costs from data transfers between components. They achieved better performance by consolidating everything into a single process, though the engineer emphasised this solution worked for their specific case.

Finding the Right Approach

The question isn't whether we should consolidate our data stacks, but how and to what extent consolidation makes sense for each organisation's needs. In some scenarios, there truly is "a best tool for the job" based on market adoption and specific requirements. Tools with broad adoption often benefit from robust community support, extensive documentation, and proven stability in production environments.

It's not about blindly adopting a neo-monolithic approach but rather about being thoughtful about where consolidation provides the most value. This requires understanding your organisation's unique data landscape, team composition, and business requirements.

Some considerations that might help in evaluating your current stack:

Cognitive load: How much mental overhead does your current stack place on team members? Are engineers spending more time context-switching between tools than solving actual business problems?
Integration points: Where are your current architecture's most fragile or problematic integration points? These might be candidates for consolidation.
Team expertise: What tools does your team already know well? Sometimes, consolidating around existing expertise is more practical than adopting an entirely new platform.
Future flexibility: Will a consolidated approach allow you to adapt as your needs evolve?

Summary

Engineering is cyclical. We've gone from monoliths to extreme distribution and are now finding a middle ground. The neo-monolith approach isn't about going backwards but being intentional with complexity. Sometimes, the best solution isn't adding another specialised tool but finding one good tool that adequately solves multiple problems.

This trend is further supported by industry projections, with the big data and data engineering market expected to grow significantly in the coming years, emphasising "scalable, agile, and innovative data management strategies". The most successful organisations will be those that can balance specialised functionality with operational simplicity (https://www.rishabhsoft.com/blog/latest-trends-in-data-engineering).

As we see with the upcoming Airflow 3 release, Dagster's rebundling approach, Orchestra's unified control plane, the competition between dbt and SQLMesh, and the interoperability efforts in the open table format space, the pendulum is swinging back toward more consolidated approaches that preserve flexibility while reducing the operational burden of managing dozens of specialized tools.

Uncle Data

Discussion about this post