Versus

March 2, 2024 and updated March 24.

There’s been a quiet war going on for the past 20 years between DBAs and software developers.

The traditional “data warehouse” was designed to make OLAP tasks look just like any other SQL database. As per tradition, the DBAs were in charge: If you wanted a new table or index, you had to go through them.

Then developers came along with Hadoop (later Spark) and convinced everyone to write MapReduce jobs on a distributed filesystem (later blob stores like S3). Individual programmers had all the power, and the DBAs were left to drown in the Data Lake.

The DBAs fought back with metadata stores (Hive) and query engines (Hive, Presto/Trino) to make that big mess of compute and storage act like a halfway-decent SQL database again (the Lakehouse) and bring some order to the chaos.

Developers countered with Data Mesh. You want a new table? Go build a microservice.

If I were to predict the next step of this cycle, it would be the “Mesh House,” a curated view of the Data Mesh via a centralized SQL-like interface.

Of course, “DBAs” and “developers” are really just representatives of two opposing goals: curation and autonomy. The tech industry regularly swings from one to the other and back again, never reaching a stable equilibrium.

In principle, there’s no reason why we can’t have both: a curated core with autonomy at the edges. But somehow the curators can never resist trying to control absolutely everything, while the individualists always chafe at the merest suggestion of centralized authority. Chalk it up to human nature, I suppose. We’re not all that good at compromise.

Update March 24, 2024:

This extends to naming too. The DBAs want everything to have a name like database_name.table_name. Sometimes you can add a prefix of schema_name., but 3 layers is the max you’re going to get in terms of hierarchical divisions.

Meanwhile, the software developers work with filesystems. You can put almost any character you want in a name, with as many layers of directory hierarchy as you can come up with.

At some point in the past, the two camps had a parley and agreed on Hive-style partitioning, a convention which, as far as I can tell, has never been formally specified but is so ubiquitous that many systems just assume or infer that’s how your directories are organized.