Against Metadata

Frustrations with YAML

This is a rant that will probably get me yelled at by librarians, but I feel it strongly from the perspective of a programmer. It's a visceral response, so don't take it too seriously; I may not even fully believe it on reflection.

Anyway, here goes: the distinction between data and metadata is a false dichotomy.

Here's an archetypical example of the distinction as usually drawn: digital photography.

RGB pixel values (or, if you're shooting RAW: a mapping of voltages to sensors)
the date the photograph was taken, aperture, location, etc.

In short, most of what we actually care about is in the metadata. These offer the ability to understand the process by which the photograph was created, its temporal context, and (in the case of a geotag) what it actually depicts. The only meaningful basis I can see for drawing this distinction is that data is what matters to the machine, and the metadata is what matters to us.

But the data, the real stuff, out of which the photograph is composed, only has meaning due to us. This might seem obvious, but I think it has important consequences. Why did I write this? Because I don't like YAML. YAML is designed for metadata, and metadata alone. But now it does a lot more than what was initially asked of it. It was originally used to describe; now it is expected to perform. What was once "Yet Another Markup Language" is now "YAML ain't markup language."

Sooner or later, the values in a YAML file are not metadata. They are data, and should be managed like data. When your YAML file defines the various libraries that are constituent parts of your Docker image, across its revision history it defines the successive states of your software stack. When it defines the health checks to your Kubernetes applications, it provides a window into all the times the stack failed to work and was made to work again. These are all hugely valuable sources of context for how real software gets made. Shoving it into underdesigned storage formats (or worse, spreading those values across several files with templating languages that break when using hyphens in names) and forcing people to do microsurgery on git revisions to reconstruct how the values have changed over time is tantamount to throwing this history away, because git commit hashes are inadequate to the task anyway. These two values were changed by the same commit? Cool. Do they depend on one another, or can I change one back to its previous state without altering the other? A "configuration as code" practice does not answer this question.

I feel strongly about this because it seems like programmers are doomed to reinvent wheels, over and over again. You might think that managing configuration in YAML will force you to "keep things simple." But sooner or later, the complexity spills over.

I doubt I'm the first person to notice these shortcomings; they likely motivated the development of the templating tools that make it easier to more flexibly produce YAML files. It would be nice to not hard-code every JSON value, right? Maybe we can create simple expressions that allow us to swap in the results of simple arithmetical expressions, so we can increment the build number easily. Also, string interpolation and concatenation would help us name things more easily, too. What a concept! So, the config now consists of a sequence of referentially transparent expressions that yield values based on the application of simple rules of substitution and a few special forms? Wait, I think I've seen something like this before...

Metadata is data. The same tools we use to process data efficiently, store it reliably, and link it together should be used with it. Doing things declaratively is a laudable goal, but those taking it on must recognize how complex declarative programs actually are. Otherwise, they'll just sweep the existing complexity of the subject under the rug.


Thoughts on Fossil

fossil is an alternative to git developed for and atop the sqlite database. When I first read about it, I was working through the configuration of automated build snapshots using CI tools and wanted a place to reliably store information about the dates of the failed builds and the underlying causes. I thought that a DVCS with an integrated database would be a ideal for this: I could add a "builds" table with the dates of successful and failed builds and track this history right with the project. More generally it would help tremendously with all the per-repository configuration management that currently gets shoved into YAML files: instead of flat key-value files, you could record the configuration values in the database backing the project with a consistent relational model.

As far as I can tell, this sort of thing isn't possible "out of the box" in fossil, but it would be a fantastic concept for a VCS that returns to the more fully expanded notion of "SCM": software configuration management instead of source control management.

Goodbye YAML?

Eno Compton and Tyler van Hensbergen gave a talk at Clojure/conj 2019 this past weekend called Goodbye YAML: Infrastructure As Code in Clojure. Unsurprisingly, I found Compton's observation that "declarative configuration" actually means wading through a "thicket of YAML" just to do anything useful with your actual code very cathartic.

Unfortunately, the hope that the library they released would solve all my YAML woes vanished when van Hensbergen mentioned that it's a Clojure wrapper atop very AWS-specific tooling. However, it got me thinking more on the prospect of embedding some kind of lightweight configuration database with an application's code in the same repository. Here are some links!

Kenneth Truyers: Git as a NoSQL Database

This short post sketches out a better way of storing key-value data in Git than shoving it in a flat whitespace-delimited file under revision control: using git write-tree, commit, and update-ref to record and construct views of arbitrary key-value data embedded directly into a bare repository rather than by putting that data into files tracked by git. The UX of this approach leaves a lot to be desired, but the fact that it's possible with the primitives that git provides means that a porcelain that makes these types of operations easier could be as widely distributed as git itself is.

Paul Stovell: Hybrid Storage: Git + SQLite vs. RavenDB?

Another post on a similar theme, with a quick sketch of an implementation of a key/value store atop Git. Stovell offers a very succinct and general problem statement for any tool that tackles this problem:

The two sets of data I'm referring to are:
  • Definitions. Environments, machines, projects, deployment steps, variables, and so on. They define what will happen when you deploy something. There usually aren't more than a couple of thousand of any of them. As a user, you spend time crafting and modifying these.
  • Records. These collections include releases, deployments, audit events, and so on. Basically, they are a record of the actions you've done using the definitions above. These collections can balloon out to many thousands, perhaps millions of documents. As a user, these are created from simple clicks or even automatically via API's/integration.
Paul Stovell

On a Kubernetes deployment, for example, the definitions would be manifest files, and the records would be instances of applying or patching those manifest files. One of the big challenges in developing Kubernetes applications is maintaining a clear sense of how your definitions and records have changed over time using the impoverished semantics of YAML configuration, a problem that Compton and van Hensbergen dealt with using Clojure in the talk above. Having better tooling within your repository itself for tracking definitions and records like this would also be very helpful.

Noms: The Versioned, Forkable, Syncable Database

Linked in the comments on Truyers' post was noms, a database directly inspired by Git's decentralized and immutable data model, but designed from the ground up to have a better query model and more flexible schema. Unfortunately, it seems to be unmaintained and not ready for prime time. Additionally, for the use case I'm describing, it's unclear how to effectively distribute the configuration data stored in a noms DB alongside the code that is affected by that configuration in a way that reliably links the two.


Obviously, it would be silly of me when talking about the idea of a scalable database with a flexible schema and a strong emphasis on immutability not to mention the big kahuna (Datomic). However, that seems like overkill for replacing YAML in a single Git repository. Datascript is Datomic's younger cousin, designed to make working with Datalog-driven data structures as easy as working with hash tables in both Clojure and Clojurescript.

Nikita Prokopov, the library's author, has a fun example on his blog of acha-acha a radical re-imagining of "front-end" applications that just uses Transit to dump all the application data into the browser on the first page load and then builds out the page logic using Datascript queries against that data. Hard to get simpler than a single API endpoint, and the performance is pretty impressive: 50,000 datoms loaded in under 20ms, according to a not-so-rigorous benchmark.

Reading about it made me think of a potential middle ground between building a whole new database atop a DVCS and storing everything in a flat YAML file: storing both the definitions and records (to use Stovell's definition) as datoms in a flat EDN file, but only every interacting with that file from within Datascript, which has domain-specific queries for dealing with the intrinsic structure of those entities.