Against Metadata
Frustrations with YAML
This is a rant that will probably get me yelled at by librarians, but I feel it strongly from the perspective of a programmer. It's a visceral response, so don't take it too seriously; I may not even fully believe it on reflection.
Anyway, here goes: the distinction between data and metadata is a false dichotomy.
Here's an archetypical example of the distinction as usually drawn: digital photography.
- Data
- RGB pixel values (or, if you're shooting RAW: a mapping of voltages to sensors)
- Metadata
- the date the photograph was taken, aperture, location, etc.
In short, most of what we actually care about is in the metadata. These offer the ability to understand the process by which the photograph was created, its temporal context, and (in the case of a geotag) what it actually depicts. The only meaningful basis I can see for drawing this distinction is that data is what matters to the machine, and the metadata is what matters to us.
But the
Error
- Error type
class clojure.lang.Compiler$CompilerException
- Error message
Unable to resolve symbol: em in this context
- Error phase
:compile-syntax-check
- Location
- Line 23, Columns 9-24
Source expression
(em "data")
, the
Error
- Error type
class clojure.lang.Compiler$CompilerException
- Error message
Unable to resolve symbol: em in this context
- Error phase
:compile-syntax-check
- Location
- Line 23, Columns 30-51
Source expression
(em "real stuff")
, out of which the photograph is composed, only has meaning due to us. This might seem obvious, but I think it has important consequences. Why did I write this? Because I don't like YAML. YAML is designed for metadata, and metadata alone. But now it does a lot more than what was initially asked of it. It was originally used to
Error
- Error type
class clojure.lang.Compiler$CompilerException
- Error message
Unable to resolve symbol: em in this context
- Error phase
:compile-syntax-check
- Location
- Line 23, Columns 380-399
Source expression
(em "describe")
; now it is expected to
Error
- Error type
class clojure.lang.Compiler$CompilerException
- Error message
Unable to resolve symbol: em in this context
- Error phase
:compile-syntax-check
- Location
- Line 23, Columns 423-441
Source expression
(em "perform")
. What was once "Yet Another Markup Language" is now "YAML ain't markup language."
Sooner or later, the values in a YAML file are not metadata. They are data, and should be managed like data. When your YAML file defines the various libraries that are
Error
- Error type
class clojure.lang.Compiler$CompilerException
- Error message
Unable to resolve symbol: link in this context
- Error phase
:compile-syntax-check
- Location
- Line 25, Columns 169-255
Source expression
(link "https://docs.docker.com/compose/" "constituent parts of your Docker image")
, across its revision history it defines the successive states of your software stack. When it defines the
Error
- Error type
class clojure.lang.Compiler$CompilerException
- Error message
Unable to resolve symbol: link in this context
- Error phase
:compile-syntax-check
- Location
- Line 25, Columns 362-543
Source expression
(link "https://github.com/helm/charts/blob/12754f06cee246f7e89d0ffbfa66cbadb644e443/stable/mysql/templates/deployment.yaml#L113" "health checks to your Kubernetes applications")
, it provides a window into all the times the stack failed to work and was made to work again. These are all hugely valuable sources of context for how real software gets made. Shoving it into underdesigned storage formats (or worse, spreading those values across several files with templating languages that break when using hyphens in names) and forcing people to do microsurgery on git revisions to reconstruct how the values have changed over time is tantamount to throwing this history away, because git commit hashes are inadequate to the task anyway. These two values were changed by the same commit? Cool. Do they depend on one another, or can I change one back to its previous state without altering the other? A "configuration as code" practice does not answer this question.
I feel strongly about this because it seems like programmers are doomed to reinvent wheels, over and over again. You might think that managing configuration in YAML will force you to "keep things simple." But sooner or later, the complexity spills over.
I doubt I'm the first person to notice these shortcomings; they likely motivated the development of the templating tools that make it easier to more flexibly produce YAML files. It would be nice to not hard-code every JSON value, right? Maybe we can create simple expressions that allow us to swap in the results of simple arithmetical expressions, so we can increment the build number easily. Also, string interpolation and concatenation would help us name things more easily, too.
Error
- Error type
class clojure.lang.Compiler$CompilerException
- Error message
Unable to resolve symbol: link in this context
- Error phase
:compile-syntax-check
- Location
- Line 29, Columns 484-534
Source expression
(link "https://jsonnet.org" "What a concept!")
So, the config now consists of a sequence of referentially transparent expressions that yield values based on the application of simple rules of substitution and a few special forms? Wait,
Error
- Error type
class clojure.lang.Compiler$CompilerException
- Error message
Unable to resolve symbol: link in this context
- Error phase
:compile-syntax-check
- Location
- Line 29, Columns 724-844
Source expression
(link "https://en.wikipedia.org/wiki/Lisp_(programming_language)" "I think I've seen something like this before...")
Metadata is data. The same tools we use to process data efficiently, store it reliably, and link it together should be used with it. Doing things declaratively is a laudable goal, but those taking it on must recognize how
Error
- Error type
class clojure.lang.Compiler$CompilerException
- Error message
Unable to resolve symbol: link in this context
- Error phase
:compile-syntax-check
- Location
- Line 31, Columns 223-283
Source expression
(link "https://en.wikipedia.org/wiki/Datalog" "complex")
Error
- Error type
class clojure.lang.Compiler$CompilerException
- Error message
Unable to resolve symbol: link in this context
- Error phase
:compile-syntax-check
- Location
- Line 31, Columns 284-344
Source expression
(link "https://en.wikipedia.org/wiki/SQL" "declarative")
Error
- Error type
class clojure.lang.Compiler$CompilerException
- Error message
Unable to resolve symbol: link in this context
- Error phase
:compile-syntax-check
- Location
- Line 31, Columns 345-409
Source expression
(link "https://en.wikipedia.org/wiki/MiniKanren" "programs")
actually are. Otherwise, they'll just sweep the existing complexity of the subject under the rug.
Previously:
Error
- Error type
class clojure.lang.Compiler$CompilerException
- Error message
Unable to resolve symbol: ul in this context
- Error phase
:compile-syntax-check
- Location
- Lines 35-36, Columns 1-97
Source expression
(ul (link "https://mikehadlow.blogspot.com/2012/05/configuration-complexity-clock.html" "The Configuration Complexity Clock")
(link "https://twitter.com/dr_c0d3/status/1040092903052378112" "At least XML had schemas..."))
Thoughts on Fossil
Error
- Error type
class clojure.lang.Compiler$CompilerException
- Error message
Unable to resolve symbol: link in this context
- Error phase
:compile-syntax-check
- Location
- Line 40, Columns 1-60
Source expression
(link "https://www.fossil-scm.org/" (in-code "fossil"))
is an alternative to git developed for and atop the sqlite database. When I first read about it, I was working through the configuration of automated build snapshots using CI tools and wanted a place to reliably store information about the dates of the failed builds and the underlying causes. I thought that a DVCS with an integrated database would be a ideal for this: I could add a "builds" table with the dates of successful and failed builds and track this history right with the project. More generally it would help tremendously with all the per-repository configuration management that currently gets shoved into YAML files: instead of flat key-value files, you could record the configuration values in the database backing the project with a consistent relational model.
As far as I can tell, this sort of thing isn't possible "out of the box" in
Error
- Error type
class clojure.lang.Compiler$CompilerException
- Error message
Unable to resolve symbol: in-code in this context
- Error phase
:compile-syntax-check
- Location
- Line 42, Columns 77-99
Source expression
(in-code "fossil")
, but it would be a fantastic concept for a VCS that returns to the more fully expanded notion of "SCM":
Error
- Error type
class clojure.lang.Compiler$CompilerException
- Error message
Unable to resolve symbol: em in this context
- Error phase
:compile-syntax-check
- Location
- Line 42, Columns 204-248
Source expression
(em "software configuration management")
instead of
Error
- Error type
class clojure.lang.Compiler$CompilerException
- Error message
Unable to resolve symbol: em in this context
- Error phase
:compile-syntax-check
- Location
- Line 42, Columns 260-296
Source expression
(em "source control management")
.
Goodbye YAML?
Eno Compton and Tyler van Hensbergen gave a talk at Clojure/conj 2019 this past weekend called
Error
- Error type
class clojure.lang.Compiler$CompilerException
- Error message
Unable to resolve symbol: link in this context
- Error phase
:compile-syntax-check
- Location
- Line 46, Columns 96-202
Source expression
(link "https://www.youtube.com/watch?v=yruVUkwlffk" "Goodbye YAML: Infrastructure As Code in Clojure")
. Unsurprisingly, I found Compton's observation that "declarative configuration" actually means wading through a "thicket of YAML" just to do anything useful with your actual code very cathartic.
Unfortunately, the hope that the library they released would solve all my YAML woes vanished when van Hensbergen mentioned that it's a Clojure wrapper atop very AWS-specific tooling. However, it got me thinking more on the prospect of embedding some kind of lightweight configuration database with an application's code in the same repository. Here are some links!
Error
- Error type
class clojure.lang.Compiler$CompilerException
- Error message
Unable to resolve symbol: link in this context
- Error phase
:compile-syntax-check
- Location
- Line 50, Columns 1-142
Source expression
[:h6 {:class "f3"} "Kenneth Truyers: " (link "https://www.kenneth-truyers.net/2016/10/13/git-nosql-database/" "Git as a NoSQL Database")]
This short post sketches out a better way of storing key-value data in Git than shoving it in a flat whitespace-delimited file under revision control: using
Error
- Error type
class clojure.lang.Compiler$CompilerException
- Error message
Unable to resolve symbol: in-code in this context
- Error phase
:compile-syntax-check
- Location
- Line 52, Columns 158-188
Source expression
(in-code "git write-tree")
,
Error
- Error type
class clojure.lang.Compiler$CompilerException
- Error message
Unable to resolve symbol: in-code in this context
- Error phase
:compile-syntax-check
- Location
- Line 52, Columns 190-212
Source expression
(in-code "commit")
, and
Error
- Error type
class clojure.lang.Compiler$CompilerException
- Error message
Unable to resolve symbol: in-code in this context
- Error phase
:compile-syntax-check
- Location
- Line 52, Columns 218-244
Source expression
(in-code "update-ref")
to record and construct views of arbitrary key-value data embedded directly into a bare repository rather than by putting that data into files tracked by git. The UX of this approach leaves a lot to be desired, but the fact that it's possible with the primitives that git provides means that a porcelain that makes these types of operations easier could be as widely distributed as git itself is.
Error
- Error type
class clojure.lang.Compiler$CompilerException
- Error message
Unable to resolve symbol: link in this context
- Error phase
:compile-syntax-check
- Location
- Line 54, Columns 1-140
Source expression
[:h6 "Paul Stovell: " (link "http://paulstovell.com/blog/hybrid-storage-git-sqlite-raven" "Hybrid Storage: Git + SQLite vs. RavenDB?")]
Another post on a similar theme, with a quick sketch of an implementation of a key/value store atop Git. Stovell offers a very succinct and general problem statement for any tool that tackles this problem:
Error
- Error type
class clojure.lang.Compiler$CompilerException
- Error message
Unable to resolve symbol: blockquote in this context
- Error phase
:compile-syntax-check
- Location
- Lines 58-61, Columns 1-351
Source expression
(blockquote
{:author "Paul Stovell"}
"The two sets of data I'm referring to are:" (ul "Definitions. Environments, machines, projects, deployment steps, variables, and so on. They define what will happen when you deploy something. There usually aren't more than a couple of thousand of any of them. As a user, you spend time crafting and modifying these."
"Records. These collections include releases, deployments, audit events, and so on. Basically, they are a record of the actions you've done using the definitions above. These collections can balloon out to many thousands, perhaps millions of documents. As a user, these are created from simple clicks or even automatically via API's/integration."))
On a Kubernetes deployment, for example, the definitions would be manifest files, and the records would be instances of applying or patching those manifest files. One of the big challenges in developing Kubernetes applications is maintaining a clear sense of how your definitions and records have changed over time using the impoverished semantics of YAML configuration, a problem that Compton and van Hensbergen dealt with using Clojure in the talk above. Having better tooling within your repository itself for tracking definitions and records like this would also be very helpful.
Error
- Error type
class clojure.lang.Compiler$CompilerException
- Error message
Unable to resolve symbol: link in this context
- Error phase
:compile-syntax-check
- Location
- Line 65, Columns 1-105
Source expression
[:h6 (link "https://github.com/attic-labs/noms" "Noms: The Versioned, Forkable, Syncable Database")]
Linked in the comments on Truyers' post was
Error
- Error type
class clojure.lang.Compiler$CompilerException
- Error message
Unable to resolve symbol: in-code in this context
- Error phase
:compile-syntax-check
- Location
- Line 67, Columns 45-65
Source expression
(in-code "noms")
, a database directly inspired by Git's decentralized and immutable data model, but designed from the ground up to have a better query model and more flexible schema. Unfortunately, it seems to be unmaintained and not ready for prime time. Additionally, for the use case I'm describing, it's unclear how to effectively distribute the configuration data stored in a
Error
- Error type
class clojure.lang.Compiler$CompilerException
- Error message
Unable to resolve symbol: in-code in this context
- Error phase
:compile-syntax-check
- Location
- Line 67, Columns 430-450
Source expression
(in-code "noms")
DB alongside the code that is affected by that configuration in a way that reliably links the two.
Error
- Error type
class clojure.lang.Compiler$CompilerException
- Error message
Unable to resolve symbol: link in this context
- Error phase
:compile-syntax-check
- Location
- Line 69, Columns 1-79
Source expression
[:h6 (link "https://github.com/tonsky/datascript" (in-code "Datascript"))]
Obviously, it would be silly of me when talking about the idea of a scalable database with a flexible schema and a strong emphasis on immutability not to mention the big kahuna (
Error
- Error type
class clojure.lang.Compiler$CompilerException
- Error message
Unable to resolve symbol: link in this context
- Error phase
:compile-syntax-check
- Location
- Line 71, Columns 179-236
Source expression
(link "https://www.datomic.com/" (in-code "Datomic"))
). However, that seems like overkill for replacing YAML in a single Git repository.
Error
- Error type
class clojure.lang.Compiler$CompilerException
- Error message
Unable to resolve symbol: in-code in this context
- Error phase
:compile-syntax-check
- Location
- Line 73, Columns 1-27
Source expression
(in-code "Datascript")
is Datomic's younger cousin, designed to make working with Datalog-driven data structures as easy as working with hash tables in both Clojure and Clojurescript.
Nikita Prokopov, the library's author, has a fun example on his blog of
Error
- Error type
class clojure.lang.Compiler$CompilerException
- Error message
Unable to resolve symbol: link in this context
- Error phase
:compile-syntax-check
- Location
- Line 75, Columns 73-141
Source expression
(link "https://tonsky.me/blog/acha-acha/" (in-code "acha-acha"))
a radical re-imagining of "front-end" applications that just uses
Error
- Error type
class clojure.lang.Compiler$CompilerException
- Error message
Unable to resolve symbol: in-code in this context
- Error phase
:compile-syntax-check
- Location
- Line 75, Columns 208-231
Source expression
(in-code "Transit")
to dump
Error
- Error type
class clojure.lang.Compiler$CompilerException
- Error message
Unable to resolve symbol: em in this context
- Error phase
:compile-syntax-check
- Location
- Line 75, Columns 240-254
Source expression
(em "all")
the application data into the browser on the first page load and then builds out the page logic using
Error
- Error type
class clojure.lang.Compiler$CompilerException
- Error message
Unable to resolve symbol: in-code in this context
- Error phase
:compile-syntax-check
- Location
- Line 75, Columns 357-383
Source expression
(in-code "Datascript")
queries against that data. Hard to get simpler than a single API endpoint, and the performance is pretty impressive: 50,000 datoms loaded in under 20ms, according to a not-so-rigorous benchmark.
Reading about it made me think of a potential middle ground between building a whole new database atop a DVCS and storing everything in a flat YAML file: storing both the definitions and records (to use Stovell's definition) as datoms in a flat EDN file, but only every interacting with that file from within Datascript, which has domain-specific queries for dealing with the intrinsic structure of those entities.