This Website Is Not A Tree


It must be emphasized, lest the orderly mind shrink in horror from anything that is not clearly articulated and categorized in tree form, that the idea of overlap, ambiguity, multiplicity of aspect and the semilattice are not less orderly than the rigid tree, but more so. They represent a thicker, tougher, more subtle and more complex view of structure.
Christopher Alexander, A City Is Not A Tree


Preliminaries

This is the first post of respatialized, a website about actual and potential spaces. Part of the reason it took me so long to launch it is because nearly every static site generator forces your writing into a tree-like structure. Only one lets you extend the site generation methods to reflect your own ideas: Matthew Butterick's pollen. Because I want to combine the sequential and additive writing style of a blog with the associational and iterative nature of a wiki, this was the only choice.

It would have taken me even longer to get started if Joel Dueck hadn't already done the excellent work of creating thenotepad, which includes functions to produce many of the things we expect from blogs, like sequential indices and RSS feeds, and many that we should expect, but don't (like the ability to generate a PDF from the blog). The code that generates this blog is forked from thenotepad and licensed under the MIT License.

Extensible textual notation

I recently switched from pollen to perun. perun's model of publishing everything via a composable collection of boot tasks encapsulates everything that I want from pollen's organizational and compositional capabilities. pollen's pagetrees can be recreated by mapping and filtering the sequential collections of hiccup data structures perun generates, and applying those transformations to generic collections comes more readily to me than creating .ptree files (the Clojure refrain, it's just data, etc.). My artwork and other content is also written in and generated using Clojure, so I don't want to have to drop into a different language that I don't know as well just to get it out. For me, the ability to iterate quickly depends on low friction and the power of simplicity: boot's Swiss army knife approach matches that perfectly.

However, I agree wholeheartedly with Matthew Butterick when he argues that Markdown is a constraining environment in which to write, especially if you're looking to write a sustained treatment of a topic which usually generates a deep and rich collection of self and cross references and its own conventions for referring back to subtopics organically over time. Markdown supports the lowest level of this: links. Anything else, you're on your own, but with a completely restricted method of manipulating the input texts.

Also, sometimes I want to contextually distinguish textual elements using CSS and I want to do it without rewriting my markdown parser. I currently do this by littering my markdown posts with div class="..." tags, which is kludgy and offers no way of systematically changing the classes applied to the textual element apart from doing find-replace on all of them with grep.

The #lang pollen directive provides a beautiful way of letting prose be prose while still letting you access the full power of a programming lanugage whenever you need it via the lozenge ◊ special character.

What I'm looking for, basically:

I'd like the ability to embed ◊(link hiccup "https://github.com/weavejester/hiccup")/clojure data structures into my textual input. They can either be data (see below) or functions called at render time that evaluate into data (see above).

[:em {:class "topic"} Extensibility]

Clojure already supports this notion in its canonical representation of data, extensibile data notation. I want to bring it to textual information, and maybe, HTML Canvas objects as well. The full power of a programming language means that we can flexibly switch between graphical and textual representations, something that pollen doesn't yet support.

This approach also acts as a force-multiplier on immutable, compositional CSS tools like tachyons or tailwind, because it brings the power of Clojure into the tool you're using to write the text, which in turn leverages tools like tachyons to apply a unified style to what you're writing using inline, simple notation.

Other examples in this space:

Both of these are built atop Javascript, and are more focused on interactivity for users than on procedural generation of text at write-time. Clojure(script)'s homoiconicity makes it ideally suited for both purposes- it can be used to generate interactive programs as well as any other form of data you'd want to display. I'm personally more interested in the latter, right now.

Extensible textual notation, part 2

Within the Clojure world and beyond, there are a few tools that suggest directions for what I'm thinking of here.

perun

This is what I'm using to write and compile the blog itself right now. However, I don't like that most of the decisions about how to parse markdown into HTML are decided by the fiats of flexmark-java I would much prefer to be able to manipulate the content in the form of hiccup data structures as I see fit before passing it to hiccup.core/html5 for rendering. This was discussed in the perun repo, but was set aside when the use case of hyphenation didn' t actually require it.

However, there are many other reasons you'd be interested in representing your writing as data.

if the book is a program, the source for that book should look more like your brain, and less like HTML (or XML or LaTeX or ...)?
Matthew Butterick

Personally, I'm interested in using the most powerful tools I have. For example, you could parse the writing into discrete chunks represented as hiccup data structures, record them as facts in a datascript DB, and query them like any other source of data. This would be even more powerful if you ran it against not just the current state of your writing, but its revision history.

oz

As a Clojure enthusiast who got started with Jupyter notebooks, I quite like the idea of using Clojure for interactive documents. I just wish Markdown wasn't so uncritically accepted as the default for text authoring, because it seems silly to give yourself the whole power of a programming language in rendering a document and then arbitrarily restrict its scope to making graphs. Scientific documents in particular deal with lots of structured data: batteries of tests, statistical analyses, summary tables. Presentation of that data is not limited to graphs: a more powerful authoring model would allow you to dynamically generate and restructure the prose annotations of scientific data as easily as the graphs that summarize it.

Anything less feels like an arbitrary step backward.

Code documentation tools

The major area in which programmers currently perform programmatic manipulation of prose and data in tandem is in the realm of documentation generators. In my experience, these tools fall into two broad categories:

Narrative-first tools like sphinx, reStructuredText, etc. They have one primary benefit: if they're to be of any use at all they require the author to write a good amount of prose introducing the project, its rationale and purpose, and the main ways of interacting with it. However, in these tools, docs are generally separate from code - even if they live in the same repo, they're often in docs/ and can easily come out of sync with the actual code.

Code-first like any javadoc, docco, Roxygen2, etc. These have the benefit of being much closer to the day-to-day work of developers and are much less likely to become out of sync with the code, because they are usually parsed out of docstrings and special comments and the process of updating documentation can be built into a project's deployment pipeline without much overhead. The drawbacks? You generally end up with a completely decontextualized list of classes or functions that doesn't inform or give examples of how you'd actually use them.

marginalia


marginalia generates elegant-looking literate programming documents from plain Clojure source code. Like perun, however, it returns rendered HTML rather than structured data from its parsing of source files.

cod

As a code documentation tool, cod feels like it has the right idea at its core. Instead of deciding how to present the documentation it pulls out of the source code for you, cod simply returns JSON data representing the annotations. Any further decisions about how to represent that JSON data in the final documentation are up to the author, allowing for a better blend of narrative and code than other documentation tools.

scribble

The fact that Racket libraries tend to have vastly superior documentation (on average) than any other programming language is a testament to the power of Scribble. Naturally, pollen owes a lot to the starting point that scribble created.

Why look at code documentation tools?

Mostly because I know that I'm going to have to write my own solution to this problem. I want the solution's source code to itself generate an example of the kind of document I want it to produce, so I'm hoping I can steal as many existing functions as possible from these other libraries while I'm bootstrapping the project.

Structural features of writing and information management systems

I've gone through myriad to-do apps, organizers, journaling systems. Here's a table depicting my overall thoughts.

TypeDisadvantagesExamplesAdvantages
Binder notebookatemporal,apresentistFilofaxassociative,organic,frictional,multi-modal,simple
DiaryapresentistBullet journalchronological,frictional,reflective,multi-modal,simple
To-do appdecontextualizedNozbe, todoistfast,simple,portable
Kanbandecontextualized,information-poorTrellosituated,simple
free-form/wikihierarchical,laboriousNotionassociative,compositional,iterative
Websitelaborious,atemporalThisfrictional,multi-modal,associative

All of these advantages and disadvantages stem from one real underlying issue, in my view: each tool imposes its own view over the data you put into it, making alternative ways of looking at the same information difficult or impossible. Paul Chiusano has written nicely about the conceptually weak data model an "intuitive" design imposes on the information it represents:

We often think about views first because views are concrete, and it’s what we interact with directly when we use software. But actually designing software ‘view first’ is problematic because it leads to rigid models that aren’t flexible enough to support the myriad of creative ways that people use your software. It also leads invariably to feature creep—when your model is overly influenced by some concrete views you had in mind during design, it invariably ends up insufficiently general purpose. So as your software becomes more popular, you start adding one-off ‘features’ to support concrete use cases that your users are asking for. A few years pass of this feature creep, and you have a bloated, complicated piece of software that no one gets joy out of using.
Paul Chiusano, The design failures of view-first

Every to-do list and knowledge management system suffers from this problem, In fact, I can feel the constraints imposed by the table above limiting what I want to say about each tool, so let's dive into what I mean by each of these words:

Most recently, I've been using a Zettelkasten-style system for my notes with a filofax binder. It's superb for free association, quick entry, and the generative friction that only putting pen to paper can provide. It's not so good for revisiting previous notes, synthesizing them into new information, reflecting on the past, or maintaining a view of what's "current." Before that, I used a journal-style notebook that was similarly good at quick free-form entry and helped maintain a chronological view of things that aided in reflection, but failed to support associational views of the information recorded within it and similarly suffered from difficulties in keeping things current. I think a two-phase system that facilitates the refinement of paper "drafts" into digital "facts" would be ideal for me, personally.

Many digital systems for doing this exist already. I chafe at using them because they all uncritically accept that markdown is a useful format for representing semantically rich textual information, and then shoehorn features on top of it to make up for its limitations.

Obviously, I'm also taking notes here instead of on paper. Writing this doesn't provide exactly the same generative friction as pen and paper, but does a good enough job of forcing me to clarify my thoughts through the pressure of putting them in a public format. I also have complete control over the content (once I can overcome the limitations of markdown). Given that what I write has currently a 1:1 file:destination relationship, it also prevents association and composition of the information I record here. Ideally you'd want to break this input/output link, which would support both private views of some information and also let you think about how to refer to the same piece of information from multiple public views.

The question of how to individuate pieces of information is permanently open, so an ideal system would support "contention" in that it can facilitate multiple methods of splitting up and representing a topic. How to do that on a technical level is obviously also an open and extremely difficult problem.

It seems daunting to come up with a solution for this, but I've been reading about something that may offer a partial way out recently: Datascript, mentioned in passing earlier. Where Chiusano proposes algebraic data types to manage this, I would prefer to start with datoms that get freely composed into views through datalog queries. Pieces of information (or even bits of writing themselves) would be decomposed down into EAVT facts and recorded in some persistent database where they can change in the future without fear of losing knowledge by revising it.

There's a lot more to say on the design of this, but mostly I wanted to get this concept "on paper" for further development into a design.

Structural features of writing and information management systems, part 2

Another distinction that cuts across everything that I referenced in that table above is the idea of closed versus open knowledge management systems. While my notebook has acquired a significant amount of internal complexity, it is largely a closed system, making interaction with other sources of information more difficult. I have a gigantic pinboard backlog, highlights in a kindle, scattered paper notes about physical books, and no means of integrating them or refining them into something more meaningful.

A lot of PIMs designed to support academic reseearch are "open" towards producing and consuming the primary objects of academic research: papers and monographs. I'm not an academic. While writing helps me clarify my ideas, I also need tools oriented towards the work I do in programming, which means supporting a more situated understanding of what I'm doing. By that I mean supporting a "keep this in mind as you act" understanding of something rather than a "discrete textual description" understanding of something. In cybernetics terms, one might say that my information management systems have not had the requisite variety to handle the tasks I want them to support. They are not open to non-textual workflows.

Here's a quick sketch of what this might be look like:views-sketch

The bottom has a pomodoro-style task tracker and the "current task", the right pane has a grouping of recent commits to keep the actual output of that task in mind as well. The role of these panes isn't the important part - the mechanism by which they're generated is. By pulling information from a common store, simple contextual visualizations of relevant parts of it would be easy to construct via Datalog queries.

A further source of information comes from the seemingly simple fact that these pieces of information are displayed together. The entities referenced by the views currently active could be linked through additional queries - for example, the commits happening in the text editor could create corresponding entities with attributes linking them to the entity of the current task. Similarly, information entries updated when a text file is open or a namespace is edited could be linked with that text file. This establishes a notion of relevance for the supporting materials of the work being done.

This website (could be) a CRDT

While considering potential applications of relay software, I recalled the notion of a conflict-free replicated data type, a data structure that provides a probably correct solution to the problem of imposing a total order on a sequence of edits to a file that arrive out of order, editing different subsets of text, with unreliable timestamps. This data type would be what you reach for if you were designing a collaborative text editor with online and offline editing capabilities, because it would save you from making hard choices about which text to discard and which to keep (or worse, making the user deal with any errors caused by your software and imposing those choices on them).

I started reading about the concept, glossing over the mathematical details in favor of an interest in its potential as an expressive medium for thought. Some ideas that fell out of this:

Making the library metaphor in 'code library' concrete


Right now you have to take home the whole library when you write some code that uses one page of one book.

Statically typed languages that rely on complex class hierarchies, especially because the compiler may make multiple passes across the codebase for definitions in different files ( ... all you wanted was the banana. - Joe Armstrong) force you to ship all this supporting material to use one part of it.

So if instead of classes defined across files or nested relationships between algebraic data types defined at compile time, we had functions operating on simple, immutable values defined in self-contained s-expressions, plus some annotations:

(defn myfunc
 {:calls #{this.ns/func other.ns/func}
...)

this could maybe be achieved even without the manual annotation if you used a macro to pull the symbols out of the function expression at compile time

Rather than a scope defined by a global namespace of evaluated expressions, these explicit references define exactly what a function needs to be lifted out of its lending library and used independently of the codebase it came from.

S-expressions would slot naturally into the delimited data structures required by a CRDT, making this serialization easy (other programming languages may have a harder time). This opens up another application:

New forms of revision control

CRDTs can contain arbitrary series of revisions to the same underlying data, in a method guaranteed to converge on a consistent result.

Documentation could be stored in the same CRDT. If the documentation has old timestamps, a tool could be built atop them to warn the user or author that they're stale relative to the rest of the code. Test results could be stored with the hash of the CRDT at the time they were executed, making failing tests trivial to reproduce. With a clever index, the failing tests associated with a given function could be recalled from the codebase's history with a query, providing useful context for finding the source of an unexpected regression.

Configuration for external systems, or expressions that modify it, could be stored in the same CRDT as the code itself. Integration and system tests could be linked with configuration changes in the manner I describe above, providing context for when the components of a distributed system fail and are made to work again.

Tests could be shipped around with their functions using a similar annotation syntax to the one above so that someone can have guarantees about the external code they're relying on.

A notebook for the table beside your hammock

If code and documentation are part of the same data structure as a whole, then an "ideas first" approach to software is as easy to start and maintain as a new experimental repo. The recorded ideas can evolve in tandem with the code that implements them, and their interplay gets expressed through the immutable history of the data structure recording them. It's an environment that makes hammock-driven development as easy as flow-state coding and bug squashing, with the ability to fluidly switch between them without breaking the flow. Code itself as one component part of an open system that doesn't treat writing down the problem and writing the code that solves it as separate activities.

What else is possible? Right now, code takes on the shape that Git repositories, and the software we use to interact with them, want it to take. Can we break code revision history and reuse out of the paradigm of discrete individual repositories? Is a distributed data structure like this enough to make the distinction between "monolithic" and "microservice-oriented" code obsolete?

Alex Miller writes of the new model embraced by deps.edn:

deps was designed to find a sweet spot in the middle of this with deps defined as data, aliases capturing program executions as data, but builds as programs. As such, the scope is drastically narrowed in deps to just a) building classpaths (by resolving dependency graphs) and b) launching programs. As such, this tends to be a dramatically simpler model to start with (your initial deps.edn can be empty), and a model that is easy to understand as you scale up. I think there is more to do in how we model "tools" (esp tools shared across projects) and program composites, but nothing prevents you from building these yourself if needed (as you have the full power of Clojure at your disposal).
Alex Miller

Storing code in a CRDT has the potential to explore new parts of the misty space between "tools" and "programs" for Clojure code. I'm definitely interested in where this could lead, but I have to figure out how to create s-expressions from my prose first.


Extensible textual notation, part 3


Structure from text

I want to replicate pollen's ability to let prose be prose while still incrementally bringing in a programming language when it's needed, but also combine it with Clojure's own data structures to capture the structure that emerges organically from the act of writing, so I could, for example, capture the table above not just as a sequence of textual elements but also preserve the structure of the tabular data itself for future use somewhere else.

The simplest implementation of that would be just reading in the file line by line and constructing maps from the paragraphs separated by line breaks:

{:id e4268ac2
 :text "Paragraph one."}
{:id e4268ac3
 :text "Paragraph two."}

The initial reading process creates entities that serve as placeholders for text as it is when read and as it may be in the future, all recorded as facts in a EAVT/RDF semantic triple format. Knowledge atoms instead of data atoms. But a collection of facts doesn't preserve the ordering of their original composition, which is a lot of structure to throw away. There are two ways of preserving it that initially occurred to me:

[1] files are entities too - just have them refer to their contents as distinct entities.

{:entity 23542
 :attribute :filename
 :value "plaintext-file.txt"}
{:entity 23542
 :attribute :contents
 :value [52952 29587 29042]}

In this mode, order of paragraphs is asserted as a fact on the basis of the vector of entity ids of the constituent paragraphs.

[2] Alternatively, the facts about the paragraph order could just be composites of other facts:

{:entity 23542
  :attribute :contents
  :value [{:uuid ab50234 :text "opening paragraph goes here"}
          {:uuid ab50235 :text "second paragraph goes here"}]}

I don't really like 2. it feels ad-hoc and non-relational, whereas 1 seems more relationally correct but is semantically not as rich as an individual fact. This shortcoming is easily resolved by a query to pull in the relevant text, however.

Speaking of which:

Text from structure

When thinking about where to store this data, I was led to Chris Smothers' cause, a very well-documented Clojure implementation of a causal tree, a type of CRDT. It places the notion of a CausalBase front and center, which sounds great, except it doesn't quite have the power implied by the "database" referred to by its name - which is generally okay in Clojure because the language already has pretty powerful facilities for quick operations on collections of maps.

But what if someone went further than that, combining a CRDT with a data model and query engine like in DataScript? Turns out in describing that I'm describing datahike, a Datalog implementation atop the hitchiker-tree CRDT.

With existing text snapshotted as facts and recorded in a CRDT, queries could be run against that data to associate formerly disparate pieces of data into new forms, and the composites those queries create could themselves be recorded and annotated as new facts about the collection. The query that retrieves those facts could be stored as data itself, with the new structure that the query identifies added as an annotation to it. Use these queries and the expressive power they create to give a new life to structur and alpha, the venerable software extensions to Kedit written by Howard J. Strauss to aid John McPhee in his writing process.

He listened to the whole process from pocket notebooks to coded slices of paper, then mentioned a text editor called Kedit, citing its exceptional capabilities in sorting. Kedit (pronounced 'kay-edit'), a product of the Mansfield Software Group, is the only text editor I have ever used. I have never used a word processor. Kedit did not paginate, italicize, approve of spelling, or screw around with headers, WYSIWYGs, thesauruses, dictionaries, footnotes, or Sanskrit fonts. Instead, Howard wrote programs to run with Kedit in imitation of the way I had gone about things for two and a half decades.

He wrote Structur. He wrote Alpha. He wrote mini-macros galore. Structur lacked an “e” because, in those days, in the Kedit directory eight letters was the maximum he could use in naming a file. In one form or another, some of these things have come along since, but this was 1984 and the future stopped there. Howard, who died in 2005, was the polar opposite of Bill Gates—in outlook as well as income. Howard thought the computer should be adapted to the individual and not the other way around. One size fits one. The programs he wrote for me were molded like clay to my requirements—an appealing approach to anything called an editor.

Structur exploded my notes. It read the codes by which each note was given a destination or destinations (including the dustbin). It created and named as many new Kedit files as there were codes, and, of course, it preserved intact the original set. In my first I.B.M. computer, Structur took about four minutes to sift and separate fifty thousand words. My first computer cost five thousand dollars. I called it a five-thousand-dollar pair of scissors.

I wrote my way sequentially from Kedit file to Kedit file from the beginning to the end of the piece. Some of those files created by Structur could be quite long. So each one in turn needed sorting on its own, and sometimes fell into largish parts that needed even more sorting. In such phases, Structur would have been counterproductive. It would have multiplied the number of named files, choked the directory, and sent the writer back to the picnic table, and perhaps under it. So Howard wrote Alpha. Alpha implodes the notes it works on. It doesn’t create anything new. It reads codes and then churns a file internally, organizing it in segments in the order in which they are meant to contribute to the writing.

Alpha is the principal, workhorse program I run with Kedit. Used again and again on an ever-concentrating quantity of notes, it works like nesting utensils. It sorts the whole business at the outset, and then, as I go along, it sorts chapter material and subchapter material, and it not infrequently arranges the components of a single paragraph. It has completely served many pieces on its own.

John McPhee


The book is a program. Tools for writing digital books should be at least as powerful as the tools created for conventional books decades ago. CRDTs provide a reliable and immutable foundation to the discrete chunks of knowledge that McPhee has used for his entire career. A query engine provides the toolkit to devise new ways of composing them together as powerful as structur and alpha, but with the added benefit of an entire programming language so that the text (or the collection of notes used to produce it) is no longer a closed system but can instead pull in data from the rest of the world.

The Markdown Cargo Cult


First and worst, Markdown isn’t semantic.
Matthew Butterick

I view basically every other problem with Markdown as downstream from this. Like Butterick, I'm utterly baffled by the degree to which everyone developing new types of interactive authoring tools simply assumes that everyone will want to write text in a format that's completely blind to the organic structure that emerges from ordinary writing. Jupyter notebooks, documentation tools, interactive documentation tools, interactive data science toolboxes, JS-based explorable explanation tools, revolutionary new prototypes of combined programming languages and visual environments, further explorations of how programming could be different, all of them voluntarily choosing to tie the millstone of this impoverished format around their necks despite serious attempts to rethink the combination of code and prose.

Why do we use it? Because one of Apple's court intellectuals decided it was convenient for him?

How is Markdown innovative exactly? It took ideas from the 70s, dropped the interesting parts, and was hailed as a revolutionary approach to marking up documents. Ie, the past 30 years of computing have been about narrowing the interface between programmer and computer to the equivalent of a straw (everything as text!) and then try to build an entire system around that.
, Hacker News

Spotted on Hacker News, the only reasonable response to someone calling Markdown 'a triumph of programmer ergonomics.'

Every system built atop Markdown will invariably have some ad-hoc and kludgy method of attempting to recapture some part of the structure that emerges from text authored in markdown, and it will be different from every other one because Markdown is blind to structure in all but the most basic of ways. In that regard it is very similar to "plain-text configuration" tools like YAML, which have all kinds of templating engines bolted on to them to overcome the limitations of what has in practice become a flat-file key-value store.

Just be aware of what you're giving up as an author in pursuit of that, and what you may be imposing on yourself later on down the line if you want to overcome these constraints.

And yes, there's no small irony in the fact that the source code for this post is currently written in Markdown. It is indeed fast and easy to get started writing with it, but I'd largely attribute that to path dependence, and the fact that my particular parser leaves the div tags I've littered throughout this post intact, which is an accident of choosing to use perun and thereby flexmark-java rather than the virtues of the format itself. I have every intention of changing the authoring tool I use into something semantically richer, but I had to get my resistance to the format on paper first.

Extensible textual notation, part 4


laptop sticker reading 'thinking about things'

I read relentlessly. I don’t do any programming not directed at making the computer do something useful, so I don’t do any exercises. I try to spend more time thinking about the problem than I do typing it in.
Rich Hickey



Beyond plain text: storing prose within datahike

Here's a background post on the Datahike internals for context about how the hitchhiker B-tree structure allows for self-balancing and efficient updates that "hitchhike" on queries.

Here's another on using the dat:// protocol for P2P replication of the data stored in a Datahike instance. It serves as a useful starting point for getting a Datahike instance up and running.

Here's what would be a useful starting point for programmatic prose parsing: including a quotation in a piece of prose writing that gets parsed as a separate component and then added to a global list of quotations maintained by the text parser, with a link back to its original positional context within the piece of writing that quoted it.

Extensible textual notation, part 5


A concrete starting point

I've managed to come up with a lot of Xanadu-like vaporware ideas in thinking through this tool without producing anything concrete.

Per the above: I'd define my first concrete goal for this library as a replacement for markdown so I can begin to dig myself out of the pit I've put myself in by relying on something I don't like using.

In order to do this, I want to parse markdown into hiccup and pull information out of the file. Whatever replaces markdown will use hiccup data structures anyway, so it's not wasted effort to build functions that process the markdown once it's represented as Clojure data. I can create functions and specs that define the expected behavior of a markdown replacement.

Based on some unscientific experimentation, the only markdown->hiccup toolchain that properly understands tables is markdown-clj + hickory, so that's what I'll go with. I have a few tests written that don't do much yet.

Other scattered thoughts:

Plaintext and database

Plain text has a lot of virtues as a long-term storage format, so I plan to make it a core part of however I persist the writing that gets parsed into data by ETN.

The current snapshot and any derived views of it exist should exist as plaintext; its history can be preserved using database backup and persistence methods. But perhaps other defined snapshots in the history of the information should be serialized as plain text as well, in a manner similar to git commits or releases.

Thoughts on Roam

I signed up for Roam because on paper it seems to be exactly what I want: a PIM with the ability to run arbitary Datalog queries across your thoughts and embed hiccup data structures for visual depictions of the concepts. It's built on Clojure, front to back! What's not to like?

Mostly, the UX. I don't like the aggressively hierarchical format it imposes on all the writing you put into it, I don't like the web interface, which will never be as fast and flexible as plaintext with a good editor, and I don't like the default views it chooses for you.

More than anything, I want a tool of my own making, free from any compromises made to accomodate commercial success or adoption among its target cohort.

I don't want an outlining tool that helps me produce writing. I want a writing tool that helps me identify and work with the structure that emerges from what I write.


Extensible textual notation, part 6

Yesterday I got too caught up reading the documentation for libraries. Today I'm disabling my wifi and striking out into the wilderness with only the standard library (and my reference book) to help me.

First discovery: I probably don't need to use specter when tree-seq will do. clojure.walk will also help, but I don't quite understand it yet.

One thing that occurred to me when thinking about pulling quotes out of plaintext: while an individual paragraph should be the basic semantic unit of my own writing, the basic semantic unit of a quotation or reference should be a sequence of one or more paragraphs. This preserves more of the structure of the origin and aids in its display in other contexts. For storage purposes, though, it should merely be (for now) a sequence of strings. Worrying about the internal structure of the quote itself (lists, etc) can come later when the specs get more refined.

Extensible textual notation, part 7

With a markdown->hiccup parser in hand, I took on a warm-up exercise to map out the problem space and get comfortable with parsing the data I've already dumped in to these markdown files, I defined some contracts for the data formats I want to pull from the textual information using clojure.spec, which will serve as constraints on the expected behavior of other parsers I write to replace Markdown. It's pretty bare-bones so far, with stuff like the following:

(defn same-size? [colls]
  (apply = (map count colls)))
(defn col-kvs? [table-map] (spec/valid? (spec/map-of string? vector?)
                                        (dissoc table-map ::table-meta)))

(spec/def ::same-size same-size?)
(spec/def ::eq-columns #(same-size? (filter sequential? (vals %))))
(spec/def ::col-kvs col-kvs?)

(spec/def ::table-whole-meta map?)
(spec/def ::table-body-meta map?)
(spec/def ::table-header-meta map?)

(spec/def ::table-meta
  (spec/keys :req [::table-whole-meta ::table-body-meta ::table-header-meta]))

(spec/def ::tidy-table
  (spec/and
   (spec/keys :opt [::table-meta])
   ::col-kvs
   ::eq-columns))

These plus some functions to transform parsed hiccup data structures into these canonical formats will allow me to capture some of the emergent structure of what I've already written here. Using spectomic I can define the constraints using spec and automatically generate datahike schemas from them.

But that's only the starting point. The real purpose here is to replace markdown with a notation format that represents text as data and lets the user easily convey other types of structured data within the text itself. To that end, I have to come up with a different format for notating the documents, a very brief example of this I sketched out above.

There is plenty of prior art for this: I generated the first version of this blog using pollen. I wanted to learn about the way scribble, on which pollen is built parses plaintext into Racket data structures for manipulation, but the library is quite complex (the documentation is fantastic, but it focuses mostly on the API rather than how the parsers and readers are implemented interally). It also reimplements much of what I intend to use hiccup to do.

Luckily, I have been spared the experience of suffering through the entire scribble codebase by Bogdan Opanchuk's Clojure implementation. I could simply use it directly, but I'm not interested in taking more shortcuts and adding more libraries to this project, especially when I know I'll have to add my own syntax to the notation and the parser rules to support them. I also won't really understand the way these parsers work unless I go forth and implement one myself. Fortunately, the project uses marginalia to give a guided tour through the internals of the code. This library may not be as comprehensive as the original implementation of scribble, but Clojure's expressivity makes it far easier to understand the global structure of this smaller implementation and thus learn from it.

Rather than a plaintext spec that monotonically grows in complexity due to the workarounds to handle the corner cases generated by ambiguous syntax, I'm hoping to define as much of the expected structure using clojure.spec, which strikes the right balance between the rigidity of a BNF grammar and the ambiguity of a plaintext spec, plus the additional leverage that comes with defining a spec as code: the ability to generate arbitary adversarial examples to ensure that corner cases are found quicker and dealt with in a more systematic fashion.

Extensible textual notation, part 8

One motivation for this concept was an incredibly useful design exercise when I was building a backend system at work: creating a feature matrix for the various sub-components to understand their interactions with one another and how those translate into both library code and user-facing features. (read Evan Miller's whole essay; it's a quick read and a succinct, lucid statement of a very powerful idea). In order to produce one, I had to step outside of my trusty text editor and flip over to Google Sheets to create a NxN grid, fill it up with the pair-wise interactions between the components, and then use an unholy spreadsheet formula to transpose those comma-delimited features into a discrete list with separate references to each column in the body of the matrix:

=unique(transpose(split(join("|", 'Sheet1'!$B$3:$B$17, 'Sheet1'!$C$3:$C$17, 'Sheet1'!$D$3:$D$17, ... "|")))

It was the most beautiful waterfall planning I've ever done. As you'd expect, looking back over the matrix now, I see how hopelessly out of date it is and how badly it serves its original purpose of defining tasks with enough granularity to yield tickets.

The codebase is always the most up-to-date part of any software project. Everything else tends to lag behind, mostly due to the unavoidable fact that you don't always know what you need to build before you build it. But why can't we write documents in a way that lends itself a little better to the day-to-day work of software development? Why can't I plop a feature matrix right into my readme and then programatically generate a set of test suites for the features within it? That way, by changing the top-level description of the project, I also change the definition of the software that ensures it functions as intended.

You might say something like "design should be design, and code should be code. Just because a problem is represented a certain way in the design phase doesn't mean that the actual code should be laid out that way." I agree with part of the spirit behind this. Stepping away from the laptop to think through the problem is something that everyone should do more often. I deeply enjoy the more embodied sense of problem-solving that sketching on a whiteboard gives me. But sometimes, you need more than a sketch or description; structured data can represent facts about a codebase that spec documents and architecture diagrams cannot. Keeping this up-to-date means seeing and editing it directly in tandem with the code. It means automatically checking off one of the "to-dos" generated from the structured documentation when a given test passes. It means storing information about failed and successful builds not in some web interface that has no direct interaction with the code but perhaps in the same data structure as the code itself.

Saying something like "code should look like code" assumes that we can only have one canonical way of representing the data that ultimately forms the program. But a richer data structure than flat files (like a CRDT) could be made to form the backend of multiple representations, one "API-first" view according to the code layout and another "feature-first" view according to the table of requirements. What if you could filter down the test suites you run as easily as filtering a spreadsheet? (you might call this way of interacting with the code view-inspired, model-driven, to borrow a term from Chiusano's essay I link above).

Sometimes it feels like our mental model of tooling for software still comes from a pre-network era, where the notion of the software artifact prevails and tools for managing individual artifacts win the day. They define the software we write as a discrete closed system, and then we bolt more complexity back on to get around this model. It's why we ship around many megabytes more code than we need to when checking out a function from a library and it's why revision control operates fundamentally at the level of a single folder rather than the sub-units defined by the code within that folder - we ship around the whole folder because that's what we have isolated and replicable history for. Software still yields artifacts: docker images, executable binaries, etc. That hasn't gone away (and it won't until we're all using ultra-live environments that live up to the legacy of the first Smalltalk systems), but now code is much more likely to be a part of a open system that interacts with build tools, clusters of virtual machines, live data stores, and the like. The mess of information those tools generate informs choices about the code we write, so why not figure out a way to represent the information we need closer to the code itself?

For one possible foundation for a different approach to managing the history of code, see Unison's concept of content-addressed code. Content-addressing and tracking the history of functions as individual units means that they can break out of their original repos while maintaining their lineage. Instead of snapshots of whole libraries, functions could migrate from one library to another, picking up unique changes as they propagate their descendants into that new project's context. Code would have a genealogy rather than dependencies. Similarly, I think there's a lot of potential power in storing arbitrary structured data in the same data structure as the code itself. We haven't really taken the conceptual leap towards developing applications around that model because we default to git for any new project and thus rely on its implicit view of the world. Hopefully that can change.

Extensible textual notation, part 9

Turns out I was more right about needing to implement my own scribble-like syntax than I knew, because while taking the clojure implementation out for a spin, I discovered that it relies heavily on an unmaintained set of reader macros that are incompatible with recent versions of Clojure because clojure.spec now enforces compile-time syntax of macros (like defn).

I'm going to have to write my own text->EDN parser that replicates what scribble relies on reader macros to do. I'm fortunately not trying to alter in any fundamental way how Clojure data is transformed into its AST; I'm just doing some preprocessing on plaintext so it's the right shape when it hits the clojure reader. Into the wilderness.

Extensible textual notation, part 10

While I have a decent enough concept of the source of the data generated through writing in plaintext, I don't yet have a good concept of the target of that data. I intend to use datahike as a persistent storage format, but it's daunting to think about where to start.

Here's a good bootstrapping exercise for understanding the format and how it works: a quotes page in Perun. It will read a quotes.edn file, dump the data parsed out from that quotes file into a datahike db, and then use that data to generate the hiccup content for the page. Once that's in place, I'll have a better idea of the schema I need to yank quotes out of the posts where they're quoted and add them to this DB.

One important benefit of using functions rather than markup to define quotes is the ability to preserve context by including the references back to the piece of writing containing the quotes. This was one of the great promises of project Xanadu, the ability to see in tandem the multiple layers of context surrounding a link to a passage from another page. I cannot create a system as fully dynamic as Xanadu, but I can use an intermediate evaluation step as text is read from its input format to capture the structure created by the text and its references.

Extensible textual notation, part 11

After a couple of half-hearted attempts to replicate the lozenge syntax of pollen using a Clojure ANTLR parser, I discovered the very new but very fully-featured ash-ra-template library.

While I recognize that building my own parser is a good programming challenge, I also need to ask myself whether I need to undertake it before doing the work that I want to do in a medium that actually supports it. Right now, I'd prefer the latter.

Extensible textual notation, part 12

The text has extended itself beyond the limitations of Markdown. I now have the power of a real programming language at my disposal in my own writing, and I used it to replace every single backtick and bracket that Markdown required. In so doing, I more fluidly composed the structures provided by HTML by using tools better designed to manipulate them directly instead of burying them under a pre-selected menu of abstractions. I understand HTML better as a result - its structure was not hidden from me. By using a dynamic, computational medium for writing, I can perform Jenny Odell calls "context creation" - I can fully express the context of this page's own creation, false starts, half-baked parsers, and all.

I just wish it hadn't taken until a once-in a generation global crisis to get here.

Paragraph detection within ash-ra-template

Overall, I'm quite pleased with how well ash-ra-template is working to directly create HTML using the power of Clojure and hiccup only when I need it. Now that I know how to make my own rendering library available to ash-ra-template, I'm seamlessly replacing Markdown's defaults with functions that I have total control over. But there's one thing missing from the experience of using markdown or other plaintext formats - paragraph detection. If I want paragraphs now, I have to insert [:p "content"] blocks, which effectively means I'm just writing in hiccup data structures and not plaintext, defeating the purpose of using a templating engine at all.

Luckily, the documentation for pollen suggests a path forward: post-processing the text after template evaluation to infer paragraph and linebreaks. The documentation for the detect-paragraphs function also has some useful test cases for paragraph inference on the basis of pre-existing blocks.

Tokenizing the text into paragraphs after evaluation also potentially allows for the recording of the text as facts in a database (see above). I'm not there yet, but I'm about to merge into master and leave Markdown behind for good.

Holotype: further steps towards simplicity

I made good progress generating content with ash-ra-template - until I wanted to dynamically render an image as part of the build process. I ran into two walls imposed by the closed environment of ShimDandy: lack of access to the local filesystem, and the Java classes necessary to render images using clojure2d are not available in that context either.

Fortunately, I recently discovered comb, a library much like ash-ra-template by the author of hiccup, which is pretty much exactly what I was thinking about building above, and its parser is exceptionally simple. It lets me seamlessly embed generative artwork into my pages as easily as I can use my templating functions to emphasize text or add headers. You can see an example here.

In addition to that, the fact that the templates are evaluated in the same context as the rest of my code adds the following benefits:

It even uses the same delimiters, which made it trivial to port all my existing pages over to the new library and simplify my project into a single namespace.

(loop (render (eval (create))))

A hylozoic approach to site rendering

I wrote this entry using an assemblage of tools intended to give me real-time feedback on the HTML and CSS generated by my templating functions.

Local development is furnished by nasus. Using this simple and lightweight HTTP server, I can launch clojure -A:serve to fire up a web server on localhost to preview what page changes look like as soon as the render loop finishes.

Originally, I performed batch builds of all files or individual files by passing the names of files to render to respatialized.build/-main, then refreshed my web browser to get the update via nasus. But I felt that the latency and context switch of manually rendering files broke my focus. I hope to keep my focus on what I write, not on actively monitoring the contents. I wanted a create-eval-render-loop.

So I rewrote respatialized.build to support a simple file watch loop that performed a per-page build on any changed file. It felt satisfying to get that feedback so quickly! But it promptly stopped working on the first misplaced parenthesis. I wanted to enable this event loop to recover from failure easily, so I created a rudimentary self-healing mechanism.

Invoking the shell command clojure -M -m respatialized.build performs a first-pass render, loads all the page contents into an atom, and then uses hawk to watch each HTML template file in the source directory for changes. On a file change, the rerender loop triggers. If the template within the changed file renders successfully, the atom gets updated and the new HTML gets written to the target directory. If it hits an error in parsing or evaluating the page content, the atom doesn't get updated, thereby preserving stable state as a fallback mechanism.

Restarting this event loop is a little slow, owing mostly due to Clojure rebuilding the classpath on startup. But there's a way around that. When I needed to redefine some of the library code like respatialized.render/header on the fly, it wasn't a problem. respatialized.build/-main also runs trivially in a REPL. All I had to do was switch to that namespace, call (future (-main)), and I was off. I could extend, rewrite, or add functions from respatialized.render, re-refer the namespace, and use new definitions without restarting the loop.

I have to use it with caution; I accidentally started a recursive succession of multiple file watchers by accident because I forgot to wrap "(future (-main))" in a string. A more defined approach to managing application components could avoid that, but I was surprised by the fecundity of my own creation, even as it slowed my text editor to a crawl.

Growing a website generator

This multi-entry essay has detailed the long and strange history of its own creation. The critiques, gripes, and observations I make here outline my motivation for rejecting existing static website generators, which can largely be summarized as any sufficiently complex static site generator contains an ad-hoc, informally-specified, bug-ridden, slow implementation of half of a graph database and query language.

Through my experience with infrastructure tooling that stores the definitions of critical systems in a state where consistency is not enforced, composition is impossible, and you have no query capabilities, I realized that my observations about static website generators extended in some respects to the templating tools underlying YAML-based configuration as code.

While these critiques formed the background context of my perpetual rewriting of this website's code (pollen -> boot/perun -> ash-ra-template -> comb -> respatialized.parse/respatialized.render), in many ways, I did not intentionally design the components that have now become a robust part of ensuring that I generate pages in a consistent format. The validation logic for HTML I currently use is an interesting example of this.

Accidentally backing in to a structure for HTML

If you go back far enough in the history of this codebase, you can find some early failed experiments with Rasmus Anderssen's raster CSS grid system. It appealed to me because it specified grid cells explicitly, as HTML elements, rather than implicitly as a presentational CSS rule. The goal for me then, as it is now, is to assign a semantics to document structure, one that captures the idea behind juxtaposing two sections of text with one another. I eventually abandoned raster for tachyons, which, despite being CSS rather than HTML, had a "good enough" grid model for presentation. I now realize I was unable to effectively leverage raster because I didn't know the HTML document model of flow content and phrasing content well enough to see how to fit what I wrote into raster's expansion of that model.

I happened to write my paragraph detection algorithm while attempting to use it for the second time. This meant that I designed the function such that a double linebreak would form a paragraph break within a grid cell and a triple linebreak would form a break between grid cells. I didn't realize how important the rules of where a paragraph can begin and end would eventually become.

I wanted to identify pathological inputs and corner cases where my paragraph detection algorithm might break down. Instrumenting respatialized.document/detect-paragraphs with clojure.spec.alpha for automated generative testing seemed like the best approach in order to get there. However, with the recursive document structures of hiccup forms, performance rapidly became a limiting factor on input generation.

I already knew about malli, a library to express specs as pure EDN that focused very heavily on high-performance use cases. However, when I had resumed work on raster forms and paragraph detection for them, the initial implementation of sequence expressions had not yet been merged into the codebase, which made validation of nested hiccup forms a non-starter. I instead chose to use minimallist, which had a working implementation of sequence expressions and recursive hiccup forms that I gradually refined into a schema for non-interactive HTML forms by following the MDN HTML spec. While the minimallist design document says performance isn't a primary goal, it was more than performant enough for some simple generative tests of the paragraph detection algorithm.

By the time I had refined this work enough to re-render my existing pages, sequence expressions had landed in malli. With one iteration of the HTML spec under my belt, and with corresponding tests for form validity, I re-translated my work from one spec library to another and expanded it to encapsulate the root elements of HTML.

Without even indending to design it at the outset, I now had a model for HTML expressive enough to reuse within my paragraph detection algorithm that could process forms differently depending on the surrounding context - paragraphs cannot be contained within paragraphs in HTML. Before long, I had a test suite that continuously conformed all of my existing pages to the top-level model of HTML. I had assigned a semantic structure to the forms of the document that I wanted to process, one that captures the hierarchy of HTML forms far better than anything I could have designed if I had tried to design something from scratch.

None of the validation logic that now exists would work without the efforts of Vincent Cantin, as well as Tommi Reiman, Pauli Jaakkola and the rest of the contributors to malli. I no longer have to shy away from the complexity of HTML - I now have a spec expressive enough to give me the leverage I need over it.

Fabricate


...if you can find a better digital-publishing tool, use that. But I’m never going back to the way I used to work.
Matthew Butterick

I have simplified the code that generates this website to the point where I can successfully extract it into its own static site generation library, called fabricate. This source repo now consumes that code instead of defining its generation process itself. The holotype has become a prototype.

Using the organizing concept of finite-state machines, I have developed an extensible method of defining the steps necessary to create a collection of HTML pages.

You can read more about the intent and point of view informing Fabricate here. I hope it brings the power and flexibility of Pollen's publishing system to the Clojure ecosystem. I also hope it means that I can write about things apart from the website generation process on this site.