Reliability and recovery

The real test of a platform is not the good day; it is the bad one. What happens when an index corrupts, when two editors save at once, when a token leaks, when the site goes dark? CTXR is built so that the answer to each of these is short and known in advance, rather than improvised under pressure.

Derivability is the safety net

Almost nothing the platform serves is irreplaceable, because almost nothing is original. The raw nodes are the single source of truth; the index and the compiled graph are derived from them. If a derived artefact breaks — a stale compile, a corrupt index — you do not restore it from a backup, you regenerate it. Recovery becomes a command, not a procedure.

This is the operational payoff of the file-based design: the surface area of “things that can be lost permanently” shrinks to just the raw content, and everything else can be rebuilt from it on demand.

Two backups, two jobs

Code and content are protected separately because they fail separately:

Code lives in version control. Every change is tracked, diffable, and revertible. “What changed and when?” is a git question with an exact answer.
Content is backed up at the storage layer. Because a space is a directory, backing one up is backing up a folder and restoring one is putting the folder back and regenerating the derived parts.

Concurrency without surprises

Writes are atomic and serialized — file locks ensure that two saves cannot interleave and leave a node half-written. On top of that, editors get an advisory signal when someone else is already working on a node, so collisions are visible rather than silent. Stricter modes are available where a space needs hard locking rather than a polite heads-up.

The system maintains itself

Housekeeping runs alongside normal operation rather than waiting for a person to remember it: expired sessions are cleared, caches are pruned, soft-deleted nodes pass out of their retention window, and logs are rotated so they cannot grow without bound. A space stays healthy without an administrator babysitting it.

A crisis playbook exists

When something does go wrong, there is a known path rather than panic. Conceptually:

Symptom	Path
Site down	Check the health endpoint, read the logs, rebuild the derived graph
Data corruption	Stop writes, diff against version control, restore, rebuild
Leaked token	Revoke the credential, then audit what it touched

Deployment supports this too. A lint gate runs before anything ships, deployment is push-to-deploy, and a bad release can be rolled back rather than hot-patched in place.

The point of a playbook is that the decisions are made in advance. None of these steps is invented during the incident — they follow from a design where content is files and everything else is derived from it.