Architecture
The VM
- Deployment from host over tailscale ssh to remote vm TTY
When nixos-rebuild switch activates a new generation on node1 with
export NIX_SSHOPTS="-t"
nixos-rebuild-ng switch \
--flake .#node1 \
--target-host xameer@100.105.10.61 \
--build-host xameer@100.105.10.61 \
--use-remote-sudo \
--ask-sudo-password \
--impure \
-j 4systemd restarts services - including tailscaled itself
if it or any of its dependencies changed. The moment tailscaled restarts, the SSH connection over Tailscale drops. The
copy phase nix copy was mid-flight when this happened.
leaving the store in a partially written state: paths recorded in the SQLite DB but missing from disk, with foreign key constraints preventing cleanup.
With some python the due sql ops for dangline store paths
store-fix.py - automates the tasks to find all DB entries whose paths don’t exist on disk, delete their refs in both directions, then re-enable constraints. After that nix-collect-garbage and a fresh rebuild restored the system.
The runner token and attic token failures that followed were secondary - the store corruption had wiped the previous activation, so agenix secrets from the old generation were gone and services re-registered with stale or missing credentials.
The fix order that worked: repair store → rebuild → fix runner token newline → regenerate attic token against the live HS256 secret.
- Disk Pressure
As the VM does run a non trivial bunch of kubernetes api managed services, encrypted with etcd and deployed with
kubectl.
Some nix store paths’ closure fails to be copied to remote vm due to OOM erros, at times the gitlab-runner hosted by
this vm [ which deploys this blog ] also fails for the same reason. To observe
watch -n2 'free -h && echo "---" && ps aux | head -5'
#for suspecios units
top -bn1 | head -20 | grep etcd
# to observe
df -h /home/xameer/clone/
#direnv and devenv paths form kubectl deployments, which can also be done from gitlab CI with k8s executors, but so long as the runner is hosted here that too shall die unless fixed
Once top has shown the zombie systemd units or processes causing OOM just stop them with systemctl and clean the grabage and rebuild.
- Binary Caching layer
The Problem – Deployment state
Nix captures the declared state – packages, service configurations, volume mounts, environment variables, and their dependencies – reproducibly and hermetically. What it cannot capture is operational state for devops Arion
Traditional CI/CD often results in: [Declared Nix] -> [Imperative Runner Setup] -> [Imperative Deploy Script] -> [Operational State] Hercules CI/Arion results in: [Declared Nix + Hercules Agent Config] -> [Hermetic Build] -> [Declarative Effect/Arion] -> [Operational State] They make the operational state, such as which runner handles a job or how containers are structured, part of the declared, version-controlled Nix configuration, thus moving it closer to the ideal of perfect reproducibility.
Terraform and Kubernetes do not reduce this gap – they shift it. Terraform manages cloud resource declarations,
but still requires manual secret rotation, manual runner registration, and manual responses to OOM events,
unless paired with auto-scaling and pod disruption budgets.
Kubernetes adds liveness probes, resource limits, and automatic
pod restarts, which would have caught the etcd memory balloon automatically via resources.limits.memoryand restarted the pod before it consumed the host. But Kubernetes itself runs on the same VM here, making the cluster a tenant of the system it is supposed to manage -- which is why kubelet's ownMemoryPressure` condition appeared in the logs as a symptom of the problem it was meant to prevent.
The gap between declared and operational state is where all the debugging happened.
journalctl -u gitlab-runner, systemctl show minio | grep Condition, and
dmesg | grep oom were the tools that made operational state visible. None of this
information is available to Nix at evaluation time.
The debugging :
Where the build environment, runner configuration, and service state were in constant flux.
The same class of errors – command not found, Nix daemon socket failures, Cachix auth
failures, OOM kills, and stuck jobs – recurred across multiple attempts not because the
fixes were wrong, but because each fix exposed the next layer of drift between the declared
configuration and the running system.
Reasoning the recurring errors
- The
command not founderrors forgrep,which, andlsinside the CI container were fixed at least twice during this process. They recurred because the fix – mounting/nix:/nix:roas a single volume – overwrote the container’s own Nix store, destroying its internal binaries. Reverting to individual mounts (/nix/store:/nix/store:ro,/nix/var/nix/db:/nix/var/nix/db:ro) restored the container’s tools, but the daemon socket mount also needed:roremoved to allow IPC. Each of these changes required anixos-rebuild switchto propagate fromgitlab-runner.nixinto the runningconfig.toml, and a service restart to apply. The window between declaration and activation was where drift lived. - Stuck CI jobs were caused by the runner losing contact with GitLab after an OOM event restarted the gitlab-runner service with a new system ID. GitLab continued showing the old runner as online while the new instance was pending Protected branch approval. The fix was to assign and mark the new runner instance as Protected in the GitLab UI – a manual step that cannot be automated without a GitLab API token, and that must be repeated after every runner re-registration.
- The
Speed of deployment
Cachix
Managed CDN-backed binary cache. Fast substitution via global edge network. Used as the primary pull cache for both pipelines.
before_script:
- cachix authtoken "$CACHIX_AUTH_TOKEN"
- cachix use "$CACHIX_CACHE"Push after build host CI uses nix path-info --recursive, blog CI uses nix develop --print-out-paths:
# Host CI full closure push
nix path-info --recursive "$spec" \
| xargs cachix push -j 2 -n 2 $CACHIX_CACHE_NAME
# Blog CI devShell closure push
nix develop .#ci --print-out-paths 2>/dev/null \
| xargs nix-store -qR \
| cachix push -j 2 -n 2 "$CACHIX_CACHE"cachix push in the primary build job must not use || true if it fails, the closure is not in cache and downstream jobs will fail pulling paths. || true is only appropriate for supplementary re-push steps.
Atticd
Self-hosted binary cache server backed by Cloudflare R2 (zero egress). Runs on localhost:8083 on thinkpad for host CI (shell executor), on node1 for blog CI (Docker executor with network_mode = host).
before_script:
- attic login local http://127.0.0.1:8083 $ATTIC_TOKEN # host CI
- attic login local http://localhost:8083 "$ATTIC_TOKEN" # blog CI
- attic use local:blog-cache # configure as substituterPush after build:
# Host CI
nix path-info --recursive "$spec" \
| xargs attic push -j 2 dev-cache || true
# Blog CI
nix develop .#ci --print-out-paths 2>/dev/null \
| xargs nix-store -qR \
| xargs attic push -j 2 blog-cache || trueAtticd push always uses || true it is supplementary to Cachix, never blocks the pipeline. If R2 is unreachable the
build still succeeds via Cachix.
On vm , though, atticd binds to 0.0.0.0:8083 on node1 so the Docker CI container (which gets a different network namespace) can reach it via 172.17.0.1:8083 (the Docker bridge gateway). If it bound to 127.0.0.1 only, the container couldn’t reach it. The funnel/tunnel is for comment-api 014 unrelated. This is correct behaviour.
Does regenerating the token destroy the cache?
No. The attic token is a JWT signed with the HS256 key. It controls access to the cache, not the cache data itself. The chunks in R2 and the database records are untouched by token rotation. attic login just writes a new JWT to ~/.config/attic/config.toml. The cache data in R2 survives indefinitely unless you explicitly delete it.
R2 Storage
Single bucket attic-cache serves both logical caches. No Cloudflare UI changes needed when adding new caches Atticd routes by key prefix internally.
# Create logical cache one time setup
attic cache create blog-cache
attic cache create dev-cache
# Verify
attic cache info blog-cache
attic cache info dev-cacheGlobal deduplication means a 500MB GHC derivation shared between dev-cache and blog-cache is stored once in R2. Storage cost is proportional to unique content, not cache count.
Auto-Cancel Redundant Pipelines
workflow:
auto_cancel:
on_new_commit: interruptible
pages:
interruptible: trueCancels the older pipeline immediately when a newer commit is pushed. Cachix push is atomic per-path a cancelled mid-push does not corrupt the cache.
The AWS Credential Chain Problem
Attic uses the official aws-sdk-s3 Rust crate (1.96.0) with aws-config (1.8.1).
The storage backend initialises via:
This runs the full AWS credential provider chain — including an EC2 IMDS lookup at
IP. On a laptop or any non-EC2 host, this times out after 5 seconds
before falling back to environment variables.
Without AWS_EC2_METADATA_DISABLED=true in your atticd environment file, every
credential refresh attempt stalls for 5 seconds hitting a non-existent EC2 metadata
endpoint. This manifests as identity resolver timed out after 5s in atticd logs and
cascading 500 errors to the push client.
The credential config in s3.rs shows why env vars are the right approach for R2:
if let Some(credentials) = &config.credentials {
builder = builder.credentials_provider(Credentials::new(
&credentials.access_key_id,
&credentials.secret_access_key,
None,
None,
"s3",
));
}If credentials is not set in atticd.toml, the SDK falls through to env vars.
Either works — but env vars via agenix keeps secrets out of the nix store.
—
Execution Context and Network Identity
This was the least obvious class of failure, and the one that explained the most.
Atticd binds to 0.0.0.0:8083 on node1. The address 0.0.0.0 is a bind directive,
not a routable address. What is actually reachable depends on where the request
originates:
- From the host machine:
127.0.0.1:8083resolves to node1 via Tailscale or SSH tunnel. - From node1 itself:
127.0.0.1:8083is the loopback interface, reaches atticd directly. - From a Docker container on node1:
127.0.0.1resolves to the container’s own loopback. Atticd is not there. The correct address is172.17.0.1, the Docker bridge gateway.
127.0.0.1 is not a stable address. It is a namespace-relative promise. In a Docker
container, localhost is the container. The host is 172.17.0.1. In the shell executor
or on the host, localhost is the machine running atticd.
This distinction is invisible in configuration files and only becomes apparent when the
same substituters entry works in one context and silently fails in another.
The consequence was that dev-cache and blog-cache, though served by the same atticd
instance, behaved as if they were different services:
| Cache | Endpoint used | Reachable from |
|---|---|---|
| dev-cache | 127.0.0.1:8083 |
host, shell executor |
| blog-cache | 172.17.0.1:8083 |
Docker CI containers |
The caches are logically identical. The difference is entirely in which endpoint was
registered at creation time, which determined where the binary cache endpoint was
advertised in nix-cache-info.
If a cache is created with atticd reachable at 127.0.0.1, the binary cache endpoint
recorded in atticd’s database reflects that address. Nix substituters will attempt that
address regardless of where the build is running. Recreating the cache with the correct
endpoint is the only fix – there is no in-place address migration.
401 Is Not a Failure
A 401 Unauthorized from the cache endpoint is correct behavior for a private cache.
Nix substitution (downloads) does not require authentication. attic push and API
operations do. A 401 from curl confirms the server is reachable and auth is
functioning. It is not a signal that the cache is broken.
The failure modes that actually indicate a problem are connection refused (wrong
address or atticd not running), 404 (cache does not exist), and silent substituter
fallback (Nix tried the cache, got no response, fell back to Cachix without logging why).
The substituter fallback is the hardest to diagnose because Nix does not emit a warning
when it skips a substituter. --verbose or --debug on nix build will show which
substituters were tried and which paths were pulled from where. Without this, a misconfigured
atticd address looks identical to a cache miss.
Root vs User Token Split
nixos-rebuild switch runs as root. The attic token lives at
~/.config/attic/config.toml for the invoking user. Root has no token. This means
atticd push from a nixos-rebuild context silently fails – the build succeeds, the
cache is not populated, and the next CI run pulls from Cachix instead.
The fix is either to run attic login as root separately, or to push explicitly from
the CI job as the user who has the token. The latter is preferable: it keeps the push
path consistent and does not require managing root credentials.
The Three Gaps
Looking across all the failures, they fall into three categories:
Declared vs operational state. Nix captures what should exist. It cannot capture
what is currently running, what credentials are live, or what the OOM killer removed.
journalctl, systemctl, and dmesg are the tools that close this gap, and they
have to be used reactively.
Network identity vs network address. 127.0.0.1 means different things in different
execution contexts. Configuration written for one context silently fails in another.
The fix is to treat network endpoints as context-dependent and make the execution
environment explicit in CI configuration rather than assuming a shared namespace.
Expected errors vs actual failures. 401 from atticd, a dropped Tailscale connection
during nix copy, and a 429 from an RSS feed all look like failures. Only one of
them is. Distinguishing expected behavior from actual failure modes requires knowing
what the system is supposed to do, which is not always documented in error messages.
The gap between declared and operational state is where all the debugging happened. The gap between network identity and network address is where the most time was lost. The gap between expected errors and failures is where the most false fixes were applied.
Leave a comment
Comments are verified via IndieAuth. You will be redirected to authenticate before your comment is published.