Tale of a GitLab Job Pipeline

2026-03-25
, , , ,

Architecture

When nixos-rebuild switch activates a new generation on node1 with

export NIX_SSHOPTS="-t"
nixos-rebuild-ng switch \
  --flake .#node1 \
  --target-host xameer@100.105.10.61 \
  --build-host xameer@100.105.10.61 \
  --use-remote-sudo \
  --ask-sudo-password \
  --impure \
  -j 4

systemd restarts services - including tailscaled itself if it or any of its dependencies changed. The moment tailscaled restarts, the SSH connection over Tailscale drops. The copy phase nix copy was mid-flight when this happened. leaving the store in a partially written state: paths recorded in the SQLite DB but missing from disk, with foreign key constraints preventing cleanup.

With some python the due sql ops for dangline store paths store-fix.py - automates the tasks to find all DB entries whose paths don’t exist on disk, delete their refs in both directions, then re-enable constraints. After that nix-collect-garbage and a fresh rebuild restored the system.

The runner token and attic token failures that followed were secondary - the store corruption had wiped the previous activation, so agenix secrets from the old generation were gone and services re-registered with stale or missing credentials.

The fix order that worked: repair store → rebuild → fix runner token newline → regenerate attic token against the live HS256 secret.

- Disk Pressure

As the VM does run a non trivial bunch of kubernetes api managed services, encrypted with etcd and deployed with kubectl. Some nix store paths’ closure fails to be copied to remote vm due to OOM erros, at times the gitlab-runner hosted by this vm [ which deploys this blog ] also fails for the same reason. To observe

 watch -n2 'free -h && echo "---" && ps aux  | head -5'
 #for suspecios units
 top -bn1 | head -20 | grep etcd
 # to observe
 df -h /home/xameer/clone/
 #direnv and devenv paths form kubectl deployments, which can also be done from gitlab CI with k8s executors, but so long as the runner is hosted here that too shall die unless fixed

Once top has shown the zombie systemd units or processes causing OOM just stop them with systemctl and clean the grabage and rebuild. - Binary Caching layer

The Problem – Deployment state

Nix captures the declared state – packages, service configurations, volume mounts, environment variables, and their dependencies – reproducibly and hermetically. What it cannot capture is operational state for devops Arion

Traditional CI/CD often results in: [Declared Nix] -> [Imperative Runner Setup] -> [Imperative Deploy Script] -> [Operational State] Hercules CI/Arion results in: [Declared Nix + Hercules Agent Config] -> [Hermetic Build] -> [Declarative Effect/Arion] -> [Operational State] They make the operational state, such as which runner handles a job or how containers are structured, part of the declared, version-controlled Nix configuration, thus moving it closer to the ideal of perfect reproducibility.

Terraform and Kubernetes do not reduce this gap – they shift it. Terraform manages cloud resource declarations, but still requires manual secret rotation, manual runner registration, and manual responses to OOM events, unless paired with auto-scaling and pod disruption budgets. Kubernetes adds liveness probes, resource limits, and automatic pod restarts, which would have caught the etcd memory balloon automatically via resources.limits.memoryand restarted the pod before it consumed the host. But Kubernetes itself runs on the same VM here, making the cluster a tenant of the system it is supposed to manage -- which is why kubelet's ownMemoryPressure` condition appeared in the logs as a symptom of the problem it was meant to prevent.

The gap between declared and operational state is where all the debugging happened. journalctl -u gitlab-runner, systemctl show minio | grep Condition, and dmesg | grep oom were the tools that made operational state visible. None of this information is available to Nix at evaluation time.

The debugging :

Where the build environment, runner configuration, and service state were in constant flux.

The same class of errors – command not found, Nix daemon socket failures, Cachix auth failures, OOM kills, and stuck jobs – recurred across multiple attempts not because the fixes were wrong, but because each fix exposed the next layer of drift between the declared configuration and the running system.

Speed of deployment

Cachix

Managed CDN-backed binary cache. Fast substitution via global edge network. Used as the primary pull cache for both pipelines.

before_script:
  - cachix authtoken "$CACHIX_AUTH_TOKEN"
  - cachix use "$CACHIX_CACHE"

Push after build host CI uses nix path-info --recursive, blog CI uses nix develop --print-out-paths:

# Host CI  full closure push
nix path-info --recursive "$spec" \
  | xargs cachix push -j 2 -n 2 $CACHIX_CACHE_NAME

# Blog CI  devShell closure push
nix develop .#ci --print-out-paths 2>/dev/null \
  | xargs nix-store -qR \
  | cachix push -j 2 -n 2 "$CACHIX_CACHE"

cachix push in the primary build job must not use || true if it fails, the closure is not in cache and downstream jobs will fail pulling paths. || true is only appropriate for supplementary re-push steps.

Atticd

Self-hosted binary cache server backed by Cloudflare R2 (zero egress). Runs on localhost:8083 on thinkpad for host CI (shell executor), on node1 for blog CI (Docker executor with network_mode = host).

before_script:
  - attic login local http://127.0.0.1:8083 $ATTIC_TOKEN   # host CI
  - attic login local http://localhost:8083 "$ATTIC_TOKEN"  # blog CI
  - attic use local:blog-cache                               # configure as substituter

Push after build:

# Host CI
nix path-info --recursive "$spec" \
  | xargs attic push -j 2 dev-cache || true

# Blog CI
nix develop .#ci --print-out-paths 2>/dev/null \
  | xargs nix-store -qR \
  | xargs attic push -j 2 blog-cache || true

Atticd push always uses || true it is supplementary to Cachix, never blocks the pipeline. If R2 is unreachable the build still succeeds via Cachix. On vm , though, atticd binds to 0.0.0.0:8083 on node1 so the Docker CI container (which gets a different network namespace) can reach it via 172.17.0.1:8083 (the Docker bridge gateway). If it bound to 127.0.0.1 only, the container couldn’t reach it. The funnel/tunnel is for comment-api 014 unrelated. This is correct behaviour. Does regenerating the token destroy the cache? No. The attic token is a JWT signed with the HS256 key. It controls access to the cache, not the cache data itself. The chunks in R2 and the database records are untouched by token rotation. attic login just writes a new JWT to ~/.config/attic/config.toml. The cache data in R2 survives indefinitely unless you explicitly delete it.

R2 Storage

Single bucket attic-cache serves both logical caches. No Cloudflare UI changes needed when adding new caches Atticd routes by key prefix internally.

# Create logical cache  one time setup
attic cache create blog-cache
attic cache create dev-cache

# Verify
attic cache info blog-cache
attic cache info dev-cache

Global deduplication means a 500MB GHC derivation shared between dev-cache and blog-cache is stored once in R2. Storage cost is proportional to unique content, not cache count.

Auto-Cancel Redundant Pipelines

workflow:
  auto_cancel:
    on_new_commit: interruptible

pages:
  interruptible: true

Cancels the older pipeline immediately when a newer commit is pushed. Cachix push is atomic per-path a cancelled mid-push does not corrupt the cache.

See Architecture

The AWS Credential Chain Problem

Attic uses the official aws-sdk-s3 Rust crate (1.96.0) with aws-config (1.8.1). The storage backend initialises via:

let shared_config = aws_config::load_defaults(BehaviorVersion::v2025_01_17()).await;

This runs the full AWS credential provider chain — including an EC2 IMDS lookup at IP. On a laptop or any non-EC2 host, this times out after 5 seconds before falling back to environment variables.

Without AWS_EC2_METADATA_DISABLED=true in your atticd environment file, every credential refresh attempt stalls for 5 seconds hitting a non-existent EC2 metadata endpoint. This manifests as identity resolver timed out after 5s in atticd logs and cascading 500 errors to the push client.

The credential config in s3.rs shows why env vars are the right approach for R2:

if let Some(credentials) = &config.credentials {
    builder = builder.credentials_provider(Credentials::new(
        &credentials.access_key_id,
        &credentials.secret_access_key,
        None,
        None,
        "s3",
    ));
}

If credentials is not set in atticd.toml, the SDK falls through to env vars. Either works — but env vars via agenix keeps secrets out of the nix store. —

Execution Context and Network Identity

This was the least obvious class of failure, and the one that explained the most.

Atticd binds to 0.0.0.0:8083 on node1. The address 0.0.0.0 is a bind directive, not a routable address. What is actually reachable depends on where the request originates:

127.0.0.1 is not a stable address. It is a namespace-relative promise. In a Docker container, localhost is the container. The host is 172.17.0.1. In the shell executor or on the host, localhost is the machine running atticd.

This distinction is invisible in configuration files and only becomes apparent when the same substituters entry works in one context and silently fails in another.

The consequence was that dev-cache and blog-cache, though served by the same atticd instance, behaved as if they were different services:

Cache Endpoint used Reachable from
dev-cache 127.0.0.1:8083 host, shell executor
blog-cache 172.17.0.1:8083 Docker CI containers

The caches are logically identical. The difference is entirely in which endpoint was registered at creation time, which determined where the binary cache endpoint was advertised in nix-cache-info.

If a cache is created with atticd reachable at 127.0.0.1, the binary cache endpoint recorded in atticd’s database reflects that address. Nix substituters will attempt that address regardless of where the build is running. Recreating the cache with the correct endpoint is the only fix – there is no in-place address migration.

401 Is Not a Failure

A 401 Unauthorized from the cache endpoint is correct behavior for a private cache. Nix substitution (downloads) does not require authentication. attic push and API operations do. A 401 from curl confirms the server is reachable and auth is functioning. It is not a signal that the cache is broken.

The failure modes that actually indicate a problem are connection refused (wrong address or atticd not running), 404 (cache does not exist), and silent substituter fallback (Nix tried the cache, got no response, fell back to Cachix without logging why).

The substituter fallback is the hardest to diagnose because Nix does not emit a warning when it skips a substituter. --verbose or --debug on nix build will show which substituters were tried and which paths were pulled from where. Without this, a misconfigured atticd address looks identical to a cache miss.

Root vs User Token Split

nixos-rebuild switch runs as root. The attic token lives at ~/.config/attic/config.toml for the invoking user. Root has no token. This means atticd push from a nixos-rebuild context silently fails – the build succeeds, the cache is not populated, and the next CI run pulls from Cachix instead.

The fix is either to run attic login as root separately, or to push explicitly from the CI job as the user who has the token. The latter is preferable: it keeps the push path consistent and does not require managing root credentials.


The Three Gaps

Looking across all the failures, they fall into three categories:

Declared vs operational state. Nix captures what should exist. It cannot capture what is currently running, what credentials are live, or what the OOM killer removed. journalctl, systemctl, and dmesg are the tools that close this gap, and they have to be used reactively.

Network identity vs network address. 127.0.0.1 means different things in different execution contexts. Configuration written for one context silently fails in another. The fix is to treat network endpoints as context-dependent and make the execution environment explicit in CI configuration rather than assuming a shared namespace.

Expected errors vs actual failures. 401 from atticd, a dropped Tailscale connection during nix copy, and a 429 from an RSS feed all look like failures. Only one of them is. Distinguishing expected behavior from actual failure modes requires knowing what the system is supposed to do, which is not always documented in error messages.

The gap between declared and operational state is where all the debugging happened. The gap between network identity and network address is where the most time was lost. The gap between expected errors and failures is where the most false fixes were applied.

See Architecture

Webmentions

Leave a comment

Comments are verified via IndieAuth. You will be redirected to authenticate before your comment is published.