From Fragile CI Scripts to Coherent Deployment Infrastructure

2026-03-25
, , , , , ,

Introduction

GitLab CI/CD can be made deterministic and reproducible by using version-controlled configuration files and consistent environments, similar to Nix. This involves defining your CI/CD pipelines in a .gitlab-ci.yml file and using versioned Docker images to ensure consistent builds.

Each component may function correctly in isolation while the deployment system as a whole remains unstable.

This work focused on reducing that incoherence.

The goal was transforming a fragile GitLab deployment pipeline into a reproducible and coordinated deployment environment where:

Docker Image
    ↓
GitLab Runner
    ↓
Nix Build
    ↓
Binary Cache
    ↓
Deployment Runtime

operates as a single coherent execution system instead of loosely connected tooling.

The failures were rarely caused by one broken component.

Most failures emerged from mismatched assumptions between components executing in different contexts.


Architecture

Current deployment stack:

VM
→ Docker Image (`blog-ci`)
→ GitLab Runner
→ Nix Daemon
→ Nix Build
→ Attic Binary Cache
→ Cloudflare R2 Storage
→ Deployment Pipeline

The architectural transition was from:

mutable CI scripts

toward:

coherent infrastructure-defined deployment execution

where the runtime environment itself becomes reproducible.


The Core Problem

The deployment stack originally behaved like independent systems stitched together at runtime:

Each layer carried different assumptions about:

This produced recurring failures that appeared unrelated but shared the same root cause:

lack of execution coherence

The Drift Between Components

Docker Image vs Runtime Reality

The CI container originally depended on bind-mounted tooling from the host system.

This created recurring failures such as:

The container image itself was incomplete and depended on runtime patching.

GitLab Runner vs Nix Daemon

The GitLab runner and Nix daemon operated with incompatible assumptions about:

This produced:

CI Scripts vs Infrastructure State

The CI scripts assumed:

But the infrastructure executing those scripts was ephemeral.

This caused:

Most deployment instability came from assumptions leaking across execution boundaries.


Restoring Execution Coherence

The fix was not debugging individual failures indefinitely.

The fix was making the deployment layers coherent with each other.


Immutable CI Image

The first major shift was replacing runtime patching with a deterministic CI image.

packages.x86_64-linux.ci-image =
  pkgs.dockerTools.buildLayeredImage {
    name = "blog-ci";
    tag = "latest";
  };

Instead of mounting binaries from the host into the container, the image itself declared the required runtime:

paths = with pkgs; [
  nix
  cacert
  cachix
  attic-client
  git
  bash
  coreutils
  gnugrep
  gnumake
  openssh
];

This changed the role of the Docker image from:

empty execution shell

to:

declared deployment runtime

The CI container became self-contained and reproducible.

The most important improvement was not adding tools.

It was eliminating runtime uncertainty.


GitLab Runner Alignment

The runner configuration was simplified so the container and daemon shared compatible assumptions.

The volume configuration reduced to:

dockerVolumes = [
  "/nix/store:/nix/store:ro"
  "/nix/var/nix/daemon-socket:/nix/var/nix/daemon-socket"
  "/var/run/docker.sock:/var/run/docker.sock"
];

The direct /nix/var/nix/db mount was removed entirely.

Previously, the CI container accessed the host SQLite store database directly, creating lock contention and partially corrupted store state after interrupted jobs.

With:

NIX_REMOTE=daemon

all store operations became coordinated through the host daemon instead of independent filesystem access.

This restored coherence between:


Binary Cache as Infrastructure

Caching stopped being treated as a deployment optimization.

It became part of the execution model itself.

The deployment system stabilized around:

Attic + R2

The important shift was architectural.

Previously:

build → deploy

Now:

build → cache → distribute → deploy

The cache became part of deployment orchestration.

This reduced:

while improving reproducibility across runners.

Distributed binary caching reduced both deployment latency and operational drift.


MinIO / R2 Consistency Problems

A critical issue emerged from configuration divergence between users.

The same MinIO client behaved differently under:

This produced errors like:

Access Denied

and:

The AWS Access Key Id you provided does not exist

The issue was not object storage itself.

The issue was inconsistent credential state across execution environments.

The fix standardized:

across all execution contexts.


Network Namespace Drift

One of the least obvious failures involved network identity.

127.0.0.1 meant different things depending on where the process executed:

The same cache endpoint worked in one environment and silently failed in another.

The real issue was not networking itself.

It was incoherent assumptions about execution locality.

localhost is not a universal address.

It is relative to the current execution namespace.


Deployment State vs Operational State

Nix captures declared infrastructure reproducibly.

What it cannot capture directly is operational runtime state:

This gap between:

declared state

and:

operational state

was where almost all debugging occurred.

The deployment stack became stable only after the execution environments themselves became coordinated.


The Three Infrastructure Gaps

Looking across all failures, they ultimately fell into three categories.

1. Build Definition vs Runtime Environment

The declared build environment differed from the environment actually executing the build.

2. Execution Context vs Network Identity

Services assumed shared locality across isolated namespaces.

3. Cache State vs Deployment State

Caches, credentials, and substituters behaved inconsistently across runners and users.

The fixes worked once those assumptions became explicit infrastructure definitions instead of runtime accidents.


Architecture Outcome

The final system is still not Kubernetes.

It is not Terraform.

It is not a fully autonomous cloud platform.

But the deployment stack now behaves as a coordinated execution system where:

The important shift was not merely fixing CI failures.

The important shift was reducing incoherence between the systems executing the deployment.


Conclusion

The original deployment system relied on mutable runtime behavior:

CI scripts patch missing state at execution time

The stabilized system instead moves toward:

deployment infrastructure defines execution coherently

where:

The result is not simply faster deployments.

It is a deployment system whose components execute with shared assumptions instead of accidental compatibility.

Webmentions

Leave a comment

Comments are verified via IndieAuth. You will be redirected to authenticate before your comment is published.