Introduction
Word count is tracked automatically and injected into metadata.
LinkedIn is noise. Job boards surface irrelevant roles, bury the ones worth reading, and optimise for engagement rather than signal. I wanted something that ranks postings by relevance to my actual work – Rust, NixOS, distributed systems, Kubernetes – and filters them before I read them.
The constraint I kept running into: filtering written as code is filtering frozen at compile time. Every change to what you care about requires a rebuild. What I actually wanted was a query – something that separates the what from the how.
jobpipe started as a crawler and scorer. It became a query engine.
Two Models of Filtering
There are two filtering modes, and the distinction matters.
A keyword query operates on tokenised text. Terms are matched against title and summary.
AND semantics by default; OR with --loose. It is fast, approximate, good for
exploration. The tradeoff: it is recall-oriented. You find things, but not only the
right things.
The DSL operates differently. It is a predicate language – logical operators over
structured fields. Text matching via has, metadata conditions like remote,
numeric thresholds on score. The evaluation is deterministic: a job either satisfies
the predicate or it does not.
The conceptual shift: a query is interpreted as a hint. A DSL expression is evaluated as a specification. One asks “what is roughly relevant”; the other asks “what exactly qualifies”.
This mirrors the distinction in information retrieval between recall-oriented and precision-oriented systems.
The DSL exists because filtering written in code is a configuration problem disguised as a programming problem. Moving it into user-defined queries separates the two.
Why Not a Library or API
The natural alternative would be a Rust API: expose a filter(jobs, predicate) function,
let the caller compose predicates programmatically. This is the library model.
The problem with the library model here is that the predicate is not code – it is data. The thing I want to vary is what I am looking for today, which changes without any change to the tool itself. A library API would require a recompile, or at minimum a new binary. Neither is acceptable for something used interactively.
A DSL is the correct abstraction when:
- the variation is in the query, not the execution
- the user and the author are the same person but at different times
- the cost of a rebuild is higher than the cost of a parser
All three conditions hold here.
The DSL is also not a configuration file. Configuration describes static state. A
predicate describes a computation – it has structure, composition, and evaluation
semantics. The distinction matters because the DSL needs to support negation,
conjunction, and numeric comparison. A flat key-value config cannot express
(and (has "rust") (not (has "wordpress")) (score > 50)) without becoming a parser
in disguise.
Accepting this means accepting that the tool is a query engine with a crawler attached, not a CLI utility with a filter flag.
The Type System as Constraint Surface
The crawler and scoring pipeline are written in Rust. The friction during development was not in the DSL – it was in the type system, which enforces correctness at boundaries that are easy to overlook.
The clearest example: struct evolution. Adding a field requires total initialisation
at every construction site. The compiler does not allow partial structs. This is
mechanical, but it means every code path that builds a Job must acknowledge the new
field. No silent omissions.
Haskell enforces a similar guarantee but the failure surface is smaller. Construction is typically centralised via record update syntax or smart constructors. Rust exposes more construction sites explicitly, which means more friction and more precision.
Ownership produced the other class of errors. The rule is simple: a value is moved unless borrowed. The consequence is that calling a function by value consumes the value; calling by reference does not. This distinction is encoded in the type and enforced at every call site. You cannot accidentally use a value after passing it to a function that consumed it.
Option<T> forced the same pattern at the data level. No implicit null. Absence is
a type, and handling it is not optional. Every field that might be missing must be
matched or defaulted explicitly.
Taken together, these constraints are not stylistic. They eliminated a class of bugs that are invisible in languages with nullable types, implicit moves, or partial construction: use-after-free, missing-field initialisation, silent null propagation.
What HTTP Actually Is
The crawler exposes a less tractable constraint: HTTP is adversarial, not neutral.
Many RSS endpoints return 403, 429, or silently serve HTML error pages with a
200 status. A parser receiving non-XML input fails at the root element, not at the
request. This means validation must happen before parsing – check status code, check
content type, check the first character of the body.
Setting a descriptive User-Agent is not optional. Many servers block generic or empty
client identifiers. A named client with a repository URL signals legitimate non-abusive
usage. Without it, a significant fraction of feeds silently fail.
The implication for the DSL: results are only as good as the crawl. A predicate that
filters for remote and score > 50 is meaningless if half the feeds are returning
blocked or malformed responses. Feed quality is part of the query context.
Scoring as a Pure Function
Scoring is a pure function from Job to i32. Keywords add or subtract weight;
title matches outweigh summary matches. The function is in score.rs and has no side
effects.
This is deliberate. Scoring must be separable from crawling, filterable by the DSL,
and replaceable by trained weights from train.py without changing the pipeline
structure. A pure function can be tested in isolation, composed with the DSL predicate,
and updated by replacing a weight table.
The SQL alternative – computing score in a query with LIKE weights – works for simple
cases. It breaks when scoring requires co-occurrence, position weighting, or integration
with trained output. A typed function is easier to reason about and easier to replace.
SQL is used for storage and retrieval. Scoring happens in Rust. The DSL filters the result of both.
What Changed
The tool evolved along three axes:
Data correctness was enforced by the type system. Struct evolution, ownership, and
Option produced compiler errors that were mechanical to fix and prevented whole
categories of runtime failure.
Transport robustness came from HTTP reality. Feeds are unreliable. Validation before
parsing, a descriptive User-Agent, and status filtering made the crawler stable
against the actual web rather than an idealised one.
Filtering became structured evaluation. The shift from keyword matching to a predicate DSL is the shift from string interpretation to logical composition. The DSL does not add features to the tool – it moves filtering out of the tool and into the user’s hands.
The natural next step is an AST parser. The current DSL is string-based: expressions are parsed at runtime but without a proper grammar. An AST would allow better error messages, typed field references, and eventually compiled predicates. The semantics are already correct – the representation is not yet.
Conclusion
The original problem was simple: filter job feeds before reading them. The solution that emerged was not a smarter filter – it was a separation of concerns. Crawling, scoring, and filtering are now distinct stages. The DSL is the interface between the user’s intent and the pipeline’s execution.
Rust enforced correctness at the boundaries that matter: memory, construction, and absence. The DSL enforced a cleaner architecture by making filtering declarative. The web enforced realism by being unreliable by default.
The tool is not a CLI utility with flags. It is a query engine. The flags are the shell interface to a predicate evaluator. That distinction changes what it is reasonable to build on top of it.
Leave a comment
Comments are verified via IndieAuth. You will be redirected to authenticate before your comment is published.