A Hybrid Ranking System for Job Feeds

2026-05-02
, , , , , ,

Introduction

LinkedIn’s boolean on the free tier still drops results silently and personalises ranking. These queries will surface more signal than free text but the recall is not deterministic the way jobpipe + Greenhouse API is. Use these for discovery and cold outreach, not as a reliable feed.

Two constraints shaped the design. Filtering written as code is filtering frozen at compile time: every change requires a rebuild. And relevance is not static – a fixed weight table cannot capture how what I want from a role shifts with context. The system needed a query layer I could change without recompiling and a scoring layer that learned from choices I was already making.

jobpipe started as a crawler and scorer. It became a hybrid ranking system: a custom DSL with a parser and AST evaluator, NLP-based feature extraction aligned to a skill ontology, and a calibrated learning loop that tightens on labelled results.

The system has three layers. The design tension was keeping them consistent with each other.

crawl -> parse -> tokenize -> concepts -> weak labels -> filter -> score -> DSL -> output

The key architectural shift: Rust owns the deterministic feature authority layer. Python owns probabilistic ranking and calibration. The boundary is explicit by design.


The DSL: Parser, AST, Evaluator

The DSL is a prefix, fully parenthesised predicate language – structurally similar to Lisp S-expressions but restricted to a fixed operator set:

Expr :=
    (and Expr+)
  | (or Expr+)
  | (not Expr)
  | (contains FIELD STRING)
  | (remote)
  | (score OP NUMBER)

Parsing and evaluation are separate passes. The parser produces an untyped Expr AST. The evaluator interprets that AST against a typed Job:

eval : Expr -> Job -> Bool

Invalid field references are not rejected at parse time – they collapse to false at evaluation. This is deliberate: the DSL is permissive at the query surface; the data model underneath is not.

A DSL is the correct abstraction here because the predicate is data, not code. What I am looking for today changes without any change to the tool itself. A library API requires a recompile. A config file cannot express negation or numeric comparison.

(and
  (contains text "rust")
  (not (contains text "wordpress"))
  (score > 50)
)

Each AST node maps to a function in the evaluator. The structure is explicit, not inferred. The (score > 50) predicate operates on the same value in both rule-based and ML modes – the evaluator does not need to know which regime produced it.


Rust Abstractions: From Heuristic to Production-Grade ML Input

The pipeline did not start clean. Early versions had hidden mutation, partial moves, clone-heavy synchronisation, and feature drift between scoring and training. The abstractions that fixed this also made the ML layer work properly.

Ownership-oriented stage separation clarified which phases read, which mutate, and which consume and export:

This removed implicit shared state between the DSL scoring layer and the feature extractor – the most common source of silent failures where the model trains on different features than it scores against.

The result: Stage<T> orchestration with explicit ownership transfer, config-driven scoring, and stable feature extraction boundaries. Train/inference feature consistency – the property that train.py sees exactly the same concept representation that score.rs produces – became a structural guarantee rather than a convention.

The core representation layers after abstraction:

Layer Representation
DSL/rules symbolic
concepts ontology
embeddings vector
ranking probabilistic

The ontology layer maps symbolic concepts – rust, kubernetes, distributed_systems, sre, platform_engineering – into stable feature IDs consumed by both the DSL evaluator and train.py:

concept feature_id
rust 381
kubernetes 912

Stable IDs mean the sparse vector produced at crawl time and the sparse vector produced at train time are the same object, not two separately maintained approximations of each other.


Feature Extraction: Text to Concept Vectors

Unstructured job descriptions are not directly usable by a ranking model. The feature layer converts them into structured signals by aligning tokens against profile.json – a hand-maintained ontology of skills, domains, and their synonyms.

The extraction pipeline:

  1. Tokenise the job text (title + summary)
  2. Expand tokens through the synonym table to canonical concept names
  3. Match against the concept vocabulary in profile.json
  4. Emit a concept presence vector

Canonical concept names mean “k8s”, “kubernetes”, and “kube” all map to a single feature. A job “Rust Kubernetes Engineer” becomes:

\[ x = [1, 1, 1, 0, 0, 0, \dots] \]

The sparse representation preserves symbolic structure while remaining ML-ready. Ownership of feature semantics lives in Rust – the DSL and the ML layer share a single ground truth for what a concept is and how it is named.

The naive implementation had two hidden \(O(n^2)\) patterns:

// nested scan: O(tokens x synonyms)
for token in tokens {
    for synonym in synonyms { check_match(token, synonym); }
}

// Vec::contains() is O(n), inside a loop = O(n^2)
terms.contains(t)

Also: lowercasing applied to job.title and job.summary but not to query tokens – "Rust" != "rust" produces silent false negatives.

Fix: build a HashMap<Token, Concepts> index once, look up in \(O(1)\). Total pipeline drops from \(O(n^2)\) to \(O(n)\).

The profile.json ontology is the shared schema between the DSL and the learning pipeline. If starting over, define this schema first and generate both the DSL evaluator and the feature extractor from it. Everything else is recoverable.


The Learning Loop and Embedding Space

Concept vectors and labelled examples feed a logistic regression model. Labels are stored in SQLite as want = 1 or want = 0 and accumulated across sessions.

df = pd.read_sql("SELECT id, want, concepts FROM jobs WHERE want IS NOT NULL", conn)
df["concepts"] = df["concepts"].apply(lambda x: json.loads(x) if x else [])

mlb = MultiLabelBinarizer()
X = mlb.fit_transform(df["concepts"].tolist())
y = df["want"]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, stratify=y)

model = LogisticRegression(class_weight="balanced", max_iter=1000)
model.fit(X_train, y_train)

pickle.dump(model, open("model.pkl", "wb"))
pickle.dump(mlb,   open("mlb.pkl",   "wb"))

MultiLabelBinarizer converts concept lists into binary presence vectors – one column per concept in the ontology. class_weight="balanced" compensates for label imbalance. The serialised weights replace the hand-tuned keyword table in score.rs without changing the architecture.

Sentence-transformer embeddings project jobs into geometric vector space:

\[ x_i \in \mathbb{R}^{d} \]

where \(d\) is the embedding dimension. Semantically similar jobs become geometrically close. Similarity is measured by cosine similarity:

\[ \cos(\theta) = \frac{x \cdot y}{\|x\| \|y\|} \]

The ranking model learns:

\[ P(y=1 \mid x) = \sigma(w^T x + b), \quad \sigma(z) = \frac{1}{1+e^{-z}} \]

where \(w\) is the learned preference direction and \(b\) is the bias term. The model infers preference structure from feedback history rather than manually declared rules.

Score semantics:

Range Meaning
0.90 – 1.00 Highly relevant
0.50 – 0.80 Uncertain, mixed signal
< 0.20 Likely irrelevant

Focus labelling effort on 0.4 < score < 0.6. These are the samples where the model is least certain and where a single label produces the largest update to the decision boundary. Thirty to forty consistent labels produce a usably calibrated model.

The loop is: crawl -> extract -> score -> rank -> label uncertain results -> retrain. Each cycle tightens the model on actual preferences rather than a static approximation.


The Manifold Hypothesis in Practice

The manifold hypothesis states that high-dimensional real-world data tends to lie on lower-dimensional structured surfaces embedded inside larger vector spaces. Formally, although embeddings exist in \(\mathbb{R}^{768}\), the semantically meaningful data occupies a much smaller effective region:

\[ \mathcal{M} \subset \mathbb{R}^{768} \]

where \(\mathcal{M}\) is the semantic manifold and nearby points share semantic structure.

This is not just theory for this pipeline – it is what made it work as a daily driver. After the Rust abstraction cleanup stabilised ontology extraction and sparse feature generation, observed prediction statistics converged strongly:

class mean prediction
want = 0 0.11
want = 1 0.88

The classifier boundary \(w^T x + b = 0\) stabilised geometrically. Predictions near \(1.0\) lie deep inside positive semantic regions; predictions near \(0.0\) lie deep inside negative regions.

The positives formed strong semantic neighbourhoods around:

These titles differ significantly in surface form but cluster on a locally coherent semantic surface – the preference manifold the model discovered through labelling history. This explains why semantically related roles become geometrically close even when their titles differ, and why nearest-neighbour retrieval and cosine similarity become effective once the embedding geometry stabilises.

The convergence from noisy heuristic scoring to stable geometric separation was not primarily a model improvement – it followed from fixing the Rust feature extraction layer. Stable features produce stable embeddings. Stable embeddings produce a stable manifold. The ML layer cannot compensate for feature drift; it can only exploit feature stability.


Concurrency and the Rust Memory Model

The crawler runs two levels of concurrency. Feed-level parallelism spawns an async task per source. Request-level throttling uses a semaphore to bound simultaneous HTTP connections:

let sem = Arc::new(Semaphore::new(5));

for feed in feeds {
    let sem = Arc::clone(&sem);
    tokio::spawn(async move {
        let _permit = sem.acquire().await?;
        client.get(...).send().await?;
    });
}

Arc provides shared ownership across tasks without copying. .clone() on an Arc increments a reference count – it is not a deep copy. move transfers ownership into each task, required because a task may outlive the stack frame that spawned it.

The performance model:

Before: O(total_feeds x network_latency) + burst traffic -> failures
After:  O(active_concurrent_feeds x latency)
        where active_concurrent_feeds = Semaphore limit

Performance here is controlled IO throughput, not raw CPU. The semaphore avoids retry storms, keeps TCP connections warm, and smooths network load against rate-limiting servers.

HTTP is adversarial. Many RSS endpoints return 403, 429, or silently serve HTML with a 200 status. Validate before parsing: check status code, content type, and the first byte of the body. A named User-Agent with a repository URL is necessary – a significant fraction of feeds block generic client identifiers. Encoding failures (invalid UTF-8, emoji in feed titles) are handled with String::from_utf8_lossy. Gzip and brotli are handled internally by reqwest 0.12.


What This System Actually Is

It is not a job scraper. It is not a static filter. It is a feedback-driven ranking system where the DSL handles the cases where you know exactly what you want, and the learned model handles the cases where you know it when you see it.

The architecture combines symbolic reasoning – DSL predicates, ontology traversal, synonym propagation, deterministic scoring – with vector semantics: embeddings, geometric similarity, probabilistic ranking. This resembles modern hybrid neuro-symbolic retrieval systems, arrived at by iterating on a daily driver rather than by design.

The shift is from string -> features -> classifier to symbolic ontology -> structured feature graph -> vector space projection -> calibrated ranker. Bias is now explicit in the Rust scoring layer. Overfitting risk is reduced by separation of concerns. The model is no longer purely statistical – it is a constrained learning system operating over a deterministic feature authority.

The unresolved tension is schema drift: profile.json, the DSL concept vocabulary, and the feature extractor are maintained separately. The correct fix is a unified schema from which both are derived – and a graph store for ontology traversal as the concept graph grows. That is the thing I would do first if starting over.

See Introduction

Webmentions

Leave a comment

Comments are verified via IndieAuth. You will be redirected to authenticate before your comment is published.