Methodology — DevCard

The thesis

Candidate-job matching is deterministic. The math is fixed code, not an LLM prompt. Given the same evidence, the same job, and the same scoring code version, you get the same score every time.

LLMs are used only to translate unstructured input — a resume, a project note, a job posting — into structured ontology data. Once the data is structured, the matching pipeline never calls an LLM again. There is no LLM in the matching loop.

This is a deliberate architectural commitment. It means scores are replayable, auditable, and explainable. Every number on a candidate scorecard traces to specific structured evidence assertions you can inspect.

The 4-dimension scorecard

A single blended fit score is not the truth object. The scorecard expresses four distinct dimensions, each with its own confidence signal. The dimensional view stays primary — any simplified summary is derivative.

Direct Readiness (DR)

Can this person do this role with minimal ramp? Specialization family alignment, current environment fluency, current work layer, recent direct evidence, topic proximity, and same-family archetype floor.

Transferable Capability (TC)

How strongly does durable capability transfer when tools do not match? Adjacent specialization families, problem-pattern similarity, ontology transferability, domain adjacency, concept primitive overlap.

Delivery Strength (DS)

How strongly does the evidence suggest this person carries work from ambiguity to safe shipped reality? Shipping reality, failure exposure, review depth, ambiguity handling, production accountability, ownership signals.

Motivation Alignment (MA)

Does this role align with what the candidate wants and can sustain? Work attraction, work aversion, direction of growth, environment sustainability, theme-based semantic matching.

Each dimension carries its own confidence — how much evidence backs the assessment. High capability with low confidence reads as "promising but verify." Recruiters always know where the uncertainty is.

Accuracy against ground truth

We validate the scoring model against a 20×20 ground truth matrix — 20 expert-curated candidate profiles, 20 expert-curated job postings, all 400 candidate-job pairs scored on each of the four dimensions by a human evaluator. That is 1,600 expert scores. We compare DevCard's deterministic output to those scores on three metrics:

MAE (Mean Absolute Error): average distance between DevCard's score and the expert score. Lower is better. A 0–100 scale.
Spearman: rank correlation between DevCard's ordering and the expert ordering. 1.0 is perfect agreement. Measures whether the right candidate ranks at the top, even if the absolute score differs.
NDCG@10: Normalized Discounted Cumulative Gain at top 10. Measures whether the top-of-list ordering is right, weighted toward the highest positions.

Dimension	GT MAE	Spearman	NDCG@10
Direct Readiness	5.45	0.6918	0.8737
Transferable Capability	13.11	0.6463	0.8623
Delivery Strength	8.77	0.6901	0.9553
Motivation Alignment	7.99	0.4682	0.6229
Composite	—	0.7091	0.9275

A 5.45 DR MAE means the average DevCard Direct Readiness score is within 5.5 points of the expert score on a 0–100 scale. Spearman 0.69 means the order DevCard ranks candidates in agrees strongly with the order an expert would have ranked them.

These numbers are read-only. Every milestone keystone advances a frozen test anchor (currently e21511b8) via mechanical file-diff verification across the scoring service tree. If a refactor would have shifted any of these numbers, the anchor would not advance and the change would not ship.

The ground truth set

The 20×20 set covers two archetypal blocks. The first 10 candidates and 10 jobs are non-FAANG developer archetypes — PHP/Laravel, generalist web, .NET enterprise, WordPress/CMS, embedded/firmware, application security, game dev, developer relations, Shopify/e-commerce, database engineering. The second 10 are FAANG and unicorn tier — backend infra, payments, ML, data, mobile, etc.

The diagonal pairs — same-archetype candidates and jobs — should score high. The cross-block pairs should score low. The interior cross-archetype pairs are where matching has to be honest about both readiness and transfer.

Every diagonal canary, every cross-family cap, and several specific archetype pairs are locked in test code as regression guards. A change that improves a number in one place at the cost of breaking a canary elsewhere fails CI.

The holdout protocol

We maintain a separate holdout set that the scoring code never sees during development. Its only purpose is to detect overfitting against the ground truth set.

We never tune against holdout. The moment we did, it would become a second ground truth set and lose its value as an overfitting detector.

The workflow is strictly one-directional: tune against ground truth, lock the changes, run holdout, report the number. If ground truth MAE improves but holdout MAE regresses, the change is overfitting and gets reverted — even if the GT improvement was meaningful in isolation.

No file in the application code, configuration, or database migration tree may reference holdout data. The only consumer is a single test file that reports MAE without ever influencing the application surface.

The cross-family alignment gate

A persistent failure mode in match scoring is "evidence-side leakage" — a candidate accumulates broad, weak overlap with a role outside their actual practice family, and the score inflates. To prevent this, DevCard installs an independent signal at the gate.

Work-family classification is derived from declared role titles, tenure, and recency — not from the assertion derivation pipeline that feeds individual scores. Two structurally separate signals: one classifies the candidate, one scores the evidence. The classifier multiplies the score at the gate.

Same-family pairs pass through with a 1.0 multiplier. Adjacent-family pairs are softened. Distant pairs are heavily damped. Unrelated pairs are floored. The gate is the structural primitive that keeps cross-family false positives from inflating regardless of evidence-side signal.

Ontology and concept primitives

DevCard is ontology-first. About 800 typed ontology items — topics, frameworks, languages, tools, platforms — form a structured graph. Hierarchy, sibling relations, and ecosystem links allow semantic resolution beyond keyword matching: MySQL and PostgreSQL surface as siblings, React and Vue share component patterns, Laravel and Rails share MVC primitives.

Underneath the ontology sits a layer of 315 concept primitives across 12 categories. A primitive is a cognitive pattern — "stateful caching coordination," "schema-driven validation," "incremental migration discipline" — that abstracts above any specific tool. Two technologies that share primitives transfer credibly even when their names do not match.

When evidence enters the system, an LLM resolves the technology name into structured ontology coordinates. Once resolved, all downstream comparison is deterministic against the graph and primitive set.

Why we do not use LLMs in the matching loop

Three reasons:

1 Replayability. A score that depends on an LLM call is non-deterministic. Two requests may differ. A score from last week may not reproduce this week. Deterministic math reproduces exactly.
2 Auditability. A deterministic score traces to specific structured inputs. If a recruiter asks "why did this person score this?" — there is a real answer. An LLM-driven score collapses that chain.
3 Bias accountability. Bias in deterministic code lives in inspectable formulas, weights, and ontology mappings. We can audit and correct it. Bias in an LLM's implicit ranking lives in opaque weights we did not train and cannot inspect.

LLMs translate. Math ranks. The two layers stay separate by design.

Known limitations

What the numbers above do not say:

Corpus skew. The 20×20 ground truth set leans toward FAANG-tier engineering archetypes in the second block. Coverage of developer relations, game dev, security engineering, and embedded firmware is thinner. Scores in those archetypes are reasonable but carry less validation density.
Diagonal compression. Top archetype matches still under-score versus expert calibration in some cases. The same-family DR floor partially closes this; a residual ceiling remains. We document where this is structural rather than tunable.
Motivation Alignment ranking quality. MA Spearman of 0.47 is the weakest dimension. The primary cause is theme-resolution failures on non-canonical phrasings of motivation signals — a structural data issue we are tracking, not a formula tuning issue.
Data quality dominates. A bare resume produces lower confidence than a guided studio profile, which produces lower confidence than a corpus-quality enriched profile. The scoring ceiling is bounded by extraction quality, not formula tuning.
Outside-the-archetype edge cases. Multi-archetype generalists, non-implementer modes (DevRel, OSS-maintainer, staff-mentor), and emerging specializations are explicitly modeled, but coverage is uneven. We do not flatten DevRel work into "weaker implementation."

Bias auditing

Numbers are about JDs, never about people. DevCard scores capability against a role context — never against an inferred personal characteristic.

We review ontology mappings, score distributions, and disparate-impact risk on a structured cadence. The deterministic architecture means a bias finding can be traced to a specific weight, mapping, or formula and corrected at the source.

DevCard does not predict whether someone is replaceable by AI, whether their career is in decline, or any "obsolescence risk" signal. Forward-effect framing only. AI must reduce uncertainty about relevance, not increase fear of obsolescence.

A mismatch with a role is communicated as context mismatch, never as personal deficiency. One role is not destiny.

Validation infrastructure

The numbers above are not periodic snapshots. They are CI gates. Every commit runs:

Canary pairs — 11 specific candidate-job combinations with locked expected scores and tolerance bands.
Rank correlation guard — Spearman against ground truth must not regress.
NDCG@10 guard — top-of-list ordering quality must not regress.
Holdout reporter — informational, never a fail gate, never tuned against.
Scoring isolation guard — file-diff verification across the scoring service tree. Anchor SHA advances only when intended.
Determinism guard — 30 runs of the same input produce bit-identical output.

A change that lifts one number at the cost of another is visible immediately and gated mechanically. The scoring code is treated as read-only across milestones; the discipline is the architecture.

DevCard Concepts

How DevCard scores work, and how we know.