Methodology
Evidence over narrative — every score traces to real work. Transparent scoring — no black box, no hidden enrichment. This page documents the math, the validation set, the accuracy numbers, and the limitations.
Last updated: 2026-05-07. Numbers below are post-v32 baselines locked in test code at anchor e21511b8.
Candidate-job matching is deterministic. The math is fixed code, not an LLM prompt. Given the same evidence, the same job, and the same scoring code version, you get the same score every time.
LLMs are used only to translate unstructured input — a resume, a project note, a job posting — into structured ontology data. Once the data is structured, the matching pipeline never calls an LLM again. There is no LLM in the matching loop.
This is a deliberate architectural commitment. It means scores are replayable, auditable, and explainable. Every number on a candidate scorecard traces to specific structured evidence assertions you can inspect.
A single blended fit score is not the truth object. The scorecard expresses four distinct dimensions, each with its own confidence signal. The dimensional view stays primary — any simplified summary is derivative.
Can this person do this role with minimal ramp? Specialization family alignment, current environment fluency, current work layer, recent direct evidence, topic proximity, and same-family archetype floor.
How strongly does durable capability transfer when tools do not match? Adjacent specialization families, problem-pattern similarity, ontology transferability, domain adjacency, concept primitive overlap.
How strongly does the evidence suggest this person carries work from ambiguity to safe shipped reality? Shipping reality, failure exposure, review depth, ambiguity handling, production accountability, ownership signals.
Does this role align with what the candidate wants and can sustain? Work attraction, work aversion, direction of growth, environment sustainability, theme-based semantic matching.
Each dimension carries its own confidence — how much evidence backs the assessment. High capability with low confidence reads as "promising but verify." Recruiters always know where the uncertainty is.
We validate the scoring model against a 20×20 ground truth matrix — 20 expert-curated candidate profiles, 20 expert-curated job postings, all 400 candidate-job pairs scored on each of the four dimensions by a human evaluator. That is 1,600 expert scores. We compare DevCard's deterministic output to those scores on three metrics:
| Dimension | GT MAE | Spearman | NDCG@10 |
|---|---|---|---|
| Direct Readiness | 5.45 | 0.6918 | 0.8737 |
| Transferable Capability | 13.11 | 0.6463 | 0.8623 |
| Delivery Strength | 8.77 | 0.6901 | 0.9553 |
| Motivation Alignment | 7.99 | 0.4682 | 0.6229 |
| Composite | — | 0.7091 | 0.9275 |
A 5.45 DR MAE means the average DevCard Direct Readiness score is within 5.5 points of the expert score on a 0–100 scale. Spearman 0.69 means the order DevCard ranks candidates in agrees strongly with the order an expert would have ranked them.
These numbers are read-only. Every milestone keystone advances a frozen test anchor (currently e21511b8) via mechanical file-diff verification across the scoring service tree. If a refactor would have shifted any of these numbers, the anchor would not advance and the change would not ship.
The 20×20 set covers two archetypal blocks. The first 10 candidates and 10 jobs are non-FAANG developer archetypes — PHP/Laravel, generalist web, .NET enterprise, WordPress/CMS, embedded/firmware, application security, game dev, developer relations, Shopify/e-commerce, database engineering. The second 10 are FAANG and unicorn tier — backend infra, payments, ML, data, mobile, etc.
The diagonal pairs — same-archetype candidates and jobs — should score high. The cross-block pairs should score low. The interior cross-archetype pairs are where matching has to be honest about both readiness and transfer.
Every diagonal canary, every cross-family cap, and several specific archetype pairs are locked in test code as regression guards. A change that improves a number in one place at the cost of breaking a canary elsewhere fails CI.
We maintain a separate holdout set that the scoring code never sees during development. Its only purpose is to detect overfitting against the ground truth set.
We never tune against holdout. The moment we did, it would become a second ground truth set and lose its value as an overfitting detector.
The workflow is strictly one-directional: tune against ground truth, lock the changes, run holdout, report the number. If ground truth MAE improves but holdout MAE regresses, the change is overfitting and gets reverted — even if the GT improvement was meaningful in isolation.
No file in the application code, configuration, or database migration tree may reference holdout data. The only consumer is a single test file that reports MAE without ever influencing the application surface.
A persistent failure mode in match scoring is "evidence-side leakage" — a candidate accumulates broad, weak overlap with a role outside their actual practice family, and the score inflates. To prevent this, DevCard installs an independent signal at the gate.
Work-family classification is derived from declared role titles, tenure, and recency — not from the assertion derivation pipeline that feeds individual scores. Two structurally separate signals: one classifies the candidate, one scores the evidence. The classifier multiplies the score at the gate.
Same-family pairs pass through with a 1.0 multiplier. Adjacent-family pairs are softened. Distant pairs are heavily damped. Unrelated pairs are floored. The gate is the structural primitive that keeps cross-family false positives from inflating regardless of evidence-side signal.
DevCard is ontology-first. About 800 typed ontology items — topics, frameworks, languages, tools, platforms — form a structured graph. Hierarchy, sibling relations, and ecosystem links allow semantic resolution beyond keyword matching: MySQL and PostgreSQL surface as siblings, React and Vue share component patterns, Laravel and Rails share MVC primitives.
Underneath the ontology sits a layer of 315 concept primitives across 12 categories. A primitive is a cognitive pattern — "stateful caching coordination," "schema-driven validation," "incremental migration discipline" — that abstracts above any specific tool. Two technologies that share primitives transfer credibly even when their names do not match.
When evidence enters the system, an LLM resolves the technology name into structured ontology coordinates. Once resolved, all downstream comparison is deterministic against the graph and primitive set.
Three reasons:
LLMs translate. Math ranks. The two layers stay separate by design.
What the numbers above do not say:
Numbers are about JDs, never about people. DevCard scores capability against a role context — never against an inferred personal characteristic.
We review ontology mappings, score distributions, and disparate-impact risk on a structured cadence. The deterministic architecture means a bias finding can be traced to a specific weight, mapping, or formula and corrected at the source.
DevCard does not predict whether someone is replaceable by AI, whether their career is in decline, or any "obsolescence risk" signal. Forward-effect framing only. AI must reduce uncertainty about relevance, not increase fear of obsolescence.
A mismatch with a role is communicated as context mismatch, never as personal deficiency. One role is not destiny.
The numbers above are not periodic snapshots. They are CI gates. Every commit runs:
A change that lifts one number at the cost of another is visible immediately and gated mechanically. The scoring code is treated as read-only across milestones; the discipline is the architecture.