When two libraries disagree about a token

Here is the observation the whole project rests on. Take one token, one key, one algorithm allowlist. Feed that identical input to two well-known JWT libraries. If one accepts it and the other rejects it, something is wrong, and the wrongness is exploitable. One of them is misimplementing the spec, or they are reading it differently, and either way an application that passes tokens between services in different languages can be split between an accepting verifier and a rejecting one. That split is an authentication bypass.

Real microservice fleets verify the same JWTs in Node and Go and Python. That is not a contrived setup. It is Tuesday. So the question I wanted to answer was concrete: across the libraries a real fleet actually uses, where do they disagree, and which disagreements hand an attacker a forged token that some service in the chain will honor.

Live, not static

Wycheproof ships static test vectors, and they are good. But static vectors test the libraries as they were when the vectors were written. I wanted the libraries running, in matched containers, against a corpus I can keep growing. Each of five JWT libraries runs as a minimal Docker container exposing one endpoint: post a token, get back whether it is valid and any error. An async orchestrator fans every corpus case out to all five in parallel and collapses the responses by their valid verdict.

The comparison, and one thing it gets right

The rule is simple: if the accept set and the reject set are both non-empty, the case is a disagreement and therefore a finding. No model judges it. No heuristic scores it. The verdict is the set of boolean answers from the libraries themselves, which means a finding is something you can re-run and check rather than something you have to trust me about.

The detail I am quietly proud of is that error strings get bucketed rather than compared as strings. Five libraries phrase the same rejection five different ways. If I compared error text I would drown in false positives where everyone correctly rejected a token but worded it differently. Comparing the boolean verdict, and only bucketing errors for triage context, keeps the signal clean.

The corpus is the product

The seed corpus pairs positive controls against a growing set of known JWT bug classes: alg confusion with an HS256 token signed against the RSA public key, kid injection with SQL and path-traversal payloads, jku spoofing to attacker-controlled JWKS, RFC 7515 critical-header handling, JWE-into-JWS confusion, ECDSA edge cases, and header JSON quirks like duplicate keys and NUL bytes.

Each disagreement that survives triage becomes a written advisory with a reproducing proof of concept. That last step is the discipline I hold every finding to: nothing is a finding until it reproduces. A differential harness that flagged disagreements I could not reproduce would be a worse-than-useless noise generator. The point is to find the one library in a fleet that will accept a token the others reject, before an attacker does, and to be able to prove it.