Calibration. How rule weights move when outcomes disagree.
Every rule on a scan card has a weight. The data has to justify it. How signals gain weight, lose it, and what RULE_VERSION on metrics tracks.

Every rule on a scan card has a weight in front of it. That weight is not a guess. It's a number the calibration loop has had to justify against real, resolved outcomes, and it changes when the data tells it to. This post is the loop itself: where the raw material comes from, how a rule earns weight (or loses it), and what the RULE_VERSION string on /metrics is actually tracking.
The weight-calibration cron is the ML. The model is a list of rules with numbers in front of them, and the training loop is a daily job that re-fits those numbers against what happened to the tokens we already scanned. No neural net. No black box. A feedback loop with receipts.
The verdict words on each scan card are unpacked in the verdicts piece. The page-level tour of every signal is in how to read a doomer scan. The pipeline that grades each scan five times is in the five checkpoints. This piece is the layer underneath all three: the part that decides whether the rules we used last week deserve to keep their weights this week.
What the loop is fed
Calibration starts with one input: a corpus of resolved outcomes. Every scan we run gets followed for thirty days, with checkpoints at +1h, +6h, +24h, +7d, and +30d. By the time a scan reaches +24h (the canonical accuracy horizon) we have the full row: which rules fired at scan, what the cope score landed at, which verdict we issued, and the four-state label the outcome pipeline stamped at the checkpoint.
That row is the training example. Features are the rules that fired. Label is the outcome. Cohort is every resolved scan in a rolling window. Everything the calibrator does, it does on top of that table.
The window matters. We don't fit weights against all-time data because the on-chain meta shifts faster than calibration does. New launchpad mechanics, new bundler patterns, new tax-flip strategies. A rule that predicted downside in February with 80% reliability might predict it in May at 40% not because the rule got worse, but because the cohort it was catching went and learned to evade it. A rolling window lets the loop notice that and re-weight before the model gets stale.
The unit of measurement
For every rule, the loop computes two numbers in the window.
Fire rate. What fraction of scans had this rule fire? A rule firing on 60% of scans does different work from one firing on 0.5%. The 60% rule is barely separating the cohort, because the base downside rate sits around 52% already; firing or not firing tells you almost nothing. The 0.5% rule is making a sharp claim about a specific structural pattern, and we judge it on whether the claim holds.
Hit rate. Of the scans where the rule fired, what fraction resolved downside at the horizon we care about? This is the rule's predictive value, measured against the cohort base rate. Above base rate, the rule is pulling weight. At or below base rate, the rule is contributing nothing.
The gap from base rate, scaled by how confident we are in the measurement given the rule's fire count, is what the weight update reads. Rare-but-accurate gains weight. Common-but-barely-better-than-coinflip loses it. Rare-and-inaccurate drifts toward zero and queues for retirement.
The loop is willing to retire its own rules. Recent deploy. Modest holder count. Low absolute volume. Anything that has been around for cycles. If a rule stops separating the cohort, its weight shrinks automatically and it eventually falls out of the live scoreset. The list of rules contributing to the cope score this month is not the same as the list that contributed last month, and that's by design.
Shrinkage, briefly
A rule that fired 11 times and hit 9 times looks like an 82% hit rate. It's also one bad week away from being a 50% hit rate, because the sample is tiny. The loop shrinks the apparent hit rate back toward the base rate proportional to how thin the data is. Once the rule accumulates more fires (and they keep landing the same way) the shrinkage relaxes and the full weight kicks in. New rules that look thrilling in early shadow-fire reports often look ordinary by the time the loop has measured them properly. That's the loop doing its job.
Bin boundaries shift too
The cope score is the weighted sum of every triggered rule. The verdict word is what bin the score lands in. Those bin boundaries are not fixed.
Three things move them:
- Rule weights moved, so the score distribution shifted. If RISKY was over-weighting a rule that calibration later retired, the post-retirement RISKY scores look different from the pre-retirement ones, and the boundary has to move to keep the bin's labeled behavior consistent.
- The on-chain meta moved. If the share of tokens going downside in the window drifts up, the SAFU bin needs to be a touch more conservative to keep the label honest. In a cleaner market, it can relax slightly.
- The rule set changed. New rules added, old rules retired. The score scale is different on either side of the change, so the boundaries get re-anchored.
Boundary moves are smaller than weight moves and they happen less often. They do happen.
RULE_VERSION as the receipt
Every meaningful change to the live scoring bumps RULE_VERSION. Major number changes when the rule set or math changes structurally, minor changes when rules are added or removed, patch changes when weights move enough to materially alter scores. The version lives in src/lib/rule-version.ts and gets read into every enriched cache entry, every outcome doc, every metrics row.
The version is the receipt. When a token's score changes between scans, you can tell whether the underlying signals changed (real on-chain state moved) or the scoring math changed (the rule version bumped). Two very different meanings. A token rescored higher because liquidity dropped is the scanner reacting. A token rescored higher because the rule version bumped is the model getting more pessimistic about its tier.
We never quietly edit weights without bumping the version. If a scoring number moved, the version moved. That's the contract. A test enforces it: any code change touching live scoring math without bumping the version fails rule-version.test.ts and refuses to land.
The metrics page lets you filter accuracy by rule version. If 8.10 looked better than 8.11 at the +24h horizon, the version filter exposes it, and the next version has to fix the regression it created.
What the calibrator IS doing, and what it isn't
The calibrator is doing one thing: re-fitting weights on existing rules and (less often) moving bin boundaries.
It is not doing the rest of what people often expect when they hear "ML model." There is no separate price-prediction model. No sequence model on candle data. No reinforcement-learning agent trading the chain. The risk engine is a rule engine. The calibrator gives the rule engine its numbers. Treating it like a trading model misses the point.
This is the right design for the problem. The downside outcomes we're predicting are discrete events with clear evidence trails. Rules with calibrated weights are the cleanest tool for that. A black-box model would be harder to debug, harder to explain to anyone reading the risks list, harder to audit on a wrong call.
The flip side: if a pattern doesn't decompose cleanly into rule shapes, the rule engine can't catch it. Slow bleed is the canonical example. The loop is honest about it because the loop only weights rules. If no rule fires, the loop has nothing to say. Patterns that need richer features need a different layer.
Adding a rule
The pipeline is short and deliberately so.
- A new rule lands in shadow mode. It fires on scans but doesn't contribute to the cope score. The risks list doesn't surface it. The outcome doc records its fire/no-fire state.
- After a window of resolved outcomes, the loop measures the rule's hit rate alongside the existing weighted rules.
- If the hit rate is materially above base rate after shrinkage, the rule graduates. RULE_VERSION bumps. The risks list now surfaces it.
- The loop re-fits the new rule's weight every cycle. No special treatment after graduation.
The longest part of the pipeline is the shadow window. We don't ship rules on a hunch. Almost-good ideas get the same shadow treatment as obviously-good ideas. Both bundlers and snipers went through this loop before their weights settled where they are now.
How we don't game ourselves
The biggest risk in a loop like this one is fitting to the cohort instead of to reality. A few defenses:
- Rolling out-of-sample evaluation. Weights are fit on the past; performance is read on the present. If a weight update only helps on the cohort it was fit to, the evaluation catches it.
- Minimum fire counts. Per-rule fire counts have to clear thresholds before updates touch their weights. We'd rather wait another week than re-weight on noisy small samples.
- Versioning. All updates are versioned. Every version's behavior is visible on metrics. If a version performs worse than the one it replaced, the regression is visible and the next version has to fix it.
None of this guarantees we won't blow a call. The loop is a tool for staying calibrated, not for being right. Misses are still real and they still get published. We don't quietly edit history.
What you see on metrics
The metrics page is the consumer view of all of this. Per-tier accuracy at +6h, +24h, +7d, and +30d. Misses included. The numbers come from the same resolved-outcome corpus the calibrator uses for weight updates. The rule version filter shows which iteration produced what you're looking at, so a regression between versions is visible.
If the public metrics ever disagree with what the loop reports internally, the loop has the bug. The page is the source of truth.
TL;DR
Calibration is the daily job that turns resolved outcomes into weight updates. Fire rate, hit rate, and shrinkage give the new weight. Rules that stop predicting lose their weight. New rules earn theirs only after a shadow window. Bin boundaries move when tier behavior moves. RULE_VERSION is the receipt for every change. Metrics is the public view of the result. There is no black-box model. There is a list of rules with numbers in front of them and a loop that keeps the numbers honest.
If any of that ever stops being true, you'll see it on metrics before anywhere else.
// faq
- What does RULE_VERSION on the metrics page mean?
- It's the semver-shaped receipt for the live scoring math. The version bumps any time rule weights move, rules are added or retired, or bin boundaries shift. Every cached scan, outcome doc, and metrics row records the version it was scored under, so a regression between versions is visible.
- Is the calibration loop a machine learning model?
- The risk engine is a rule engine, and the calibration loop is the part that fits the per-rule weights against real resolved outcomes. There's no separate neural net or trading model. The model is a list of rules with numbers in front of them, and calibration is the feedback loop that keeps the numbers honest.
- How does a new rule get added to the live scoring set?
- New rules land in shadow mode. They fire on scans but don't contribute to the cope score. After a window of resolved outcomes, calibration measures the hit rate. If it's materially above the cohort base rate after shrinkage, the rule graduates, RULE_VERSION bumps, and the risks list starts surfacing it.
// read next

One hop back. When 'this dev is new' really means 'this funder is the same.'
New dev wallets are easy to spawn. The wallet that funded them isn't. How one hop back from deployer to funder catches most of the 93.8%.

The 2.1% we don't catch. Slow bleeds, abandoned pools, and what's on the scoring backlog.
No malicious event. No coordinated dump. The token just stops. The failure mode we catch worst, and what would change to flag it at scan.