Mizan · the proof
The proof — a measured backtest
A labelled synthetic population where the true hidden BNPL stack is known. The same reconstruction the product runs is executed blind on realistically noisy transactions — so we can measure exactly what the pipeline recovers and what the forward view adds.
n = 500 synthetic borrowers · seed 42 · generated 2026-06-02 · reproducible
What this proves — and what it doesn’t
What this proves
That the measurement pipeline survives noise: from messy, mis-tagged transactions it recovers the true hidden stack and reaches the same forward verdict it would on the truth. A reconstruction + decisioning result.
What it does not prove
That forward affordability predicts real-world defaults better than a point-in-time DBR. The oracle here is Mizan’s own forward model on the true stack — proving the thesis from it would be circular. The thesis rests on the mechanism + published external evidence (FinRegLab, CFPB, JPMorgan Chase Institute). How it works covers this →
Does the pipeline recover the true hidden stack?
Plan recall
95.3%
1053/1105 true plans recovered
Plan precision
96.2%
1095 detected
Installment exact
99.7%
of matched plans
High recovery under injected noise — the agent’s tagging plus the deterministic engine reconstruct what a single provider can’t see.
When it flags a forward cliff, is it right — and how often does it catch them?
Over the 434 applications the status-quo rules (DBR ≤ 45% + SAR 10,000 cap) approved — the only place a forward layer can change the call.
true positive
false negative
false positive
true negative
Recall (catch-rate)
77.6%
38 of 49 forward cliffs caught
Precision
67.9%
38 of 56 flags were real cliffs
Over-caution
4.7%
18 of 385 good customers flagged
Most of its flags are real cliffs (38 of 56), and it catches more than three-quarters of the cliffs that exist. Where it is cautious, the next section shows the trade still pays.
Why a missed cliff costs far more than a cautious call
The two errors are not symmetric. A missed cliff becomes a default — a near-total loss on unsecured BNPL that’s hard to recover. A cautious counter-offer on a good customer only forgoes the margin on the slice it deferred; the customer still buys. So a false negative costs ~25× a false positive.
Loss avoided
SAR 215,600
88 cliffs caught × SAR 2,450 loss/default
Over-caution cost
− SAR 4,100
41 counter-offers × SAR 100
Net, per 1,000 apps
≈ SAR 212,000
prevented loss, rule-passing apps
Even at 5× the over-caution cost, the layer is still net-positive (≈ SAR 195,000 per 1,000) — the asymmetry alone carries it; the solid precision is upside.
Illustrative figures (not from the backtest): exposure SAR 3,500 · LGD 70% · over-caution SAR 100/case. Stated so they can be challenged; the conclusion holds across a wide range.
Decision quality & escalation
Agreement vs the true stack
90.4%
verdict on reconstructed vs true stack
Referred to a human
15.6%
model unsure (low confidence / unresolved) — escalated, not bluffed
This population is deliberately stress-weighted (over-sampling stacked, thin-file and variable-income borrowers to test the engine), so its 15.6% escalation runs higher than a mixed production book — it is not the lender-wide “straight-through” rate.
Where this is heading — a calibrated probabilistic engine
The shipped engine answers yes/no (does any month breach capacity?). The next level models monthly income as a distribution and returns a probability — so the confidence number falls out of the math, and calibration becomes measurable on a labelled population.
Engine default (stability prior). On the dashed line = perfectly calibrated · dot size ∝ borrowers.
Brier score
0.081
vs 0.199 no-skill → real skill
Calibration error
0.042
mean |predicted − actual| (ECE)
Base survival
73%
n = 4,000 synthetic
We tested three ways to estimate income variance
| estimator | ECE | Brier |
|---|---|---|
| Stability proxy — engine default | 0.042 | 0.081 |
| Shrinkage | 0.057 | 0.088 |
| Observed CV | 0.079 | 0.106 |
The counter-intuitive result: the raw observed CV is the worst. At 2–3 income months it’s a noisy small-sample estimate; the stability prior is lower-variance and better-calibrated. Observed variance pays off only with more income history — until then, the prior wins.
Spike, not shipped: a Monte-Carlo layer over the same pure engine, on a synthetic DGP (income modelled lognormal with an explicit true CV; the few observed months are a small sample of it). It measures whether an estimator yields calibrated probabilities — and which estimator to trust — the honest next claim beyond catch-rate.
Method & honest scope
- · Each synthetic borrower has a known true stack; it emits noisy tagged transactions (entity-resolution slips, mis-tags, confidence variation) injected independently of the reconstruction.
- · The oracle verdict = the forward simulation on the true stack; Mizan’s verdict = the same simulation on the reconstructed stack. The backtest measures recovery + agreement, not real-default prediction.
- · The engine is a transparent, coarse method — monthly buckets (real plans are often biweekly), a linear volatility haircut, a heuristic remaining-count. It demonstrates carrying-capacity; it is not a calibrated production scorecard. Read cliffs as “within ~a month or two”, not exact dates.
- · Cleared by the rules = passes the point-in-time DBR gate (modelled at SAMA’s 45%-of-income consumer limit) and the SAR 10,000 cap. Seeded PRNG → identical numbers every run.
Validates the engine’s robustness and the value of the forward view — not real-world default prediction; production re-calibrates on real outcomes.