Mizan · the proof

The proof — a measured backtest

A labelled synthetic population where the true hidden BNPL stack is known. The same reconstruction the product runs is executed blind on realistically noisy transactions — so we can measure exactly what the pipeline recovers and what the forward view adds.

n = 500 synthetic borrowers · seed 42 · generated 2026-06-02 · reproducible

What this proves — and what it doesn’t

What this proves

That the measurement pipeline survives noise: from messy, mis-tagged transactions it recovers the true hidden stack and reaches the same forward verdict it would on the truth. A reconstruction + decisioning result.

What it does not prove

That forward affordability predicts real-world defaults better than a point-in-time DBR. The oracle here is Mizan’s own forward model on the true stack — proving the thesis from it would be circular. The thesis rests on the mechanism + published external evidence (FinRegLab, CFPB, JPMorgan Chase Institute). How it works covers this →

Does the pipeline recover the true hidden stack?

Plan recall

95.3%

1053/1105 true plans recovered

Plan precision

96.2%

1095 detected

Installment exact

99.7%

of matched plans

High recovery under injected noise — the agent’s tagging plus the deterministic engine reconstruct what a single provider can’t see.

When it flags a forward cliff, is it right — and how often does it catch them?

Over the 434 applications the status-quo rules (DBR ≤ 45% + SAR 10,000 cap) approved — the only place a forward layer can change the call.

Mizan flagged

Mizan passed

Truly cliffs (49)

38caught

true positive

11missed → default

false negative

Truly affordable (385)

18cautious counter-offer

false positive

367correctly passed

true negative

Recall (catch-rate)

77.6%

38 of 49 forward cliffs caught

Precision

67.9%

38 of 56 flags were real cliffs

Over-caution

4.7%

18 of 385 good customers flagged

Most of its flags are real cliffs (38 of 56), and it catches more than three-quarters of the cliffs that exist. Where it is cautious, the next section shows the trade still pays.

Why a missed cliff costs far more than a cautious call

The two errors are not symmetric. A missed cliff becomes a default — a near-total loss on unsecured BNPL that’s hard to recover. A cautious counter-offer on a good customer only forgoes the margin on the slice it deferred; the customer still buys. So a false negative costs ~25× a false positive.

Loss avoided

SAR 215,600

88 cliffs caught × SAR 2,450 loss/default

Over-caution cost

− SAR 4,100

41 counter-offers × SAR 100

Net, per 1,000 apps

≈ SAR 212,000

prevented loss, rule-passing apps

Even at 5× the over-caution cost, the layer is still net-positive (≈ SAR 195,000 per 1,000) — the asymmetry alone carries it; the solid precision is upside.

Illustrative figures (not from the backtest): exposure SAR 3,500 · LGD 70% · over-caution SAR 100/case. Stated so they can be challenged; the conclusion holds across a wide range.

Decision quality & escalation

Agreement vs the true stack

90.4%

verdict on reconstructed vs true stack

Referred to a human

15.6%

model unsure (low confidence / unresolved) — escalated, not bluffed

This population is deliberately stress-weighted (over-sampling stacked, thin-file and variable-income borrowers to test the engine), so its 15.6% escalation runs higher than a mixed production book — it is not the lender-wide “straight-through” rate.

Where this is heading — a calibrated probabilistic engine

The shipped engine answers yes/no (does any month breach capacity?). The next level models monthly income as a distribution and returns a probability — so the confidence number falls out of the math, and calibration becomes measurable on a labelled population.

Engine default (stability prior). On the dashed line = perfectly calibrated · dot size ∝ borrowers.

Brier score

0.081

vs 0.199 no-skill → real skill

Calibration error

0.042

mean |predicted − actual| (ECE)

Base survival

73%

n = 4,000 synthetic

We tested three ways to estimate income variance

estimator	ECE	Brier
Stability proxy — engine default	0.042	0.081
Shrinkage	0.057	0.088
Observed CV	0.079	0.106

The counter-intuitive result: the raw observed CV is the worst. At 2–3 income months it’s a noisy small-sample estimate; the stability prior is lower-variance and better-calibrated. Observed variance pays off only with more income history — until then, the prior wins.

The verdict becomes a probability. At SAR 5,699 Rashed has a 32% chance of staying within capacity over six months; the deterministic “safe” SAR 4,599 corresponds to 62%. The haircut was an implicitconfidence bar — a probabilistic engine makes it a dial the lender sets.

Spike, not shipped: a Monte-Carlo layer over the same pure engine, on a synthetic DGP (income modelled lognormal with an explicit true CV; the few observed months are a small sample of it). It measures whether an estimator yields calibrated probabilities — and which estimator to trust — the honest next claim beyond catch-rate.

Method & honest scope

· Each synthetic borrower has a known true stack; it emits noisy tagged transactions (entity-resolution slips, mis-tags, confidence variation) injected independently of the reconstruction.
· The oracle verdict = the forward simulation on the true stack; Mizan’s verdict = the same simulation on the reconstructed stack. The backtest measures recovery + agreement, not real-default prediction.
· The engine is a transparent, coarse method — monthly buckets (real plans are often biweekly), a linear volatility haircut, a heuristic remaining-count. It demonstrates carrying-capacity; it is not a calibrated production scorecard. Read cliffs as “within ~a month or two”, not exact dates.
· Cleared by the rules = passes the point-in-time DBR gate (modelled at SAMA’s 45%-of-income consumer limit) and the SAR 10,000 cap. Seeded PRNG → identical numbers every run.

Validates the engine’s robustness and the value of the forward view — not real-world default prediction; production re-calibrates on real outcomes.