Skip to main content

Statistical Tests

This page explains the statistical tests used to validate die value randomness quality. These tests run continuously on rolling windows of beacon output.

Note: For tests comparing rng.dev against drand and NIST beacons (hash byte comparison), see Benchmark Tests.


Understanding the Results

Effect Size (Primary Metric)

We use effect size rather than p-values to determine pass/fail status. Effect size measures how large a deviation is, not just whether it's statistically detectable.

Effect Size (Cramér's V)InterpretationStatus
< 0.05Negligible✓ PASS
0.05 - 0.10Weak✓ PASS
0.10 - 0.20Moderate⚠ WATCH
> 0.20Strong✗ INVESTIGATE

Why effect size instead of p-values?

P-values become misleading at large sample sizes. With 100,000 rounds, even a 0.1% deviation produces p < 0.001 — but such tiny deviations have no practical impact on randomness quality.

Effect size (Cramér's V) is sample-size invariant: a V of 0.02 means the same thing whether you have 100 or 100,000 samples.

P-Values (For Reference)

Tests also report p-values for those familiar with traditional statistical analysis:

P-ValueMeaning
> 0.05No statistically significant deviation
0.01 - 0.05Borderline — may be worth monitoring
< 0.01Statistically significant deviation detected

Important: At large sample sizes, low p-values are expected even for excellent randomness. Always check effect size for the practical interpretation.

Expected Failures

Random data will sometimes produce borderline results. This is not a bug — it's mathematics.

  • Running multiple tests means occasional anomalies are expected
  • This is why we apply Bonferroni correction (adjust significance for multiple tests)
  • Effect size thresholds are calibrated to avoid false alarms

The dashboard shows occasional WATCH status as expected behavior, not system problems. Only persistent INVESTIGATE status indicates potential bias.


Test Descriptions

1. Chi-Squared Distribution Test

What it tests: Are all die faces appearing with equal frequency?

How it works:

  1. Count occurrences of each face (1-6)
  2. Compare observed counts to expected counts (n/6 each)
  3. Calculate chi-squared statistic: χ² = Σ (observed - expected)² / expected

What it detects:

  • Biased die (one face appears more often)
  • Manufacturing defects in physical dice
  • Software bugs favoring certain values

Interpretation:

  • High p-value (>0.05): Distribution looks uniform
  • Low p-value (<0.01): Some faces appear too often or too rarely

Example:

1000 rolls, expected 166.7 per face

Face | Observed | Expected | Contribution
-----|----------|----------|-------------
1 | 158 | 166.7 | 0.45
2 | 172 | 166.7 | 0.17
3 | 165 | 166.7 | 0.02
4 | 170 | 166.7 | 0.07
5 | 168 | 166.7 | 0.01
6 | 167 | 166.7 | 0.00
─────────
χ² = 0.72

p-value = 0.98 → PASS (no bias detected)

2. Runs Test (Odd/Even Parity)

What it tests: Do odd and even values alternate appropriately?

How it works:

  1. Convert each die value to parity: odd (1,3,5) → 1, even (2,4,6) → 0
  2. Count "runs" — consecutive sequences of same parity
  3. Compare run count to expected value for random sequence

What it detects:

  • Too much alternation (odd-even-odd-even pattern)
  • Too much clustering (odd-odd-odd-odd pattern)
  • Predictable sequencing

Why odd/even instead of above/below median:

  • Discrete die values split cleanly: exactly 3 odd, 3 even
  • Avoids ambiguity at median (3.5)
  • Better statistical properties for 6-sided dice

Interpretation:

  • High p-value: Runs count is normal for random data
  • Low p-value: Sequence is too clustered or too alternating

3. Streak Distribution

What it tests: Do consecutive same-value sequences follow expected lengths?

How it works:

  1. Find all "streaks" — consecutive identical values
  2. Count streaks of each length (1, 2, 3, 4+)
  3. Compare to theoretical distribution

Expected distribution for fair die:

P(streak length k) = (1/6)^(k-1) × (5/6)

Length | Probability | Per 1000 streaks
-------|-------------|------------------
1 | 83.3% | 833
2 | 13.9% | 139
3 | 2.3% | 23
4 | 0.4% | 4
5+ | 0.1% | 1

What it detects:

  • Sticky behavior (too many long streaks)
  • Anti-sticky behavior (too few repeats)
  • Memory in the random source

Interpretation:

  • High p-value: Streak lengths are normal
  • Low p-value: Streaks are suspiciously long or short

4. Transition Matrix Test

What it tests: Does the next value depend on the current value?

How it works:

  1. Build 6×6 matrix counting transitions (e.g., "1 followed by 4")
  2. For each row, run chi-squared test against uniform distribution
  3. Report 6 separate p-values (one per starting value)

Expected behavior:

P(next = j | current = i) = 1/6 for all i, j

Transition matrix should look roughly like:
→ 1 2 3 4 5 6
1 16.7% 16.7% 16.7% 16.7% 16.7% 16.7%
2 16.7% 16.7% 16.7% 16.7% 16.7% 16.7%
...

What it detects:

  • Markov dependencies ("3 is often followed by 5")
  • Mechanical bias in physical dice
  • Pseudo-random generator weaknesses

Why 6 separate tests instead of one:

  • Raw matrix chi-squared violates independence assumptions
  • Row-by-row testing is statistically valid
  • Pinpoints which transitions are problematic

Interpretation:

  • All rows p > 0.05: No transition bias detected
  • One row p < 0.01: That starting value has biased follow-ups

5. Serial Pair Test

What it tests: Do all consecutive pairs appear with equal frequency?

How it works:

  1. Extract all consecutive pairs: (1,4), (4,2), (2,6), ...
  2. Count occurrences of each of 36 possible pairs
  3. Chi-squared test against expected frequency (n/36 each)

Expected behavior:

36 unique pairs, each with probability 1/36 ≈ 2.78%

In 10,000 rolls → ~9,999 pairs → ~278 expected per pair

What it detects:

  • Subtle sequential bias invisible to single-value tests
  • "1 is more likely to follow 3" patterns
  • State-dependent behavior

Difference from transition matrix:

  • Transition matrix: conditional probabilities (P(next|current))
  • Serial pair test: joint probabilities (P(current AND next))
  • Both are valuable; they catch different anomalies

Interpretation:

  • High p-value: All pairs appear equally often
  • Low p-value: Some pairs are over/under-represented

6. Shannon Entropy

What it tests: How much information content is in the output?

How it works:

  1. Calculate frequency of each face: p_i = count_i / total
  2. Compute entropy: H = -Σ p_i × log₂(p_i)
  3. Compare to theoretical maximum

Theoretical values:

Maximum entropy for 6-sided die:
H_max = log₂(6) ≈ 2.585 bits

This occurs when all faces are equally likely (p = 1/6)

What it detects:

  • Low entropy = predictable output
  • Concentrated distribution = fewer effective outcomes
  • Information loss from bias

Interpretation:

Observed EntropyMeaning
~2.585 bitsPerfect — all outcomes equally likely
2.4 - 2.58 bitsGood — minor variation
< 2.4 bitsConcerning — some outcomes dominate
< 2.0 bitsSerious — significant bias present

Note: Entropy is always shown as a metric, not a pass/fail test, because it provides intuitive understanding of randomness quality.


7. Autocorrelation

What it tests: Is there correlation between values at different time lags?

How it works:

  1. For each lag k (1, 2, 3, ... 20):
  2. Compute correlation between sequence and itself shifted by k
  3. Check if correlations fall within expected bounds

Expected behavior:

For truly random data:
- Autocorrelation at all lags ≈ 0
- 95% confidence bounds: ±1.96/√n

For n = 1000:
- Bounds ≈ ±0.062
- Values outside bounds suggest correlation

What it detects:

  • Periodic patterns (every 10th value repeats)
  • Trending behavior
  • Poor PRNG with short cycles

Important caveat: Die values (1-6) are categorical, not continuous. Autocorrelation assumes numeric distance matters, but "1 → 6" isn't meaningfully different from "2 → 3" in randomness terms.

Interpretation:

  • Use for visualization and trend detection
  • Don't rely on it as primary randomness test
  • Transition matrix is more appropriate for sequential independence

Rolling Windows

Tests run on multiple window sizes to catch both short-term and long-term anomalies:

WindowPurposeUpdate Frequency
100Short-term fluctuationsEvery round
1,000Medium-term patternsEvery round
10,000Stable statisticsEvery 10 rounds
100,000Long-term validationEvery 100 rounds

Why multiple windows?

  • Small windows: Responsive to recent changes, but noisy
  • Large windows: Stable statistics, but slow to detect new problems
  • Combined view: Best of both worlds

Bonferroni Correction

When running multiple tests, false positives accumulate.

Problem:

  • 6 tests at α = 0.05
  • Probability of at least one false positive: 1 - (0.95)⁶ ≈ 26%

Solution: Bonferroni correction

  • Adjusted α = 0.05 / 6 ≈ 0.0083
  • Each test must achieve p < 0.0083 to be "significant"
  • Family-wise error rate stays at 5%

Dashboard shows:

  • Individual test p-values
  • Bonferroni-corrected overall status
  • Whether any test is "significant" after correction

What These Tests Cannot Tell You

Statistical tests have limitations:

  1. Cannot prove randomness — only detect specific types of non-randomness
  2. Cannot detect all manipulation — adversary might pass all tests
  3. Cannot predict future values — that's the point
  4. Will occasionally fail — 5% false positive rate is expected

The dashboard is evidence of quality, not proof of perfection.


Further Reading