Statistical Tests

This page explains the statistical tests used to validate die value randomness quality. These tests run continuously on rolling windows of beacon output.

Note: For tests comparing rng.dev against drand and NIST beacons (hash byte comparison), see Benchmark Tests.

Understanding the Results

Effect Size (Primary Metric)

We use effect size rather than p-values to determine pass/fail status. Effect size measures how large a deviation is, not just whether it's statistically detectable.

Effect Size (Cramér's V)	Interpretation	Status
< 0.05	Negligible	✓ PASS
0.05 - 0.10	Weak	✓ PASS
0.10 - 0.20	Moderate	⚠ WATCH
> 0.20	Strong	✗ INVESTIGATE

Why effect size instead of p-values?

P-values become misleading at large sample sizes. With 100,000 rounds, even a 0.1% deviation produces p < 0.001 — but such tiny deviations have no practical impact on randomness quality.

Effect size (Cramér's V) is sample-size invariant: a V of 0.02 means the same thing whether you have 100 or 100,000 samples.

P-Values (For Reference)

Tests also report p-values for those familiar with traditional statistical analysis:

P-Value	Meaning
> 0.05	No statistically significant deviation
0.01 - 0.05	Borderline — may be worth monitoring
< 0.01	Statistically significant deviation detected

Important: At large sample sizes, low p-values are expected even for excellent randomness. Always check effect size for the practical interpretation.

Expected Failures

Random data will sometimes produce borderline results. This is not a bug — it's mathematics.

Running multiple tests means occasional anomalies are expected
This is why we apply Bonferroni correction (adjust significance for multiple tests)
Effect size thresholds are calibrated to avoid false alarms

The dashboard shows occasional WATCH status as expected behavior, not system problems. Only persistent INVESTIGATE status indicates potential bias.

Test Descriptions

1. Chi-Squared Distribution Test

What it tests: Are all die faces appearing with equal frequency?

How it works:

Count occurrences of each face (1-6)
Compare observed counts to expected counts (n/6 each)
Calculate chi-squared statistic: χ² = Σ (observed - expected)² / expected

What it detects:

Biased die (one face appears more often)
Manufacturing defects in physical dice
Software bugs favoring certain values

Interpretation:

High p-value (>0.05): Distribution looks uniform
Low p-value (<0.01): Some faces appear too often or too rarely

Example:

1000 rolls, expected 166.7 per face

Face | Observed | Expected | Contribution
-----|----------|----------|-------------
  1  |   158    |  166.7   |    0.45
  2  |   172    |  166.7   |    0.17
  3  |   165    |  166.7   |    0.02
  4  |   170    |  166.7   |    0.07
  5  |   168    |  166.7   |    0.01
  6  |   167    |  166.7   |    0.00
                             ─────────
                     χ² =      0.72

p-value = 0.98 → PASS (no bias detected)

2. Runs Test (Odd/Even Parity)

What it tests: Do odd and even values alternate appropriately?

How it works:

Convert each die value to parity: odd (1,3,5) → 1, even (2,4,6) → 0
Count "runs" — consecutive sequences of same parity
Compare run count to expected value for random sequence

What it detects:

Too much alternation (odd-even-odd-even pattern)
Too much clustering (odd-odd-odd-odd pattern)
Predictable sequencing

Why odd/even instead of above/below median:

Discrete die values split cleanly: exactly 3 odd, 3 even
Avoids ambiguity at median (3.5)
Better statistical properties for 6-sided dice

Interpretation:

High p-value: Runs count is normal for random data
Low p-value: Sequence is too clustered or too alternating

3. Streak Distribution

What it tests: Do consecutive same-value sequences follow expected lengths?

How it works:

Find all "streaks" — consecutive identical values
Count streaks of each length (1, 2, 3, 4+)
Compare to theoretical distribution

Expected distribution for fair die:

P(streak length k) = (1/6)^(k-1) × (5/6)

Length | Probability | Per 1000 streaks
-------|-------------|------------------
   1   |   83.3%     |      833
   2   |   13.9%     |      139
   3   |    2.3%     |       23
   4   |    0.4%     |        4
  5+   |    0.1%     |        1

What it detects:

Sticky behavior (too many long streaks)
Anti-sticky behavior (too few repeats)
Memory in the random source

Interpretation:

High p-value: Streak lengths are normal
Low p-value: Streaks are suspiciously long or short

4. Transition Matrix Test

What it tests: Does the next value depend on the current value?

How it works:

Build 6×6 matrix counting transitions (e.g., "1 followed by 4")
For each row, run chi-squared test against uniform distribution
Report 6 separate p-values (one per starting value)

Expected behavior:

P(next = j | current = i) = 1/6 for all i, j

Transition matrix should look roughly like:
       → 1    2    3    4    5    6
    1   16.7% 16.7% 16.7% 16.7% 16.7% 16.7%
    2   16.7% 16.7% 16.7% 16.7% 16.7% 16.7%
    ...

What it detects:

Markov dependencies ("3 is often followed by 5")
Mechanical bias in physical dice
Pseudo-random generator weaknesses

Why 6 separate tests instead of one:

Raw matrix chi-squared violates independence assumptions
Row-by-row testing is statistically valid
Pinpoints which transitions are problematic

Interpretation:

All rows p > 0.05: No transition bias detected
One row p < 0.01: That starting value has biased follow-ups

5. Serial Pair Test

What it tests: Do all consecutive pairs appear with equal frequency?

How it works:

Extract all consecutive pairs: (1,4), (4,2), (2,6), ...
Count occurrences of each of 36 possible pairs
Chi-squared test against expected frequency (n/36 each)

Expected behavior:

36 unique pairs, each with probability 1/36 ≈ 2.78%

In 10,000 rolls → ~9,999 pairs → ~278 expected per pair

What it detects:

Subtle sequential bias invisible to single-value tests
"1 is more likely to follow 3" patterns
State-dependent behavior

Difference from transition matrix:

Transition matrix: conditional probabilities (P(next|current))
Serial pair test: joint probabilities (P(current AND next))
Both are valuable; they catch different anomalies

Interpretation:

High p-value: All pairs appear equally often
Low p-value: Some pairs are over/under-represented

6. Shannon Entropy

What it tests: How much information content is in the output?

How it works:

Calculate frequency of each face: p_i = count_i / total
Compute entropy: H = -Σ p_i × log₂(p_i)
Compare to theoretical maximum

Theoretical values:

Maximum entropy for 6-sided die:
H_max = log₂(6) ≈ 2.585 bits

This occurs when all faces are equally likely (p = 1/6)

What it detects:

Low entropy = predictable output
Concentrated distribution = fewer effective outcomes
Information loss from bias

Interpretation:

Observed Entropy	Meaning
~2.585 bits	Perfect — all outcomes equally likely
2.4 - 2.58 bits	Good — minor variation
< 2.4 bits	Concerning — some outcomes dominate
< 2.0 bits	Serious — significant bias present

Note: Entropy is always shown as a metric, not a pass/fail test, because it provides intuitive understanding of randomness quality.

7. Autocorrelation

What it tests: Is there correlation between values at different time lags?

How it works:

For each lag k (1, 2, 3, ... 20):
Compute correlation between sequence and itself shifted by k
Check if correlations fall within expected bounds

Expected behavior:

For truly random data:
- Autocorrelation at all lags ≈ 0
- 95% confidence bounds: ±1.96/√n

For n = 1000:
- Bounds ≈ ±0.062
- Values outside bounds suggest correlation

What it detects:

Periodic patterns (every 10th value repeats)
Trending behavior
Poor PRNG with short cycles

Important caveat: Die values (1-6) are categorical, not continuous. Autocorrelation assumes numeric distance matters, but "1 → 6" isn't meaningfully different from "2 → 3" in randomness terms.

Interpretation:

Use for visualization and trend detection
Don't rely on it as primary randomness test
Transition matrix is more appropriate for sequential independence

Rolling Windows

Tests run on multiple window sizes to catch both short-term and long-term anomalies:

Window	Purpose	Update Frequency
100	Short-term fluctuations	Every round
1,000	Medium-term patterns	Every round
10,000	Stable statistics	Every 10 rounds
100,000	Long-term validation	Every 100 rounds

Why multiple windows?

Small windows: Responsive to recent changes, but noisy
Large windows: Stable statistics, but slow to detect new problems
Combined view: Best of both worlds

Bonferroni Correction

When running multiple tests, false positives accumulate.

Problem:

6 tests at α = 0.05
Probability of at least one false positive: 1 - (0.95)⁶ ≈ 26%

Solution: Bonferroni correction

Adjusted α = 0.05 / 6 ≈ 0.0083
Each test must achieve p < 0.0083 to be "significant"
Family-wise error rate stays at 5%

Dashboard shows:

Individual test p-values
Bonferroni-corrected overall status
Whether any test is "significant" after correction

What These Tests Cannot Tell You

Statistical tests have limitations:

Cannot prove randomness — only detect specific types of non-randomness
Cannot detect all manipulation — adversary might pass all tests
Cannot predict future values — that's the point
Will occasionally fail — 5% false positive rate is expected

The dashboard is evidence of quality, not proof of perfection.

Understanding the Results​

Effect Size (Primary Metric)​

P-Values (For Reference)​

Expected Failures​

Test Descriptions​

1. Chi-Squared Distribution Test​

2. Runs Test (Odd/Even Parity)​

3. Streak Distribution​

4. Transition Matrix Test​

5. Serial Pair Test​

6. Shannon Entropy​

7. Autocorrelation​

Rolling Windows​

Bonferroni Correction​

What These Tests Cannot Tell You​

Further Reading​

Understanding the Results

Effect Size (Primary Metric)

P-Values (For Reference)

Expected Failures

Test Descriptions

1. Chi-Squared Distribution Test

2. Runs Test (Odd/Even Parity)

3. Streak Distribution

4. Transition Matrix Test

5. Serial Pair Test

6. Shannon Entropy

7. Autocorrelation

Rolling Windows

Bonferroni Correction

What These Tests Cannot Tell You

Further Reading