Statistical Tests
This page explains the statistical tests used to validate die value randomness quality. These tests run continuously on rolling windows of beacon output.
Note: For tests comparing rng.dev against drand and NIST beacons (hash byte comparison), see Benchmark Tests.
Understanding the Results
Effect Size (Primary Metric)
We use effect size rather than p-values to determine pass/fail status. Effect size measures how large a deviation is, not just whether it's statistically detectable.
| Effect Size (Cramér's V) | Interpretation | Status |
|---|---|---|
| < 0.05 | Negligible | ✓ PASS |
| 0.05 - 0.10 | Weak | ✓ PASS |
| 0.10 - 0.20 | Moderate | ⚠ WATCH |
| > 0.20 | Strong | ✗ INVESTIGATE |
Why effect size instead of p-values?
P-values become misleading at large sample sizes. With 100,000 rounds, even a 0.1% deviation produces p < 0.001 — but such tiny deviations have no practical impact on randomness quality.
Effect size (Cramér's V) is sample-size invariant: a V of 0.02 means the same thing whether you have 100 or 100,000 samples.
P-Values (For Reference)
Tests also report p-values for those familiar with traditional statistical analysis:
| P-Value | Meaning |
|---|---|
| > 0.05 | No statistically significant deviation |
| 0.01 - 0.05 | Borderline — may be worth monitoring |
| < 0.01 | Statistically significant deviation detected |
Important: At large sample sizes, low p-values are expected even for excellent randomness. Always check effect size for the practical interpretation.
Expected Failures
Random data will sometimes produce borderline results. This is not a bug — it's mathematics.
- Running multiple tests means occasional anomalies are expected
- This is why we apply Bonferroni correction (adjust significance for multiple tests)
- Effect size thresholds are calibrated to avoid false alarms
The dashboard shows occasional WATCH status as expected behavior, not system problems. Only persistent INVESTIGATE status indicates potential bias.
Test Descriptions
1. Chi-Squared Distribution Test
What it tests: Are all die faces appearing with equal frequency?
How it works:
- Count occurrences of each face (1-6)
- Compare observed counts to expected counts (n/6 each)
- Calculate chi-squared statistic: χ² = Σ (observed - expected)² / expected
What it detects:
- Biased die (one face appears more often)
- Manufacturing defects in physical dice
- Software bugs favoring certain values
Interpretation:
- High p-value (>0.05): Distribution looks uniform
- Low p-value (<0.01): Some faces appear too often or too rarely
Example:
1000 rolls, expected 166.7 per face
Face | Observed | Expected | Contribution
-----|----------|----------|-------------
1 | 158 | 166.7 | 0.45
2 | 172 | 166.7 | 0.17
3 | 165 | 166.7 | 0.02
4 | 170 | 166.7 | 0.07
5 | 168 | 166.7 | 0.01
6 | 167 | 166.7 | 0.00
─────────
χ² = 0.72
p-value = 0.98 → PASS (no bias detected)
2. Runs Test (Odd/Even Parity)
What it tests: Do odd and even values alternate appropriately?
How it works:
- Convert each die value to parity: odd (1,3,5) → 1, even (2,4,6) → 0
- Count "runs" — consecutive sequences of same parity
- Compare run count to expected value for random sequence
What it detects:
- Too much alternation (odd-even-odd-even pattern)
- Too much clustering (odd-odd-odd-odd pattern)
- Predictable sequencing
Why odd/even instead of above/below median:
- Discrete die values split cleanly: exactly 3 odd, 3 even
- Avoids ambiguity at median (3.5)
- Better statistical properties for 6-sided dice
Interpretation:
- High p-value: Runs count is normal for random data
- Low p-value: Sequence is too clustered or too alternating
3. Streak Distribution
What it tests: Do consecutive same-value sequences follow expected lengths?
How it works:
- Find all "streaks" — consecutive identical values
- Count streaks of each length (1, 2, 3, 4+)
- Compare to theoretical distribution
Expected distribution for fair die:
P(streak length k) = (1/6)^(k-1) × (5/6)
Length | Probability | Per 1000 streaks
-------|-------------|------------------
1 | 83.3% | 833
2 | 13.9% | 139
3 | 2.3% | 23
4 | 0.4% | 4
5+ | 0.1% | 1
What it detects:
- Sticky behavior (too many long streaks)
- Anti-sticky behavior (too few repeats)
- Memory in the random source
Interpretation:
- High p-value: Streak lengths are normal
- Low p-value: Streaks are suspiciously long or short
4. Transition Matrix Test
What it tests: Does the next value depend on the current value?
How it works:
- Build 6×6 matrix counting transitions (e.g., "1 followed by 4")
- For each row, run chi-squared test against uniform distribution
- Report 6 separate p-values (one per starting value)
Expected behavior:
P(next = j | current = i) = 1/6 for all i, j
Transition matrix should look roughly like:
→ 1 2 3 4 5 6
1 16.7% 16.7% 16.7% 16.7% 16.7% 16.7%
2 16.7% 16.7% 16.7% 16.7% 16.7% 16.7%
...
What it detects:
- Markov dependencies ("3 is often followed by 5")
- Mechanical bias in physical dice
- Pseudo-random generator weaknesses
Why 6 separate tests instead of one:
- Raw matrix chi-squared violates independence assumptions
- Row-by-row testing is statistically valid
- Pinpoints which transitions are problematic
Interpretation:
- All rows p > 0.05: No transition bias detected
- One row p < 0.01: That starting value has biased follow-ups
5. Serial Pair Test
What it tests: Do all consecutive pairs appear with equal frequency?
How it works:
- Extract all consecutive pairs: (1,4), (4,2), (2,6), ...
- Count occurrences of each of 36 possible pairs
- Chi-squared test against expected frequency (n/36 each)
Expected behavior:
36 unique pairs, each with probability 1/36 ≈ 2.78%
In 10,000 rolls → ~9,999 pairs → ~278 expected per pair
What it detects:
- Subtle sequential bias invisible to single-value tests
- "1 is more likely to follow 3" patterns
- State-dependent behavior
Difference from transition matrix:
- Transition matrix: conditional probabilities (P(next|current))
- Serial pair test: joint probabilities (P(current AND next))
- Both are valuable; they catch different anomalies
Interpretation:
- High p-value: All pairs appear equally often
- Low p-value: Some pairs are over/under-represented
6. Shannon Entropy
What it tests: How much information content is in the output?
How it works:
- Calculate frequency of each face: p_i = count_i / total
- Compute entropy: H = -Σ p_i × log₂(p_i)
- Compare to theoretical maximum
Theoretical values:
Maximum entropy for 6-sided die:
H_max = log₂(6) ≈ 2.585 bits
This occurs when all faces are equally likely (p = 1/6)
What it detects:
- Low entropy = predictable output
- Concentrated distribution = fewer effective outcomes
- Information loss from bias
Interpretation:
| Observed Entropy | Meaning |
|---|---|
| ~2.585 bits | Perfect — all outcomes equally likely |
| 2.4 - 2.58 bits | Good — minor variation |
| < 2.4 bits | Concerning — some outcomes dominate |
| < 2.0 bits | Serious — significant bias present |
Note: Entropy is always shown as a metric, not a pass/fail test, because it provides intuitive understanding of randomness quality.
7. Autocorrelation
What it tests: Is there correlation between values at different time lags?
How it works:
- For each lag k (1, 2, 3, ... 20):
- Compute correlation between sequence and itself shifted by k
- Check if correlations fall within expected bounds
Expected behavior:
For truly random data:
- Autocorrelation at all lags ≈ 0
- 95% confidence bounds: ±1.96/√n
For n = 1000:
- Bounds ≈ ±0.062
- Values outside bounds suggest correlation
What it detects:
- Periodic patterns (every 10th value repeats)
- Trending behavior
- Poor PRNG with short cycles
Important caveat: Die values (1-6) are categorical, not continuous. Autocorrelation assumes numeric distance matters, but "1 → 6" isn't meaningfully different from "2 → 3" in randomness terms.
Interpretation:
- Use for visualization and trend detection
- Don't rely on it as primary randomness test
- Transition matrix is more appropriate for sequential independence
Rolling Windows
Tests run on multiple window sizes to catch both short-term and long-term anomalies:
| Window | Purpose | Update Frequency |
|---|---|---|
| 100 | Short-term fluctuations | Every round |
| 1,000 | Medium-term patterns | Every round |
| 10,000 | Stable statistics | Every 10 rounds |
| 100,000 | Long-term validation | Every 100 rounds |
Why multiple windows?
- Small windows: Responsive to recent changes, but noisy
- Large windows: Stable statistics, but slow to detect new problems
- Combined view: Best of both worlds
Bonferroni Correction
When running multiple tests, false positives accumulate.
Problem:
- 6 tests at α = 0.05
- Probability of at least one false positive: 1 - (0.95)⁶ ≈ 26%
Solution: Bonferroni correction
- Adjusted α = 0.05 / 6 ≈ 0.0083
- Each test must achieve p < 0.0083 to be "significant"
- Family-wise error rate stays at 5%
Dashboard shows:
- Individual test p-values
- Bonferroni-corrected overall status
- Whether any test is "significant" after correction
What These Tests Cannot Tell You
Statistical tests have limitations:
- Cannot prove randomness — only detect specific types of non-randomness
- Cannot detect all manipulation — adversary might pass all tests
- Cannot predict future values — that's the point
- Will occasionally fail — 5% false positive rate is expected
The dashboard is evidence of quality, not proof of perfection.
Further Reading
- Benchmark Tests — Comparing rng.dev against drand and NIST beacons
- NIST SP 800-22 — Statistical test suite for random number generators
- Diehard Tests — Classic battery of randomness tests
- How It Works — Beacon generation process
- Threat Model — Security assumptions and limitations