RealStats: Real-Only Statistical Framework for Fake Image Detection

AISTATS 2026 · Spotlight
Bar-Ilan University · Shaham Lab
BibTeX copied!

Abstract

As generative models continue to advance, detecting AI-generated images remains a significant challenge. Existing detection methods can be effective, but they often lack formal interpretability and rely on implicit assumptions about fake content, which may limit their adaptability under distribution shifts. In this work, we introduce RealStats, a statistically rigorous, training-free framework for fake-image detection that produces probability scores interpretable with respect to the real-image population. RealStats leverages the strengths of multiple existing detectors by combining strong, training-free statistical measures. Specifically, we compute p-values across a range of test statistics and aggregate them using classical statistical ensembling to assess how well an image aligns with a unified real-image distribution. This framework is generic, flexible, and entirely training-free, making it well-suited for robust fake-image detection in diverse and continually evolving settings.

Limitations of Existing Training-Free Detectors

Current training-free fake-image detectors face two fundamental challenges that motivate the RealStats framework. Select a topic below to view an illustration of each limitation.

Top: Classifier scores separate classes but lack statistical meaning; they are often overconfident and not calibrated.
Bottom: RealStats produces calibrated p-values with clear interpretation: the probability of observing such a statistic if the image were real.

Illustration of the score interpretability gap.
Supervised models output uncalibrated scores, whereas RealStats yields statistically meaningful p-values that enable principled decisions under a standard significance level.

Manifold Curvature (SDXL)
Manifold Curvature (StyleGAN2)
Permutation-Based Features
(StyleGAN2, f = CLIP, λ = 0.1)
Permutation-Based Features
(StyleGAN2, f = CLIP, λ = 0.01)

Illustration of the adaptability gap.
Top: Manifold curvature flips ordering between SDXL and StyleGAN2, showing generator-dependent behavior. Bottom: Permutation-based features depend strongly on λ; real-vs-fake separation reverses as λ changes. These shifts show that handcrafted statistics rely on assumptions that fail under distribution shifts.

The Three Pillars of RealStats

RealStats is built upon three foundational principles that together create a statistically interpretable, modular, and competitive framework for fake-image detection. These pillars structure the design of our method and guide the experimental sections that follow.

Interpretability

Every RealStats output is a calibrated p-value, a statistically meaningful, distribution-grounded measure of how consistent an image is with real data. This replaces opaque “realness scores” with transparent hypothesis testing, enabling explicit uncertainty quantification and principled decision thresholds.

Adaptability

RealStats is fully modular: any new scalar detector computed on real images can be added without retraining. Through independence validation and classical multi-test aggregation, the framework seamlessly incorporates new statistics, such as ManifoldBias, to adapt to evolving generative models.

Competitive Performance

Despite prioritizing interpretability and modularity, RealStats achieves performance comparable to state-of-the-art training-free detectors. By aggregating diverse, training-free statistics, it maintains strong AUC and AP while avoiding the instability observed in single-statistic baselines.

Interpretability Does Not Come at the Price of Performance

Bar chart comparing average AUC and AP against training-free baselines.
Radar plot comparing AUC per generator across methods.

Left: mean AUC and AP with standard deviations for several training-free detectors.
Right: per-generator AUC values shown as a radar plot comparing performance across methods.

A central aim is to provide interpretable and reliable detection. By returning p-values rather than raw scores, the method offers outputs with clear statistical meaning. While it shows some weaknesses on generators such as CycleGAN, it generally maintains more balanced performance across datasets, supported by multi-RIGID variants that help avoid failure when a single statistic underperforms. Overall, this maintains performance comparable to existing training-free approaches while improving consistency.

Adaptability in Action: Improving Performance on Challenging Generators

Adding the ManifoldBias statistic improves performance on generators: GauGAN, CycleGAN and SAN.

p-value distributions on the entire evaluation set before and after integrating ManifoldBias.

Some generators, such as GauGAN, CycleGAN, and SAN, are more challenging for our detector, whereas ManifoldBias performs better. To illustrate RealStats adaptability, we incorporate this statistic under the same setup, which leads to clear improvements (left figure). This effect also scales to the full benchmark dataset, increasing the overall AUC and p-values seperation (right figure) - showing the method ability to strengthen itself by selectively integrating new independent statistics when needed.

Qualitative Comparison of Interpretability Across Methods

Qualitative interpretability comparison across methods

Figure: Qualitative comparison of interpretability across detection methods. While baseline detectors often assign similar scores to visually diverse fake images, RealStats produces calibrated p-values that reflects each sample's statistical consistency with the real-image distribution. This enables a principled, graded notion of realism: higher p-values indicate compatibility with the null model, whereas low values signal meaningful deviation.

Overview of the RealStats Pipeline

RealStats operates in two phases. During null distribution modeling, we compute diverse training-free statistics on real images, estimate their empirical CDFs, and select an independent subset using chi-squared-based tests. At inference, incoming images are mapped to p-values through cached ECDFs and aggregated via Stouffer or Min-p rules, yielding interpretable decisions with provable error control.

Null distribution modeling: statistics computed on real data are converted into ECDFs, pruned by independence tests, verified through a uniformity test, and arranged into a maximal clique that guarantees valid multi-test aggregation.

Inference: selected statistics are evaluated on a test image, transformed into p-values using stored ECDFs, and aggregated into a single interpretable probability of the image being real.

Formal Validity

Calibrated two-sided empirical p-values are computed under an assumed real-image distribution, providing a statistically interpretable basis for evaluation.

Modular Statistics

Any scalar detector computed on real images can be added, ECDF-modeled, and incorporated into the framework without retraining.

Evidence Sensitivity

Stouffer and Min-p aggregation adapt to two evidence modes: several weak signals that align, or a single statistic that stands out strongly.

Practical Deployability: Runtime, Scalability, and Memory Efficiency

Efficient Statistic Extraction

Statistic extraction runtime under different worker configurations

Parallel workers reduce extraction time for 16 statistics from 15+ minutes (1 worker) to under 8 minutes (4 workers).

All experiments use a single A100 GPU to ensure fair comparison with baselines, which cannot utilize multiple GPUs. Within this constraint, RealStats exploits multiple workers to maximize GPU utilization and reduce CPU-GPU synchronization overhead. Runtime grows predictably with the number of statistics, and parallelism yields substantial speedups. At inference, RealStats processes 2,000 images in 5.5 minutes (0.165s per image), significantly faster than ManifoldBias (2.4s) and AEROBLADE (5.1s).

Fast Independence Testing and Clique Selection

Runtime of independence testing and clique extraction

Independence testing and clique selection remain under 200 ms even for 32 statistics.

Independence validation is essential for guaranteeing valid multi-statistic aggregation. Both χ² tests and maximal clique extraction scale efficiently, adding negligible overhead even as the number of statistics grows. This enables RealStats to incorporate new detectors on the fly without slowing down the pipeline.

Memory Efficiency

Peak GPU memory usage across methods

RealStats uses only 7-22GB (1-4 workers), compared to 40GB for ManifoldBias and 76GB for AEROBLADE.

Despite using multiple statistics, RealStats remains memory efficient. With batch size 128 and up to four workers, GPU memory stays within 7-22GB. In contrast, AEROBLADE reaches 76GB, and ManifoldBias (with 8 perturbations) consumes 40GB at batch size 1 and cannot scale further. ECDF storage is also minimal, about 0.25MB for 32 statistics.

These results show that RealStats offers a statistically principled framework with practical runtime, strong scalability under single-GPU constraints, and substantially lower memory usage than competing training-free baselines.

Conclusion

RealStats provides an interpretable, training-free framework for fake-image detection by modeling real-image statistics, enforcing independence among detectors, and aggregating calibrated p-values. It achieves competitive performance with state-of-the-art training-free methods while remaining modular and easily extensible, allowing new statistics to be incorporated without retraining. Together with its efficient runtime and scalable multi-statistic design, RealStats offers a principled and adaptable foundation for detecting fake images in evolving generative settings.

Citation

@article{zisman2026realstats,
  title={RealStats: A Real-Only Statistical Framework for Fake Image Detection},
  author={Haim Zisman and Uri Shaham},
  journal={arXiv preprint},
  eprint={2601.18900},
  archivePrefix={arXiv},
  primaryClass={cs.CV},
  year={2026}
}