Revolutionizing Computer Vision: Synthetic Images Match Real Datasets

Global: Synthetic Images Match Large-Scale Real Datasets in Vision Transformer Pretraining

A team of researchers posted a paper on arXiv in June 2022 showing that formula-driven supervised learning (FDSL) can achieve vision transformer (ViT) performance comparable to pre‑training on massive real‑image collections such as ImageNet‑21k and JFT‑300M, without using any real images, human‑provided labels, or self‑supervision.

Background

FDSL relies on images generated algorithmically from mathematical formulas. Prior work has demonstrated that large‑scale datasets improve ViT accuracy, but they also raise concerns about privacy, copyright, labeling costs, and dataset bias. The new study investigates whether synthetic data can address these issues while maintaining competitive performance.

Performance Benchmarks

The authors report that a ViT‑Base model pre‑trained on ImageNet‑21k achieved 83.0% top‑1 accuracy and the same architecture pre‑trained on JFT‑300M reached 84.1% when fine‑tuned on ImageNet‑1k. Under comparable training settings, the FDSL approach attained 83.8% top‑1 accuracy, effectively matching the ImageNet‑21k result and approaching the JFT‑300M benchmark.

Efficiency Gains

One synthetic dataset, ExFractalDB‑21k, required roughly one‑fourteenth (x14.2 fewer) the number of images needed for JFT‑300M to achieve similar downstream performance, highlighting a substantial reduction in computational and storage demands.

Benefits of Synthetic Data

Because the images are generated algorithmically, they sidestep privacy and copyright constraints, eliminate labeling errors, and reduce the financial and labor costs associated with curating real‑world image collections. The authors suggest that these attributes give synthetic datasets strong potential for pre‑training general‑purpose models.

Investigated Hypotheses

To understand the source of the performance, the study examined two hypotheses: (i) that object contours are the primary factor driving success in FDSL datasets, and (ii) that increasing the complexity of label creation improves pre‑training outcomes.

Contour Dataset Results

The researchers constructed a dataset composed solely of simple object‑contour combinations. Experimental results indicated that this contour‑only dataset matched the performance of more elaborate fractal databases, supporting the notion that edge information is a key contributor.

Impact of Label Complexity

When the difficulty of the synthetic labeling task was heightened, the models consistently achieved higher fine‑tuning accuracy, confirming that more challenging pre‑training objectives can translate into better downstream performance.

Future Directions

The findings suggest that formula‑driven synthetic imagery could serve as a scalable, low‑cost alternative to traditional large‑scale image collections, potentially reshaping pre‑training strategies for computer‑vision models while mitigating ethical and legal concerns.

This report is based on information from arXiv, licensed under Academic Preprint / Open Access. Based on the abstract of the research paper. Full text available via ArXiv.

Synthetic Formula-Generated Images Match Large-Scale Real Datasets in Vision Transformer Pretraining