Fei-Fei Li's GPIC: The Open Dataset Built to Replace ImageNet

Table of Contents

Fei-Fei Li’s GPIC: The Open Dataset Built to Replace ImageNet

For more than a decade, ImageNet served as the defining benchmark of computer vision research. From AlexNet and VGG to ResNet and Vision Transformers, nearly every major breakthrough in visual recognition was measured against the same dataset.

That era may now be coming to a close.

A new project led by Stanford researchers, including renowned AI pioneer Fei-Fei Li, aims to provide the next-generation benchmark for visual generation research. Called the Giant Permissive Image Corpus (GPIC), the dataset contains 100 million image-text pairs, approximately 28 trillion pixels, and a completely redesigned evaluation framework built specifically for the age of generative AI.

More than just another dataset, GPIC represents an attempt to solve some of the most significant challenges facing modern image generation research: benchmark saturation, closed training data, reproducibility issues, and outdated evaluation metrics.

🚀 Why ImageNet Is No Longer Enough
#

To understand GPIC’s importance, it is necessary to understand how dramatically computer vision has evolved.

When ImageNet was introduced in 2009, the dominant challenge was image classification. Researchers trained models to answer questions such as:

Is this a dog or a cat?
Is this a car or a truck?
Which object category does this image belong to?

ImageNet was perfectly suited for that era because it consisted of millions of images labeled with predefined categories.

Today’s frontier AI systems operate in a fundamentally different paradigm.

Modern models:

Generate images from text prompts
Create photorealistic scenes
Produce artwork and videos
Learn from massive image-text corpora
Understand multimodal relationships

Evaluating these systems using benchmarks designed for classification increasingly resembles using an outdated exam to measure entirely new skills.

As image generation technology has advanced, many commonly used metrics have begun reaching their limits.

📈 Benchmark Saturation Has Become a Serious Problem
#

One of the clearest signs that current benchmarks are losing effectiveness is the growing saturation of evaluation scores.

Several recent papers have reported Fréchet Inception Distance (FID) results that outperform the scores achieved by real images themselves.

This creates a paradox.

If synthetic images score “better” than real images according to the benchmark, the benchmark is no longer faithfully measuring actual visual quality.

In practical terms, researchers face a troubling question:

Are models genuinely becoming better, or are they simply becoming better at optimizing for the benchmark?

When a benchmark can no longer distinguish meaningful improvements from metric optimization, scientific progress becomes harder to measure.

The field needs a new ruler.

🏛️ A New Dataset from the Team Behind ImageNet
#

GPIC comes from a group uniquely qualified to address this challenge.

The project includes researchers from Stanford University, with contributions from:

Fei-Fei Li
Jiajun Wu
Keshigeyan Chandrasegaran
Kyle Sargen
Multiple Stanford AI researchers

Fei-Fei Li’s involvement is particularly notable.

As the creator of ImageNet and one of the central figures behind the modern deep learning revolution, she helped establish the benchmark infrastructure that defined computer vision research for more than a decade.

Today, she serves as a leader within Stanford’s AI research ecosystem while also founding the spatial intelligence company World Labs.

In many ways, GPIC can be viewed as a successor project from the same research lineage that originally brought ImageNet to the world.

🖼️ What Makes GPIC Different?
#

GPIC was designed specifically for modern visual generation systems.

The final release contains:

100 million training image-text pairs
200,000 validation samples
1 million test images
Approximately 12.9 TB of data
Roughly 28 trillion pixels
8,000 streaming-ready shards

The scale alone places it among the largest openly available visual generation datasets ever released.

More importantly, GPIC was designed around principles of legality, accessibility, reproducibility, and evaluation quality.

⚖️ Built Entirely from Legally Permissible Sources
#

One of the most controversial aspects of modern generative AI concerns training data.

Many leading commercial systems rely on datasets that remain unpublished, proprietary, or legally disputed.

GPIC takes a different approach.

The research team sourced images exclusively from:

Flickr
Wikimedia Commons

Only images with clearly permissive licenses were included:

CC BY
CC0
Public Domain
No Known Restrictions

This licensing strategy provides significantly stronger legal clarity than many existing datasets while enabling both academic and commercial research.

The initial collection phase yielded approximately 110 million images, with the majority originating from Flickr.

🔍 Rigorous Quality Control and Deduplication
#

Large internet datasets often contain substantial amounts of noise.

GPIC underwent several filtering stages designed to improve overall quality.

Content Quality Filtering
#

Researchers used the vision-language model Qwen3-VL-4B to identify and remove:

Extremely low-resolution images
Severely blurred content
Overexposed images
Nearly blank images
Unsafe content

Although the percentage of removed images was relatively small, the enormous scale of the dataset meant hundreds of thousands of problematic samples were eliminated.

Duplicate Removal
#

Internet-scale datasets frequently contain:

Reposts
Near-identical copies
Burst photography sequences
Slightly modified duplicates

To address this issue, the team employed the SSCD image-copy detection framework.

After similarity analysis, more than one million duplicate images were removed, helping ensure higher dataset diversity.

✍️ AI-Generated Captions for Every Image
#

Traditional image datasets often suffer from poor metadata quality.

Many captions consist of:

File names
Generic labels
Incomplete descriptions
Missing contextual information

GPIC addresses this problem by generating entirely new descriptions for every image.

Using Qwen3-VL-4B, researchers created captions at four levels of detail:

Tag
Short description
Medium description
Long description

Generating these captions required approximately 1,500 NVIDIA H100 GPU hours.

The result is a substantially richer multimodal dataset that better aligns with the needs of modern text-to-image systems.

📊 Introducing FD-DINOv2: A New Evaluation Metric
#

The dataset itself is only part of the contribution.

GPIC also introduces a new evaluation framework built around FD-DINOv2.

The Problem with FID
#

FID has become the dominant metric for image generation research.

However, FID relies on feature representations extracted from Inception-v3, a classification network introduced in 2015.

The issue is that Inception-v3 was never designed to evaluate generated images.

Researchers have increasingly observed situations where:

Lower FID scores do not correspond to better visual quality.
Models learn to optimize specifically for FID.
Human preferences diverge from benchmark rankings.

Why DINOv2 Matters
#

FD-DINOv2 replaces Inception-based features with representations derived from DINOv2, Meta’s self-supervised vision model.

DINOv2 offers several advantages:

Stronger semantic understanding
Better feature representations
Improved alignment with human perception
Greater robustness for visual similarity evaluation

Initial experiments indicate that current generation models remain well below the theoretical ceiling of FD-DINOv2, suggesting the metric retains substantial room for future progress.

🧪 A Better Test Set Improves Scientific Rigor
#

Another important innovation involves how evaluation is performed.

Many existing benchmarks compare generated images against training distributions.

This creates a serious loophole.

A model can achieve impressive scores simply by memorizing training examples rather than learning meaningful generalizations.

GPIC addresses this issue by evaluating against an independent one-million-image test set.

This design encourages:

Generalization
Fair comparisons
More reliable benchmarking
Stronger scientific reproducibility

By separating training and evaluation more rigorously, GPIC reduces the risk of benchmark overfitting.

🤖 A Baseline Model for Future Research
#

To help researchers compare results consistently, the Stanford team also trained a reference model on GPIC.

The baseline uses:

JiT (Just image Transformers)
Flow matching training
1.1 billion parameters
Transformer architecture
256×256 image resolution

Training was performed using:

Eight NVIDIA H100 GPUs
Approximately 40 hours of training
Roughly one full pass through the dataset

The resulting model achieved an FD-DINOv2 score of 76.25 under its optimal guidance settings.

While the absolute number is less important than state-of-the-art performance, it provides a common baseline for future experimentation.

🏗️ Scalable Versions for Different Research Budgets
#

Recognizing that not every lab has access to large GPU clusters, the team released multiple dataset sizes:

GPIC-Nano
#

1 million image-text pairs

GPIC-Lite
#

10 million image-text pairs

GPIC-Full
#

100 million image-text pairs

This tiered approach enables both academic groups and industrial research teams to participate regardless of available compute resources.

🌍 Building Open Infrastructure for Generative AI
#

The release of GPIC arrives during an intense period of competition in generative AI.

Leading systems such as image generators and video models continue to improve at an extraordinary pace. Yet much of this progress occurs behind closed doors, with organizations training on proprietary datasets and evaluating models using internal methodologies.

Historically, scientific advancement has depended on shared infrastructure.

Natural language processing benefited enormously from standardized benchmarks such as:

GLUE
SuperGLUE
BIG-bench

These frameworks created common evaluation standards that accelerated progress across the entire field.

Visual generation has lacked a comparable foundation.

🔮 A Potential Turning Point for Computer Vision
#

GPIC is more than a large dataset. It is an attempt to redefine how visual generation research is conducted.

By combining:

Open licensing
Massive scale
High-quality image-text pairs
Modern evaluation metrics
Reproducible benchmarks
Public accessibility

the project addresses several of the most pressing challenges facing contemporary generative AI research.

Perhaps most importantly, it represents a continuation of a philosophy that helped transform computer vision once before: scientific progress accelerates when researchers share common tools, common benchmarks, and common standards.

More than fifteen years after helping launch the ImageNet era, Fei-Fei Li and her collaborators are once again attempting to build the infrastructure that may define the next chapter of AI research. Whether GPIC ultimately becomes the new standard remains to be seen, but it clearly marks one of the most ambitious efforts yet to create an open and reproducible foundation for the future of visual generative modeling.