PixelDiT — NVIDIA Just Killed the VAE: A Best-Paper Image Model That Generates Straight Onto Pixels

Every image model you have ever used — Stable Diffusion, FLUX, the lot — cheats. It does not actually paint pixels. It paints inside a compressed shorthand called a latent, then hands the result to a VAE that guesses its way back to a real image. That guess is where the mush comes from: the smeared eyelashes, the melted text, the plasticky skin. NVIDIA just shipped a model that throws the whole shorthand away and generates straight onto pixels — and it was named a CVPR 2026 Best Paper Finalist for the trouble.

Extreme macro of a human eye generated by PixelDiT, showing per-pixel iris and eyelash detail — A 1024px macro generated directly in pixel space — no latent, no VAE, no reconstruction floor. Source: PixelDiT project page.

The Story

Meet PixelDiT (Pixel Diffusion Transformers), out of NVIDIA and the University of Rochester — Yongsheng Yu, Wei Xiong (project lead), Weili Nie, Yichen Sheng, Shiqiu Liu and Jiebo Luo. For three years the entire diffusion industry has been built on a compromise: training in pixel space is brutally expensive, so everyone moved into the compressed latent space of a pretrained autoencoder. It made models fast and cheap. It also welded a hard ceiling onto how much real detail any of them could ever produce, because the VAE that decompresses the latent back into an image is itself lossy. No prompt fixes that. The information is simply gone before generation even finishes.

Diagram contrasting latent diffusion's VAE-based pipeline with PixelDiT's single-stage pixel-space pipeline — Latent diffusion (top) routes everything through a lossy autoencoder. PixelDiT (bottom) is a single end-to-end model on raw pixels. Source: PixelDiT, arXiv 2511.20645.

PixelDiT’s trick is a dual-level transformer. A patch-level pathway handles the big picture — composition, semantics, “what is in the scene.” A pixel-level pathway then does dense texture refinement, the actual grain and micro-detail, with the two stitched together by a per-pixel adaptive normalization (Pixel-wise AdaLN). The obvious problem — running attention over a million-plus pixels is ruinous — is tamed by a Pixel Token Compaction step that keeps the math affordable. The upshot is a single-stage model you train and sample end-to-end, with nothing standing between the network and the canvas.

PixelDiT dual-level architecture: a patch-level pathway for global semantics and a pixel-level pathway for texture detail — The dual-level design: patch-level semantics + pixel-level texture, joined by Pixel-wise AdaLN. Source: PixelDiT project page.

And the numbers hold up. On ImageNet it posts a 1.61 gFID at 256px and 1.80 at 512px — beating prior pixel-space models and matching the latent heavyweights. For text-to-image at 1024px the released 1.3B checkpoint scores 0.74 on GenEval (prompt-following), ahead of FLUX-dev’s 0.67, and 83.5 on DPG-bench. For a model a fraction of the size of the frontier giants, generating directly on pixels, that is not supposed to be possible.

Why You Should Care

If you only generate hero shots from scratch, the VAE ceiling is easy to ignore. The moment you start editing, it bites. Every latent round-trip — inpaint a face, swap a background, tweak a logo — pays the reconstruction tax again, so unchanged regions drift, fine text degrades, and identity slips frame to frame. Pixel-space sidesteps the whole thing: there is no encode/decode to corrupt the parts you did not touch. The team’s FlowEdit results lean hard on exactly this — surgical edits that leave the rest of the image genuinely untouched.

PixelDiT FlowEdit examples showing targeted edits that preserve untouched regions of the image — Pixel-space editing preserves what you did not change — no latent round-trip to smear it. Source: PixelDiT project page.

For this audience that matters in concrete ways: cleaner texture maps and reference plates with no VAE haze baked in; logo and typography work that survives a generation pass; and a research direction that points at the same upgrade for video and 3D, where lossy latents are arguably an even bigger sin. It is also a reminder that the “just scale the latent model” era is not the only road — the architecture underneath is still very much up for grabs.

Ukiyo-e style illustration of a whale generated by PixelDiT at 1024px — Style and linework hold together at 1024px straight off the model. Source: PixelDiT project page.

Try It

The best part: this is not a paper-only flex. PixelDiT landed natively in ComfyUI v0.23.0 on June 1, 2026, so you can run it today.

Models: grab pixeldit_1300m_1024px_bf16.safetensors (into ComfyUI/models/diffusion_models/) and the gemma_2_2b_it text encoder (into ComfyUI/models/text_encoders/).
Workflow: load the official ComfyUI PixelDiT example — ResolutionSelector → Text-to-Image subgraph → SaveImage. At ~1.3B params it runs on consumer GPUs.
Weights & code: Hugging Face model card and the NVlabs/PixelDiT GitHub (training + inference).
Read the paper: arXiv 2511.20645 and the project page gallery.

IK3D Lab Take

Let’s be honest about scale: 1.3B parameters is small, and on raw aesthetics a maxed-out FLUX.2 or Midjourney run will still out-pretty it on plenty of prompts. PixelDiT is not here to dethrone them this week. What it is — and why we think it earned that Best Paper nod — is the first genuinely convincing argument that the VAE was a crutch, not a law of physics. It removes a ceiling the whole field had quietly agreed to live under, and it does it without blowing up the compute budget. The detail purists, the texture artists, the people who edit instead of re-roll: this is your roadmap. If the next FLUX or the next open video model ships pixel-native, remember you saw the proof-of-concept here first. Go break it in ComfyUI and tell us what you find.

The Story

Why You Should Care

Try It

IK3D Lab Take

Related Articles

Leave a Reply Cancel reply