Every image model you have ever used — Stable Diffusion, FLUX, the lot — cheats. It does not actually paint pixels. It paints inside a compressed shorthand called a latent, then hands the result to a VAE that guesses its way back to a real image. That guess is where the mush comes from: the smeared eyelashes, the melted text, the plasticky skin. NVIDIA just shipped a model that throws the whole shorthand away and generates straight onto pixels — and it was named a CVPR 2026 Best Paper Finalist for the trouble.
The Story
Meet PixelDiT (Pixel Diffusion Transformers), out of NVIDIA and the University of Rochester — Yongsheng Yu, Wei Xiong (project lead), Weili Nie, Yichen Sheng, Shiqiu Liu and Jiebo Luo. For three years the entire diffusion industry has been built on a compromise: training in pixel space is brutally expensive, so everyone moved into the compressed latent space of a pretrained autoencoder. It made models fast and cheap. It also welded a hard ceiling onto how much real detail any of them could ever produce, because the VAE that decompresses the latent back into an image is itself lossy. No prompt fixes that. The information is simply gone before generation even finishes.
PixelDiT’s trick is a dual-level transformer. A patch-level pathway handles the big picture — composition, semantics, “what is in the scene.” A pixel-level pathway then does dense texture refinement, the actual grain and micro-detail, with the two stitched together by a per-pixel adaptive normalization (Pixel-wise AdaLN). The obvious problem — running attention over a million-plus pixels is ruinous — is tamed by a Pixel Token Compaction step that keeps the math affordable. The upshot is a single-stage model you train and sample end-to-end, with nothing standing between the network and the canvas.
And the numbers hold up. On ImageNet it posts a 1.61 gFID at 256px and 1.80 at 512px — beating prior pixel-space models and matching the latent heavyweights. For text-to-image at 1024px the released 1.3B checkpoint scores 0.74 on GenEval (prompt-following), ahead of FLUX-dev’s 0.67, and 83.5 on DPG-bench. For a model a fraction of the size of the frontier giants, generating directly on pixels, that is not supposed to be possible.
Why You Should Care
If you only generate hero shots from scratch, the VAE ceiling is easy to ignore. The moment you start editing, it bites. Every latent round-trip — inpaint a face, swap a background, tweak a logo — pays the reconstruction tax again, so unchanged regions drift, fine text degrades, and identity slips frame to frame. Pixel-space sidesteps the whole thing: there is no encode/decode to corrupt the parts you did not touch. The team’s FlowEdit results lean hard on exactly this — surgical edits that leave the rest of the image genuinely untouched.
For this audience that matters in concrete ways: cleaner texture maps and reference plates with no VAE haze baked in; logo and typography work that survives a generation pass; and a research direction that points at the same upgrade for video and 3D, where lossy latents are arguably an even bigger sin. It is also a reminder that the “just scale the latent model” era is not the only road — the architecture underneath is still very much up for grabs.
Try It
The best part: this is not a paper-only flex. PixelDiT landed natively in ComfyUI v0.23.0 on June 1, 2026, so you can run it today.
- Models: grab
pixeldit_1300m_1024px_bf16.safetensors(intoComfyUI/models/diffusion_models/) and thegemma_2_2b_ittext encoder (intoComfyUI/models/text_encoders/). - Workflow: load the official ComfyUI PixelDiT example — ResolutionSelector → Text-to-Image subgraph → SaveImage. At ~1.3B params it runs on consumer GPUs.
- Weights & code: Hugging Face model card and the NVlabs/PixelDiT GitHub (training + inference).
- Read the paper: arXiv 2511.20645 and the project page gallery.
IK3D Lab Take
Let’s be honest about scale: 1.3B parameters is small, and on raw aesthetics a maxed-out FLUX.2 or Midjourney run will still out-pretty it on plenty of prompts. PixelDiT is not here to dethrone them this week. What it is — and why we think it earned that Best Paper nod — is the first genuinely convincing argument that the VAE was a crutch, not a law of physics. It removes a ceiling the whole field had quietly agreed to live under, and it does it without blowing up the compute budget. The detail purists, the texture artists, the people who edit instead of re-roll: this is your roadmap. If the next FLUX or the next open video model ships pixel-native, remember you saw the proof-of-concept here first. Go break it in ComfyUI and tell us what you find.



