Netflix VOID — The First Open Object-Eraser That Deletes the Shadow, the Reflection, and the Domino It Knocked Over

Erasing a person from a video clip has been a solved-ish problem for years. Erasing the shadow they cast, the reflection they left in the window, and the stack of dominoes their hand just toppled? That has been pure fantasy — until Netflix open-sourced VOID. It is the streaming giant’s first public AI model, it ships under Apache 2.0, and as of 14 May it runs natively inside ComfyUI.

VOID method overview diagram showing the quadmask pipeline and CogVideoX diffusion process
VOID’s pipeline: a four-value “quadmask” tells the diffusion model not just what to delete, but what the deletion will physically affect. Source: VOID project page

The Story

VOID — Video Object and Interaction Deletion — comes out of a collaboration between Netflix’s AI team and INSAIT in Sofia, led by Saman Motamed. The paper dropped on arXiv in early April; the weights went up on Hugging Face the same week; and a month later the ComfyUI team wired in native support alongside Google’s Gemma 4 and BiRefNet.

The clever bit is the input. Most inpainting models take a binary mask: white means “paint over this,” black means “leave it alone.” VOID takes a quadmask — a grayscale map with four values. 0 marks the object to delete, 63 marks overlap regions, 127 marks zones that will be physically affected by the removal (the dominoes that should now stay standing, the water that should never have splashed), and 255 marks the background to protect. That single design choice turns inpainting from a pixel-filling task into a cause-and-effect reasoning task.

VOID removing a hand and reverting the dominoes it knocked over
Remove the hand, and the dominoes it pushed stand back up. VOID reasons about the interaction, not just the silhouette. Source: VOID project page

Under the hood it is a fine-tune of Alibaba’s CogVideoX-Fun-V1.5-5b-InP, a 5-billion-parameter 3D-transformer video model. Netflix trained it on two synthetic, physics-accurate datasets: HUMOTO (motion-capture human-object interactions re-simulated and rendered in Blender) and Google’s Kubric (object-collision scenes with ground-truth “counterfactual” videos showing what the scene looks like with an object never present). Because the training pairs are synthetic, the model gets a perfect before/after for every interaction — something you could never capture by hand.

There is a two-pass architecture, too. Pass 1 does the base removal. Pass 2 is optional and fixes “object morphing” — the warping artifacts that creep in over long clips — by re-initializing from optical-flow-warped latents of Pass 1. In a user study the team ran VOID against ProPainter, DiffuEraser, MiniMax-Remover, ROSE, Gen-Omnimatte and Runway, and it came out on top for temporal consistency.

VOID example: removing a cat from a scene with a Jenga tower
One of VOID’s torture tests: pull the cat out, and the Jenga tower it was about to topple has to behave. Source: VOID project page

Why You Should Care

If you do VFX, cleanup, or any kind of plate prep, this is the model that finally understands that the world reacts to the thing you are deleting. Rig removal, crew reflections in a car door, a stand-in actor casting a shadow on the hero — these are the jobs that eat roto hours, and VOID takes a real swing at all of them. It is also a quietly huge signal: Netflix shipping an open, Apache-2.0 production tool rather than locking it behind an internal pipeline is the kind of move that drags the whole industry forward.

For the ComfyUI crowd, the native integration means you can drop VOID into the same graph as your segmentation, upscaling, and color nodes — no separate environment, no glue scripts. Pair it with SAM3 to auto-generate the masks and you have a near-push-button removal workflow.

VOID example: bowling scene object removal with physics-aware results
Physics-aware removal in motion — the ball is gone, and so is the chain reaction it would have caused. Source: VOID project page

Try It / Follow Them

The catch: VOID wants 40 GB+ of VRAM (think A100). It runs at 384×672, up to 197 frames, BF16 with FP8 quantization. Local-first creators on a single consumer GPU will be leaning on the Hugging Face Space or Comfy Cloud for now — but the architecture is public, and quantized community builds are exactly the kind of thing that follows an open release.

IK3D Lab Take

The quadmask is the idea worth stealing here. By encoding consequence directly into the conditioning signal, Netflix sidestepped the usual approach of bolting a physics simulator onto a video model — they just taught the diffusion model what “affected region” means and let it generalize. That trick will show up far beyond object removal. For now, VOID is the most genuinely useful open video tool we have seen this month, and the 40 GB barrier is the only thing standing between it and your timeline. Watch the community quantizations — that wall comes down fast.

 

Sharing is caring!

Leave a Reply

Your email address will not be published. Required fields are marked *