For two years, the dirty secret of AI character animation was the middleman. To make a still image dance, you first had to rip a skeleton out of a driving video — OpenPose stick figures, DensePose maps, depth passes — and pray the rig survived the translation. SCAIL-2, just open-sourced under Apache 2.0 by Z.ai (Zhipu AI), throws the whole pipeline in the bin. You hand it one reference image and one driving video, and it transfers the motion directly. No skeleton. No keypoints. No masks.
The Story
SCAIL-2 (the “2” follows an earlier SCAIL-Preview) is a 14-billion-parameter model built on top of the Wan video ecosystem — it reuses Wan’s VAE and T5 text encoder, so it slots neatly into the open tooling everyone already runs. The headline idea is what the paper calls a Unified Motion Transfer Interface: instead of treating “animate this character,” “replace this character,” and “drive with an animal” as separate problems with separate preprocessing, SCAIL-2 folds them all into one model using dedicated masking channels and a mode-specific RoPE design. One network, many jobs.
The other clever move is in the data. Clean motion-transfer training pairs barely exist in the wild, so the team synthesized them: roughly 60,000 pairs (they call the set MotionPair-60K) generated using off-the-shelf models including SCAIL-Preview, Wan-Animate and MoCha, then refined with a “Bias-Aware DPO” step to sharpen detail. The payoff is a model that does things its skeleton-bound predecessors structurally could not — because it never extracts keypoints, the source and target don’t have to share a body plan.
That unlocks the genuinely fun stuff: cross-identity replacement (drop a polar bear into the exact choreography a human performed), Any2Any driving where a cartoon cat copies a real cat’s pounce, multi-character scenes that don’t collapse when two bodies overlap, and zero-shot generalization to inputs the model never trained on — animals, egocentric “GoPro on your forehead” footage, even SAM3D mesh renders as a control signal.
Why You Should Care
If you’ve ever tried to puppet a stylized character with a pose-based pipeline, you know the failure mode: the rig is built for a human silhouette, so the moment your character has a tail, four legs, or proportions that aren’t 7-heads-tall, everything smears. By killing the skeleton step, SCAIL-2 makes non-human and non-realistic characters first-class citizens. For 3D artists and animators that’s the unlock — your Blender creature, your mascot, your concept-art hero can inherit a real performance from any video clip without a single bone.
It’s also refreshingly honest about its limits. It runs at 512p and 704p (dimensions divisible by 32), so this is a previs and pre-production weapon, not a final-frame 4K renderer — yet. But it’s Apache 2.0, the weights are on Hugging Face, and it’s already wired into ComfyUI. That combination is how a research drop becomes a Tuesday-afternoon workflow.
Try It / Follow Them
- Code & weights: github.com/zai-org/SCAIL-2 and huggingface.co/zai-org/SCAIL-2
- Paper: SCAIL-2: Unifying Controlled Character Animation with End-to-end In-Context Conditioning (arXiv 2606.10804)
- Demos: the project page is wall-to-wall video comparisons — watch the multi-character clips
- ComfyUI: the Comfy-Org diffusion-model build is on Hugging Face, with community workflows from Kijai (test graph) and RunningHub (infinite-length). Pair it with a SAM 3.1 checkpoint for masked replacement.
IK3D Lab Take
We’ve spent months at the Lab watching Gaussian splatting eat the 3D world. SCAIL-2 is a useful reminder that the character side of the pipeline is moving just as fast — and in a more artist-friendly direction. The skeleton was always a compromise, a lossy intermediate we tolerated because the models couldn’t reason about motion directly. Watching that crutch get removed, open-sourced under Apache 2.0, and dropped straight into ComfyUI in the same week is exactly the kind of moment the Lab exists to flag. Grab a driving clip, point it at your weirdest non-human character, and see how far 704p takes you. We suspect “the skeleton step” is about to sound as quaint as “the render farm.”



