How to Install and Play with the Sonic Model: A Friendly Guide
Hey RUN GEN friends,
I’m back with another handy AI gem to share! If you caught my earlier post Exciting News: Introducing the Sonic Open-Source Digital Human Model! you already know Sonic’s a standout. For the uninitiated, Sonic is this clever open-source digital human model from Tencent and Zhejiang University that transforms static images and audio into smooth, talking videos. Today, I’m breaking down how to get it up and running—perfect for our How-to Articles section. Let’s jump right in!
Getting Sonic Set Up with ComfyUI
This workflow depends on the following plugins. Please ensure you have completed the plugin and dependency installation, or install missing nodes using ComfyUI-manager after downloading My Sonic Workflow.
Downloading the Sonic Models
Sonic needs its model files to do its thing, and honestly, the official download method can be a bit of a hassle. So, I’ve cooked up two handy options—one for Colab users like me, and one for local setups—to grab all the required models in one go and make sure the paths are spot-on. Pick the one that works for you!
For Colab Users
# Step 1: Create the folder structure as required by the author
!mkdir -p /content/ComfyUI/models/sonic/whisper-tiny
!mkdir -p /content/ComfyUI/models/sonic/RIFE
# Step 2: Install Git LFS and clone the Hugging Face repository to a temporary directory
!git lfs install
!git clone https://huggingface.co/LeonJoe13/Sonic /tmp/Sonic
# Step 3: Move all necessary files to their correct locations
!mv /tmp/Sonic/Sonic/unet.pth /content/ComfyUI/models/sonic/unet.pth
!mv /tmp/Sonic/Sonic/audio2bucket.pth /content/ComfyUI/models/sonic/audio2bucket.pth
!mv /tmp/Sonic/Sonic/audio2token.pth /content/ComfyUI/models/sonic/audio2token.pth
!mv /tmp/Sonic/yoloface_v5m.pt /content/ComfyUI/models/sonic/yoloface_v5m.pt
!mv /tmp/Sonic/RIFE/flownet.pkl /content/ComfyUI/models/sonic/RIFE/flownet.pkl
# Step 4: Download Whisper-Tiny model files from Hugging Face
!wget -O /content/ComfyUI/models/sonic/whisper-tiny/config.json https://huggingface.co/openai/whisper-tiny/resolve/main/config.json
!wget -O /content/ComfyUI/models/sonic/whisper-tiny/model.safetensors https://huggingface.co/openai/whisper-tiny/resolve/main/model.safetensors
!wget -O /content/ComfyUI/models/sonic/whisper-tiny/preprocessor_config.json https://huggingface.co/openai/whisper-tiny/resolve/main/preprocessor_config.json
# Step 5: Clean up the temporary directory
!rm -rf /tmp/Sonic
# Step 6: Download Stable Video Diffusion model using my Hugging Face API token
# Note: The token below is a placeholder; replace it with an actual token from Hugging Face (called "User Access Token")
!curl -L -H "Authorization: Bearer hf_xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx" "https://huggingface.co/stabilityai/stable-video-diffusion-img2vid-xt-1-1/resolve/main/svd_xt_1_1.safetensors?download=true" -o $checkpointpatch/svd_xt_1_1.safetensors
For Local Runners
import os
import subprocess
import requests
import shutil
# Define the base directory relative to ComfyUI/models/
BASE_DIR = "ComfyUI/models"
SONIC_DIR = os.path.join(BASE_DIR, "sonic")
WHISPER_DIR = os.path.join(SONIC_DIR, "whisper-tiny")
RIFE_DIR = os.path.join(SONIC_DIR, "RIFE")
TMP_DIR = "tmp/Sonic"
# Step 1: Create the folder structure as required by the author
os.makedirs(WHISPER_DIR, exist_ok=True)
os.makedirs(RIFE_DIR, exist_ok=True)
# Step 2: Install Git LFS and clone the Hugging Face repository to a temporary directory
subprocess.run(["git", "lfs", "install"])
subprocess.run(["git", "clone", "https://huggingface.co/LeonJoe13/Sonic", TMP_DIR])
# Step 3: Move all necessary files to their correct locations relative to ComfyUI/models/
shutil.move(os.path.join(TMP_DIR, "Sonic/unet.pth"), os.path.join(SONIC_DIR, "unet.pth"))
shutil.move(os.path.join(TMP_DIR, "Sonic/audio2bucket.pth"), os.path.join(SONIC_DIR, "audio2bucket.pth"))
shutil.move(os.path.join(TMP_DIR, "Sonic/audio2token.pth"), os.path.join(SONIC_DIR, "audio2token.pth"))
shutil.move(os.path.join(TMP_DIR, "yoloface_v5m.pt"), os.path.join(SONIC_DIR, "yoloface_v5m.pt"))
shutil.move(os.path.join(TMP_DIR, "RIFE/flownet.pkl"), os.path.join(RIFE_DIR, "flownet.pkl"))
# Step 4: Download Whisper-Tiny model files from Hugging Face
urls = {
"config.json": "https://huggingface.co/openai/whisper-tiny/resolve/main/config.json",
"model.safetensors": "https://huggingface.co/openai/whisper-tiny/resolve/main/model.safetensors",
"preprocessor_config.json": "https://huggingface.co/openai/whisper-tiny/resolve/main/preprocessor_config.json"
}
for filename, url in urls.items():
response = requests.get(url)
with open(os.path.join(WHISPER_DIR, filename), "wb") as f:
f.write(response.content)
# Step 5: Clean up the temporary directory
shutil.rmtree(TMP_DIR)
# Step 6: Download Stable Video Diffusion model using a Hugging Face User Access Token
# Replace 'your_huggingface_token_here' with your actual token from Hugging Face
HF_TOKEN = "your_huggingface_token_here"
url = "https://huggingface.co/stabilityai/stable-video-diffusion-img2vid-xt-1-1/resolve/main/svd_xt_1_1.safetensors?download=true"
headers = {"Authorization": f"Bearer {HF_TOKEN}"}
response = requests.get(url, headers=headers)
with open(os.path.join(BASE_DIR, "svd_xt_1_1.safetensors"), "wb") as f:
f.write(response.content)
print("Download and setup completed successfully!")
My Sonic Workflow for Video Fun
Now for the fun part—making videos! I’ve bundled my workflow into a JSON file for you to use:
How It Works:
- Drag the JSON into ComfyUI—it’ll load up in a snap.
- Add a clear portrait photo and a solid audio file (more on audio quality in a bit).
- Set your paths, adjust to 512 resolution (my sweet spot), and hit run.
- Sit tight for about 900 seconds (for a 10-second video), and watch your character come to life!
It’s straightforward, even if you’re new to this. Give it a whirl and share your results in Generated Videos
and here is mine ![]()
What I Think of Sonic
After some quality time with Sonic, here’s my honest take—straight from one enthusiast to another.
It Delivers Impressive Results
The latest ComfyUI_Sonic version handles any aspect ratio, which is a massive step up. No more awkward square crops—your whole image gets to shine, and the videos look sharp and natural. It’s a real treat for creators.
Resolution Sweet Spot at 512
On my Colab paid plan, 512 (short side) is the way to go. A 512x768 video, 10 seconds long, uses about 15G of VRAM and takes 900 seconds to render. Push it higher—like 768x1152—and my 22G VRAM gives up. Stick to 512, and you’ll avoid the dreaded crash.
Multilingual, but Audio’s the Key
Sonic’s a champ with languages—it syncs lips to English, Chinese, and more with ease. The trick? Use clear audio. Muddy voices or background noise can leave your character’s mouth stuck shut. Crisp sound makes all the difference.
Wrap-Up: Sonic’s a Keeper
Sonic’s a fantastic find—open-source, powerful, and just plain cool for us AI lovers. It’s not tough to set up, and the results are worth every second. Whether you’re crafting videos or experimenting, it’s a tool you’ll want to keep handy.
Got questions? Drop them below or pop into Help Needed
. Made something awesome? Show it off in Generated Videos —we’re all here to share the fun. RUN GEN’s the best spot for geeking out together, so let’s keep it rolling!
Happy creating!