ComfyUI-HunyuanVideo-Nyan

Text Encoders finally matter 🤖🎥 - scale CLIP & LLM influence!

plus, a Nerdy Transformer Shuffle node

Changes 19/DEC/2024:

New (best) SAE-informed Long-CLIP model with 90% ImageNet/ObjectNet accuracy.
Code is here, model is at my HF 🤗: https://huggingface.co/zer0int/LongCLIP-SAE-ViT-L-14

To clarify, only put this folder into ComfyUI/custom_nodes; if you cloned the entire repo, you'll need to move it. only this! should be in ComfyUI/custom_nodes; you should have an __init__.py in your ComfyUI/custom_nodes/ComfyUI-HunyuanVideo-Nyan folder. If you see a README.md, that's wrong.

The CLIP model doesn't seem to matter much? True for default Hunyuan Video, False with this node! ✨
Simply put the ComfyUI... folder from this repo in ComfyUI/custom_nodes
See example workflow; it's really easy to use, though. Replaces the loader node.
Recommended CLIP huggingface.co/zer0int/CLIP-SAE-ViT-L-14
Takes 248 tokens, 🆕 @ 19/DEC/24 🤗: https://huggingface.co/zer0int/LongCLIP-SAE-ViT-L-14

Requires kijai/ComfyUI-HunyuanVideoWrapper
⚠️ If something breaks because WIP: Temporarily fall back to my fork for compatibility
Uses HunyuanVideoWrapper -> loader node implementation. All credits to the original author!
My code = only the 2 different 'Nyan nodes' in hynyan.py.
Loader is necessary as the mod changes model buffers; changes are cumulative if not re-loaded.
You can choose to re-load from file - or from RAM deepcopy (faster, may require >64 GB RAM).

Q: What does it do, this Factor for scaling CLIP & LLM? 🤔
A: Here are some examples. Including a 'do NOT set BOTH the CLIP and LLM factors >1' example.
Prompt: high quality nature video of a red panda balancing on a bamboo stick while a bird lands on the panda's head, there's a waterfall in the background
SAE: Bird at least flies (though takes off), better feet on panda (vs. OpenAI)

demo-default.mp4

These are all my CLIP models from huggingface.co/zer0int; SAE is best.
See details on legs; blurriness; coherence of small details.

demo-spider-cafe.mp4

🆕 Long-CLIP @ 19/DEC/24: The original CLIP model has 77 tokens max input - but only ~20 tokens effective length. See the original Long-CLIP paper for details. HunyuanVideo demo:

69 tokens, normal scene:

Lens: 16mm. Aperture: f/2.8. Color Grading: Blue-green monochrome. Lighting: Low-key with backlit silhouettes. Background: Gothic cathedral at night, stained glass windows breaking. Camera angle: Over the shoulder of a ninja, tracking her mid-air leap as she lands on a rooftop.

52 tokens, OOD (Out-of-Distribution) scene: Superior handling for consistency and prompt-following despite OOD concept.

In this surreal nightmare documentary, a sizable spider with a human face is peacefully savoring her breakfast at a diner. The spider has a spider body, but a lady's face on the front, and regular human hands at the end of the spider legs.

demo-long-sae-short.mp4

Q: And what does this confusing, gigantic node for nerds do? 🤓
A: You can glitch the transformer (video model) by shuffling or skipping MLP and Attention layers:

demo-glitchformer.mp4

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
ComfyUI-HunyuanVideo-Nyan		ComfyUI-HunyuanVideo-Nyan
workflows		workflows
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ComfyUI-HunyuanVideo-Nyan

Text Encoders finally matter 🤖🎥 - scale CLIP & LLM influence!

Changes 19/DEC/2024:

About

Releases

Packages

Languages

zer0int/ComfyUI-HunyuanVideo-Nyan

Folders and files

Latest commit

History

Repository files navigation

ComfyUI-HunyuanVideo-Nyan

Text Encoders finally matter 🤖🎥 - scale CLIP & LLM influence!

Changes 19/DEC/2024:

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages