Composing Concepts from Images and Videos via Concept-prompt Binding

Kong, Xianghao; Zhang, Zeyu

Composing Concepts from Images and Videos via Concept-prompt Binding

Xianghao Kong ¹ Zeyu Zhang ¹ Yuwei Guo ² Zhuoran Zhao ^{1, 3} Songchun Zhang ¹ Anyi Rao ¹

¹ HKUST ² CUHK ³ HKUST(GZ)
CVPR 2026 (Highlight)

Paper Code arXiv

Input Image 1

Image Prompt: "A vibrant Minecraft landscape featuring a flowing river, lush trees, a cascading waterfall, sheep grazing, and blocky clouds under a bright blue sky."

Input Image 2

Image Prompt: "A dynamic volcano erupts, spewing vibrant red lava and creating a dramatic ash cloud against a serene blue sky, with molten lava pooling on the black rocks below."

Input Video

Video Prompt: "A beautiful butterfly rests on a vibrant yellow flower, flapping its wings softly against a backdrop of lush green leaves."

BiCo (Ours)

Composed Prompt: "A beautiful butterfly rests on a vibrant yellow flower, flapping its wings surrounded by a vibrant Minecraft landscape featuring a dynamic volcano which erupts, spewing vibrant red lava and creating a dramatic ash cloud against a serene blue sky, with molten lava pooling on the black rocks below."

Input Image

Image Prompt: "A beagle dog wearing a collar stands on a pathway surrounded by grass and grassland."

Input Video

Video Prompt: "A bartender in a black shirt skillfully mixes a drink in a shaker at a bar, surrounded by a cityscape visible through a large window."

BiCo (Ours)

Composed Prompt: "A beagle dog wearing a collar mixes a drink vigorously using a shaker with its dog's paws at a bar, surrounded by a cityscape visible through a large window."

We introduce Bind & Compose (BiCo), a one-shot method that enables flexible visual concept composition by binding visual concepts with the corresponding prompt tokens and composing the target prompt with bound tokens from various sources.

Abstract

Visual concept composition, which aims to integrate different elements from images and videos into a single, coherent visual output, still falls short in accurately extracting complex concepts from visual inputs and flexibly combining concepts from both images and videos. We introduce Bind & Compose, a one-shot method that enables flexible visual concept composition by binding visual concepts with corresponding prompt tokens and composing the target prompt with bound tokens from various sources. It adopts a hierarchical binder structure for cross-attention conditioning in Diffusion Transformers to encode visual concepts into corresponding prompt tokens for accurate decomposition of complex visual concepts. To improve concept-token binding accuracy, we design a Diversify-and-Absorb Mechanism that uses an extra absorbent token to eliminate the impact of concept-irrelevant details when training with diversified prompts. To enhance the compatibility between image and video concepts, we present a Temporal Disentanglement Strategy that decouples the training process of video concepts into two stages with a dual-branch binder structure for temporal modeling. Evaluations demonstrate that our method achieves superior concept consistency, prompt fidelity, and motion quality over existing approaches, opening up new possibilities for visual creativity.

Qualitative Results

Input Image

Image Prompt: "A happy Akita dog with its tongue out, stands in a grassy area with flowers and leaves, bathed in sunlight."

Input Video

Video Prompt: "A man in a red plaid shirt and black headphones raises his arms excitedly while holding a gaming controller, deeply engaged in a game in a cozy living room setting."

BiCo (Ours)

Composed Prompt: "A happy Akita dog in a red plaid shirt and black headphones raises its paws excitedly while holding a gaming controller, deeply engaged in a game in a cozy living room setting."

Input Image 1

Image Prompt: "A husky dog with striking blue eyes stands in the snow, gazing directly at the camera."

Input Image 2

Image Prompt: "A poised black Doberman Pinscher stands on a wooden floor against a backdrop of vibrant green foliage."

Input Video

Video Prompt: "A man in a suit stands still on a rooftop, looking around as another person appears behind him, with a cityscape and blue sky dotted with clouds in the background."

BiCo (Ours)

Composed Prompt: "A husky dog with striking blue eyes stands still on a rooftop, as a poised black Doberman Pinscher appears behind, with a cityscape and blue sky dotted with clouds in the background."

Input Image

Image Prompt: "A fluffy white dog curled up on soft, white bedding in a sunlit room."

Input Video

Video Prompt: "A woman in a red dress sits still, her tattooed arm visible, as she poses for a camera on a tripod under softbox lighting in a living room setting."

BiCo (Ours)

Composed Prompt: "A fluffy white dog in a red dress sits still, as it poses for a camera on a tripod under softbox lighting in a living room setting."

Input Image

Image Prompt: "A close-up of a German Shepherd dog outdoors, sitting on grass under natural light."

Input Video

Video Prompt: "A focused individual practices boxing in a gym, punching a heavy bag with intensity while the camera follows their movements."

BiCo (Ours)

Composed Prompt: "A German Shepherd dog individual practices boxing in a gym, punching a heavy bag with intensity while the camera follows their movements."

Input Image

Image Prompt: "A happy Akita dog with its tongue out, stands in a grassy area with flowers and leaves, bathed in sunlight."

Input Video

Video Prompt: "A man swings in an aerial yoga hammock in a fitness studio, twisting mid-air before landing on a mat, with mirrors reflecting the scene and barrels and flowers visible in the background."

BiCo (Ours)

Composed Prompt: "A happy Akita dog with its tongue out, swings in an aerial yoga hammock in a fitness studio, twisting mid-air before landing on a mat, with mirrors reflecting the scene and barrels and flowers visible in the background."

Input Image

Image Prompt: "A close-up of a German Shepherd dog outdoors, sitting on grass under natural light."

Input Video

Video Prompt: "A muscular man wearing a black tank top works out on fitness equipment in a gym, pulling and lifting with intensity."

BiCo (Ours)

Composed Prompt: "A German Shepherd dog wearing a black tank top works out on fitness equipment in a gym, pulling and lifting with intensity."

Input Image

Image Prompt: "A happy dog with gray and white fur, blue eyes, and a cheerful expression lies outdoors on pavement."

Input Video

Video Prompt: "A man in a mint green suit and hat energetically points upward while holding a trumpet, set against the backdrop of a peach-colored building with windows."

BiCo (Ours)

Composed Prompt: "A happy dog with gray and white fur, blue eyes, in a mint green suit and hat energetically points upward while holding a trumpet, set against the backdrop of a peach-colored building with windows."

Input Image

Image Prompt: "A relaxed dog with a white chest and brown patches lies outdoors on concrete near grass and mountains."

Input Video

Video Prompt: "A man in a striped shirt and cap stands still playing a guitar against a concrete wall with yellow graffiti, while another person walks away into the distance in an underground tunnel."

BiCo (Ours)

Composed Prompt: "A relaxed dog with a white chest and brown patches in a striped shirt and cap stands still playing a guitar with its dog's paws against a concrete wall with yellow graffiti, while another person walks away into the distance in an underground tunnel."

Input Video 1

Video Prompt: "A man with long hair and beard in a striped shirt and black jeans and a cap, plays the guitar against a concrete wall with yellow graffiti, while another person walks away into the distance in an underground tunnel."

Input Video 2

Video Prompt: "A man in a mint green suit and hat energetically points upward while holding a trumpet, set against the backdrop of a peach-colored building with windows."

BiCo (Ours)

Composed Prompt: "A man with long hair and beard in a striped shirt and black jeans and a cap, plays the guitar, while another man in a mint green suit and hat energetically points upward while holding a trumpet, set against the backdrop of a peach-colored building with windows."

Input Image 1

Image Prompt: "A happy Akita dog with its tongue out, stands in a grassy area with flowers and leaves, bathed in sunlight."

Input Image 2

Image Prompt: "A stylish outfit featuring a tan trench coat, white pants, a black bag, sunglasses, a white shirt, and black shoes arranged against a white background."

Input Image 3

Image Prompt: "A joyful woman wearing a striped blue and white hat smiles on a sunny beach with the ocean and sky in the background."

Input Video

Video Prompt: "A woman sits on a wooden bench against a wooden wall, engrossed in reading a book, occasionally flipping its pages and looking up thoughtfully."

BiCo (Ours)

Composed Prompt: "A happy Akita dog sits on a wooden bench against a wooden wall, engrossed in reading a book, wearing a tan trench coat, white pants, sunglasses, a white shirt, black shoes, and a striped blue and white hat."

Comparisons to Prior Works

Input Image

Image Prompt: "A young monkey perches on a tree branch surrounded by lush green leaves in a forest setting."

Input Video

Video Prompt: "A woman in a black coat stands in a forest, putting on her headphones as she looks around at fallen leaves covering the ground."

BiCo (Ours)

Composed Prompt: "A young monkey with no hair in a black coat stands in a forest, putting on its headphones as it looks around at fallen leaves covering the ground."

Text-Inv

DB-LoRA

DreamVideo

DualReal

Input Image

Image Prompt: "An elephant stands near water at sunset, with trees and dirt in the background."

Input Video

Video Prompt: "A turtle swims gracefully through the blue ocean, bubbles rising around it as it glides underwater."

BiCo (Ours)

Composed Prompt: "An elephant swims gracefully through the blue ocean, bubbles rising around it as it glides underwater."

Text-Inv

DB-LoRA

DreamVideo

DualReal

Input Image

Image Prompt: "A young monkey perches on a tree branch surrounded by lush green leaves in a forest setting."

Input Video

Video Prompt: "A squirrel sits on dirt, nibbling on something while bathed in sunlight, with its tail curled behind it."

BiCo (Ours)

Composed Prompt: "A young monkey sits on dirt, nibbling on something while bathed in sunlight."

Text-Inv

DB-LoRA

DreamVideo

DualReal

Case Studies

Input Image

Image Prompt: "A pixel art interpretation of The Starry Night by Vincent van Gogh, featuring a night sky filled with stars and a crescent moon, set against a cityscape with a prominent building silhouette."

Input Video

Video Prompt: "A bird soars gracefully through a clear blue sky."

BiCo (Ours)

Composed Prompt: "A pixel art interpretation of The Starry Night by Vincent van Gogh, featuring a bird soars gracefully."

Baseline

Baseline + Hierarchical

Baseline + Hierarchical + Diversification

Baseline + Hierarchical + Diversification + Absorbing

Baseline + Hierarchical + Diversification + TDS

Baseline + Hierarchical (w/o two-stage inverted training) + Diversification + Absorbing + TDS

Input Image

Image Prompt: "An elephant stands near water at sunset, with trees and dirt in the background."

Input Video

Video Prompt: "A brown bear walks along a rocky terrain, surrounded by a stone wall and green foliage, exploring its surroundings."

BiCo (Ours)

Composed Prompt: "An elephant walks along a rocky terrain, surrounded by a stone wall and green foliage, exploring its surroundings."

Baseline

Baseline + Hierarchical

Baseline + Hierarchical + Diversification

Baseline + Hierarchical + Diversification + Absorbing

Baseline + Hierarchical + Diversification + TDS

Baseline + Hierarchical (w/o two-stage inverted training) + Diversification + Absorbing + TDS

Paper

BibTeX

@InProceedings{Kong_2026_BiCo,
    author    = {Kong, Xianghao and Zhang, Zeyu and Guo, Yuwei and Zhao, Zhuoran and Zhang, Songchun and Rao, Anyi},
    title     = {Composing Concepts from Images and Videos via Concept-prompt Binding},
    booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
    month     = {June},
    year      = {2026},
    pages     = {14800-14810}
}