In AI System, Images Evolve from Random Snapshots into Coherent Visual Stories

May 4, 2026 By: daviddefusco

Namrata Patel, a Katz School Ph.D. student in Mathematics in the Graduate Department of Computer Science and Engineering, will present her research at the IEEE International Conference on Multimedia and Expo (ICME) in July.

By Dave DeFusco

For many people, today’s image-generating AI tools feel almost magical. You type a sentence—“a dog running in a park”—and within seconds, a lifelike image appears. If you ask for a sequence of images that show what happens next, however, things fall apart. The dog may suddenly change shape, color or even species from one image to the next. The story doesn’t hold together.

That challenge is at the heart of a research paper, “CausalDreamer: Temporal Image Generation through Causal Intervention Attention with Identity Preservation,” led by Namrataben Patel, a Katz School Ph.D. student in Mathematics in the Graduate Department of Computer Science and Engineering. She will present her research at the IEEE International Conference on Multimedia and Expo (ICME) in July.

In her paper, Patel set out to solve a simple but important problem: how to make AI-generated images behave more like a story instead of a collection of unrelated snapshots.

“Most of the existing models are designed to generate one image at a time,” said Patel. “When I provide a text prompt, it generates one image. But when I provide the next prompt, it won’t necessarily get the same object in that next frame.”

In other words, today’s systems don’t really understand time or continuity. They don’t “remember” what came before. As Patel puts it, “it will randomly pick up the best matched object in the next frame,” which is why characters often change unexpectedly.

CausalDreamer approaches the problem differently. Instead of treating each image as separate, it tries to model how events unfold step by step—more like a storyboard than a single photograph. To do this, Patel trained the system using only text descriptions of cause and effect. For example: “a dog digging a hole” might lead to “the ground breaks,” then “the hole gets deeper.” These sequences teach the model how one action leads to another.

Initially, she struggled to generate consistent sequences—keeping the same subject while meaningfully evolving the scene from one image to the next was difficult. The breakthrough came when she moved beyond simple prompts and introduced a learned “semantic shift” between steps. By training a lightweight adapter and injecting this learned progression directly into the image generation process, she was able to guide how each frame evolves, enabling controlled and consistent transitions across images.

This idea connects to the concept of causality—the idea that actions lead to consequences. Patel drew inspiration from this principle to guide how the AI “pays attention” when generating images.

“When we hear the term causal, we think there is an action and an effect,” she said. “If a cup drops to the floor, it will break. But current models don’t think this way. They take the prompt as it is. They will not predict the after-effects.”

CausalDreamer begins to change that. If given a prompt like “two cars crashing,” the system can start to imagine not just the crash, but what follows—damage, movement and consequences in the scene. Another key challenge was balance. Some AI models can keep a character consistent, but nothing changes from frame to frame. Others create variety, but lose the character entirely. Patel aimed for both.

“Some models were preserving identity, but they were not changing the actions,” she said. “For our model, it is like a balance where we can see that there is a change in action while persisting the identity.”

The result is a system that can generate a sequence of images showing meaningful progression, like a child playing with a ball or a dog digging a hole, while keeping the same subject recognizable throughout. One surprising feature of CausalDreamer is that it is relatively lightweight and accessible. Instead of retraining massive AI systems, Patel built a small add-on component.

“Scalability is very easy,” she said. “You can add as many actions and effects as you want using a simple file.”

She has also shared the project publicly, making it easier for other researchers to build on the work. The system works especially well with people and animals, where identity is easier to recognize. It struggles more with generic objects like “a bottle” or “a hat,” where distinguishing features are limited. Patel sees this as an area for future improvement.

Looking ahead, she hopes to extend the work into full video generation and educational tools. “If I am giving a lecture, I may want to explain it using some images,” she said. “In the future, I may be able to do that.”

Honggang Wang, senior author of the paper and chair of the department, noted that teaching AI to understand sequences and cause-and-effect relationships is a major step forward for applications like storytelling, training and scientific visualization.

“By helping machines move from single images to meaningful sequences,” said Wang, “CausalDreamer opens the door to AI systems that don’t just create pictures but begin to tell coherent visual stories.”

��ƽ��

��ƽ�� News

News Channel