How to write Midjourney video prompts that actually work

Stop leaving your animations to chance

Jul 11, 2025

Film directors shout 'Action!' AI directors write prompts.

Prompting for Midjourney video

Midjourney's video feature opens up exciting possibilities for creators who want more control over their animations. While you can simply animate an image without any prompting, writing effective prompts gives you the power to guide exactly how your video unfolds.

This guide focuses on the art and science of video prompting, turning your static images into compelling animated stories.

If you're new to Midjourney video, check out this article to cover the basics. Now let's look into more details that'll make your videos stand out.

How Midjourney video actually works

Understanding the technical foundation helps you craft better prompts. The video feature uses a diffusion video model that works quite differently from what you might expect.

Instead of creating frames one by one, the model generates all frames simultaneously, treating them as a unified volume of data that represents both space (the contents of the frame) and time (the sequence of the frames).

This parallel processing approach has important implications:

The entire sequence gets processed at once, not frame-by-frame
Providing comprehensive context in your prompt helps since the model considers everything simultaneously

This parallel processing approach means your prompting strategy should account for the entire video sequence, not just individual moments.

Other details to know:

• Videos output at 24 fps for smooth playback

• The video format for social media uses H.264 - MPEG-4 AVC codec with Planar 4:2:0 YUV color format for optimal platform compatibility for web and streaming video

Starting frame essentials

Your starting frame sets the foundation for everything that follows. Get this right, and your video generation becomes much more predictable. Your starting frame should contain every element you want to see throughout the video.

Here's what matters most:

Skip the upscaling since --video 1 only supports 480p (the bot will downscale high-resolution images, creating unwanted artifacts)
Style carries through consistently, meaning illustrations stay illustrated, photographs stay photographic
When subjects move out of view and return, describe their visual characteristics again and reinforce with phrases like "the same cup appears again"
Character states matter for animation flow (want your character to open their eyes? Start with eyes closed, not already open)

Timing strategy

Think backwards from your desired action. You want to prompt images that happen seconds before the peak action, not during or after.

The "about to" technique works particularly well:

Use this keyword to signal the model to prepare for imminent action
Create an image of a dog looking at a hamburger, about to take a bite. When animated, the dog will complete the eating motion

cinematic photo of an excited dog is about to bite a hamburger given by his owner --ar 16:9 --profile oaefodl jqfuczz  --v 7

the dog swallows the burger in one bite --ar 91:51 --motion high --video 1

Interestingly, if you do not write the prompt for the dog to eat the burger, the dog may refuse to eat it! Apparently, the prompt for image generation is insufficient to guide the action (laugh).

cinematic photo of an excited dog is about to bite a hamburger --ar 91:51 --motion high --video 1

Sorry…Doggo not gonna eat that, human!!

Sometimes dogs in the video refuse to eat the burger if you don't specifically prompt the eating action! Apparently, the image generation prompt alone isn't enough to guide the action (laugh)
Work within the 5.2 second limit by keeping actions simple and achievable • For complex actions, try extending the video to see if the bot continues the last action
Time travel is possible because you can prompt the video to show what happened before your starting frame, so choose that first moment carefully.

Describing actions like a director

Specificity transforms basic movements into compelling animations. Think like a film director giving precise instructions. Instead of "he picks up an apple," write "he picks up the apple with his left hand."

Pro Tip: Layer your actions for more dynamic results:

Primary action (what's happening)
Secondary action (what happens as a result)
Background activity (what's happening in the environment)
Without clear action prompts, the bot tends to rotate or spin geometric patterns or static images

Prompting techniques that deliver results

Video prompting differs significantly from image prompting. The focus shifts to motion, sequence, and temporal elements rather than static visual details.

Your video prompts should prioritize:

Motion and movement patterns
Sequence of actions
Object permanence (keeping things visible)
Camera movement instructions

Start with a clean slate approach. Remove those image generation prompts because they don't help with video. Focus purely on motion and sequence instead.

For action prompts, use this simple formula: Subject + active action verb + (adverb). For example: "A rabbit hopping rapidly along the pathway."

Building sequences becomes crucial for longer videos. Link actions using connective keywords like: first, then, next, after that, suddenly, eventually, as, while, at the same time, finally.

If the object moves in and out from view, make sure you provide more details to track that object. A good example: "She turns in a circle, holding a cup with one hand" (this maintains object presence throughout the turn).

Character consistency gets easier once you understand the rules. You don't need to re-describe characters since the bot picks up details from the starting frame. However, if you edit the image and change the prompt significantly, provide more context. For complex characters, simplify your prompts: "Long-haired, blue-eyed elf in green dress" becomes just "she" + action.

Pro Tip: Simplify your subject/object reference by using “he/she/it…”

Environmental reinforcement helps when using camera movement. Describe surroundings like "a messy and dirty place" to maintain that atmosphere throughout the video.

A few technical tips that make a difference:

Use --raw for closer prompt adherence
Action order in prompts doesn't matter, but sequence of actions does
Describe what you can see, not internal character feelings
Physics interactions (waves, wind, impacts) aren't reliable yet

Introducing new elements mid-video

Adding elements that weren't in your starting frame presents unique challenges. It's significantly harder to introduce new elements if they're not in the first frame, but it's possible with the right approach.

When introducing new subjects, describe them thoroughly including their type (illustration/photography). Based on testing, new elements often work better when introduced during video extensions rather than in the initial generation.

For example, a cat (new element) is introduced into the video via video extension:

cinematic photograph of a black cat with blue eyes taking shelter from the rain. The cat at far is looking at the droids --ar 52:29 --motion low --video 1

Another example. It is easier to bring in a new subject that the model is familiar with (like a cat, which is a common subject). Note: The first frame of this video does not have a cat.

A black cat with blue eyes walking pass the robots leisurely --ar 91:51 --motion high --video 1

The video model has no idea what size the new subject (cat) is! The size may not be proportional to other existing subjects/objects. The first frame of this video also does not contain the cat.

a black cat with blue eyes come to greet the robots. The robot pet the black cat --ar 91:51 --motion high --video 1

Camera movement experimentation

Midjourney recognizes various camera movements, but results vary significantly. The available movement types include pan, tilt, zoom, dolly, tracking, crane, pedestal, steadicam, handheld, POV, rack focus, and different static camera approaches (subject moves while camera doesn't, or camera moves while subject stays still).

Not all camera keywords work as expected, so experimentation is essential.

For example, try "tracking view from a distance showing [subject] moving through [environment]" for a tracking shot. Results may vary, but when it works, the effect can be quite striking.

For example, the tracking view from a distant:

sudden wind blows up dust and rubbish while the cat walk away from the camera, rubbish everywhere. Camera tracking view from behind the cat following the cat from a distance --ar 52:29 --motion high --video 1

Quality management

Understanding quality limitations helps you plan better video projects. Video quality decreases with each extension, especially after the third extension.

The first 5 seconds are your best footage, so plan your most important content for the initial generation.

Maintaining consistency in longer videos

Creating cohesive, longer videos requires strategic planning. For character management, set up character libraries using Omni Reference and document key actions, movements, and poses. Choose characters with signature visual attributes (like distinctive blue eyes) for better consistency.

Style control matters just as much. Establish Style Reference early to maintain color, worldbuilding, and style consistency throughout your project. Generate separate images for each scene rather than relying solely on extensions.

Also, limit video extensions to a maximum of 2 whenever possible. Both quality and consistency benefit from this restraint.

Thanks for reading Geeky Curiosity! This post is public so feel free to share it.

Midjourney just launched image-to-video generation and it's mind-blowing

Geeky Animals

Jun 19

Read full story

Midjourney: Quick overview of the --oref parameter (Omni Reference)