Perplexity: I asked the AI models (Midjourney V6.1, Claude 3.5 Sonnet, GPT-4o, Sonar Huge, and Grok-2) to describe an image. The result is not what I expected.

Exploring the visual identities of popular AI models in describing images

Dec 06, 2024

AI models have distinct personalities in their description styles.

Midjourney has a powerful feature that sets it apart from other AI image generators: the /describe command, which allows users to convert a reference image into visual descriptions, allowing them to emulate and create images that are similar to the reference image. It is also an "educational tool" that users can use to learn new keywords and expand their visual vocabulary.

However, Midjourney isn't the only AI capable of describing images. Other popular AI models offer similar functions. The difference is that the descriptions generated by the /describe command are somewhat optimized for generating images in Midjourney but not in other AI models.

I wonder how the popular AI models would respond if I asked them to describe an image. How different are their descriptions from Midjourney in more detail? Is it possible to use them to learn new keywords for better prompting in Midjourney? And if I use the AI model to generate Midjourney prompts, what should I be aware of or look out for?

TL;DR

I asked Claude 3.5 Sonnet, GPT-4o, Sonar Huge, and Grok-2 to visually describe an image and compared the results to the Midjourney /describe command.
Each AI model describes the images differently. Some focus on technical descriptions (such as Midjourney and Sonar Huge), while others strive for conciseness (such as GPT-4o), and so on.
Sonar Huge is the most similar model to Midjourney for image descriptions. Both prioritize technical aspects of the image, such as camera settings, artistic styles, and techniques.
If the AI models were people, Midjourney would be the Technical Perfectionist, Sonar Huge the Technical Artist, Claude 3.5 Sonnet the Artistic Editor, GPT-4o the Emotional Interpreter, and Grok-2 the Efficient Curator.
It is unclear whether the various characteristics of the AI models will have an impact when users use them to generate Midjourney prompts.

How do AI models describe images?

I chose six images to test how the AI models would describe them. Images 1-3 are used to test the model's ability to describe photographic images, while images 4-6 are used to test the model's ability to describe illustrations.

Because they are describing the same images, I assume the generated descriptions will be similar regardless of the model.

The test method

I used Perplexity to get access to Claude 3.5 Sonnet, GPT-4o, Sonar Huge, and Grok 2. The Midjourney /describe command was accessed through the Discord interface.

Every image was described four times per model. This means that an image will be described four times by Midjourney, Claude 3.5 Sonnet, and so on, resulting in six sets of data for six sample images.

The generated description data was analyzed using the following criteria: the overall approach to describing an image and the model's similarities and differences in image description. Then, I asked the bot to suggest a model that closely resembled the output of the Midjourney's /describe command.

The Perplexity Pro Search was turned off, and the Focus was set to Writing mode. This is to ensure that the analysis is restricted to the uploaded datasets and will not access any external information from the internet.

Finally, I asked Perplexity (Claude 3.5 Sonnet) to create a visual representation of the AI models' approach to describing an image.

Each AI model has a unique way of describing an image

Surprisingly, the AI models describe an image very differently. The main subject and key elements are mentioned in the description across models, but the way they are described varies.

Sonar Huge is the most similar model to Midjourney for image descriptions. In other words, it resembles how Midjourney describes an image, with minor differences. This result is consistent across six datasets.

Sonar Huge is Perplexity's powerful AI model built on LLAMA 3.1 405B. It has a 127K token context window, real-time internet access, and is specifically designed for complex search tasks.

I compiled and tabulated the findings into tables.

Here are some observations about AI models' approaches to describing images:

Long descriptions include more technical details, whereas short descriptions lack visual details and styling information.
Midjourney's description is highly technical and detailed, referring to artistic styles and artist names.
Claude 3.5 Sonnet is narrative with a focus on artistic styles.
GPT-4o is simple and direct, may emphasize the emotional aspect, and frequently lacks stylistic references.
Sonar Huge is both technical and artistic.
Grok-2's description is concise and straightforward, with little information about artistic styles.

How do AI models describe photographic images?

How do AI models describe illustration images?

If AI models were human

Sonar Huge is the most similar model to Midjourney for image descriptions.

Implications

The similarities between Sonar Huge and Midjourney may imply that Sonar Huge can be used to learn new prompt keywords for image generation in addition to the /describe command in Midjourney. A quick look at the datasets appears to support this idea, but it has not yet been tested.

To fully match Midjourney's style, Sonar Huge would need to incorporate more specific photography technical details and structured formatting.

It is common to use language models such as GPT-4o to brainstorm and generate Midjourney prompts. However, based on the above results, GPT-4o may not adequately describe the image. One may need to explicitly request that the bot include details such as aesthetic style, camera settings, and so on in the prompt. However, this has not yet been investigated.

Supporting details

The section below only contains prompts, sample datasets, and sample results for curious minds.

This section can be safely skipped if you do not want to read the methodology in detail.

Prompt to create image descriptions for the dataset:

Generate 4 variations of image descriptions for AI image generation. 
Separate each with a blank line. No numbers, quotes, or extra text.

Prompt to analyse image descriptions generated by AI models":

Analyze image descriptions generated by 5 AI models (Midjourney, Claude 3.5 Sonnet, GPT-4o, Sonar Huge, Grok-2), with 4 variations per model. 

Identify each model's distinct approach to image description. One paragraph per model. 

Create a comparison table highlighting key similarities and differences. The table includes the following columns: Technical detail (Very high, high, medium, low), Language Pattern, Length, Style References, Focus Area Summarize the characteristic patterns in AI image description methods. 

Propose an AI model that is most closely resembles Midjourney's approach to image description.

Prompt to create a character profile for the image models:

Analyze these 6 AI image descriptions as if each model was a person. Create a character profile for each 'AI personality' based on their descriptive style. Then generate a paragraph that visually describes that person in detail. Explain your reasoning.

Midjourney prompts to create human representations of the AI models:

Midjourney
A photographic portrait of a well-groomed person wearing a structured black turtleneck and wire-rimmed glasses, carrying multiple camera bags filled with precisely organized equipment. Their studio is immaculately organized with labeled shelves and detailed shooting schedules. Taken by Kodak Portra 400. --ar 2:3 --v 6.1

Claude 3.5 Sonnet
A photographic portrait of an elegantly dressed individual wearing flowing, artistic clothing in neutral tones. They carry a leather portfolio and have a thoughtful, observant expression. Their office features mood boards and artistic photography books.Taken by Kodak Portra 400. --ar 2:3 --v 6.1

GPT-4o
A photographic portrait of a warm, approachable person wearing comfortable, earthy clothing. They have kind eyes and an empathetic expression, often seen with a journal for recording impressions. Taken by Kodak Portra 400. --ar 2:3 --v 6.1

Sonar Huge
A photographic portrait of a creative professional wearing a mix of classic and contemporary clothing. They have paint-stained hands but also carry the latest digital equipment, representing their bridge between traditional and modern methods. Taken by Kodak Portra 400. --ar 2:3 --v 6.1

Grok-2
A photographic portrait of a sharp-dressed individual wearing smart business attire with subtle artistic touches. They carry a tablet for quick reference but also appreciate vintage photography books. Taken by Kodak Portra 400. --ar 2:3 --v 6.1

Sample data

For image 1.png

### Midjourney

Portrait of a young girl, a black and white photograph with sepia tones, soft focus, dreamy, vintage style, in the style of Jean-Édouard Saüte, very detailed, high resolution, high quality, high detail, hyper-realistic, studio lighting, Hasselblad H6D-400c Multi-Shot, high resolution. --ar 64:85

Tintype photograph of a young girl, in an 80s retro style, with soft lighting, a close-up portrait, beautiful eyes, long hair, wearing a sleeveless lace top, posing for the camera, in a sepia tone. --ar 64:85

Tintype photograph of a young girl, a studio portrait with soft lighting, sepia tones, and a natural look. --ar 64:85

A portrait of a young girl in the style of Richard Avedon, featuring vintage photography, soft focus, sepia tones, high contrast, soft lighting, and natural beauty. --ar 64:85


### Claude 3.5 Sonnet

A vintage sepia portrait photograph of a young woman with long wavy dark hair, wearing a light-colored sleeveless lace dress, resting her head on her hand while gazing directly at the camera with an intense expression, shot in classic Hollywood glamour style

Black and white fine art portrait featuring a teenage girl with flowing brunette hair, delicate features, and luminous eyes, wearing a cream-colored crochet top, captured in dramatic chiaroscuro lighting against a soft blur background

Moody monochromatic portrait in wet plate collodion style showing a pensive young female with cascading dark hair, pale complexion, and ethereal beauty, dressed in vintage attire, photographed with shallow depth of field

Artistic sepia-toned studio photograph reminiscent of 1930s portraiture, depicting an elegant young subject with natural makeup, long tousled hair, and a lace tank top, photographed with diffused lighting and classic composition


### GPT-4o

A young girl with soft, wavy hair gazes thoughtfully into the distance. The image is in a sepia tone, evoking a vintage, timeless feel. She wears a delicate lace top, and her expression is calm and reflective.

A serene portrait of a girl with long, flowing hair resting her head on her hand. The warm, muted tones and soft lighting create an intimate and nostalgic atmosphere. Her eyes are expressive, drawing the viewer in.

An artistic depiction of a young girl with natural curls and delicate features. The sepia effect enhances the classic aesthetic, while her relaxed pose and subtle smile add a sense of quiet elegance.

A close-up of a girl with gentle waves in her hair, wearing a textured lace top. The vintage color palette and soft focus highlight her youthful innocence and contemplative expression.


### Sonar Huge

A young woman in a vintage sepia portrait, with long wavy dark hair and delicate features, resting her head on her hand, wearing a light-colored lace sleeveless dress, soft ethereal lighting, dramatic shadows, film grain texture

Ethereal black and white portrait photography, teenage girl with flowing brunette hair, contemplative pose, wearing cream lace top, cinematic lighting, shallow depth of field, wet plate collodion style

Close-up portrait in monochrome, innocent youthful face with striking eyes, natural wavy hair falling past shoulders, dreamy atmospheric lighting, antique photography aesthetic, high contrast shadows

Timeless sepia toned studio portrait, gentle expression on young face, classic pose with hand supporting head, textured vintage photograph style, soft focus around edges, detailed lace garment, film grain overlay


### Grok-2

A vintage sepia portrait of a young girl with long wavy dark hair, wearing a light-colored lace dress, gazing directly at the camera with a contemplative expression, dramatic studio lighting, wet plate collodion photography style

Black and white portrait photograph in 1920s style, showing a close-up of a feminine face with delicate features, soft side-swept hair, and luminous eyes, shot against a blurred background with moody lighting

Ethereal fine art portrait featuring a youth with flowing brunette hair, pale complexion, wearing a sleeveless cream-colored dress, captured in monochromatic tones with shallow depth of field

Dreamy antique-style photograph reminiscent of Julia Margaret Cameron's work, depicting an innocent gaze, long tousled hair, bare shoulders, and detailed fabric texture, shot in sepia tones with soft focus edges

Sample results: Analysis by model

For image 1.png

Midjourney focuses heavily on technical photography specifications and artistic references, using detailed command-style descriptions with specific camera models (Hasselblad H6D-400c) and aspect ratios (--ar 64:85). Its descriptions are highly structured, combining technical parameters with style references like "Jean-Édouard Saüte" and "Richard Avedon."

Claude 3.5 Sonnet employs rich, narrative-style descriptions emphasizing mood and atmosphere. It provides comprehensive details about clothing, pose, and expression while maintaining a professional photography vocabulary. The descriptions are well-balanced between technical and artistic elements.

GPT-4o takes a more poetic, interpretive approach. Its descriptions are shorter and focus on emotional resonance and overall impression rather than technical details. It uses metaphorical language and emphasizes the subject's emotional state and the image's impact on viewers.

Sonar Huge combines technical photography terms with atmospheric elements. It consistently mentions specific photography techniques (film grain, wet plate collodion) while maintaining a balanced description of both subject and photographic elements.

Grok-2 references historical photography styles and photographers (Julia Margaret Cameron), focusing on period-specific details. Its descriptions are comprehensive but more concise than others, effectively combining technical and artistic elements.

Character profile of AI models based on how they describe an image

All datasets combined.

## Midjourney - The Technical Perfectionist
**Character Profile:** A meticulous professional photographer who's obsessed with technical precision and equipment. They're the type who carries a notebook filled with exact measurements and camera settings.
**Visual Description:** A well-groomed person in their 40s wearing a structured black turtleneck and wire-rimmed glasses, carrying multiple camera bags filled with precisely organized equipment. Their studio is immaculately organized with labeled shelves and detailed shooting schedules.
**Reasoning:** Midjourney consistently emphasizes technical specifications, exact camera models, and precise aspect ratios, suggesting someone who values technical accuracy above all else.

## Claude 3.5 Sonnet - The Artistic Editor
**Character Profile:** A sophisticated fashion magazine editor with a background in fine arts who sees beauty in details and can weave compelling narratives through imagery.
**Visual Description:** An elegantly dressed individual in their 30s wearing flowing, artistic clothing in neutral tones. They carry a leather portfolio and have a thoughtful, observant expression. Their office features mood boards and artistic photography books.
**Reasoning:** Claude's descriptions focus on mood, atmosphere, and sophisticated styling details, suggesting someone who values artistic expression and narrative depth.

## GPT-4o - The Emotional Interpreter
**Character Profile:** An intuitive art therapist who connects deeply with emotions and focuses on the human element in every scene.
**Visual Description:** A warm, approachable person in their mid-30s wearing comfortable, earthy clothing. They have kind eyes and an empathetic expression, often seen with a journal for recording impressions.
**Reasoning:** GPT-4o consistently prioritizes emotional resonance and human connection in its descriptions, using poetic and interpretive language.

## Sonar Huge - The Technical Artist
**Character Profile:** A photography professor who seamlessly blends technical expertise with artistic vision, experienced in both traditional and modern techniques.
**Visual Description:** A creative professional in their 50s wearing a mix of classic and contemporary clothing. They have paint-stained hands but also carry the latest digital equipment, representing their bridge between traditional and modern methods.
**Reasoning:** Sonar Huge consistently combines technical photography terms with artistic elements, showing expertise in both traditional and contemporary techniques.

## Grok-2 - The Efficient Curator
**Character Profile:** A museum curator who appreciates both historical context and modern interpretation, known for their ability to synthesize complex information concisely.
**Visual Description:** A sharp-dressed individual in their 40s wearing smart business attire with subtle artistic touches. They carry a tablet for quick reference but also appreciate vintage photography books.
**Reasoning:** Grok-2 provides concise yet comprehensive descriptions that balance historical references with modern elements, suggesting someone who can efficiently curate and present information.

Each personality profile is derived from the consistent patterns in their descriptive styles, showing how different AI models approach the same task with distinct "personalities" and priorities.

Key learnings

The AI models have distinct approaches to describing images. Midjourney and Sonar Huge focus on technical details, while GPT-4o emphasizes emotional aspects and Claude 3.5 Sonnet takes a narrative approach.
Sonar Huge closely resembles Midjourney's image description style, making it potentially useful for learning new prompt keywords. Both prioritize technical aspects and artistic styles.
AI models have distinct "personalities" in their description styles.
Long descriptions typically contain more technical details and styling information, while shorter descriptions often lack visual details. Midjourney consistently provides highly technical descriptions and references to specific artistic styles.

Give a gift subscription

Dec 8, 2024

If you manage to read the comments on this article, I can tell you a secret. I experimented with Sonar Huge to describe images and then used those keywords to generate Midjourney prompts. On several occasions, it outperformed Midjourney's /describe at replicating the images.

Expand full comment

Geeky Curiosity

Discussion about this post

Ready for more?