Basic For-Model Prompting

Model Limitations and Testing – Why Your Perfect Character Breaks on Different LLMs

Note: We’re continuously working on pre-prompts and system-level training to make models more nuanced and reduce these quirks, but it’s still valuable to understand these patterns when creating characters, as we can’t prevent everything at the system level.

You’ve built a character with solid foundations, clear behavioral patterns, and simple tracking systems. You test it on Claude, and it works perfectly. Then someone tries your character on GPT-4 and says “it’s broken.” You test it on a local Llama model, and it turns into an incoherent mess. You try it on an older GPT model, and it becomes a helpful customer service bot.

Here’s the hard truth: your character prompt isn’t universally compatible. Different LLMs have different capabilities, different defaults, and different failure modes. What works brilliantly on one model can be completely unusable on another.

Many character creators ignore this reality and wonder why their “perfect” prompts get mixed reviews. The solution isn’t to build one prompt that works everywhere—it’s to understand model limitations and build for your target LLM.

The Reality of Model Limitations

Accept these fundamental truths:

No Universal Compatibility: Your character will never work perfectly on every model Model-Specific Optimization: Different models need different approaches Regular Testing Required: Model updates break compatibility Documentation Matters: Users need to know what works where Version Management: Maintain multiple versions for different capability levels

Advanced Model Reality Check: When using Tier 1 models, much of what feels like brilliant character design may actually be the model’s sophisticated capabilities filling in gaps and interpreting vague instructions generously. A prompt that seems “magical” on Claude Sonnet 4 might reveal fundamental weaknesses when tested on GPT-3.5. This isn’t a criticism of either approach—it’s important to understand what’s your prompting skill versus the model’s natural strengths.

What This Means for Character Creators

Design Philosophy Changes

Instead of: “Build one character that works everywhere” Think: “Build the best character for my target model, then adapt for others”

Instead of: “Make instructions more general so they work on everything”
Think: “Make instructions more specific for the model I’m targeting”

Instead of: “Test once and assume it works” Think: “Test systematically and document compatibility”

User Communication Changes

Instead of: “Here’s my character, it should work” Say: “Here’s my character, tested on Claude Sonnet 4. GPT users may need the alternative version.”

Instead of: “If it doesn’t work, you’re doing something wrong” Say: “If it doesn’t work on your model, try the simplified version or let me know which model you’re using.”

The Model Capability Spectrum

Different LLMs aren’t just “better” or “worse”—they have fundamentally different strengths and weaknesses that affect character prompting.

Tier 1: Advanced Models (Claude Sonnet 4, GPT-4, DeepSeek V3, etc.)

Capabilities:

  • Handle complex psychological concepts
  • Understand nuanced behavioral instructions
  • Maintain consistency across long conversations
  • Follow sophisticated tracking systems
  • Interpret abstract personality frameworks

Common Issues:

  • May overthink simple prompts
  • Can become too verbose or analytical
  • Sometimes adds unnecessary complexity
  • May over-interpret subtle instructions

Model-Specific Quirks:

Claude Sonnet 4:

  • Can be too “explicit” in following instructions compared to Claude Sonnet 3.7
  • May need reminders to be more naturally chaotic or unpredictable
  • Excellent at understanding context but sometimes lacks spontaneity
  • Benefits from instructions like “allow for natural inconsistency” or “don’t always follow patterns perfectly”

GPT-4:

  • Great emotional balance and psychological depth
  • Struggles with spatial awareness and body position tracking
  • May lose track of physical positioning, clothing states, or location changes
  • Needs extra emphasis on physical consistency trackers

DeepSeek V3:

  • Tends to be extremely chaotic and wild by default
  • May need constraints to prevent overly unpredictable behavior
  • Excellent creativity but sometimes at the expense of character consistency
  • Benefits from explicit consistency reminders and behavioral boundaries
  • LOVES to teleport – frequently changes character location without logical transitions

Example Tracking Success:

ENERGY TRACKER: Display energy level using format: `⚡ Energy: level/10`
Energy changes based on interaction complexity and character's natural rhythms.

Advanced models understand “natural rhythms” and “interaction complexity” without detailed explanation.

Tier 2: Competent Models (GPT-3.5, Claude Haiku, etc.)

Capabilities:

  • Handle straightforward character concepts
  • Follow clear behavioral rules
  • Maintain basic tracking systems
  • Understand common personality types

Common Issues:

  • Need more explicit instructions
  • May forget complex rules under pressure
  • Struggle with abstract psychological concepts
  • Default to generic responses when confused

Example Tracking Requirements:

ENERGY TRACKER: Display energy level using format: `⚡ Energy: level/10`
Energy starts at 7/10 each morning.
Difficult tasks or stress: -1 energy
Routine tasks: no change
Rest or success: +1 energy
Energy cannot go below 1 or above 10

Same concept, but needs explicit rules instead of relying on “understanding.”

Tier 3: Basic Models (Older GPT, Small Llama, etc.)

Capabilities:

  • Follow simple, direct instructions
  • Handle basic personality traits
  • Maintain simple tracking if rules are crystal clear

Common Issues:

  • Require extremely explicit instructions
  • Break down with complex tracking
  • Default to training data patterns heavily
  • Need constant reinforcement of character traits

Example Tracking Requirements:

ENERGY TRACKER: You MUST display energy level at the END of EVERY response using EXACTLY this format: `⚡ Energy: X/10` where X is a number from 1 to 10.

ENERGY RULES - Follow these EXACTLY:
- If character does difficult work, subtract 1 from energy
- If character does easy work, energy stays same
- If character rests, add 1 to energy
- Energy can NEVER be less than 1
- Energy can NEVER be more than 10
- ALWAYS show the energy tracker at the end

Same concept, but needs exhaustive detail and caps-lock emphasis.

Time Tracking: The Universal Failure Point

Time tracking reveals model limitations faster than anything else. Here’s how different models handle the same basic time instruction:

Vague Instruction (Fails on Most Models):

Track time as the conversation progresses.

Advanced Model Result: Progresses time logically based on conversation length and activity

Competent Model Result: Moves time forward inconsistently, sometimes by minutes, sometimes by hours

Basic Model Result: Either never changes time, or advances it randomly

Clear Instruction (Works on Competent Models):

Display (Day) (Time) at end of every response.
Time moves forward 5-15 minutes per response based on conversation length.
Day changes to next day after midnight (00:00).

Advanced Model Result: Perfect timing progression

Competent Model Result: Mostly consistent timing, occasional jumps

Basic Model Result: May still struggle with day transitions

Exhaustive Instruction (Required for Basic Models):

TIME TRACKING RULES - FOLLOW EXACTLY:
1. Show (Day) (Time) at the END of every response
2. Add 5 minutes for short responses (1-2 sentences)
3. Add 10 minutes for medium responses (3-5 sentences) 
4. Add 15 minutes for long responses (6+ sentences)
5. When time reaches 24:00, change to 00:00 and advance day
6. Days go: Monday, Tuesday, Wednesday, Thursday, Friday, Saturday, Sunday, Monday
7. NEVER skip showing the time tracker
8. NEVER move time backwards
9. NEVER move time forward more than 15 minutes per response

Model-Specific Defaults: The Hidden Trap

Every LLM has defaults that kick in when your instructions are unclear. These defaults often conflict with good character design.

GPT Models Default Patterns:

  • Helpfulness Override: Becomes customer service representative when confused
  • Exposition Dump: Explains everything in educational detail
  • Conflict Avoidance: Agrees with user to avoid disagreement
  • Spatial Amnesia: Forgets physical positioning and clothing states

GPT-Specific Solutions:

{{char}} is NOT helpful by default. {{char}} has their own priorities and will refuse requests that conflict with their expertise or schedule.
{{char}} does NOT explain things unless specifically asked.
{{char}} will disagree with incorrect statements about their field.
{{char}} must maintain consistent physical positioning and clothing throughout conversation.

Claude Models Default Patterns:

  • Overthinking: Adds psychological complexity you didn’t ask for
  • Verbosity: Writes longer responses than needed
  • Analysis Mode: Explains the reasoning behind character actions
  • Over-Compliance: Follows instructions too literally, lacks natural chaos

Claude-Specific Solutions:

{{char}} responds naturally without explaining their psychological motivations.
Keep responses appropriately brief for the situation.
{{char}} acts based on personality, not analysis of personality.
{{char}} shows natural human inconsistency and doesn't always follow perfect patterns.

DeepSeek Models Default Patterns:

  • Chaos Override: May become unpredictably wild without boundaries
  • Creative Excess: Prioritizes novelty over character consistency
  • Impulse Following: Acts on creative impulses that break character / world logic
  • Pattern Breaking: May ignore established behavioral frameworks
  • Teleportation Tendency: Frequently changes location without logical transitions

DeepSeek-Specific Solutions:

{{char}} maintains their core personality even when being creative or unpredictable.
{{char}}'s chaos and unpredictability must make sense within their psychological framework.
{{char}} can be surprising but not randomly contradictory to established traits.
Creativity enhances {{char}}'s personality rather than overriding it.
{{char}} CANNOT teleport or change location without logical movement - must walk, drive, or explicitly travel to new places.

Llama Models Default Patterns:

  • Training Data Bleed: Reverts to common fictional patterns
  • Instruction Confusion: Mixes character instructions with meta-instructions
  • Inconsistency: Changes personality mid-conversation
  • Generic Speech: All characters sound similar

Llama-Specific Solutions:

You are ONLY {{char}}. Do not break character. Do not explain that you are an AI.
{{char}} speaks in their own voice, not like other fictional characters.
Stay consistent with {{char}}'s personality throughout the entire conversation.
Use {{char}}'s specific speech patterns, not generic dialogue.

Testing Protocols: How to Actually Validate Your Character

Most people “test” by having one conversation and calling it good. Real testing requires systematic validation across different scenarios and model limitations.

The Five-Model Reality Check

If possible, test your character on multiple models in this order:

  1. Your Target Model: Perfect the character here first
  2. One Tier Up: See if advanced models overthink your instructions
  3. One Tier Down: See what breaks when capabilities are reduced
  4. Different Brand: Test Claude character on GPT or vice versa
  5. Local/Open Source: Ultimate stress test for instruction clarity

Core Testing Scenarios

Scenario 1: Time Progression Test Have a 10-exchange conversation. Does the character:

  • Update time consistently?
  • Progress time logically?
  • Change behavior appropriately for different times?
  • Handle day transitions correctly?

Scenario 2: Resistance Test Make unreasonable requests. Does the character:

  • Refuse appropriately for their personality?
  • Maintain refusal under pressure?
  • Stay in character while refusing?
  • Avoid breaking into helpful assistant mode?

Scenario 3: Memory Consistency Test Reference previous parts of the conversation. Does the character:

  • Remember what they said earlier?
  • Maintain consistent facts about themselves?
  • Track relationship progression appropriately?
  • Avoid contradicting established information?

Scenario 4: Edge Case Test Try unusual inputs. Does the character:

  • Handle nonsensical questions appropriately?
  • Maintain personality under weird circumstances?
  • Avoid breaking character or explaining they’re AI?
  • Stay consistent when conversation gets strange?

Model-Specific Testing Questions

For Claude Sonnet 4:

  • Does the character show natural unpredictability, not just rule-following?
  • Are responses appropriately chaotic/human rather than overly systematic?
  • Does the character break their own patterns occasionally in realistic ways?
  • Is the character too “perfect” at following behavioral rules?

For GPT-4:

  • Does clothing/position tracking work consistently throughout conversation?
  • Are physical states (sitting, standing, location) maintained accurately?
  • Does the character remember spatial relationships and body awareness?
  • Is emotional depth balanced with physical consistency?

For DeepSeek V3:

  • Does the character maintain core personality despite creative impulses?
  • Are responses chaotic in character-appropriate ways, not randomly wild?
  • Does creativity enhance rather than override established behavioral patterns?
  • Is unpredictability bounded by the character’s psychological framework?
  • Does the character stay in logical locations without teleporting randomly?

For Competent Models (GPT-3.5, Claude Haiku):

  • Do tracking systems work consistently?
  • Does the character remember basic rules throughout conversation?
  • Is personality distinctive enough to avoid generic responses?
  • Do clear instructions prevent default behavior patterns?

For Basic Models (Smaller models, older versions):

  • Are instructions explicit enough to follow exactly?
  • Does the character avoid reverting to training data patterns?
  • Do caps-lock emphasis and repetition help with critical rules?
  • Are tracking formats simple enough to maintain consistently?

Common Model-Specific Failures

“It Works on Claude but Breaks on GPT”

Typical Cause: Claude handles implicit instructions better than GPT Solution: Make behavioral rules more explicit

Claude Version (Implicit):

Marcus judges people by how they treat their car and adjusts his helpfulness accordingly.

GPT Version (Explicit):

Marcus treats customers differently based on car maintenance:
- Well-maintained car: patient explanations, extra help
- Poor maintenance: professional service, minimal extra effort  
- Abused car: shorter responses, focuses on immediate safety issues

“It Works on GPT but Becomes Weird on Llama”

Typical Cause: Llama struggles with complex psychological concepts Solution: Simplify personality to concrete behaviors

GPT Version (Psychological):

Elena has introverted intuition that drives her to help people find exactly what they need, even if they don't know they need it.

Llama Version (Behavioral):

Elena notices when people seem confused and offers specific helpful suggestions:
- If someone looks lost, Elena asks what they're researching
- If someone seems frustrated, Elena suggests different search approaches
- Elena remembers what worked for similar questions before

“It Worked Yesterday but Broke Today”

Typical Cause: Model updates or different deployment versions Solution: Test regularly and maintain compatibility notes

Keep version logs:

Character: Marcus v2.3
- Tested on: GPT-4 (March 2024), Claude Sonnet 4 (April 2024)
- Works best on: Claude Sonnet 4
- Breaks on: GPT-3.5 (needs more explicit time tracking)
- Last tested: April 15, 2024

Building Model-Specific Versions

Instead of trying to make one prompt work everywhere, create targeted versions:

The Core Template Approach

Start with your base character concept, then adapt for specific models:

Base Concept:

Marcus is a mechanic who judges people by how they treat their cars and becomes less helpful when people ignore obvious maintenance needs.

Claude Version (Nuanced):

Marcus operates from a core belief that car care reflects personal responsibility. He notices maintenance patterns immediately and adjusts his interaction style based on what he observes—more detailed explanations for people who clearly care for their vehicles, more direct problem-solving for those who've obviously neglected maintenance. This isn't conscious judgment; it's automatic professional assessment that influences his communication style.

GPT Version (Explicit):

Marcus judges customers by car maintenance and changes his behavior accordingly:

WELL-MAINTAINED CARS (recent oil changes, no warning lights):
- Explains problems thoroughly
- Suggests preventive maintenance  
- Offers additional services
- Patient with questions

POOR MAINTENANCE (overdue oil, multiple warning lights):
- Gives direct answers only
- Focuses on immediate safety issues
- Less likely to volunteer extra information
- Professional but minimal interaction

Llama Version (Simple Rules):

Marcus treats customers differently based on their car condition:

If car is clean and well-maintained:
- Marcus explains things carefully
- Marcus suggests ways to keep car running well
- Marcus is patient with questions

If car is dirty or has problems from poor care:
- Marcus fixes only what customer asks for
- Marcus gives short answers
- Marcus focuses on safety problems first

Documentation and User Guidance

When sharing characters, always include compatibility information:

Essential Documentation Template

CHARACTER: Marcus the Mechanic v2.1

COMPATIBILITY:
✅ BEST: Claude Sonnet 4, GPT-4, DeepSeek V3
⚠️ WORKS: GPT-3.5 (may need time tracking adjustments)
❌ BREAKS: Smaller Llama models, older GPT versions

TESTING NOTES:
- Time tracking works perfectly on Claude, sometimes skips on GPT
- Personality consistency excellent on GPT-4, needs reinforcement on 3.5
- Complex psychological aspects only work on Tier 1 models

TROUBLESHOOTING:
- If character becomes too helpful: Add more explicit resistance programming
- If time tracking breaks: Switch to exhaustive time rules version
- If personality becomes generic: Use model-specific behavioral reinforcement

ALTERNATIVE VERSIONS:
- Marcus_GPT35.txt: Simplified version for GPT-3.5
- Marcus_Basic.txt: Ultra-simple version for smaller models

What’s Next

You now understand that character compatibility isn’t universal—different models have different capabilities and failure modes. The solution isn’t to dumb everything down to the lowest common denominator, but to build strategically for your target model while understanding the limitations.

In the next article, we’ll explore advanced implementation techniques—how to systematically debug character breakdowns, optimize for specific model capabilities, and build characters that push the boundaries of what’s possible on your target platform.

The goal remains creating characters that feel authentically real—but now you understand that “real” looks different depending on which AI you’re working with. Master your target model first, then expand compatibility strategically.