Large language models lose the thread in long conversations
“Large language models can provide impressive answers, but they lose their way surprisingly quickly in a long conversation.”
The use of language models as digital colleagues is growing rapidly. However, recent research from Microsoft Research and Salesforce shows that large language models (GPT-4.1, Claude 3, and Gemini 2.5 Pro) fundamentally fall short in a situation that is perfectly normal for humans: clarifying an assignment step-by-step during a conversation.
The study shows that large language models perform an average of 39% worse when information is distributed across multiple conversational moments. Furthermore, these models also get lost when repeatedly replicating the same image, as this example demonstrates.
Research method and background
For this study, researchers simulated over 200,000 conversations with 15 large language models. The core question: do large language models continue to perform as well if you provide the instructions bit by bit rather than all at once?
To test this, they used a new method called sharded prompting. This involves breaking a complete instruction into smaller pieces (shards), which are only shared later in the conversation. This simulates how people use language models in practice: first a general request, followed by additional context, conditions, or details.
What is the Lost in Conversation effect?
The research shows that many large language models stumble over conversations in which information is built up incrementally.
The figures:
- Performance drop of 39% on average with fragmented instructions: if you provide an assignment in parts instead of all at once, large language models score an average of 39 points lower on a scale of 0 to 100. Consequently, they deliver noticeably poorer answers for the exact same task.
- Reliability is halved: the difference between the best and worst outcome for the same task sometimes reaches up to 50 points on a scale of 0 to 100.
- This effect occurs even in conversations consisting of only two steps.
What exactly goes wrong:
- The large language models make a guess too early and then remain stuck in that initial, often incorrect interpretation.
- The large language models overreact to the latest input and forget what was said previously.
- Answers become longer and ‘bloated’, with more irrelevant details and less precision.
Why these findings are relevant for legal professionals
More and more legal professionals are using large language models as co-pilots for drafting advice, analyzing documents, or structuring conversations. However, those who rely on a fluid conversation with a language model run a risk.
Because:
- Legal professionals often work with step-by-step information, such as in case file development, client instructions, or negotiations.
- In practice, assignments change during the conversation. Large language models appear to still handle this poorly.
- Reliable collaboration requires language models that are context-consistent: language models that do not forget previous information and can correctly integrate new input. Such large language models hardly exist yet.
Three levels of instructions for large language models and their risks
- Full instruction at once
✓ Highest reliability.
✓ Best performance. - Information as a list (concat)
▪ In this approach, you provide all information at once, but not in a single continuous text. You list the components point by point, for example, as bullet points.
▪ Performance is slightly lower (95%) than with a full instruction at once, but still stable and usable. - Conversational build-up (sharded prompts)
✗ Information is shared over multiple steps, as in a dialogue. This resembles how humans speak, but this is where large language models lose the thread.
✗ Result: greater chance of errors, inconsistencies, and confusion.
Conclusion: new insights, new skills
What does this mean for legal professionals who want to use language models smartly? You must learn to prompt with structure and clarity. It is not enough to simply ‘have a conversation’ with such a language model.
Five tips from the research:
- Bundle your instructions
- Provide as complete an assignment as possible in one go.
- Formulate the objective, context, constraints, and desired format together.
- State explicitly what the language model should and should not do
- Start with a clear task description, followed by background information.
- Avoid vague intermediate sentences such as: “what do you think so far?” These make the assignment unclear.
- Use lists or bullets for complex tasks
- Lists (as in concat-prompts) work better than isolated remarks spread throughout the conversation.
- For example: “Take into account: 1. Confidentiality, 2. Deadline, 3. Applicable law.”
- Prevent the language model from guessing
- Large language models like to fill in missing information themselves.
- If they do so, that assumption persists stubbornly throughout the conversation.
- Therefore, it is better to write: “Do not use assumptions regarding parties or context.”
- Restart your conversation if it gets stuck
- If the language model remains stuck in a wrong interpretation, start a new session.
- First, have the language model summarize everything: “What do you know so far?”
- Then, input that summary as a new, complete prompt. This prevents previous errors from continuing to seep through.
Good output therefore begins with a clear instruction, especially when collaborating with a language model.