Understanding 6 Key LLM.Agent Architectures
Understanding 6 Key LLM Agent Architectures: From Basic Reflection to Parallel Execution
Large Language Model (LLM) agents have revolutionized how we build AI applications. But understanding the architectural approaches behind these agents can be challenging, especially with the code-heavy implementations that don't always reveal the intuitive flow of how these systems work.
In this article, we'll break down six main agent architectures for LLM-based applications, from basic reflection to advanced parallel execution frameworks. For consistency, each architecture was tested using GPT-4 Turbo (April 9th release) on the same query: "What are the current trends in digital marketing for tech companies?"
1. Basic Reflection
How It Works
The basic reflection architecture follows a simple flow:
1. The user provides an initial query
2. The LLM generates an initial response
3. A reflection step critiques the response, suggesting improvements
4. A new response is generated based on the critique
5. This process repeats for a set number of iterations or until satisfactory results are achieved
This approach is essentially a loop of generate-reflect-regenerate that progressively improves the output.
Performance
- **Time**: 118 seconds
- **Token usage**: ~18,160 tokens
- **Iterations**: Three reflection cycles
Strengths and Limitations
**Strengths:**
- Very easy to implement
- Straightforward conceptual approach
- No complex dependencies
**Limitations:**
- High token usage as it regenerates the entire response each time
- Slow processing due to multiple full-text generations
- Without grounding tools, it often fabricates information (like claiming "Amazon's personalized marketing techniques increased conversion rates by over 202%")
2. Reflexion
Background
Based on the October 2023 paper "Reflexion: Language Agents with Verbal Reinforcement Learning"
How It Works
Reflexion enhances the basic reflection approach:
1. User query is input
2. Initial response is generated along with self-critique and suggested tool queries
3. The suggested tool queries are executed (e.g., web searches)
4. A "reviser" prompt combines the original response, self-reflection, and new context from executed tools
5. This generates a revised response with updated information, new self-reflection, and potentially new tool queries
6. The process repeats until a satisfactory answer is achieved
The key difference from basic reflection is the separate tool execution step that grounds the model in external data.
Performance
- **Time**: 69 seconds
- **Token usage**: ~24,000 tokens
- **Iterations**: Three revision cycles
Strengths and Limitations
**Strengths:**
- Nearly 50% faster than basic reflection
- Grounds responses in external data sources
- Provides citations and references from web searches
- More accurate information than basic reflection
**Limitations:**
- Token usage can be high, especially when retrieving large amounts of web content
- Still requires multiple iterations to refine the answer
3. LATS (Language Agent Tree Search)
Background
Based on the December 2023 paper "Language Agent Tree Search: Unifying Reasoning, Acting and Planning in Language Models"
How It Works
LATS uses a Monte Carlo tree search approach, with LLMs serving three roles:
1. User query serves as input
2. Initial response or tool execution is generated as the root node
3. LLM generates a reflection on the output and assigns a score
4. If a solution isn't found, multiple candidate nodes (typically 5-6) are generated from the root node
5. Each candidate is scored by the reflection prompt
6. The highest-scoring candidate becomes the new starting point
7. The process repeats until a high enough score is reached or maximum search depth
This creates a tree structure where the agent explores different potential solution paths.
Performance
- **Time**: 30 seconds
- **Token usage**: ~8,000 tokens
Strengths and Limitations
**Strengths:**
- Very quick processing compared to other reflection methods
- Explores multiple potential solutions simultaneously
**Limitations:**
- Using LLMs as scoring mechanisms is problematic as they tend to overscore
- Often accepts solutions too easily, limiting exploration
- Can generate excessive text if multiple candidate nodes are created at each step
4. Plan and Execute
Background
Based on the May 2023 paper "Plan-and-Solve Prompting: Improving Zero-Shot Chain of Thought Reasoning by Large Language Models"
How It Works
This approach focuses on task decomposition:
1. User query is processed
2. Instead of immediate generation, a planning prompt creates a step-by-step approach
3. The first step of the plan is executed (generation or tool use)
4. A "replan" prompt evaluates if the task is complete or requires more steps
5. If more steps are needed, the next step is executed
6. This loop continues until the replan prompt determines the answer is adequate
This method breaks complex problems into smaller, more manageable steps.
Performance
- **Time**: 24 seconds
- **Token usage**: ~3,000 tokens
Strengths and Limitations
**Strengths:**
- Significantly faster than reflection-based approaches
- More efficient token usage
- Breaking tasks into subtasks makes complex problems more manageable for LLMs
**Limitations:**
- Quality depends on how well the initial plan is constructed
- May not benefit from the iterative improvements that reflection provides
5. RiU (Reasoning without Observation)
Background
Based on the May 2023 paper "RiU: Decoupling Reasoning from Observations for Efficient Augmented Language Models"
How It Works
RiU optimizes the Plan and Execute approach through variable substitution:
1. User query is input
2. A planner generates a full task list with placeholder variables (E1, E2, etc.)
3. These variables create dependencies between steps (e.g., E2 might depend on the output of E1)
4. Each step is executed sequentially, with outputs substituted into placeholder variables
5. After all steps are complete, a solver prompt uses the plan and evidence to generate the final response
This creates a more efficient execution path by planning the entire task chain upfront.
Performance
- **Time**: 21 seconds
- **Token usage**: Slightly more than Plan and Execute due to web search results
Strengths and Limitations
**Strengths:**
- Faster than Plan and Execute despite handling more complex dependency chains
- Creates a clear execution path with explicit dependencies
- Reduces the need for replanning
**Limitations:**
- Token usage can still be high with web search tools
- Sequential execution can be slower than parallel methods for independent tasks
6. LLM Compiler
Background
Based on the February 2024 paper "LLM Compiler for Parallel Function Calling"
How It Works
LLM Compiler takes RiU's concept further by adding parallel processing:
1. User query is input
2. A planner creates a task list with placeholder variables and reasoning
3. A task fetching unit parses the plan and determines interdependencies
4. Tasks without dependencies are executed in parallel
5. Results are fed back to resolve dependencies for remaining tasks
6. This repeats until the plan is fully executed
7. A joiner prompt either determines the final answer or requests additional steps
The key innovation is the parallel execution of independent tasks, inspired by directed acyclic graphs from compiler design.
Performance
- **Time**: Fastest of all approaches
- **Token usage**: Efficient due to parallel processing
Strengths and Limitations
**Strengths:**
- Optimized for speed with parallel function calling
- Algorithmically determines the most efficient execution path
- Combines the best elements of planning and execution approaches
**Limitations:**
- More complex to implement
- May have overhead for simple queries that don't benefit from parallelization
Conclusion
These six architectures represent an evolution in LLM agent design, from simple reflection loops to sophisticated parallel execution frameworks. Each approach offers different tradeoffs between simplicity, speed, accuracy, and token efficiency.
For simpler applications, basic reflection or Plan and Execute might be sufficient. For more complex, information-dense tasks that require multiple tool calls, advanced architectures like RiU or LLM Compiler can provide significant performance benefits.
Understanding these different architectural patterns allows developers to choose the right approach for their specific use case, or even design hybrid approaches that combine the strengths of multiple architectures.
---
*Note: All code implementations mentioned in this article are available via LangChain's platform, which provided many of the graphics and code examples used in testing these architectures.*
Comments
Post a Comment