From Basic Networks to Multi-Agent Reinforcement Learning
Neural Networks (MLP, RNN, Transformer): These are the building blocks - the actual network architectures that process data.
RL Algorithms (PPO, DDPG, MAPPO, MADDPG): These are training algorithms that USE neural networks as their policy/value functions.
Example: MAPPO is an algorithm that can use either MLP or RNN/LSTM/GRU as its underlying network architecture. So you have "MLP-MAPPO" or "Recurrent-MAPPO".
| Feature | MLP | Vanilla RNN | LSTM | GRU |
|---|---|---|---|---|
| Type | Feedforward | Recurrent | Recurrent | Recurrent |
| Memory | None | Short-term | Long-term | Long-term |
| Gates | 0 | 0 | 3 | 2 |
| Best For | Static data | Short sequences | Long sequences | Medium sequences |
Revolution: Transformers replaced RNNs as the dominant architecture for sequence tasks through the attention mechanism.
Key Idea: Instead of processing sequences step-by-step (like RNNs), transformers process all positions simultaneously using self-attention.
| Feature | RNN/LSTM/GRU | Transformer |
|---|---|---|
| Processing | Sequential (step-by-step) | Parallel (all at once) |
| Memory Mechanism | Hidden state | Self-attention |
| Long Dependencies | Difficult (even with LSTM) | Easy (direct attention) |
| Training Speed | Slow (sequential) | Fast (parallelizable) |
| Parameters | Fewer | Many more |
| Data Requirements | Works with less data | Needs large datasets |
| Examples | Simple NLP tasks | GPT, BERT, ChatGPT |
Convert tokens to vectors + positional encoding
Each position attends to all other positions: Attention(Q, K, V) = softmax(QK^T/โd)V
Run multiple attention mechanisms in parallel (typically 8-16 heads)
Two-layer MLP applied to each position independently
Stabilize training and enable deep networks (often 12-96+ layers)
Reinforcement Learning algorithms are training methods that teach agents to make decisions. They USE neural networks (MLP, RNN, etc.) as function approximators for policies and value functions.
| Algorithm | Type | Network Used | Key Idea |
|---|---|---|---|
| PPO | Policy Gradient | MLP or RNN | Clip objective to prevent large policy updates |
| DDPG | Actor-Critic | MLP (typically) | Continuous action spaces with deterministic policy |
| MAPPO | Multi-Agent PPO | MLP or RNN | PPO extended to multiple agents with centralized training |
| MADDPG | Multi-Agent DDPG | MLP (typically) | DDPG for multiple agents with centralized critic |
Agent interacts with environment using current policy ฯ_ฮธ
A(s,a) = Q(s,a) - V(s) - how much better an action is than average
L = min(r(ฮธ)ยทA, clip(r(ฮธ), 1-ฮต, 1+ฮต)ยทA) where r(ฮธ) = ฯ_ฮธ(a|s)/ฯ_old(a|s)
Train critic network to predict V(s) using MSE loss
Deterministic policy ฮผ(s) that outputs continuous actions
Q-function Q(s,a) estimates value of state-action pairs
Store transitions in buffer and sample randomly for training
Use slowly-updated target networks for stable learning
Multiple agents learning simultaneously in shared environment
These are NOT neural networks! They are RL training algorithms that extend single-agent methods (PPO and DDPG) to multi-agent scenarios.
Key Concept: Each algorithm can use EITHER MLP or Recurrent (LSTM/GRU) networks as their underlying architecture.
| Algorithm | Based On | Network Options | Training | Execution |
|---|---|---|---|---|
| MAPPO | PPO | MLP or RNN/LSTM/GRU | Centralized | Decentralized |
| Recurrent-MAPPO | PPO | LSTM/GRU (with hidden states) | Centralized | Decentralized |
| MADDPG | DDPG | MLP (typically) | Centralized Critic | Decentralized Actor |
| Recurrent-MADDPG | DDPG | LSTM/GRU (for partial obs) | Centralized Critic | Decentralized Actor |
MLP-MAPPO: Each agent has MLP policy network (for fully observable states)
Recurrent-MAPPO: Each agent has LSTM/GRU policy network (for partial observability or history dependence)
Centralized value function V(s) sees global state during training
Each agent's policy ฯ_i(a_i|o_i) uses only local observations at test time
Can share parameters across agents (homogeneous) or use separate networks (heterogeneous)
MLP-MADDPG: Actor and critic use MLP networks (standard)
Recurrent-MADDPG: Actor uses LSTM/GRU for partial observability (less common)
Q_i(s, a_1,...,a_N) - critic sees all agents' states and actions
ฮผ_i(o_i) - each actor only uses its own observation
Best for environments with continuous action spaces (robotics, control)
Policy: MLP or RNN
Observation: oโ
Policy: MLP or RNN
Observation: oโ
Policy: MLP or RNN
Observation: oโ
Use MLP-based (MAPPO/MADDPG) when:
Use Recurrent (LSTM/GRU) when:
| Name | Category | Memory | Parallelizable | Best Use Case |
|---|---|---|---|---|
| MLP | Neural Network (Feedforward) | None | โ Yes | Static data, classification |
| Vanilla RNN | Neural Network (Recurrent) | Short-term | โ No | Short sequences (rarely used) |
| LSTM | Neural Network (Recurrent) | Long-term | โ No | Long sequences, complex patterns |
| GRU | Neural Network (Recurrent) | Long-term | โ No | Medium sequences, efficiency |
| Transformer | Neural Network (Attention) | Attention-based | โ Yes | Large-scale NLP, language models |
| PPO | Single-Agent RL Algorithm | Depends on network | โ Parallel collection | General RL, robotics |
| DDPG | Single-Agent RL Algorithm | Depends on network | โ Parallel collection | Continuous control, robotics |
| MAPPO | Multi-Agent RL Algorithm | Depends on network | โ Parallel agents | Cooperative multi-agent tasks |
| Recurrent-MAPPO | Multi-Agent RL Algorithm | Long-term (LSTM/GRU) | โ Parallel agents | Partial observability, history-dependent |
| MADDPG | Multi-Agent RL Algorithm | Depends on network | โ Parallel agents | Multi-agent continuous control |
| Recurrent-MADDPG | Multi-Agent RL Algorithm | Long-term (LSTM/GRU) | โ Parallel agents | Partial obs multi-agent control |
Layer 1 - Neural Networks (MLP, RNN, LSTM, GRU, Transformer): These are the fundamental building blocks. They are network architectures that process and transform data.
Layer 2 - RL Algorithms (PPO, DDPG): These are training methods that USE neural networks as function approximators. PPO might use an MLP or LSTM to represent its policy.
Layer 3 - Multi-Agent Extensions (MAPPO, MADDPG): These extend single-agent RL algorithms to multiple agents. They also USE neural networks (MLP or recurrent).
Key Point: An algorithm like "Recurrent-MAPPO" means: Multi-Agent PPO algorithm using LSTM/GRU networks instead of MLPs.
What they are: Data processing architectures
What they do: Transform inputs to outputs
Examples: MLP, LSTM, Transformer
What they are: Training procedures
What they do: Teach agents to make decisions
Examples: PPO, DDPG, MAPPO, MADDPG
Reality: RL algorithms USE neural networks
Example: "Recurrent-MAPPO" = MAPPO algorithm + LSTM network
Choice: Pick algorithm AND network architecture
Supervised Learning? โ Use Neural Networks (MLP, RNN, Transformer)
Reinforcement Learning? โ Continue to Step 2
Single Agent โ PPO or DDPG
Multiple Agents โ MAPPO or MADDPG
Fully observable, no history needed? โ MLP
Partial observability or temporal patterns? โ LSTM/GRU (Recurrent)
Examples: MLP-PPO, Recurrent-MAPPO, MLP-MADDPG, etc.
Scenario: Training multiple robots to cooperate in a warehouse
Choice 1 - Algorithm: MAPPO (multi-agent cooperation)
Choice 2 - Network: If robots have full visibility โ MLP-MAPPO
If robots have limited sensors โ Recurrent-MAPPO (LSTM)
Result: "We're using Recurrent-MAPPO" means multi-agent PPO with LSTM networks for handling partial observability.