Modular Temporal Difference (TD) Learning System¶
Overview¶
This module adds a modular, pluggable TD learning system to NEVO while preserving the Nengo neuromorphic implementation. The system supports TD(0), TD(λ), and allows dynamic switching between different learning rules and value function models.
Key Features¶
TD(0) and TD(λ) learning variants.
Pluggable learning rules: Simple, Decaying, Conservative, Adaptive.
Pluggable value models: Linear, Bounded.
Eligibility traces for multi-step credit assignment.
Nengo neuromorphic networks preserved. TD learning operates on top.
Dynamic switching. Change rules and models at runtime.
Backward compatible. Existing code still works.
Architecture¶
Components¶
1. Learning Rules (LearningRule ABC)¶
Define how TD errors are converted to value updates.
Available Rules:
SimpleTDRule: Standard TD(0) update: ΔV = α * δ_tDecayingTDRule: Decaying updates (exponential, linear, or constant)ConservativeTDRule: Dampened updates for stabilityAdaptiveTDRule: Magnitude-dependent learning rates
2. Value Models (ValueModel ABC)¶
Store and manage value function estimates.
Available Models:
LinearValueModel: Simple one-per-operator valuesBoundedValueModel: Values with adaptive bounds and stability
3. Eligibility Traces (EligibilityTraceManager)¶
Manages credit assignment across multiple operators using traces that decay over time.
Key for TD(λ) implementation.
4. Temporal Difference Learner (TemporalDifferenceLearner)¶
Core TD(0)/TD(λ) algorithm integrating rules, models, and traces.
5. Basal Ganglia Selector (Modified)¶
Now uses TD learning alongside Nengo neuromorphic networks.
Usage¶
Basic TD(0) Learning¶
from nevo.core.basal_ganglia import BasalGangliaSelector
from nevo.core.td_learning import SimpleTDRule, LinearValueModel
from nevo.operators.standard import LevyFlight, ParticleSwarm
# Create operators
operators = [LevyFlight(), ParticleSwarm()]
# Create selector with TD(0) learning
selector = BasalGangliaSelector(
operators=operators,
td_enabled=True,
lambda_coeff=0.0, # TD(0)
learning_rate=0.1,
gamma=0.99, # Discount factor
learning_rule=SimpleTDRule(),
value_model=LinearValueModel(len(operators)),
)
# Start optimization episode
selector.begin_episode()
# Select operators during optimization
best_fitness = 10.0
operator_selection = np.array([1.0, 0.5]) # From basal ganglia
op = selector.select_operator(operator_selection, best_fitness)
# ... use operator, generate candidates, evaluate, get reward ...
# Next selection (TD learning updates values automatically)
better_fitness = 8.0
op = selector.select_operator(operator_selection, better_fitness)
TD(λ) Learning with Eligibility Traces¶
# TD(λ) with λ=0.9
selector = BasalGangliaSelector(
operators=operators,
td_enabled=True,
lambda_coeff=0.9, # TD(0.9) - multi-step credit
learning_rate=0.1,
gamma=0.99,
)
# With eligibility traces, updates propagate to previously visited operators
# Example: visiting operators [0, 1, 0, 2] assigns credit proportional to recency
Dynamic Learning Rule Switching¶
from nevo.core.td_learning import ConservativeTDRule, AdaptiveTDRule
# Start with simple rule
selector = BasalGangliaSelector(
operators=operators,
learning_rule=SimpleTDRule(),
)
# Switch to conservative rule for stability
selector.set_learning_rule(ConservativeTDRule(stability_weight=0.5))
# Switch to adaptive rule
selector.set_learning_rule(AdaptiveTDRule(window_size=20))
Dynamic Value Model Switching¶
from nevo.core.td_learning import BoundedValueModel
# Start with linear model
selector = BasalGangliaSelector(
operators=operators,
value_model=LinearValueModel(len(operators)),
)
# Switch to bounded model for stability
bounded_model = BoundedValueModel(
n_operators=len(operators),
min_bound=0.2,
max_bound=3.0,
adapt_bounds=True,
)
selector.set_value_model(bounded_model)
Accessing Learning Information¶
# Get current TD-learned values
td_values = selector.get_td_values()
print(f"Operator values: {td_values}")
# Get TD learning statistics
stats = selector.get_td_statistics()
print(f"Mean TD error: {stats['mean_td_error']:.4f}")
print(f"Std TD error: {stats['std_td_error']:.4f}")
# Get utility weights (state-dependent)
utility_weights = selector.get_utility_weights()
# Reset TD learning
selector.reset_td_learning()
Adjusting TD Parameters Dynamically¶
# Change learning rate
selector.td_learner.set_learning_rate(0.05)
# Change λ coefficient (interpolate between TD(0) and MC)
selector.set_td_lambda(0.5) # TD(0.5)
selector.set_td_lambda(0.9) # TD(0.9)
Neuromorphic Integration¶
The Nengo-based basal ganglia network remains unchanged and neuromorphic:
# build_network() creates Nengo ensembles as before
selected_operator_ens = selector.build_network(model, state_ensemble)
# TD learning operates at the decision level (post-simulation).
# It updates value estimates based on rewards from the search.
How TD Learning Works with Nengo¶
The Nengo network computes utility functions for each operator.
The basal ganglia performs winner-take-all selection based on utilities.
TD learning observes the reward and updates value estimates.
Updated values inform the next Nengo utility computation.
This keeps the neuromorphic computation intact while adding learning capability on top.
Learning Rules Comparison¶
Rule |
Type |
Use Case |
|---|---|---|
SimpleTDRule |
Direct |
Default, fast learning |
DecayingTDRule |
Decaying |
Fade old information |
ConservativeTDRule |
Stable |
Prevent wild swings |
AdaptiveTDRule |
Adaptive |
Scale with error magnitude |
Value Models Comparison¶
Model |
Features |
Use Case |
|---|---|---|
LinearValueModel |
Simple, fast |
Most cases |
BoundedValueModel |
Stable bounds, adaptive |
Prevent value explosion |
Implementation Details¶
TD(0) Update¶
For each operator i:
δ_t = r_t + γ * max_j(V_j(t+1)) - V_i(t) # TD error
V_i(t+1) = V_i(t) + α * δ_t # Value update
TD(λ) Update¶
Uses eligibility traces that decay:
e_i(t) = λ * γ * e_i(t-1) + 1[i visited] # Eligibility trace
δ_t = r_t + γ * max_j(V_j(t+1)) - V_i(t) # TD error
V_i(t+1) = V_i(t) + α * δ_t * e_i(t) # Multi-step update
With traces, credit propagates to previously visited operators, enabling multi-step learning.
Configuration in Optimiser¶
The NEVOptimiser can be configured to use TD learning:
from nevo.core.optimiser import NEVOptimiser
from nevo.core.td_learning import ConservativeTDRule, BoundedValueModel
optimiser = NEVOptimiser(
objective_function=f,
bounds=bounds,
dimension=10,
epsilon=0.1,
learning_rate=0.1,
td_enabled=True,
)
# Swap learning rule before or after the first run()
optimiser.bg_selector.set_learning_rule(
ConservativeTDRule(stability_weight=0.5)
)
optimiser.run(time=10.0)
Testing¶
Run the test suite:
pytest tests/test_td_learning.py # unit tests
pytest tests/test_td_integration.py # integration tests (nm_dual mode)
Tests cover:
All learning rules
All value models
Eligibility traces
TD(0) and TD(λ) learning
Basal ganglia integration
Dynamic switching
Advanced: Custom Learning Rules¶
Implement custom rules by extending LearningRule:
from nevo.core.td_learning import LearningRule
class CustomTDRule(LearningRule):
def compute_update(self, td_error, learning_rate, current_value, **kwargs):
# Your custom update formula
return learning_rate * td_error * some_factor
def reset(self):
# Reset internal state if needed
pass
# Use it
selector.set_learning_rule(CustomTDRule())
Advanced: Custom Value Models¶
Implement custom models by extending ValueModel:
from nevo.core.td_learning import ValueModel
class CustomValueModel(ValueModel):
def get_value(self, operator_idx):
# Your implementation
pass
def set_value(self, operator_idx, value):
# Your implementation
pass
def update(self, operator_idx, delta):
# Your implementation
pass
def reset(self):
# Reset to initial
pass
def get_values_array(self):
# Return numpy array
pass
# Use it
selector.set_value_model(CustomValueModel())
References¶
Temporal Difference Learning¶
Sutton, R. S. (1988). “Learning to predict by the method of temporal differences.”
Sutton, R. S., & Barto, A. G. (2018). “Reinforcement Learning: An Introduction”
TD(λ) and Eligibility Traces¶
Sutton, R. S. (1984). “Temporal Credit Assignment in Reinforcement Learning”
Watkins, C. (1989). “Learning with Delayed Rewards”
Nengo Neuromorphic Computing¶
Bekolay, T., Bergstra, J., Hunsberger, E., et al. (2014). “Nengo: A Python tool for building large-scale functional brain models”
https://www.nengo.ai/
Future Extensions¶
Multi-step bootstrapping: n-step returns instead of 1-step
Importance sampling: For off-policy learning
Function approximation: Neural network value functions
Experience replay: For sample efficiency
Asynchronous TD: Parallel multi-threaded learning
Module Structure¶
nevo/core/
├── td_learning.py # TD learning algorithms
│ ├── LearningRule # Abstract base
│ ├── SimpleTDRule
│ ├── DecayingTDRule
│ ├── ConservativeTDRule
│ ├── AdaptiveTDRule
│ ├── ValueModel # Abstract base
│ ├── LinearValueModel
│ ├── BoundedValueModel
│ ├── EligibilityTraceManager
│ └── TemporalDifferenceLearner
│
├── basal_ganglia.py # Modified with TD integration
│ ├── UtilityFunction # (unchanged)
│ └── BasalGangliaSelector # (now with TD learning)
│
└── optimiser.py # Uses BasalGangliaSelector
tests/
└── test_td_learning.py # Comprehensive tests
Summary¶
This system provides a modular TD learning layer that:
Preserves Nengo neuromorphic networks. Operators are unchanged.
Adds TD learning. Both TD(0) and TD(λ) are supported.
Allows pluggable rules and models. Easy to extend.
Enables dynamic switching. Change algorithms at runtime.
Is backward compatible. Existing code continues to work.