nevo.core.td_learning¶
Temporal Difference (TD) Learning Algorithms¶
This module implements modular TD learning variants for adaptive operator selection. Supports TD(0), TD(λ), and pluggable learning rules and value models.
- class LearningRule[source]¶
Bases:
ABCAbstract base class for TD learning rules.
A learning rule defines how TD errors are used to update value estimates.
- class SimpleTDRule[source]¶
Bases:
LearningRuleSimple TD(0) update rule: ΔV = α * δ_t
Standard TD learning update.
- class DecayingTDRule(decay_type='exponential', decay_rate=0.9)[source]¶
Bases:
LearningRuleDecaying TD update rule with eligibility traces.
Allows different weighting schemes: constant, linear, or exponential decay.
- class ConservativeTDRule(stability_weight=0.5)[source]¶
Bases:
LearningRuleConservative TD update rule with value stability.
Includes magnitude thresholding and clipping to prevent wild swings.
- __init__(stability_weight=0.5)[source]¶
- Parameters:
stability_weight (float) – How much to dampen updates (0=full update, 1=no update)
- class AdaptiveTDRule(window_size=10)[source]¶
Bases:
LearningRuleAdaptive TD update rule with magnitude-dependent learning rate.
Scales learning rate based on recent TD error magnitude.
- __init__(window_size=10)[source]¶
- Parameters:
window_size (int) – Number of recent TD errors to track for adaptation
- class ValueModel[source]¶
Bases:
ABCAbstract base class for value function models.
A value model stores and updates value estimates for each operator.
- class LinearValueModel(n_operators, initial_value=0.5)[source]¶
Bases:
ValueModelSimple linear value model: V(s, a) = w_a
One value per operator, no state dependence.
- class BoundedValueModel(n_operators, initial_value=0.5, min_bound=0.1, max_bound=5.0, adapt_bounds=True)[source]¶
Bases:
ValueModelValue model with learnable bounds for stability.
Maintains per-operator lower/upper bounds on values.
- class EligibilityTraceManager(n_operators, lambda_coeff=0.9, trace_decay=0.99)[source]¶
Bases:
objectManages eligibility traces for TD(λ) learning.
Maintains traces that decay over time, enabling multistep credit assignment.
- class TemporalDifferenceLearner(n_operators, learning_rate=0.1, gamma=0.99, lambda_coeff=0.0, learning_rule=None, value_model=None)[source]¶
Bases:
objectTemporal Difference learner with pluggable rules and value models.
Implements TD(0) and TD(λ) for operator value learning.
- __init__(n_operators, learning_rate=0.1, gamma=0.99, lambda_coeff=0.0, learning_rule=None, value_model=None)[source]¶
- Parameters:
n_operators (int) – Number of operators
learning_rate (float) – Learning rate α
gamma (float) – Discount factor
lambda_coeff (float) – λ for trace decay (0.0 = TD(0), 0.9 = TD(0.9), 1.0 = Monte Carlo)
learning_rule (LearningRule, optional) – Learning rule to use (default: SimpleTDRule)
value_model (ValueModel, optional) – Value function model (default: LinearValueModel)
- set_lambda(lambda_coeff)[source]¶
Update λ coefficient (switches between TD(0) and TD(λ)).
- Parameters:
lambda_coeff (float) – New λ value (0.0 to 1.0)
- begin_episode()[source]¶
Reset traces and timestep for a new episode. Value estimates are preserved.