nevo.core.td_learning¶

Temporal Difference (TD) Learning Algorithms¶

This module implements modular TD learning variants for adaptive operator selection. Supports TD(0), TD(λ), and pluggable learning rules and value models.

class LearningRule[source]¶

Bases: ABC

Abstract base class for TD learning rules.

A learning rule defines how TD errors are used to update value estimates.

abstractmethod compute_update(td_error, learning_rate, current_value, **kwargs)[source]¶

Compute value update from TD error.

Parameters:

td_error (float) – Temporal difference error
learning_rate (float) – Learning rate α
current_value (float) – Current value estimate
**kwargs (dict) – Additional parameters specific to the rule

Returns:

update – Value increment (delta_V)

Return type:

float

abstractmethod reset()[source]¶: Reset any internal state.

class SimpleTDRule[source]¶

Bases: LearningRule

Simple TD(0) update rule: ΔV = α * δ_t

Standard TD learning update.

compute_update(td_error, learning_rate, current_value, **kwargs)[source]¶

Direct proportional update to TD error.

Return type:: float

reset()[source]¶: No internal state to reset.

class DecayingTDRule(decay_type='exponential', decay_rate=0.9)[source]¶

Bases: LearningRule

Decaying TD update rule with eligibility traces.

Allows different weighting schemes: constant, linear, or exponential decay.

__init__(decay_type='exponential', decay_rate=0.9)[source]¶

Parameters:

decay_type (str) – “constant” (no decay), “linear”, or “exponential”
decay_rate (float) – Decay parameter (for exponential/linear)

compute_update(td_error, learning_rate, current_value, timestep=0, **kwargs)[source]¶

Update with decay applied based on history depth.

Return type:: float

reset()[source]¶: Clear trace history.

class ConservativeTDRule(stability_weight=0.5)[source]¶

Bases: LearningRule

Conservative TD update rule with value stability.

Includes magnitude thresholding and clipping to prevent wild swings.

__init__(stability_weight=0.5)[source]¶

Parameters:: stability_weight (float) – How much to dampen updates (0=full update, 1=no update)

compute_update(td_error, learning_rate, current_value, **kwargs)[source]¶

Damped update with stability.

Return type:: float

reset()[source]¶: No internal state to reset.

class AdaptiveTDRule(window_size=10)[source]¶

Bases: LearningRule

Adaptive TD update rule with magnitude-dependent learning rate.

Scales learning rate based on recent TD error magnitude.

__init__(window_size=10)[source]¶

Parameters:: window_size (int) – Number of recent TD errors to track for adaptation

compute_update(td_error, learning_rate, current_value, **kwargs)[source]¶

Update with adaptive learning rate.

Return type:: float

reset()[source]¶: Clear error history.

class ValueModel[source]¶

Bases: ABC

Abstract base class for value function models.

A value model stores and updates value estimates for each operator.

abstractmethod get_value(operator_idx)[source]¶

Get current value estimate for operator.

Return type:: float

abstractmethod set_value(operator_idx, value)[source]¶: Set value estimate for operator.

abstractmethod update(operator_idx, delta)[source]¶: Increment value by delta.

abstractmethod reset()[source]¶: Reset all values.

abstractmethod get_values_array()[source]¶

Get all values as array.

Return type:: ndarray

class LinearValueModel(n_operators, initial_value=0.5)[source]¶

Bases: ValueModel

Simple linear value model: V(s, a) = w_a

One value per operator, no state dependence.

__init__(n_operators, initial_value=0.5)[source]¶

Parameters:

n_operators (int) – Number of operators
initial_value (float) – Initial value for all operators

get_value(operator_idx)[source]¶

Get value for operator.

Return type:: float

set_value(operator_idx, value)[source]¶: Set value for operator.

update(operator_idx, delta)[source]¶: Update value by delta.

reset()[source]¶: Reset all values to initial.

get_values_array()[source]¶

Get copy of values array.

Return type:: ndarray

class BoundedValueModel(n_operators, initial_value=0.5, min_bound=0.1, max_bound=5.0, adapt_bounds=True)[source]¶

Bases: ValueModel

Value model with learnable bounds for stability.

Maintains per-operator lower/upper bounds on values.

__init__(n_operators, initial_value=0.5, min_bound=0.1, max_bound=5.0, adapt_bounds=True)[source]¶

Parameters:

n_operators (int) – Number of operators
initial_value (float) – Initial value for all operators
min_bound (float) – Minimum value bound
max_bound (float) – Maximum value bound
adapt_bounds (bool) – Whether bounds adapt over time

get_value(operator_idx)[source]¶

Get value for operator.

Return type:: float

set_value(operator_idx, value)[source]¶: Set value for operator.

update(operator_idx, delta)[source]¶: Update value by delta with bounds checking.

adapt_bounds(operator_idx, window_size=20)[source]¶: Adapt bounds based on value history (called periodically).

reset()[source]¶: Reset all values to initial.

get_values_array()[source]¶

Get copy of values array.

Return type:: ndarray

class EligibilityTraceManager(n_operators, lambda_coeff=0.9, trace_decay=0.99)[source]¶

Bases: object

Manages eligibility traces for TD(λ) learning.

Maintains traces that decay over time, enabling multistep credit assignment.

__init__(n_operators, lambda_coeff=0.9, trace_decay=0.99)[source]¶

Parameters:

n_operators (int) – Number of operators
lambda_coeff (float) – λ coefficient for trace decay (0.0 = TD(0), 1.0 = MC)
trace_decay (float) – Per-timestep decay of all traces

update_trace(operator_idx, increment=1.0)[source]¶

Update trace for visited operator and decay all traces.

Parameters:

operator_idx (int) – Index of visited operator
increment (float) – Increment to add to trace

get_traces()[source]¶

Get current trace vector.

Return type:: ndarray

reset()[source]¶: Reset all traces.

set_lambda(lambda_coeff)[source]¶

Adjust λ coefficient dynamically.

Parameters:: lambda_coeff (float) – New λ value (0.0 to 1.0)

class TemporalDifferenceLearner(n_operators, learning_rate=0.1, gamma=0.99, lambda_coeff=0.0, learning_rule=None, value_model=None)[source]¶

Bases: object

Temporal Difference learner with pluggable rules and value models.

Implements TD(0) and TD(λ) for operator value learning.

__init__(n_operators, learning_rate=0.1, gamma=0.99, lambda_coeff=0.0, learning_rule=None, value_model=None)[source]¶

Parameters:

n_operators (int) – Number of operators
learning_rate (float) – Learning rate α
gamma (float) – Discount factor
lambda_coeff (float) – λ for trace decay (0.0 = TD(0), 0.9 = TD(0.9), 1.0 = Monte Carlo)
learning_rule (LearningRule, optional) – Learning rule to use (default: SimpleTDRule)
value_model (ValueModel, optional) – Value function model (default: LinearValueModel)

set_learning_rate(learning_rate)[source]¶: Update learning rate.

set_lambda(lambda_coeff)[source]¶

Update λ coefficient (switches between TD(0) and TD(λ)).

Parameters:: lambda_coeff (float) – New λ value (0.0 to 1.0)

begin_episode()[source]¶: Reset traces and timestep for a new episode. Value estimates are preserved.

update(operator_idx, reward, next_state_value=0.0, is_terminal=False)[source]¶

Perform TD update for visited operator.

Parameters:

operator_idx (int) – Index of operator to update
reward (float) – Immediate reward signal
next_state_value (float) – Value of next state (for bootstrapping)
is_terminal (bool) – Whether this is terminal state

Returns:

update_info – Information about the update (TD error, magnitude, etc.)

Return type:

Dict

get_values()[source]¶

Get current value estimates for all operators.

Return type:: ndarray

get_value(operator_idx)[source]¶

Get value for specific operator.

Return type:: float

reset_values()[source]¶: Reset all value estimates.

get_statistics()[source]¶

Get learning statistics.

Return type:: Dict[str, Any]