nevo.core.td_learning

Temporal Difference (TD) Learning Algorithms

This module implements modular TD learning variants for adaptive operator selection. Supports TD(0), TD(λ), and pluggable learning rules and value models.

class LearningRule[source]

Bases: ABC

Abstract base class for TD learning rules.

A learning rule defines how TD errors are used to update value estimates.

abstractmethod compute_update(td_error, learning_rate, current_value, **kwargs)[source]

Compute value update from TD error.

Parameters:
  • td_error (float) – Temporal difference error

  • learning_rate (float) – Learning rate α

  • current_value (float) – Current value estimate

  • **kwargs (dict) – Additional parameters specific to the rule

Returns:

update – Value increment (delta_V)

Return type:

float

abstractmethod reset()[source]

Reset any internal state.

class SimpleTDRule[source]

Bases: LearningRule

Simple TD(0) update rule: ΔV = α * δ_t

Standard TD learning update.

compute_update(td_error, learning_rate, current_value, **kwargs)[source]

Direct proportional update to TD error.

Return type:

float

reset()[source]

No internal state to reset.

class DecayingTDRule(decay_type='exponential', decay_rate=0.9)[source]

Bases: LearningRule

Decaying TD update rule with eligibility traces.

Allows different weighting schemes: constant, linear, or exponential decay.

__init__(decay_type='exponential', decay_rate=0.9)[source]
Parameters:
  • decay_type (str) – “constant” (no decay), “linear”, or “exponential”

  • decay_rate (float) – Decay parameter (for exponential/linear)

compute_update(td_error, learning_rate, current_value, timestep=0, **kwargs)[source]

Update with decay applied based on history depth.

Return type:

float

reset()[source]

Clear trace history.

class ConservativeTDRule(stability_weight=0.5)[source]

Bases: LearningRule

Conservative TD update rule with value stability.

Includes magnitude thresholding and clipping to prevent wild swings.

__init__(stability_weight=0.5)[source]
Parameters:

stability_weight (float) – How much to dampen updates (0=full update, 1=no update)

compute_update(td_error, learning_rate, current_value, **kwargs)[source]

Damped update with stability.

Return type:

float

reset()[source]

No internal state to reset.

class AdaptiveTDRule(window_size=10)[source]

Bases: LearningRule

Adaptive TD update rule with magnitude-dependent learning rate.

Scales learning rate based on recent TD error magnitude.

__init__(window_size=10)[source]
Parameters:

window_size (int) – Number of recent TD errors to track for adaptation

compute_update(td_error, learning_rate, current_value, **kwargs)[source]

Update with adaptive learning rate.

Return type:

float

reset()[source]

Clear error history.

class ValueModel[source]

Bases: ABC

Abstract base class for value function models.

A value model stores and updates value estimates for each operator.

abstractmethod get_value(operator_idx)[source]

Get current value estimate for operator.

Return type:

float

abstractmethod set_value(operator_idx, value)[source]

Set value estimate for operator.

abstractmethod update(operator_idx, delta)[source]

Increment value by delta.

abstractmethod reset()[source]

Reset all values.

abstractmethod get_values_array()[source]

Get all values as array.

Return type:

ndarray

class LinearValueModel(n_operators, initial_value=0.5)[source]

Bases: ValueModel

Simple linear value model: V(s, a) = w_a

One value per operator, no state dependence.

__init__(n_operators, initial_value=0.5)[source]
Parameters:
  • n_operators (int) – Number of operators

  • initial_value (float) – Initial value for all operators

get_value(operator_idx)[source]

Get value for operator.

Return type:

float

set_value(operator_idx, value)[source]

Set value for operator.

update(operator_idx, delta)[source]

Update value by delta.

reset()[source]

Reset all values to initial.

get_values_array()[source]

Get copy of values array.

Return type:

ndarray

class BoundedValueModel(n_operators, initial_value=0.5, min_bound=0.1, max_bound=5.0, adapt_bounds=True)[source]

Bases: ValueModel

Value model with learnable bounds for stability.

Maintains per-operator lower/upper bounds on values.

__init__(n_operators, initial_value=0.5, min_bound=0.1, max_bound=5.0, adapt_bounds=True)[source]
Parameters:
  • n_operators (int) – Number of operators

  • initial_value (float) – Initial value for all operators

  • min_bound (float) – Minimum value bound

  • max_bound (float) – Maximum value bound

  • adapt_bounds (bool) – Whether bounds adapt over time

get_value(operator_idx)[source]

Get value for operator.

Return type:

float

set_value(operator_idx, value)[source]

Set value for operator.

update(operator_idx, delta)[source]

Update value by delta with bounds checking.

adapt_bounds(operator_idx, window_size=20)[source]

Adapt bounds based on value history (called periodically).

reset()[source]

Reset all values to initial.

get_values_array()[source]

Get copy of values array.

Return type:

ndarray

class EligibilityTraceManager(n_operators, lambda_coeff=0.9, trace_decay=0.99)[source]

Bases: object

Manages eligibility traces for TD(λ) learning.

Maintains traces that decay over time, enabling multistep credit assignment.

__init__(n_operators, lambda_coeff=0.9, trace_decay=0.99)[source]
Parameters:
  • n_operators (int) – Number of operators

  • lambda_coeff (float) – λ coefficient for trace decay (0.0 = TD(0), 1.0 = MC)

  • trace_decay (float) – Per-timestep decay of all traces

update_trace(operator_idx, increment=1.0)[source]

Update trace for visited operator and decay all traces.

Parameters:
  • operator_idx (int) – Index of visited operator

  • increment (float) – Increment to add to trace

get_traces()[source]

Get current trace vector.

Return type:

ndarray

reset()[source]

Reset all traces.

set_lambda(lambda_coeff)[source]

Adjust λ coefficient dynamically.

Parameters:

lambda_coeff (float) – New λ value (0.0 to 1.0)

class TemporalDifferenceLearner(n_operators, learning_rate=0.1, gamma=0.99, lambda_coeff=0.0, learning_rule=None, value_model=None)[source]

Bases: object

Temporal Difference learner with pluggable rules and value models.

Implements TD(0) and TD(λ) for operator value learning.

__init__(n_operators, learning_rate=0.1, gamma=0.99, lambda_coeff=0.0, learning_rule=None, value_model=None)[source]
Parameters:
  • n_operators (int) – Number of operators

  • learning_rate (float) – Learning rate α

  • gamma (float) – Discount factor

  • lambda_coeff (float) – λ for trace decay (0.0 = TD(0), 0.9 = TD(0.9), 1.0 = Monte Carlo)

  • learning_rule (LearningRule, optional) – Learning rule to use (default: SimpleTDRule)

  • value_model (ValueModel, optional) – Value function model (default: LinearValueModel)

set_learning_rate(learning_rate)[source]

Update learning rate.

set_lambda(lambda_coeff)[source]

Update λ coefficient (switches between TD(0) and TD(λ)).

Parameters:

lambda_coeff (float) – New λ value (0.0 to 1.0)

begin_episode()[source]

Reset traces and timestep for a new episode. Value estimates are preserved.

update(operator_idx, reward, next_state_value=0.0, is_terminal=False)[source]

Perform TD update for visited operator.

Parameters:
  • operator_idx (int) – Index of operator to update

  • reward (float) – Immediate reward signal

  • next_state_value (float) – Value of next state (for bootstrapping)

  • is_terminal (bool) – Whether this is terminal state

Returns:

update_info – Information about the update (TD error, magnitude, etc.)

Return type:

Dict

get_values()[source]

Get current value estimates for all operators.

Return type:

ndarray

get_value(operator_idx)[source]

Get value for specific operator.

Return type:

float

reset_values()[source]

Reset all value estimates.

get_statistics()[source]

Get learning statistics.

Return type:

Dict[str, Any]