Moss
MOSS; Minimax Optimal Strategy in the Stochastic case (Audibert & Bubeck 2009). UCB variant where the confidence bound shrinks faster than UCB1 once an arm has accumulated more than t / K samples. Score is mean + sqrt(max(0, ln(t / (K * n))) / n).
Achieves minimax-optimal regret in the stochastic bandit setting; tighter worst-case bound than UCB1 across all reward distributions the bandit could face. Eliminates the log(t) slack term once an arm is sampled enough.
Uses the anytime form (no fixed horizon argument). nbrArms is needed to compute the t / K denominator; pass the same arm count the containing MultiArmedBandit uses.
Reach for it when minimax regret matters more than asymptotic optimality ; adversarial reward distributions, settings where the worst case matters. For Bernoulli rewards specifically, KlUcb is asymptotically tighter.
Constructors
Properties
Functions
Hook called when a new arm joins the population. Lets stateful policies fold the new arm's snapshot into their global counters (UCB's total-samples, UCB1Normal's arm count). Default no-op.
Allocate a fresh per-arm accumulator from the arm spec. Default delegates to arm.createStat(); override only if the policy needs a non-standard variant.
Score an arm given its current snapshot. Higher scores are preferred by the bandit. step is the global update count (for time-dependent exploration schedules); rng is the bandit's shared com.eignex.kumulant.bandit.Bandit.random (consumed by sampling policies).
Hook called when an arm leaves the population. Inverse of addArm; lets stateful policies remove the departing arm's contribution from their global counters. Default no-op.
Moss
addArm
Hook called when a new arm joins the population. Lets stateful policies fold the new arm's snapshot into their global counters (UCB's total-samples, UCB1Normal's arm count). Default no-op.
arm
evaluate
Score an arm given its current snapshot. Higher scores are preferred by the bandit. step is the global update count (for time-dependent exploration schedules); rng is the bandit's shared com.eignex.kumulant.bandit.Bandit.random (consumed by sampling policies).
nbrArms
removeArm
Hook called when an arm leaves the population. Inverse of addArm; lets stateful policies remove the departing arm's contribution from their global counters. Default no-op.
update
Fold an observed reward value (with optional weight) into the per-arm stat. Default applies arm.encode first; policies with global counters (UCB families) override to update their counter alongside the stat update.