UCB1
Classical UCB1 (Auer, Cesa-Bianchi, Fischer 2002). Score is mean + alpha * sqrt(2 * ln(totalSamples) / armSamples); exploitation (running mean) plus a confidence bound that shrinks as the arm accumulates pulls. Unexplored arms get +infinity so they're tried at least once.
Designed for Bernoulli rewards but works on any [0, 1]-bounded reward. The exploration constant alpha scales the confidence width; the theoretical value is 1.0, lower values reduce exploration, higher increases it.
Pair this with a BernoulliArm beta prior; the prior alpha/beta seed the snapshot so the first few pulls aren't dominated by integer noise.
Constructors
Properties
Per-arm cumulator spec; determines the prior pseudo-counts, value encoding, and result shape that evaluate consumes.
Functions
Hook called when a new arm joins the population. Lets stateful policies fold the new arm's snapshot into their global counters (UCB's total-samples, UCB1Normal's arm count). Default no-op.
Allocate a fresh per-arm accumulator from the arm spec. Default delegates to arm.createStat(); override only if the policy needs a non-standard variant.
Score an arm given its current snapshot. Higher scores are preferred by the bandit. step is the global update count (for time-dependent exploration schedules); rng is the bandit's shared com.eignex.kumulant.bandit.Bandit.random (consumed by sampling policies).
Hook called when an arm leaves the population. Inverse of addArm; lets stateful policies remove the departing arm's contribution from their global counters. Default no-op.
UCB1
addArm
Hook called when a new arm joins the population. Lets stateful policies fold the new arm's snapshot into their global counters (UCB's total-samples, UCB1Normal's arm count). Default no-op.
alpha
arm
Per-arm cumulator spec; determines the prior pseudo-counts, value encoding, and result shape that evaluate consumes.
evaluate
Score an arm given its current snapshot. Higher scores are preferred by the bandit. step is the global update count (for time-dependent exploration schedules); rng is the bandit's shared com.eignex.kumulant.bandit.Bandit.random (consumed by sampling policies).
removeArm
Hook called when an arm leaves the population. Inverse of addArm; lets stateful policies remove the departing arm's contribution from their global counters. Default no-op.
update
Fold an observed reward value (with optional weight) into the per-arm stat. Default applies arm.encode first; policies with global counters (UCB families) override to update their counter alongside the stat update.