KlUcb
KL-UCB (Garivier & Cappé 2011). UCB variant for Bernoulli arms with a KL-divergence confidence bound instead of the Hoeffding bound UCB1 uses. Score is the largest q in [mean, 1] such that n * KL(mean, q) <= ln(t) + c * ln(ln(t)); computed by binary search with tolerance precision.
Asymptotically optimal for Bernoulli rewards; the bound matches Lai-Robbins lower regret in the limit. Beats UCB1 in practice when rewards are genuinely Bernoulli; falls back to similar regret when rewards are bounded but not Bernoulli.
Per-evaluate cost is O(log(1/tolerance)) for the binary search; with default tolerance = 1e-6 that's ~20 steps, each constant-time. Cheaper than full Thompson but more expensive than UCB1.
Constructors
Properties
Per-arm cumulator spec; determines the prior pseudo-counts, value encoding, and result shape that evaluate consumes.
Functions
Hook called when a new arm joins the population. Lets stateful policies fold the new arm's snapshot into their global counters (UCB's total-samples, UCB1Normal's arm count). Default no-op.
Allocate a fresh per-arm accumulator from the arm spec. Default delegates to arm.createStat(); override only if the policy needs a non-standard variant.
Score an arm given its current snapshot. Higher scores are preferred by the bandit. step is the global update count (for time-dependent exploration schedules); rng is the bandit's shared com.eignex.kumulant.bandit.Bandit.random (consumed by sampling policies).
Hook called when an arm leaves the population. Inverse of addArm; lets stateful policies remove the departing arm's contribution from their global counters. Default no-op.
KlUcb
addArm
Hook called when a new arm joins the population. Lets stateful policies fold the new arm's snapshot into their global counters (UCB's total-samples, UCB1Normal's arm count). Default no-op.
arm
Per-arm cumulator spec; determines the prior pseudo-counts, value encoding, and result shape that evaluate consumes.
c
evaluate
Score an arm given its current snapshot. Higher scores are preferred by the bandit. step is the global update count (for time-dependent exploration schedules); rng is the bandit's shared com.eignex.kumulant.bandit.Bandit.random (consumed by sampling policies).
removeArm
Hook called when an arm leaves the population. Inverse of addArm; lets stateful policies remove the departing arm's contribution from their global counters. Default no-op.
tolerance
update
Fold an observed reward value (with optional weight) into the per-arm stat. Default applies arm.encode first; policies with global counters (UCB families) override to update their counter alongside the stat update.