com.eignex.kumulant/bandit/univariate/KlUcb

KlUcb

class KlUcb(val c: Double = 0.0, val tolerance: Double = 1.0E-6, priorAlpha: Double = 1.0, priorBeta: Double = 1.0) : BanditPolicy<BernoulliSumResult> (source)

KL-UCB (Garivier & Cappé 2011). UCB variant for Bernoulli arms with a KL-divergence confidence bound instead of the Hoeffding bound UCB1 uses. Score is the largest q in [mean, 1] such that n * KL(mean, q) <= ln(t) + c * ln(ln(t)); computed by binary search with tolerance precision.

Asymptotically optimal for Bernoulli rewards; the bound matches Lai-Robbins lower regret in the limit. Beats UCB1 in practice when rewards are genuinely Bernoulli; falls back to similar regret when rewards are bounded but not Bernoulli.

Per-evaluate cost is O(log(1/tolerance)) for the binary search; with default tolerance = 1e-6 that's ~20 steps, each constant-time. Cheaper than full Thompson but more expensive than UCB1.

Constructors

KlUcb

constructor(c: Double = 0.0, tolerance: Double = 1.0E-6, priorAlpha: Double = 1.0, priorBeta: Double = 1.0)(source)

Types

Companion

object Companion

Bernoulli KL utilities used by KlUcb.

Properties

arm

open override val arm: BernoulliArm(source)

Per-arm cumulator spec; determines the prior pseudo-counts, value encoding, and result shape that evaluate consumes.

c

val c: Double(source)

Confidence padding: ln(t) + c * ln(ln(t)). Default c = 0 is the standard form.

tolerance

val tolerance: Double(source)

Binary-search tolerance for the quantile root.

Functions

addArm

open override fun addArm(snapshot: BernoulliSumResult)(source)

Hook called when a new arm joins the population. Lets stateful policies fold the new arm's snapshot into their global counters (UCB's total-samples, UCB1Normal's arm count). Default no-op.

evaluate

open override fun evaluate(snapshot: BernoulliSumResult, step: Long, rng: Random): Double(source)

Score an arm given its current snapshot. Higher scores are preferred by the bandit. step is the global update count (for time-dependent exploration schedules); rng is the bandit's shared com.eignex.kumulant.bandit.Bandit.random (consumed by sampling policies).

removeArm

open override fun removeArm(snapshot: BernoulliSumResult)(source)

Hook called when an arm leaves the population. Inverse of addArm; lets stateful policies remove the departing arm's contribution from their global counters. Default no-op.

update

open override fun update(stat: SeriesStat<BernoulliSumResult>, value: Double, weight: Double = 1.0)(source)

Fold an observed reward value (with optional weight) into the per-arm stat. Default applies arm.encode first; policies with global counters (UCB families) override to update their counter alongside the stat update.

createArm

open fun createArm(): SeriesStat<BernoulliSumResult>

Allocate a fresh per-arm accumulator from the arm spec. Default delegates to arm.createStat(); override only if the policy needs a non-standard variant.