kumulant

UcbV

class UcbV(val zeta: Double = 1.2, val c: Double = 1.0, priorMean: Double = 0.0, priorWeight: Double = 0.02) : BanditPolicy<MomentsResult> (source)

UCB-V; variance-aware UCB with finite-sample honesty (Audibert, Munos, Szepesvári 2009). Score is mean + sqrt(2 * V * zeta * ln(t) / n) + 3 * c * zeta * ln(t) / n, where V is the running variance from the MomentsResult snapshot.

The third term; a bias correction scaled by c; is what distinguishes UCB-V from UCB1Tuned: the bound is honest at finite sample sizes rather than only asymptotically. Reach for it when sample sizes per arm stay small (early stopping, expensive arms) and the asymptotic tightness of UCB1Tuned / KlUcb doesn't materialise.

Audibert et al. recommend zeta in [1, 1.2]; defaults are at the upper end of that range.

Constructors

Link copied to clipboard
constructor(zeta: Double = 1.2, c: Double = 1.0, priorMean: Double = 0.0, priorWeight: Double = 0.02)

Properties

Link copied to clipboard
open override val arm: MomentsArm

Per-arm cumulator spec; determines the prior pseudo-counts, value encoding, and result shape that evaluate consumes.

Link copied to clipboard
val c: Double

Bias-correction term scale. Default matches the original paper.

Link copied to clipboard

Variance-term scale. Audibert et al. recommend zeta in [1, 1.2].

Functions

Link copied to clipboard
open override fun addArm(snapshot: MomentsResult)

Hook called when a new arm joins the population. Lets stateful policies fold the new arm's snapshot into their global counters (UCB's total-samples, UCB1Normal's arm count). Default no-op.

Link copied to clipboard

Allocate a fresh per-arm accumulator from the arm spec. Default delegates to arm.createStat(); override only if the policy needs a non-standard variant.

Link copied to clipboard
open override fun evaluate(snapshot: MomentsResult, step: Long, rng: Random): Double

Score an arm given its current snapshot. Higher scores are preferred by the bandit. step is the global update count (for time-dependent exploration schedules); rng is the bandit's shared com.eignex.kumulant.bandit.Bandit.random (consumed by sampling policies).

Link copied to clipboard
open override fun removeArm(snapshot: MomentsResult)

Hook called when an arm leaves the population. Inverse of addArm; lets stateful policies remove the departing arm's contribution from their global counters. Default no-op.

Link copied to clipboard
open override fun update(stat: SeriesStat<MomentsResult>, value: Double, weight: Double = 1.0)

Fold an observed reward value (with optional weight) into the per-arm stat. Default applies arm.encode first; policies with global counters (UCB families) override to update their counter alongside the stat update.

UcbV

constructor(zeta: Double = 1.2, c: Double = 1.0, priorMean: Double = 0.0, priorWeight: Double = 0.02)(source)

addArm

open override fun addArm(snapshot: MomentsResult)(source)

Hook called when a new arm joins the population. Lets stateful policies fold the new arm's snapshot into their global counters (UCB's total-samples, UCB1Normal's arm count). Default no-op.

arm

open override val arm: MomentsArm(source)

Per-arm cumulator spec; determines the prior pseudo-counts, value encoding, and result shape that evaluate consumes.

c

Bias-correction term scale. Default matches the original paper.

evaluate

open override fun evaluate(snapshot: MomentsResult, step: Long, rng: Random): Double(source)

Score an arm given its current snapshot. Higher scores are preferred by the bandit. step is the global update count (for time-dependent exploration schedules); rng is the bandit's shared com.eignex.kumulant.bandit.Bandit.random (consumed by sampling policies).

removeArm

open override fun removeArm(snapshot: MomentsResult)(source)

Hook called when an arm leaves the population. Inverse of addArm; lets stateful policies remove the departing arm's contribution from their global counters. Default no-op.

update

open override fun update(stat: SeriesStat<MomentsResult>, value: Double, weight: Double = 1.0)(source)

Fold an observed reward value (with optional weight) into the per-arm stat. Default applies arm.encode first; policies with global counters (UCB families) override to update their counter alongside the stat update.

zeta

Variance-term scale. Audibert et al. recommend zeta in [1, 1.2].