com.eignex.kumulant/bandit/univariate/BanditPolicy

BanditPolicy

interface BanditPolicy<R : Result>(source)

Scoring strategy for a com.eignex.kumulant.bandit.univariate.MultiArmedBandit. Decides which arm to play given snapshots of each arm's sufficient statistic R. The bandit calls evaluate for every arm and picks the argmax; the policy is the entire exploration/exploitation knob.

The policy owns the per-arm cumulator lifecycle through its arm spec:

createArm returns a freshly-prior-seeded SeriesStat from arm.createStat().
update folds an observation in, applying arm.encode first so the stat sees the encoded value (e.g. ln(value) for LogNormalArm).
evaluate reads the resulting snapshot.

Two flavours:

Sampling-based (ThompsonSampling): score each arm by a draw from its conjugate Posterior given the snapshot. Exploration is implicit in posterior variance: under-explored arms have wider posteriors and draw higher scores more often.
UCB-based (UCB1, UCB1Normal, UCB1Tuned, UcbV, KlUcb, Moss) ; score is mean + alpha * confidence-bound derived from the snapshot directly. Exploration is explicit in the confidence width.

Per-policy global state (e.g. total samples for UCB) updates through addArm / removeArm when the arm population changes mid-run, and through update's side effects on each observation.

Inheritors

Properties

arm

abstract val arm: Arm<R>(source)

Per-arm cumulator spec; determines the prior pseudo-counts, value encoding, and result shape that evaluate consumes.

Functions

addArm

open fun addArm(snapshot: R)(source)

Hook called when a new arm joins the population. Lets stateful policies fold the new arm's snapshot into their global counters (UCB's total-samples, UCB1Normal's arm count). Default no-op.

createArm

open fun createArm(): SeriesStat<R>(source)

Allocate a fresh per-arm accumulator from the arm spec. Default delegates to arm.createStat(); override only if the policy needs a non-standard variant.

evaluate

abstract fun evaluate(snapshot: R, step: Long, rng: Random): Double(source)

Score an arm given its current snapshot. Higher scores are preferred by the bandit. step is the global update count (for time-dependent exploration schedules); rng is the bandit's shared com.eignex.kumulant.bandit.Bandit.random (consumed by sampling policies).

removeArm

open fun removeArm(snapshot: R)(source)

Hook called when an arm leaves the population. Inverse of addArm; lets stateful policies remove the departing arm's contribution from their global counters. Default no-op.

update

open fun update(stat: SeriesStat<R>, value: Double, weight: Double = 1.0)(source)

Fold an observed reward value (with optional weight) into the per-arm stat. Default applies arm.encode first; policies with global counters (UCB families) override to update their counter alongside the stat update.