BanditPolicy
Scoring strategy for a com.eignex.kumulant.bandit.univariate.MultiArmedBandit. Decides which arm to play given snapshots of each arm's sufficient statistic R. The bandit calls evaluate for every arm and picks the argmax; the policy is the entire exploration/exploitation knob.
The policy owns the per-arm cumulator lifecycle through its arm spec:
createArm returns a freshly-prior-seeded SeriesStat from
arm.createStat().update folds an observation in, applying
arm.encodefirst so the stat sees the encoded value (e.g.ln(value)for LogNormalArm).evaluate reads the resulting snapshot.
Two flavours:
Sampling-based (ThompsonSampling): score each arm by a draw from its conjugate Posterior given the snapshot. Exploration is implicit in posterior variance: under-explored arms have wider posteriors and draw higher scores more often.
UCB-based (UCB1, UCB1Normal, UCB1Tuned, UcbV, KlUcb, Moss) ; score is
mean + alpha * confidence-boundderived from the snapshot directly. Exploration is explicit in the confidence width.
Per-policy global state (e.g. total samples for UCB) updates through addArm / removeArm when the arm population changes mid-run, and through update's side effects on each observation.
Inheritors
Properties
Functions
Allocate a fresh per-arm accumulator from the arm spec. Default delegates to arm.createStat(); override only if the policy needs a non-standard variant.
Score an arm given its current snapshot. Higher scores are preferred by the bandit. step is the global update count (for time-dependent exploration schedules); rng is the bandit's shared com.eignex.kumulant.bandit.Bandit.random (consumed by sampling policies).
addArm
arm
createArm
Allocate a fresh per-arm accumulator from the arm spec. Default delegates to arm.createStat(); override only if the policy needs a non-standard variant.
evaluate
Score an arm given its current snapshot. Higher scores are preferred by the bandit. step is the global update count (for time-dependent exploration schedules); rng is the bandit's shared com.eignex.kumulant.bandit.Bandit.random (consumed by sampling policies).