bandit.univariate
Indexless multi-armed bandits. Each round, the bandit picks one of K arms via choose(), the caller observes a reward, and update(arm, value, weight) folds it back into that arm's accumulator. No per-round feature vector; for that, see com.eignex.kumulant.bandit.contextual.
Bandit shells
MultiArmedBandit is the workhorse. It carries a BanditPolicy and a list of Arms; choose delegates to the policy's scoring rule. Most named bandits in the literature (UCB1, Thompson, epsilon-greedy, etc.) are just a policy swap on this shell.
| Bandit | Selection rule |
|---|---|
| MultiArmedBandit | Argmax over per-arm policy scores, or whatever else the policy implements (joint sampling, etc.). |
| BoltzmannBandit | Softmax over per-arm means with a cooling temperature schedule. |
| Exp3Bandit | Adversarial bandit with exponential-weights updates and a regret bound under non-stationary reward distributions. |
| RouletteWheelBandit | Operator-selection roulette where arm probability is proportional to score. Used in meta-heuristics where each "arm" is a neighbourhood move and the score is the recent improvement rate. |
| TopTwoThompsonBandit | Top-two Thompson sampling: draw two samples, play the second-best with probability 1 - beta. Identifies the best arm faster than vanilla Thompson when many arms are competitive. |
Policies
BanditPolicy is the scoring strategy plugged into MultiArmedBandit. The library ships every commonly-cited one:
| Policy | Family |
|---|---|
| Greedy | Pure exploitation; argmax of point estimates. The baseline. |
| EpsilonGreedy, EpsilonDecreasing | Mix of exploitation and uniform-random exploration; epsilon either fixed or annealed. |
| UniformSelection | Pure exploration, used as a baseline. |
| UCB1, UCB1Normal, UCB1Tuned | Upper-confidence-bound family; variants differ in confidence-interval shape. |
| KlUcb | KL-UCB; tighter bound than UCB1 for Bernoulli arms. |
| Moss | MOSS bound; near-optimal regret across stationary settings. |
| UcbV | UCB-V; UCB with variance-aware confidence width. |
| ThompsonSampling | Posterior sampling: draw from each arm's posterior, play argmax. |
Each policy reads its per-arm state through a Posterior adapter that projects the arm's com.eignex.kumulant.core.Result to whatever the scoring rule needs (a mean, a Beta posterior, a normal-gamma posterior, etc.). The GammaScalePosterior is the canonical example used by BoltzmannBandit for variance-scaled softmax temperatures.
Arms
Arm is the per-arm state contract. Each arm pairs a stat with the com.eignex.kumulant.core.Concurrency and reset story it needs:
| Arm | Backing stat | Suits |
|---|---|---|
| BernoulliArm | com.eignex.kumulant.stat.summary.BernoulliSumStat + count | Binary rewards (click / no click, pass / fail). |
| MeanArm | com.eignex.kumulant.stat.summary.MeanStat | Continuous rewards where mean suffices. |
| NormalArm | com.eignex.kumulant.stat.summary.VarianceStat | Continuous rewards with normally-distributed noise; carries enough state for UCB-V and Thompson with normal-gamma. |
| LogNormalArm | Welford over log(value) | Multiplicative rewards (revenue, latency). |
| MomentsArm | com.eignex.kumulant.stat.summary.MomentsStat | Higher-order shape matters (skewness / kurtosis aware scoring). |
CompositeArm (and CompositeSubArm) model multi-component rewards like zero-inflated lognormal revenue, without writing a per-shape arm class. Routing and score combination travel as com.eignex.kumulant.schema.ScalarExpr expressions, so the whole composite round-trips on the wire alongside the rest of the bandit config.
Wire portability
UnivariateBanditSpec is the sealed root of wire-portable bandit configs:
MultiArmedSpec: bandit + policy + arm list.
RouletteWheelSpec: roulette-wheel variant.
Other family-specific specs co-located here.
Configurations and policies round-trip through skema-based JSON / CBOR just like the com.eignex.kumulant.schema.StatSpec family. The materializer in com.eignex.kumulant.bandit takes a spec and a Random and returns the live bandit; pass the same seed across replicas for reproducible exploration.
Interface hierarchy
See com.eignex.kumulant.bandit for the action/state interface split: which bandits expose com.eignex.kumulant.bandit.Scorable, which implement com.eignex.kumulant.bandit.PerArmBandit, and where joint- sampling bandits diverge from the per-arm-score path.
Types
Recipe for one bandit arm's cumulator side: how to build a freshly-seeded SeriesStat for that arm, and how to encode a raw observation before folding it into the stat. Posteriors and BanditPolicys pair with arm specs of the same R.
Scoring strategy for a com.eignex.kumulant.bandit.univariate.MultiArmedBandit. Decides which arm to play given snapshots of each arm's sufficient statistic R. The bandit calls evaluate for every arm and picks the argmax; the policy is the entire exploration/exploitation knob.
Wire-portable specification for a BanditPolicy.
Bernoulli arm. The reward is binary {0, 1} and the unknown is the success probability p. A Beta(priorAlpha, priorBeta) prior is conjugate to the Bernoulli likelihood; the posterior is Beta(priorAlpha + successes, priorBeta + failures).
Beta posterior over a Bernoulli rate. successes and trials-successes are the Beta parameters; both must be positive (i.e. snapshot must be prior-seeded).
Boltzmann exploration (a.k.a. softmax bandit): samples arm a with probability proportional to exp(mean[a] / tau(t)), where tau(t) is the temperature at round t and per-arm means are tracked by independent com.eignex.kumulant.stat.summary.MeanStat cells.
Spec for BoltzmannBandit.
Composite Arm built from N independent sub-arms. Each observation is routed through every sub-arm's CompositeSubArm.valueExpr / CompositeSubArm.weightExpr / CompositeSubArm.filter AST before being fed to the corresponding per-sub-arm accumulator. The composite result is a ResultList of the sub-snapshots; pair with CompositePosterior to combine sub-arm draws into a single score.
Composite Posterior over the sub-snapshots produced by a CompositeArm. Each sub-posterior draws independently; the resulting samples are packed as V(0)..V(N-1) and reduced to a single score by the combine AST.
One leg of a CompositeArm: which arm receives observations, with optional AST-driven transformation of value, weight, and a filter predicate.
Annealed epsilon-greedy. Effective exploration probability decreases as sample count accumulates: eps(t) = min(1, epsilon / totalSamples^decay).
Spec for EpsilonDecreasing.
Epsilon-greedy: with probability epsilon pick a uniformly random arm (explore), otherwise pick the arm with the highest mean (exploit). The simplest exploration scheme that actually works; no math machinery, tune one knob.
Spec for EpsilonGreedy.
Per-arm snapshot for Exp3Bandit: the exponential-weight cell for one arm.
EXP3 (Auer, Cesa-Bianchi, Freund, Schapire 2002); adversarial multi-armed bandit over a fixed pool of nbrArms. Each round: compute play distribution p[a] = (1 - gamma) · w[a]/Σw + gamma/K, sample a ~ p, then on reward r ∈ [0,1] update w[a] *= exp(eta · r / p[a]) using the importance-sampling-corrected gain.
Spec for Exp3Bandit. Pass null for eta / gamma to use the algorithm's defaults.
Gamma posterior over an Exponential rate.
Gamma posterior over the scale of a Gamma likelihood with fixed shape - the shape is a posterior parameter rather than something we infer from data. Not an object because of that parameter.
Beta posterior over a Geometric success probability.
Pure-exploitation policy: always picks the arm with the highest posterior mean. No exploration at all; converges fastest to the apparent best arm but can lock into a suboptimal arm forever if early rewards mislead it.
Spec for Greedy.
KL-UCB (Garivier & Cappé 2011). UCB variant for Bernoulli arms with a KL-divergence confidence bound instead of the Hoeffding bound UCB1 uses. Score is the largest q in [mean, 1] such that n * KL(mean, q) <= ln(t) + c * ln(ln(t)); computed by binary search with tolerance precision.
Spec for KlUcb.
Log-Normal-Gamma posterior: same draw as NormalGammaPosterior but exp-transformed back to the real scale. Intended for arms whose stat already accumulates log-rewards (see LogNormalArm's encode).
Single-moment arm; tracks the running mean but not variance. The right pick when the likelihood's sufficient statistic is one running sum (or equivalently a running mean × count):
Moments-tracking arm. Backs MomentsStat under the hood, which means the snapshot exposes the raw second moment m2 (in addition to mean and variance); required by the variance-aware UCB policies that need mean-of-squares directly:
MOSS; Minimax Optimal Strategy in the Stochastic case (Audibert & Bubeck 2009). UCB variant where the confidence bound shrinks faster than UCB1 once an arm has accumulated more than t / K samples. Score is mean + sqrt(max(0, ln(t / (K * n))) / n).
Spec for Moss.
Univariate bandit with a fixed number of independent arms, each backed by a kumulant SeriesStat; on every choose the bandit asks the policy to score a fresh snapshot per arm and picks the argmax.
Spec for MultiArmedBandit.
Gaussian arm with a Normal-Gamma prior (unknown mean and variance). Tracks both the running mean and the sum of squared deviations, which gives NormalGammaPosterior enough to draw (mean, variance) jointly and gives the variance-aware policies (Greedy, EpsilonGreedy) a reasonable variance estimate.
Normal-Gamma posterior over a normal mean/variance. Draws (variance, mean) jointly: variance ~ Inverse-Gamma(n/2, n*s^2/2), mean | variance ~ Normal(snapshot.mean, sigma^2/n).
Gamma posterior over a Poisson rate: Gamma(sum, totalWeights).
Per-arm state snapshot for RouletteWheelBandit. Exposes the current weight plus the running segment counters callers may want to inspect for debugging.
Adaptive operator-selection bandit in the Ropke-Pisinger 2006 ALNS scheme: each arm carries a weight, choose samples proportional to weights (roulette wheel), and weights re-balance in batches.
Spec for RouletteWheelBandit.
Thompson sampling: score each arm by a draw from its conjugate posterior given the snapshot. The bandit then picks the arm with the highest sample; no explicit exploration knob, the exploration falls out of posterior variance shrinking as data accumulates.
Spec for ThompsonSampling.
Top-Two Thompson Sampling (Russo 2020); pure-exploration variant of Thompson sampling for best-arm identification: sample every arm's posterior, take the argmax arm1, play it with probability beta, or else resample until the argmax differs from arm1 and play that runner-up.
Spec for TopTwoThompsonBandit.
Classical UCB1 (Auer, Cesa-Bianchi, Fischer 2002). Score is mean + alpha * sqrt(2 * ln(totalSamples) / armSamples); exploitation (running mean) plus a confidence bound that shrinks as the arm accumulates pulls. Unexplored arms get +infinity so they're tried at least once.
UCB1-Normal (Auer et al. 2002). Variance-aware UCB for Gaussian rewards; uses the sample variance derived from the MomentsResult snapshot to scale the confidence bound. Reach for it when rewards are roughly Gaussian and unbounded; UCB1's [0, 1] assumption doesn't hold.
Spec for UCB1Normal.
Spec for UCB1.
UCB1-Tuned (Auer et al. 2002). Same shape as UCB1 but the confidence bound multiplier uses an upper bound on the variance: min(0.25, v) where v is the sample variance plus a small padding term. Tighter bound than plain UCB1 when the empirical variance is well below 0.25; degrades gracefully to UCB1 when the variance is uninformative.
Spec for UCB1Tuned.
UCB-V; variance-aware UCB with finite-sample honesty (Audibert, Munos, Szepesvári 2009). Score is mean + sqrt(2 * V * zeta * ln(t) / n) + 3 * c * zeta * ln(t) / n, where V is the running variance from the MomentsResult snapshot.
Spec for UcbV.
Pure-exploration policy: every evaluate returns a fresh uniform draw, so the bandit picks arms uniformly at random regardless of observations. No exploitation at all; the opposite extreme of Greedy.
Spec for UniformSelection.
Wire-portable specification for a univariate bandit instance.
Functions
Thompson sampling over a Beta(priorAlpha, priorBeta) prior on a Bernoulli reward.
Thompson sampling over an exponential reward with a Gamma prior on the rate.
Thompson sampling over a Gamma reward with known shape and Gamma prior on the scale.
Thompson sampling over a geometric reward with a Beta prior on the success probability.
Thompson sampling over a log-normal reward via Normal-Gamma on log(value).
Thompson sampling over a Normal-Gamma prior; unknown mean and variance.
Thompson sampling over a Poisson reward with a Gamma prior on the rate.
Warm-started BernoulliArm from a global Bernoulli snapshot.
Warm-started LogNormalArm from a global weighted-variance snapshot on the log scale (caller is responsible for ensuring the snapshot is over ln(reward)).
Warm-started MeanArm from a global weighted-mean snapshot.
Warm-started MomentsArm from a global moments snapshot.
Warm-started NormalArm from a global weighted-variance snapshot. The arm's prior variance is preserved from the global; only the prior weight is shrunk.