Exp3Bandit
EXP3 (Auer, Cesa-Bianchi, Freund, Schapire 2002); adversarial multi-armed bandit over a fixed pool of nbrArms. Each round: compute play distribution p[a] = (1 - gamma) · w[a]/Σw + gamma/K, sample a ~ p, then on reward r ∈ [0,1] update w[a] *= exp(eta · r / p[a]) using the importance-sampling-corrected gain.
Regret bound is O(sqrt(T · K · ln K)) under default tunings. Univariate sibling to com.eignex.kumulant.bandit.contextual.Exp4Bandit; same machinery without the expert layer. Standalone class (not a BanditPolicy under MultiArmedBandit) because its sampling distribution is computed across arms, not by independent-per-arm score + argmax.
Rewards passed to update must lie in [0, 1] for the regret theory to apply; outside-bound rewards are accepted but may destabilise the exponential weight update.
Use cases: non-stationary or adversarial scalar-reward problems where the per-arm reward distribution may shift over time; settings where a regret bound is wanted without distributional assumptions.
Arms: indexless, nbrArms fixed at construction; per-arm state is one exponential weight.
Memory: O(nbrArms); one weight per arm plus a cached play distribution.
Choose: O(nbrArms); build the play distribution, inverse-CDF sample.
Update: O(nbrArms); rebuilds the play distribution to read p[arm], then multiplicative weight update on the played arm.
Randomness: every choose consumes one random.nextDouble(); reproducible under a fixed seed.
Concurrency: not thread-safe; weights and the cached play distribution are mutated without synchronisation. Serialise choose and update externally for multi-thread use.
Functions
Per-arm snapshot at armIndex. Default implementation reads from the full snapshot; implementations may override to avoid building the entire list when only one arm is needed.
Current per-arm weights, normalised to sum to 1.
Spawn a fresh bandit with the same tunables; weights reset.
Fold another replica's other state into this bandit. Most families merge exactly via the underlying stat's parallel-merge formula; SGD- based contextual bandits merge approximately. Each concrete bandit's KDoc documents its merge semantics.
Current play distribution: weight-normalised softmax blended with uniform gamma.
Materialise the current state as a serialisable snapshot. Reads are non-mutating; call as often as needed without affecting decisions. Same snapshot consistency rules as com.eignex.kumulant.core.Stat.read ; under com.eignex.kumulant.core.Concurrency.Relaxed coupled cells may drift by ULPs.
Exp3Bandit
armWeights
Current per-arm weights, normalised to sum to 1.
choose
create
Spawn a fresh bandit with the same tunables; weights reset.
eta
gamma
merge
Fold another replica's other state into this bandit. Most families merge exactly via the underlying stat's parallel-merge formula; SGD- based contextual bandits merge approximately. Each concrete bandit's KDoc documents its merge semantics.
nbrArms
playDistribution
Current play distribution: weight-normalised softmax blended with uniform gamma.
random
reset
snapshot
Materialise the current state as a serialisable snapshot. Reads are non-mutating; call as often as needed without affecting decisions. Same snapshot consistency rules as com.eignex.kumulant.core.Stat.read ; under com.eignex.kumulant.core.Concurrency.Relaxed coupled cells may drift by ULPs.