Exp4Bandit
EXP4 (Auer, Cesa-Bianchi, Freund, Schapire 2002); adversarial contextual bandit over a fixed pool of experts. Each round, every expert returns a distribution over arms for the context; the bandit mixes those distributions weighted by per-expert exponential weights, blends with uniform exploration gamma, samples an arm, and on reward r ∈ [0,1] folds the IPS-corrected gain back into the expert weights.
Regret bound is O(sqrt(T · K · ln N)) under the default eta/gamma picks derived from nbrArms (K) and experts.size (N), so the algorithm trades off exploration breadth (more experts) against learning rate. Rewards passed to update must lie in [0, 1] for the regret theory to apply; outside-bound rewards are accepted but may destabilise the weight updates.
State is per-expert (not per-arm) so it surfaces via Snapshotable<Exp4State> rather than the com.eignex.kumulant.bandit.PerArmBandit convenience used by sibling contextual bandits.
Use cases: non-stationary or adversarial contextual problems where a small set of policies (linear scorers, rule-based heuristics, pretrained models) can advise arm distributions; meta-learning over a finite pool of experts.
Arms: contextual with caller-defined feature dimension (every expert's advise returns length nbrArms); nbrArms and experts.size fixed at construction.
Memory: O(experts.size + experts.size · nbrArms); one weight per expert plus a cached last-advice matrix and play distribution.
Choose: O(experts.size · (advise + nbrArms)); query every expert and mix their distributions.
Update: O(experts.size · (advise + nbrArms)); re-evaluates experts at x so the played arm's IPS gain is correct, then multiplicative update across all expert weights.
Randomness: every choose consumes one random.nextDouble(); reproducible under a fixed seed when expert advise is deterministic.
Concurrency: not thread-safe; expert weights, the cached advice matrix, and the cached play distribution are mutated without synchronisation. Serialise choose and update externally for multi-thread use.
Functions
Build the round's play distribution and sample an arm.
Spawn a fresh bandit with the same experts and tunables; weights reset to uniform.
Current per-expert weights, normalised to sum to 1.
Mean of expert distributions at x weighted by current weights, blended with uniform exploration via gamma.
Materialise the current state as a serialisable snapshot. Reads are non-mutating; call as often as needed without affecting decisions. Same snapshot consistency rules as com.eignex.kumulant.core.Stat.read ; under com.eignex.kumulant.core.Concurrency.Relaxed coupled cells may drift by ULPs.
Exp4Bandit
choose
Build the round's play distribution and sample an arm.
create
Spawn a fresh bandit with the same experts and tunables; weights reset to uniform.
eta
expertWeights
Current per-expert weights, normalised to sum to 1.
experts
Fixed pool of experts; non-empty.
gamma
merge
nbrArms
playDistribution
Mean of expert distributions at x weighted by current weights, blended with uniform exploration via gamma.
random
reset
snapshot
Materialise the current state as a serialisable snapshot. Reads are non-mutating; call as often as needed without affecting decisions. Same snapshot consistency rules as com.eignex.kumulant.core.Stat.read ; under com.eignex.kumulant.core.Concurrency.Relaxed coupled cells may drift by ULPs.