RouletteWheelBandit
Adaptive operator-selection bandit in the Ropke-Pisinger 2006 ALNS scheme: each arm carries a weight, choose samples proportional to weights (roulette wheel), and weights re-balance in batches.
After every segmentLength calls, weights re-balance via w_i = w_i · (1 - reactionFactor) + reactionFactor · avgScore_i, where avgScore_i is the mean reward per call of arm i over the segment, floored at minWeight so no arm is permanently extinguished. Batched re-balance (rather than per-observation) is useful when rewards are noisy and continuous updates would thrash. Only meaningful for reward-maximisation: the weight increase is asymmetric and has no clean "minimise" dual; callers wanting to minimise should negate the reward before update.
Implemented as a direct UnivariateBandit rather than a BanditPolicy plugged into MultiArmedBandit because the re-balance is a global cross-arm operation (each arm's new weight depends on its segment-mean).
Use cases: operator selection inside meta-heuristics (ALNS, LNS), where reward signals are noisy and selection breadth across many candidate operators matters more than fine-grained per-step tracking.
Arms: indexless, nbrArms fixed at construction; per-arm state is (weight, segment score sum, segment call count).
Memory: O(nbrArms); three parallel arrays plus a segment counter.
Choose: O(nbrArms); sum the weights, inverse-CDF sample.
Update: O(1) amortised; O(nbrArms) on the segment boundary where the re-balance sweeps every arm.
Randomness: every choose consumes one random.nextDouble() (or one nextInt when all weights collapse to zero); reproducible under a fixed seed.
Concurrency: not thread-safe; weights, segment scores, and the segment counter are mutated without synchronisation. Serialise choose and update externally for multi-thread use.
Constructors
Properties
Starting weight assigned to every arm.
Single source of randomness for UnivariateBandit.choose / ContextualBandit.choose and any policy-internal sampling. Callers pass a Random(seed) at construction for reproducible exploration; the bandit threads the same instance through every randomised decision.
Blend factor for the Ropke-Pisinger weight update; 0 = no learning, 1 = pure segment-mean.
Number of update calls between successive weight rebalances.
Functions
Per-arm snapshot at armIndex. Default implementation reads from the full snapshot; implementations may override to avoid building the entire list when only one arm is needed.
Pick an arm to play next. Uses Bandit.random for any sampling. The returned index is in [0, nbrArms). Repeated calls without intervening updates may return different arms (for randomised selection) or the same arm (for argmax-style policies once the leading arm is well-separated).
Spawn a fresh bandit with the same configuration; state resets to the prior seed. The random source is replaced; pass the source you want the new bandit to use for exploration (which is independent of merging in another snapshot's state).
Heuristic merge: weights are arithmetically averaged across replicas; scores and call counts are summed (treating the other replica's segment as additional unobserved data). Roulette-wheel adaptive selection has no canonical merge semantics - the segment-based rebalance is inherently sequential - so use this for "roughly combine two parallel runs" rather than for principled aggregation.
Clear all state back to the prior-seeded baseline. Equivalent to spawning a fresh bandit with the same configuration via Snapshotable.create, but in place; keeps the same arm count, policy, concurrency mode, and random instance.
Materialise the current state as a serialisable snapshot. Reads are non-mutating; call as often as needed without affecting decisions. Same snapshot consistency rules as com.eignex.kumulant.core.Stat.read ; under com.eignex.kumulant.core.Concurrency.Relaxed coupled cells may drift by ULPs.
RouletteWheelBandit
choose
Pick an arm to play next. Uses Bandit.random for any sampling. The returned index is in [0, nbrArms). Repeated calls without intervening updates may return different arms (for randomised selection) or the same arm (for argmax-style policies once the leading arm is well-separated).
create
Spawn a fresh bandit with the same configuration; state resets to the prior seed. The random source is replaced; pass the source you want the new bandit to use for exploration (which is independent of merging in another snapshot's state).
Useful when a worker accepts a stream of snapshots to apply sequentially: create(random).also { it.merge(snapshot) }.
evaluate
initialWeight
Starting weight assigned to every arm.
merge
Heuristic merge: weights are arithmetically averaged across replicas; scores and call counts are summed (treating the other replica's segment as additional unobserved data). Roulette-wheel adaptive selection has no canonical merge semantics - the segment-based rebalance is inherently sequential - so use this for "roughly combine two parallel runs" rather than for principled aggregation.
minWeight
nbrArms
random
Single source of randomness for UnivariateBandit.choose / ContextualBandit.choose and any policy-internal sampling. Callers pass a Random(seed) at construction for reproducible exploration; the bandit threads the same instance through every randomised decision.
reactionFactor
Blend factor for the Ropke-Pisinger weight update; 0 = no learning, 1 = pure segment-mean.
reset
Clear all state back to the prior-seeded baseline. Equivalent to spawning a fresh bandit with the same configuration via Snapshotable.create, but in place; keeps the same arm count, policy, concurrency mode, and random instance.
segmentLength
Number of update calls between successive weight rebalances.
snapshot
Materialise the current state as a serialisable snapshot. Reads are non-mutating; call as often as needed without affecting decisions. Same snapshot consistency rules as com.eignex.kumulant.core.Stat.read ; under com.eignex.kumulant.core.Concurrency.Relaxed coupled cells may drift by ULPs.
update
Fold a single observed reward value into the arm at armIndex with the given weight. Weight is the same observation-weight that runs through the rest of the library; typically 1.0, occasionally importance-weighted for off-policy correction.
Index out of range throws; some bandits also bound-check the value (e.g. Bernoulli arms require value in {0.0, 1.0}).