kumulant

stat.regression.tree

Online VFDT decision trees and random forests, plus the shared machinery they're built on. The package covers both regression (continuous y) and classification (y in [0, numClasses)) under one consistent shape.

Picking a tree stat

StatOutputLeaf state
DecisionTreeRegressionStatContinuous y predictioncom.eignex.kumulant.stat.summary.WeightedVarianceResult
RandomForestRegressionStatBagged ensemble of regression treesOne DecisionTreeRegressionStat per tree
DecisionTreeClassifierStatK-way class probabilitiesClassCountsResult
RandomForestClassifierStatBagged ensemble of classifier trees; predictions average per-class probabilitiesOne DecisionTreeClassifierStat per tree

Each takes a list of Split candidates (axis-aligned thresholds or arbitrary ExprSplit expressions) and a config; RegressionTreeConfig for the regression side, ClassificationTreeConfig for the classification side. Splits fire when a candidate clears the Hoeffding bound on the configured metric: VarianceReduction for regression, GiniReduction / InformationGain for classification.

Split candidates

ThresholdSplit is the standard axis-aligned predicate x[i] <= t. ExprSplit takes any com.eignex.kumulant.schema.BoolExpr for non-axis-aligned splits; useful when you want a tree to split on a derived feature (x[0] + x[1] > 0, (x[0] - x[1]).abs() > 1) without materialising the feature in the input vector.

Pass an empty splitCandidates list to disable growth and degenerate the stat into a single global accumulator. The Random forest's config.mtry defaults to ceil(sqrt(p)) Breiman-style.

Posteriors

TreePosterior (and its forest counterpart ForestPosterior) turn a TreeRegressionResult / ForestRegressionResult into a scalar score:

These plug into com.eignex.kumulant.bandit.contextual.RegressionContextualBandit for non-linear contextual bandits with the same Thompson / UCB / mean choice that the GLM side has via the com.eignex.kumulant.stat.regression.glm.LinearPosterior family.

Merge

All four merge approximately through tree.mergeSnapshot(), which combines matching leaf accumulators: weighted Welford per leaf for the regression trees, cell-wise class-count sums for the classifiers, per-tree for the forests. The combine assumes the tree structures line up (same split history); divergent trees merge only their shared leaves. For distributed training, prefer feeding one stat the ordered stream and treat merge as a roll-up.

Concurrency

The hot update path touches exactly one accumulator; the leaf the observation routes to. Internal split nodes carry no live arm; subtree aggregates are derived by combining descendants at snapshot time. Each leaf arm honours the com.eignex.kumulant.core.Concurrency level passed in, so multiple threads landing in different leaves never contend. Split conversion takes a per-tree lock fired only at split decisions. See RegressionTree / ClassificationTree for the full concurrency design.

Types

Link copied to clipboard
@Serializable
@SerialName(value = "ClassCountsResult")
data class ClassCountsResult(val numClasses: Int, val counts: DoubleArray) : Result

Snapshot of a K-bin weighted class-count vector. Used as the leaf aggregate for ClassificationTree: the per-leaf running tally of how many (weighted) observations of each class landed in this leaf. Exposes derived class probabilities and the argmax prediction.

Link copied to clipboard
class ClassCountsStat(val numClasses: Int, val concurrency: Concurrency = Concurrency.None) : SeriesStat<ClassCountsResult>

Series stat over discrete class-index inputs. update(value, weight) interprets value.toInt() as the class index; out-of-range values are dropped. Each class cell is an independent striped atomic adder so updates commute under any Concurrency level.

Link copied to clipboard

Audit leaf tracking per-candidate pos/neg class-count accumulators.

Link copied to clipboard

Leaf; owns a per-class count accumulator.

Link copied to clipboard
sealed interface ClassificationNode

Classification mirror of RegressionNode. Identical structure to the regression side except that leaf arms carry class-count snapshots rather than weighted-variance summaries.

Link copied to clipboard
@Serializable
sealed interface ClassificationSplitMetric

Classification analogue of SplitMetric: scores a candidate split using the leaf's per-class counts. Higher is better; returns 0 when the split has no signal (one side empty, or zero impurity reduction).

Link copied to clipboard

Split mirror; predicate routes to pos or neg; carryover absorbs orphan aggregates produced by mixed-structure merges or pre-split snapshots.

Link copied to clipboard

Frozen leaf; no further splits considered.

Link copied to clipboard
class ClassificationTree(numClasses: Int, splitCandidates: List<Split>, config: ClassificationTreeConfig = ClassificationTreeConfig(), concurrency: Concurrency = Concurrency.None, leafArmFactory: () -> SeriesStat<ClassCountsResult> = { ClassCountsStat(numClasses, concurrency) }, randomSeed: Int = 0)

Classification mirror of RegressionTree: online VFDT decision tree where each leaf carries a per-class count accumulator and audit leaves track class counts per candidate split. Splits fire when a candidate clears the Hoeffding bound on the configured ClassificationSplitMetric (Gini or information gain).

Link copied to clipboard
@Serializable
data class ClassificationTreeConfig(val delta: Double = 0.05, val deltaDecay: Double = 0.9, val tau: Double = 0.05, val minSamplesSplit: Double = 30.0, val minSamplesLeaf: Double = 5.0, val splitPeriod: Int = 10, val maxDepth: Int = 16, val maxNodes: Int = 1024, val metric: ClassificationSplitMetric = GiniReduction, val mtry: Int? = null)

Classification analogue of RegressionTreeConfig. Same tunables, but the split metric defaults to GiniReduction and the criterion is a ClassificationSplitMetric.

Link copied to clipboard
class DecisionTreeClassifierStat(val featureSize: Int, val numClasses: Int, val splitCandidates: List<Split>, val config: ClassificationTreeConfig = ClassificationTreeConfig(), val concurrency: Concurrency = Concurrency.None, leafArmFactory: () -> SeriesStat<ClassCountsResult> = { ClassCountsStat(numClasses, concurrency) }, randomSeed: Int = 0) : RegressionStat<TreeClassificationResult>

Online VFDT decision-tree classifier; the classification counterpart of DecisionTreeRegressionStat. Each leaf carries a ClassCountsResult (per-class weighted counts); splits fire when a candidate beats the runner-up by the Hoeffding bound on Gini reduction or information gain.

Link copied to clipboard
class DecisionTreeRegressionStat(val featureSize: Int, val splitCandidates: List<Split>, val config: RegressionTreeConfig = RegressionTreeConfig(), val concurrency: Concurrency = Concurrency.None, leafArmFactory: () -> SeriesStat<WeightedVarianceResult> = { VarianceStat(concurrency) }, randomSeed: Int = 0) : RegressionStat<TreeRegressionResult>

Online VFDT decision-tree regressor; a piecewise-constant predictor over the feature space, growing on the fly via the Hoeffding bound. Wraps a RegressionTree in the kumulant RegressionStat contract so it composes with everything that consumes regressors (the bandit family, schemas, op pipelines).

Link copied to clipboard
@Serializable
@SerialName(value = "ExprSplit")
data class ExprSplit(val expr: BoolExpr) : Split

Route by an arbitrary BoolExpr evaluated against the context vector. The expression sees the context's first coordinate as X, the second as Y, and the full vector via V(i); matching the existing kumulant AST conventions. Wire-portable through skema's polymorphism on BoolExpr.

Link copied to clipboard
@Serializable
@SerialName(value = "ForestClassificationResult")
data class ForestClassificationResult(val numClasses: Int, val trees: List<TreeClassificationResult>) : Result

Snapshot of a RandomForestClassifierStat: per-tree immutable snapshots plus ensemble-aware predict helpers.

Link copied to clipboard

TreePosterior family ported to forests: every leaf snapshot the query routes to is merged into a single weighted-variance result, then scored with the tree-posterior semantics. Same options, applied to the ensembled leaf.

Link copied to clipboard
@Serializable
@SerialName(value = "ForestRegressionResult")
data class ForestRegressionResult(val trees: List<TreeRegressionResult>) : Result

Snapshot of a RandomForestRegressionStat: per-tree immutable snapshots.

Link copied to clipboard
@Serializable
@SerialName(value = "GiniReduction")
data object GiniReduction : ClassificationSplitMetric

Weighted Gini-impurity reduction. The classic CART classification criterion.

Link copied to clipboard
@Serializable
@SerialName(value = "InformationGain")
data object InformationGain : ClassificationSplitMetric

Information-gain split criterion: parent entropy minus weighted children entropy.

Link copied to clipboard

Forest counterpart to MeanTreePosterior.

Link copied to clipboard

Score is the leaf's running mean; point estimate, no exploration.

Link copied to clipboard
class RandomForestClassifierStat(val featureSize: Int, val numClasses: Int, val splitCandidates: List<Split>, val nbrTrees: Int = 10, config: ClassificationTreeConfig = ClassificationTreeConfig(), val bagging: Boolean = true, val concurrency: Concurrency = Concurrency.None, leafArmFactory: () -> SeriesStat<ClassCountsResult> = { ClassCountsStat(numClasses, concurrency) }, randomSeed: Int = 0) : RegressionStat<ForestClassificationResult>

Online random-forest classifier; the classification counterpart of RandomForestRegressionStat. Same diversity tricks (Oza & Russell bagging, per-leaf mtry), but per-tree leaves are ClassCountsResult and ensemble predictions average per-class probabilities across trees.

Link copied to clipboard
class RandomForestRegressionStat(val featureSize: Int, val splitCandidates: List<Split>, val nbrTrees: Int = 10, config: RegressionTreeConfig = RegressionTreeConfig(), val bagging: Boolean = true, val concurrency: Concurrency = Concurrency.None, leafArmFactory: () -> SeriesStat<WeightedVarianceResult> = { VarianceStat(concurrency) }, randomSeed: Int = 0) : RegressionStat<ForestRegressionResult>

Online random-forest regressor; a population of RegressionTrees sharing the candidate-split pool. Diversity comes from:

Link copied to clipboard

Leaf that tracks per-candidate pos/neg stats. When a candidate clears the Hoeffding- bound test, this leaf is replaced by a RegressionSplitNode. The candidate subset is per-leaf ; picked at leaf birth; so mtry-style random subspace selection lives at the leaf level.

Link copied to clipboard

Leaf node; terminus of the tree walk for a given row, and the only node type that owns a live accumulator.

Link copied to clipboard
sealed interface RegressionNode

Internal tree node. The hot update path touches only the leaf an observation routes to; internal split nodes are never written by RegressionTree.update. Splits may carry an optional carryover arm: a one-shot snapshot of the pre-split aggregate captured at the moment a leaf converts into a split, plus any orphaned aggregates folded in by mixed-structure merges. Subtree aggregates include the carryover but the hot path never reads or writes it.

Link copied to clipboard

Routes by split to either pos (true) or neg (false). The optional carryover holds aggregates that don't structurally belong to either child; the pre-split data frozen at split time, or orphans absorbed from a mixed merge. Never written by the update hot path; never read by findLeaf or predict; included by subtreeAggregate.

Link copied to clipboard

Frozen leaf; no further splits will be considered.

Link copied to clipboard
class RegressionTree(splitCandidates: List<Split>, config: RegressionTreeConfig = RegressionTreeConfig(), concurrency: Concurrency = Concurrency.None, leafArmFactory: () -> SeriesStat<WeightedVarianceResult> = { VarianceStat(concurrency) }, randomSeed: Int = 0)

Online VFDT-style decision tree partitioning context vectors. Each leaf carries a weighted-variance accumulator; audit leaves additionally track pos/neg sub-arms per candidate split and, every RegressionTreeConfig.splitPeriod observations, evaluate them against the Hoeffding bound to decide whether to convert themselves into a RegressionSplitNode.

Link copied to clipboard
@Serializable
data class RegressionTreeConfig(val delta: Double = 0.05, val deltaDecay: Double = 0.9, val tau: Double = 0.05, val minSamplesSplit: Double = 30.0, val minSamplesLeaf: Double = 5.0, val splitPeriod: Int = 10, val maxDepth: Int = 16, val maxNodes: Int = 1024, val metric: SplitMetric = VarianceReduction, val mtry: Int? = null)
Link copied to clipboard
@Serializable
sealed interface Split

Binary predicate routing a context vector to "pos" (true) or "neg" (false).

Link copied to clipboard
data class SplitInfo(val top1: Double, val top2: Double, val bestIndex: Int)

Result of evaluating all candidate splits at a leaf: best score, runner-up, best index.

Link copied to clipboard
@Serializable
sealed interface SplitMetric

Scores a candidate split against a leaf's pre-split distribution. Higher is better. Returned score must satisfy value(total, total, empty) == 0 so that "no signal" is always last in the ranking.

Link copied to clipboard
data class ThompsonForestPosterior(val priorWeight: Double = 1.0, val priorVariance: Double = 1.0) : ForestPosterior

Forest counterpart to ThompsonTreePosterior.

Link copied to clipboard
data class ThompsonTreePosterior(val priorWeight: Double = 1.0, val priorVariance: Double = 1.0) : TreePosterior

Thompson sampling over the leaf's Normal-Gamma posterior. Given the leaf's pseudo- count n, sample mean m, and sample variance v, draws are mu ~ N(m, exploration * v / max(n, 1)); the posterior on the leaf mean assuming a Normal-Gamma conjugate with weak prior. exploration = 0.0 collapses to the leaf mean.

Link copied to clipboard
@Serializable
@SerialName(value = "ThresholdSplit")
data class ThresholdSplit(val featureIndex: Int, val threshold: Double) : Split

Route by row[featureIndex] <= threshold. Threshold is inclusive on the "pos" side.

Link copied to clipboard
@Serializable
@SerialName(value = "TreeClassificationLeafResult")
data class TreeClassificationLeafResult(val value: ClassCountsResult) : TreeClassificationNodeResult

Immutable classification leaf-node snapshot.

Link copied to clipboard
@Serializable
sealed interface TreeClassificationNodeResult

Snapshot of a single classification-tree node; split or leaf.

Link copied to clipboard
@Serializable
@SerialName(value = "TreeClassificationResult")
data class TreeClassificationResult(val root: TreeClassificationNodeResult) : Result

Classification mirror of TreeRegressionResult.

Link copied to clipboard
@Serializable
@SerialName(value = "TreeClassificationSplitResult")
data class TreeClassificationSplitResult(val split: Split, val pos: TreeClassificationNodeResult, val neg: TreeClassificationNodeResult, val value: ClassCountsResult) : TreeClassificationNodeResult

Immutable classification split-node snapshot.

Link copied to clipboard
@Serializable
@SerialName(value = "TreeLeafResult")
data class TreeLeafResult(val value: WeightedVarianceResult) : TreeNodeResult

Immutable leaf-node snapshot.

Link copied to clipboard
@Serializable
sealed interface TreeNodeResult

Snapshot of a single tree node; split or leaf.

Link copied to clipboard

RegressionTree-aware scorer: routes the query x to a leaf snapshot and turns its weighted- variance summary into a single Double. Parallels the linear-side com.eignex.kumulant.stat.regression.glm.LinearPosterior family for the tree regressor shape.

Link copied to clipboard
@Serializable
@SerialName(value = "TreeRegressionResult")
data class TreeRegressionResult(val root: TreeNodeResult) : Result

Immutable snapshot of a RegressionTree at read time. Carries the tree structure (split predicates + per-node weighted-variance aggregates) so callers can route a context vector to its leaf without reaching back into the live stat.

Link copied to clipboard
@Serializable
@SerialName(value = "TreeSplitResult")
data class TreeSplitResult(val split: Split, val pos: TreeNodeResult, val neg: TreeNodeResult, val value: WeightedVarianceResult) : TreeNodeResult

Immutable split-node snapshot.

Link copied to clipboard
data class UcbForestPosterior(val priorWeight: Double = 1.0, val priorVariance: Double = 1.0) : ForestPosterior

Forest counterpart to UcbTreePosterior.

Link copied to clipboard
data class UcbTreePosterior(val priorWeight: Double = 1.0, val priorVariance: Double = 1.0) : TreePosterior

UCB-style score: mean + exploration * sqrt(variance / (totalWeights + priorWeight)). The sqrt(.) term is the leaf's standard error of the mean; the prior-weight floor keeps the bound finite at empty leaves.

Link copied to clipboard
@Serializable
@SerialName(value = "VarianceReduction")
data object VarianceReduction : SplitMetric

Mean variance reduction. The classic CART regression criterion.

Functions

Link copied to clipboard

Score every candidate split and return the top-2 + index. Splits that don't meet minSamplesLeaf on both sides or minSamplesSplit in total are skipped.

Link copied to clipboard

Freeze a live tree node into an immutable snapshot. Internal split aggregates are derived from the snapshotted children so the wire format stays stable even though live splits hold no arm.