stat.regression.tree
Online VFDT decision trees and random forests, plus the shared machinery they're built on. The package covers both regression (continuous y) and classification (y in [0, numClasses)) under one consistent shape.
Picking a tree stat
| Stat | Output | Leaf state |
|---|---|---|
| DecisionTreeRegressionStat | Continuous y prediction | com.eignex.kumulant.stat.summary.WeightedVarianceResult |
| RandomForestRegressionStat | Bagged ensemble of regression trees | One DecisionTreeRegressionStat per tree |
| DecisionTreeClassifierStat | K-way class probabilities | ClassCountsResult |
| RandomForestClassifierStat | Bagged ensemble of classifier trees; predictions average per-class probabilities | One DecisionTreeClassifierStat per tree |
Each takes a list of Split candidates (axis-aligned thresholds or arbitrary ExprSplit expressions) and a config; RegressionTreeConfig for the regression side, ClassificationTreeConfig for the classification side. Splits fire when a candidate clears the Hoeffding bound on the configured metric: VarianceReduction for regression, GiniReduction / InformationGain for classification.
Split candidates
ThresholdSplit is the standard axis-aligned predicate x[i] <= t. ExprSplit takes any com.eignex.kumulant.schema.BoolExpr for non-axis-aligned splits; useful when you want a tree to split on a derived feature (x[0] + x[1] > 0, (x[0] - x[1]).abs() > 1) without materialising the feature in the input vector.
Pass an empty splitCandidates list to disable growth and degenerate the stat into a single global accumulator. The Random forest's config.mtry defaults to ceil(sqrt(p)) Breiman-style.
Posteriors
TreePosterior (and its forest counterpart ForestPosterior) turn a TreeRegressionResult / ForestRegressionResult into a scalar score:
MeanTreePosterior / MeanForestPosterior: deterministic leaf mean.
ThompsonTreePosterior / ThompsonForestPosterior: Normal-Gamma draw from the leaf's
(mean, variance, totalWeights)triplet.UcbTreePosterior / UcbForestPosterior: UCB-style
mean + alpha * sqrt(variance / n).
These plug into com.eignex.kumulant.bandit.contextual.RegressionContextualBandit for non-linear contextual bandits with the same Thompson / UCB / mean choice that the GLM side has via the com.eignex.kumulant.stat.regression.glm.LinearPosterior family.
Merge
All four merge approximately through tree.mergeSnapshot(), which combines matching leaf accumulators: weighted Welford per leaf for the regression trees, cell-wise class-count sums for the classifiers, per-tree for the forests. The combine assumes the tree structures line up (same split history); divergent trees merge only their shared leaves. For distributed training, prefer feeding one stat the ordered stream and treat merge as a roll-up.
Concurrency
The hot update path touches exactly one accumulator; the leaf the observation routes to. Internal split nodes carry no live arm; subtree aggregates are derived by combining descendants at snapshot time. Each leaf arm honours the com.eignex.kumulant.core.Concurrency level passed in, so multiple threads landing in different leaves never contend. Split conversion takes a per-tree lock fired only at split decisions. See RegressionTree / ClassificationTree for the full concurrency design.
Types
Snapshot of a K-bin weighted class-count vector. Used as the leaf aggregate for ClassificationTree: the per-leaf running tally of how many (weighted) observations of each class landed in this leaf. Exposes derived class probabilities and the argmax prediction.
Series stat over discrete class-index inputs. update(value, weight) interprets value.toInt() as the class index; out-of-range values are dropped. Each class cell is an independent striped atomic adder so updates commute under any Concurrency level.
Audit leaf tracking per-candidate pos/neg class-count accumulators.
Leaf; owns a per-class count accumulator.
Classification mirror of RegressionNode. Identical structure to the regression side except that leaf arms carry class-count snapshots rather than weighted-variance summaries.
Classification analogue of SplitMetric: scores a candidate split using the leaf's per-class counts. Higher is better; returns 0 when the split has no signal (one side empty, or zero impurity reduction).
Split mirror; predicate routes to pos or neg; carryover absorbs orphan aggregates produced by mixed-structure merges or pre-split snapshots.
Frozen leaf; no further splits considered.
Classification mirror of RegressionTree: online VFDT decision tree where each leaf carries a per-class count accumulator and audit leaves track class counts per candidate split. Splits fire when a candidate clears the Hoeffding bound on the configured ClassificationSplitMetric (Gini or information gain).
Classification analogue of RegressionTreeConfig. Same tunables, but the split metric defaults to GiniReduction and the criterion is a ClassificationSplitMetric.
Online VFDT decision-tree classifier; the classification counterpart of DecisionTreeRegressionStat. Each leaf carries a ClassCountsResult (per-class weighted counts); splits fire when a candidate beats the runner-up by the Hoeffding bound on Gini reduction or information gain.
Online VFDT decision-tree regressor; a piecewise-constant predictor over the feature space, growing on the fly via the Hoeffding bound. Wraps a RegressionTree in the kumulant RegressionStat contract so it composes with everything that consumes regressors (the bandit family, schemas, op pipelines).
Snapshot of a RandomForestClassifierStat: per-tree immutable snapshots plus ensemble-aware predict helpers.
TreePosterior family ported to forests: every leaf snapshot the query routes to is merged into a single weighted-variance result, then scored with the tree-posterior semantics. Same options, applied to the ensembled leaf.
Snapshot of a RandomForestRegressionStat: per-tree immutable snapshots.
Weighted Gini-impurity reduction. The classic CART classification criterion.
Information-gain split criterion: parent entropy minus weighted children entropy.
Forest counterpart to MeanTreePosterior.
Score is the leaf's running mean; point estimate, no exploration.
Online random-forest classifier; the classification counterpart of RandomForestRegressionStat. Same diversity tricks (Oza & Russell bagging, per-leaf mtry), but per-tree leaves are ClassCountsResult and ensemble predictions average per-class probabilities across trees.
Online random-forest regressor; a population of RegressionTrees sharing the candidate-split pool. Diversity comes from:
Leaf that tracks per-candidate pos/neg stats. When a candidate clears the Hoeffding- bound test, this leaf is replaced by a RegressionSplitNode. The candidate subset is per-leaf ; picked at leaf birth; so mtry-style random subspace selection lives at the leaf level.
Leaf node; terminus of the tree walk for a given row, and the only node type that owns a live accumulator.
Internal tree node. The hot update path touches only the leaf an observation routes to; internal split nodes are never written by RegressionTree.update. Splits may carry an optional carryover arm: a one-shot snapshot of the pre-split aggregate captured at the moment a leaf converts into a split, plus any orphaned aggregates folded in by mixed-structure merges. Subtree aggregates include the carryover but the hot path never reads or writes it.
Routes by split to either pos (true) or neg (false). The optional carryover holds aggregates that don't structurally belong to either child; the pre-split data frozen at split time, or orphans absorbed from a mixed merge. Never written by the update hot path; never read by findLeaf or predict; included by subtreeAggregate.
Frozen leaf; no further splits will be considered.
Online VFDT-style decision tree partitioning context vectors. Each leaf carries a weighted-variance accumulator; audit leaves additionally track pos/neg sub-arms per candidate split and, every RegressionTreeConfig.splitPeriod observations, evaluate them against the Hoeffding bound to decide whether to convert themselves into a RegressionSplitNode.
Tunables for RegressionTree growth, shared by DecisionTreeRegressionStat and RandomForestRegressionStat.
Scores a candidate split against a leaf's pre-split distribution. Higher is better. Returned score must satisfy value(total, total, empty) == 0 so that "no signal" is always last in the ranking.
Forest counterpart to ThompsonTreePosterior.
Thompson sampling over the leaf's Normal-Gamma posterior. Given the leaf's pseudo- count n, sample mean m, and sample variance v, draws are mu ~ N(m, exploration * v / max(n, 1)); the posterior on the leaf mean assuming a Normal-Gamma conjugate with weak prior. exploration = 0.0 collapses to the leaf mean.
Route by row[featureIndex] <= threshold. Threshold is inclusive on the "pos" side.
Immutable classification leaf-node snapshot.
Snapshot of a single classification-tree node; split or leaf.
Classification mirror of TreeRegressionResult.
Immutable classification split-node snapshot.
Immutable leaf-node snapshot.
Snapshot of a single tree node; split or leaf.
RegressionTree-aware scorer: routes the query x to a leaf snapshot and turns its weighted- variance summary into a single Double. Parallels the linear-side com.eignex.kumulant.stat.regression.glm.LinearPosterior family for the tree regressor shape.
Immutable snapshot of a RegressionTree at read time. Carries the tree structure (split predicates + per-node weighted-variance aggregates) so callers can route a context vector to its leaf without reaching back into the live stat.
Immutable split-node snapshot.
Forest counterpart to UcbTreePosterior.
UCB-style score: mean + exploration * sqrt(variance / (totalWeights + priorWeight)). The sqrt(.) term is the leaf's standard error of the mean; the prior-weight floor keeps the bound finite at empty leaves.
Mean variance reduction. The classic CART regression criterion.
Functions
Score every candidate split and return the top-2 + index. Splits that don't meet minSamplesLeaf on both sides or minSamplesSplit in total are skipped.
Freeze a live tree node into an immutable snapshot. Internal split aggregates are derived from the snapshotted children so the wire format stays stable even though live splits hold no arm.