stat.sketch
Structural queries on a stream that aren't shaped like a cardinality or a quantile. Four members; each answers a different question.
Picking a sketch
| Stat | Result | Question |
|---|---|---|
| BloomFilterStat | BloomFilterResult | "Have I seen this key before?" Membership only, one-sided false positives, no per-key counts. |
| CountMinSketchStat | CountMinSketchResult | "How many times has this key appeared?" Approximate per-key counts; biased upward by collisions, exact for heavy items. |
| SpaceSavingStat | HeavyHittersResult | "What are the top K heaviest keys?" Long-tail problem: out of millions of distinct items, a handful might dominate volume. |
| MinHashStat | MinHashResult | "How similar are two sets?" Jaccard similarity sketch; estimates overlap without comparing the sets directly. |
All four take a stream of opaque Long keys through com.eignex.kumulant.core.DiscreteStat. As with com.eignex.kumulant.stat.cardinality, hash domain-specific keys through com.eignex.kumulant.math.hash64 first so the input carries uniform 64-bit entropy.
When to pick what
BloomFilter when you need yes/no membership at the absolute cheapest memory per key. A "yes" may be a false positive (rate controlled by configured bits + hashes); a "no" is always exact. Standard tool for deduplication of seen-already keys.
CountMinSketch when you need per-key counts but an exact map would blow memory. Counts are biased upward by collisions but exact for heavy items, which is fine for top-K-like questions and rate monitoring.
SpaceSaving when you only need the K heaviest keys: viral URLs, hot database rows, top users. Cheaper than a full map: keeps K counters and guarantees that any key whose true frequency exceeds the K-th largest is in the retained set.
MinHash for near-duplicate detection over streams. Standard Jaccard-similarity sketch; two sets' MinHash signatures collide on hash
jwith probability equal to their Jaccard index.
Merge
BloomFilterStat merges via bitwise OR. Exact for same-sized bitsets.
CountMinSketchStat merges via cell-wise sum. Same-shape sketches required.
SpaceSavingStat merges by re-priority over the union of counters.
MinHashStat merges via cell-wise min over signature arrays. Exact Jaccard reproduction.
All four families work cleanly in distributed pipelines: workers track slices, ship snapshots, the coordinator merges.
Concurrency
Each family decomposes updates into independent atomic operations on single cells (BloomFilter / CountMin / MinHash) or a small bounded set of cells (SpaceSaving). Lock-free and exact under every com.eignex.kumulant.core.Concurrency level.
Types
Bloom-filter snapshot. words is the bitset packed as bits / 64 longs; merging two snapshots requires identical bits and hashes.
Bloom filter - probabilistic set-membership test with no false negatives. bits bits are split across hashes positions per insert, derived from the hasher (default SplitMix64) via the Kirsch-Mitzenmacher double-hashing scheme.
CountStat-MinStat sketch snapshot. counters is the depth x width matrix of counters in row-major order. seed determines the per-row hash salts; merging two snapshots requires identical depth, width, and seed. totalSeen is the unweighted update count.
CountStat-MinStat sketch - a probabilistic frequency estimator over a depth x width matrix of counters. Each update hashes the value with depth independent salts (derived from seed) through the hasher (default SplitMix64) and increments one counter per row; the estimated count for any value is the minimum counter across rows.
Space-Saving heavy-hitters snapshot. keys, counts, errors are parallel arrays of length <= capacity; for each tracked key, counts is the (over)estimated weighted count and errors is the Space-Saving overestimate bound (the count is at most this much above the true count). totalSeen is the unweighted update count.
MinHash signature snapshot. signatures is the per-hash running minimum of splitmix64(value xor splitmix64(seed + i)) over all updates; merging two snapshots takes element-wise min and requires identical numHashes and seed.
MinHash signature - for each of numHashes independent hash functions (salted by splitmix64(seed + i)), maintain the running minimum hash over all inserted values. The Jaccard similarity between two sets is estimated by the fraction of slots whose signatures agree (see jaccard).
Heavy-hitters tracker; keeps the capacity most-frequent keys with one-sided overestimate guarantees on their counts.
Functions
True iff every bit set during an update(value) is still set in words. Re-derives the bit positions with the same LongHasher the filter used (resolved by name via Hashers), so a custom hasher must be registered before querying.
Estimated weighted count of value - the minimum across rows. Re-derives the per-row index with the same LongHasher the sketch used (resolved by name via Hashers), so a custom hasher must be registered before querying.
Estimated Jaccard similarity between the two underlying sets - the fraction of slots where signatures agree. Requires matching numHashes and seed.