Warehouse Robotics Baselines with ClickHouse

Practical ClickHouse-style telemetry and alert thresholds for warehouse robotics preprod tests—schema, queries, and gating strategies for 2026.

Hook: Why your preprod robotic fleet keeps surprising you

Environment drift, noisy telemetry, and vague alerting thresholds turn every preprod run into a guessing game. Teams ship robots that pass unit tests but fail under realistic warehouse load: late picks, navigation hiccups, network stalls, and battery depletion that only show up at scale. If that sounds familiar, this article gives you a concrete, ClickHouse-style OLAP telemetry schema and a set of practical alerting thresholds tailored for warehouse robotics preprod tests.

Inverted pyramid: What you get in this article

A production-ready ClickHouse schema for raw telemetry and pre-aggregates.
Retention, partitioning, and downsampling rules that control cost in ephemeral preprod clusters.
Concrete alerting thresholds (absolute, relative, and statistical) for common robotic failure modes.
Queries and example materialized views you can copy-paste into your CI/CD preprod pipeline.

Why ClickHouse-style OLAP matters for robotics telemetry in 2026

Warehouse automation moved fast in 2025–2026. Vendors and integrators are assembling fleets of heterogeneous robots, converging on telemetry-driven operations. OLAP systems like ClickHouse—which saw major market momentum and funding interest into 2026—are now a pragmatic choice for high-cardinality, high-ingest telemetry because they let you:

Store raw event streams at high throughput (hundreds of thousands of events/sec).
Run low-latency percentile and histogram queries across fleet, zones, or test runs.
Materialize aggregates for alerting and dashboarding without sacrificing raw-data fidelity for forensic analysis.

“By 2026, warehouse automation is no longer standalone; it’s data-driven. Use OLAP to turn telemetry into repeatable preprod gates.”

Design principles for preprod telemetry

Before we jump to schema and thresholds, adopt these principles for preprod telemetry:

Raw first, aggregate later: Keep full-fidelity events for 7–30 days to support postmortems; downsample older data.
Partition for test-context: Tag every event with test_run_id, run_type (smoke/soak/stress/chaos), and warehouse_layout_id.
Store both continuous and discrete telemetry: sensor streams (battery, odometry), event streams (task_started, task_completed), and health/diagnostics (CPU, comms).
Keep cardinality manageable: low_cardinality() columns for firmware, model, and software_version reduce storage and speed up queries.
Control costs for ephemeral preprod clusters: TTLs, downsampled materialized views, and ephemeral ClickHouse clusters with autoscale.

ClickHouse-style telemetry schema (practical, copyable)

The schema below is optimized for high-ingest raw events, per-minute pre-aggregates, and compact storage for longer retention. Use MergeTree engines with sensible partitioning and TTLs.

1) Raw events table (high cardinality, short retention)

CREATE TABLE telemetry_raw (
  ts DateTime64(3),
  warehouse_id UInt32,
  warehouse_layout_id UInt32,
  test_run_id UUID,
  run_type LowCardinality(String), -- smoke, soak, stress, chaos, canary
  robot_id UInt64,
  robot_model LowCardinality(String),
  firmware_version LowCardinality(String),
  event_type LowCardinality(String), -- e.g., sensor, navigation, task, diag
  event_subtype LowCardinality(String), -- e.g., odometry, battery, localization
  value Float64, -- primary numeric payload
  detail String, -- JSON blob for raw payload if needed
  tags Array(LowCardinality(String))
) ENGINE = MergeTree()
PARTITION BY (warehouse_id, toYYYYMMDD(ts))
ORDER BY (warehouse_id, robot_id, ts)
TTL ts + INTERVAL 30 DAY -- raw kept short for preprod
SETTINGS index_granularity = 8192;

Notes: keep raw retention short in preprod (30 days recommended). For long-term baselines, materialize aggregated tables (below).

2) Per-robot per-minute aggregates (fast queries & alerts)

CREATE TABLE telemetry_agg_minute (
  minute_ts DateTime,
  warehouse_id UInt32,
  test_run_id UUID,
  robot_id UInt64,
  run_type LowCardinality(String),
  model LowCardinality(String),
  battery_min Float32,
  battery_avg Float32,
  battery_max Float32,
  nav_error_p50 Float32,
  nav_error_p95 Float32,
  task_duration_p50 Float32,
  task_duration_p95 Float32,
  task_failures UInt32,
  msgs_sent UInt32,
  msgs_dropped UInt32
) ENGINE = SummingMergeTree()
PARTITION BY toYYYYMMDD(minute_ts)
ORDER BY (warehouse_id, minute_ts, robot_id)
TTL minute_ts + INTERVAL 90 DAY;

-- Materialized view to populate this table from telemetry_raw would compute quantiles using aggregate states

Use materialized views to stream aggregates from raw. Keep longer retention here (90 days) for trend baselines.

3) Fleet-level daily baselines

CREATE TABLE telemetry_baseline_daily (
  day Date,
  warehouse_id UInt32,
  layout_id UInt32,
  model LowCardinality(String),
  run_type LowCardinality(String),
  metric_name LowCardinality(String),
  p50 Float32,
  p90 Float32,
  p95 Float32,
  p99 Float32,
  avg Float32,
  stddev Float32
) ENGINE = MergeTree()
PARTITION BY toYYYYMMDD(day)
ORDER BY (warehouse_id, day, model, metric_name)
TTL day + INTERVAL 365 DAY;

This table is the authoritative baseline store you consult while gating preprod releases.

Materialized view examples (streaming aggregates)

Populate telemetry_agg_minute from raw events. The example computes digest-based quantiles for latency-like metrics.

CREATE MATERIALIZED VIEW mv_agg_minute TO telemetry_agg_minute AS
SELECT
  toStartOfMinute(ts) AS minute_ts,
  warehouse_id,
  test_run_id,
  robot_id,
  run_type AS run_type,
  robot_model AS model,
  minIf(value, event_subtype = 'battery') AS battery_min,
  avgIf(value, event_subtype = 'battery') AS battery_avg,
  maxIf(value, event_subtype = 'battery') AS battery_max,
  quantileTDigest(0.50)(value) FILTER (WHERE event_subtype = 'nav_error') AS nav_error_p50,
  quantileTDigest(0.95)(value) FILTER (WHERE event_subtype = 'nav_error') AS nav_error_p95,
  quantileTDigest(0.50)(value) FILTER (WHERE event_subtype = 'task_duration') AS task_duration_p50,
  quantileTDigest(0.95)(value) FILTER (WHERE event_subtype = 'task_duration') AS task_duration_p95,
  sumIf(event_type = 'task' AND event_subtype = 'failure', 1) AS task_failures,
  sumIf(event_subtype = 'msg_sent', value) AS msgs_sent,
  sumIf(event_subtype = 'msg_dropped', value) AS msgs_dropped
FROM telemetry_raw
GROUP BY minute_ts, warehouse_id, test_run_id, robot_id, robot_model, run_type;

How to compute baselines

Baselines should be derived from representative test runs. For preprod, use your last N soak/stress runs and compute percentiles per metric:

INSERT INTO telemetry_baseline_daily
SELECT
  toDate(minute_ts) AS day,
  warehouse_id,
  layout_id,
  model,
  run_type,
  'task_duration' AS metric_name,
  quantileExact(0.50)(task_duration) AS p50,
  quantileExact(0.90)(task_duration) AS p90,
  quantileExact(0.95)(task_duration) AS p95,
  quantileExact(0.99)(task_duration) AS p99,
  avg(task_duration) AS avg,
  stddevPop(task_duration) AS stddev
FROM telemetry_agg_minute
WHERE run_type IN ('soak','stress')
GROUP BY day, warehouse_id, layout_id, model, run_type;

Alerting: threshold types and example rules

Use three complementary threshold types in preprod gating:

Absolute thresholds — obvious safety limits. Trigger immediately.
Relative thresholds — compare to the daily baseline (percent delta).
Statistical/anomaly thresholds — rolling z-score, EWMA or quantile exceedance to catch gradual regressions.

Common metrics and recommended thresholds

Below are practical thresholds you can use as starting points. Tailor them to your robot class, layout, and SLA.

Task completion time (per-task duration):
- Alert (warning): 95th percentile > baseline_95 * 1.20 for 10 minutes.
- Alert (critical): 95th percentile > baseline_95 * 1.50 OR absolute 95th > 2x SLA (e.g., 60s).
Task failure rate:
- Warning: failure_rate > baseline + 3 percentage points for a 1-hour window.
- Critical: failure_rate > 5% sustained for 10 minutes.
Navigation/localization error (distance off-path):
- Warning: p95 > baseline_p95 + 0.5m AND p95 > 1.5x baseline_p95.
- Critical: p95 > 2m or robot marked stuck > 5 incidents in 15 minutes.
Battery depletion rate:
- Warning: avg discharge rate > baseline_rate * 1.25 for 30 minutes.
- Critical: drop > 20% battery within 10 minutes or unexpected shutdown.
Comms / message drop (network health):
- Warning: msgs_dropped / msgs_sent > 0.02 (2%) aggregated across fleet for 5 minutes.
- Critical: single robot msgs_dropped / msgs_sent > 0.10 OR RTT > 200ms p95.
CPU / thermal anomalies:
- Warning: CPU p95 > 85% for 10 minutes.
- Critical: CPU > 95% OR temp > manufacturer thermal limit.

Implementing these alerts in practice

Compute the relevant aggregate (per-minute or per-5-min) in ClickHouse and forward the keys to your alerting system. Example query to evaluate the 95th-percentile rule:

SELECT
  toStartOfMinute(minute_ts) AS m,
  warehouse_id,
  quantileTDigest(0.95)(task_duration_p95) AS fleet_p95
FROM telemetry_agg_minute
WHERE minute_ts > now() - INTERVAL 15 MINUTE
GROUP BY m, warehouse_id
HAVING fleet_p95 > (
  SELECT p95 FROM telemetry_baseline_daily
  WHERE day = today() - 1 AND metric_name = 'task_duration' AND warehouse_id = telemetry_agg_minute.warehouse_id
) * 1.20;

Feed the results to Alertmanager / Opsgenie or your CI gating system. In CI/CD, block merge if a preprod canary crosses critical thresholds for N minutes.

Preprod test architecture and gating strategies

Design preprod tests to be repeatable and meaningful:

Smoke: Basic nav & task flow with 1–3 robots to catch immediate regressions.
Soak: Longer runs (4–24 hours) at nominal load to detect resource leaks and battery degradation.
Stress / scale: Run at 1.5–2x nominal load to measure queuing and scheduler behavior.
Chaos: Inject network partitions, sensor faults, and abrupt reboots to validate resilience.

Gate merges with tiered policies:

Developer merge: pass smoke tests + no critical alerts.
Release candidate: pass soak + stress tests with all alerts at warning or lower.
Production promotion: pass soak, stress, and chaos tests with no critical alerts for 24 hours.

Cost controls for telemetry in ephemeral preprod clusters

Preprod environments must be cheap. Use these tactics:

Short raw retention: 7–30 days depending on your postmortem window.
Downsample and compress: Keep per-minute aggregates for 90 days and daily baselines for 1 year.
Ephemeral OLAP clusters: Spin up ClickHouse nodes for heavy test windows, scale down or snapshot after runs.
Sampling for low-value telemetry: Use SAMPLE BY for noisy channels (e.g., high-frequency IMU) and preserve critical channels (events, task outcomes) in full.
Use TTL for auto-cleanup: Tables above demonstrate TTL usage—enforce retention by policy, not manual ops.

Anomaly detection and advanced strategies

Percentile baselines catch many regressions, but combine them with:

Rolling z-score / EWMA: Detect slow drift in task durations or battery drain.
CUSUM for step changes: Good for detecting sudden regressions after a deployment.
Seasonal decomposition: Account for time-of-day traffic patterns in fulfillment centers.
Model-backed baselines: Use small regression models (features: warehouse_layout, robot_model, load_percentage) to predict expected p95 and compare to observed.

ClickHouse can precompute features at ingest; export feature windows to lightweight anomaly engines or run simple calculations inside ClickHouse using array and window functions.

Case example: One-week soak run baseline

Scenario: 50 robots in a single layout, one-week soak. Steps to produce baselines:

Run soak with representative SKU mix and traffic. Ensure test_run_id tags.
Ingest raw telemetry into telemetry_raw with ts and relevant tags.
Materialize minute aggregates into telemetry_agg_minute.
Compute daily baselines into telemetry_baseline_daily for the 7 days.
Set alerts based on the p95 thresholds and run a 48-hour hold to ensure stability.

Outcome: After the soak run, your team will have a stable p95 baseline for task durations, a reliable battery-discharge profile for the robot model, and a fleet-level message-drop rate to compare against future deployments.

Practical checklist before gating production

Have a reproducible test run harness that tags all telemetry with test_run_id and run_type.
Raw events are retained for N days; per-minute aggregates are computed.
Baselines exist for the last 7–30 representative runs.
Alerts use a combination of absolute and relative thresholds and are wired into CI gating.
Cost controls (TTL, downsampling) are enforced for preprod clusters.

Future predictions — what to prepare for in 2026 and beyond

Expect these trends through 2026:

OLAP-native anomaly detection: OLAP vendors will add built-in anomaly detection primitives (quantile-change detectors) to support streaming baselines.
Federated baselines: Teams will compare baselines across layouts and vendors to build transfer-learning models for new warehouses.
Edge-first telemetry aggregation: More pre-aggregation will happen on the robot/edge to reduce bandwidth, sending summarized deltas to OLAP.

Actionable takeaways

Adopt an OLAP-first approach for fleet telemetry in preprod to run fast percentile and histogram queries.
Keep raw data short-lived in preprod; rely on aggregated materialized views for baseline retention.
Use multi-tier alerts: absolute + relative + statistical to avoid false positives and detect regressions early.
Define gating rules that block merges when critical thresholds are breached during preprod tests.

Closing: Turn telemetry into repeatable preprod confidence

Warehouse automation teams in 2026 are judged by how reliably their fleets operate in production—long before that first production deployment. A ClickHouse-style OLAP telemetry pipeline gives you the tools to define reproducible baselines, enforce meaningful alerts, and control telemetry costs in ephemeral preprod environments. The schema and thresholds above are a starting point—adapt them to your robot classes, warehouse layouts, and business SLAs.

Call to action

Ready to bootstrap your preprod telemetry? Download the ClickHouse schema, materialized view scripts, and alerting templates from our repository and run them in a disposable preprod cluster. If you want help translating these baselines into CI/CD gates or need a cost-audited retention plan for your telemetry, contact our preprod engineering team to schedule a workshop.

Performance Baselines for Warehouse Robotics: Telemetry, OLAP, and Alerting

Hook: Why your preprod robotic fleet keeps surprising you

Inverted pyramid: What you get in this article

Why ClickHouse-style OLAP matters for robotics telemetry in 2026

Design principles for preprod telemetry

ClickHouse-style telemetry schema (practical, copyable)

1) Raw events table (high cardinality, short retention)

2) Per-robot per-minute aggregates (fast queries & alerts)

3) Fleet-level daily baselines

Materialized view examples (streaming aggregates)

How to compute baselines

Alerting: threshold types and example rules

Common metrics and recommended thresholds

Implementing these alerts in practice

Preprod test architecture and gating strategies

Cost controls for telemetry in ephemeral preprod clusters

Anomaly detection and advanced strategies

Case example: One-week soak run baseline

Practical checklist before gating production

Future predictions — what to prepare for in 2026 and beyond

Actionable takeaways

Closing: Turn telemetry into repeatable preprod confidence

Call to action

Related Topics

preprod

Up Next

Release Freeze Checklist: What to Review in Preprod Before High-Risk Launch Windows

Multi-Cloud Preprod Architecture: When It Helps and When It Adds Unnecessary Complexity

Preprod Security Scanning: SAST, DAST, and Dependency Checks That Matter

Hook: Why your preprod robotic fleet keeps surprising you

Inverted pyramid: What you get in this article

Why ClickHouse-style OLAP matters for robotics telemetry in 2026

Design principles for preprod telemetry

ClickHouse-style telemetry schema (practical, copyable)

1) Raw events table (high cardinality, short retention)

2) Per-robot per-minute aggregates (fast queries & alerts)

3) Fleet-level daily baselines

Materialized view examples (streaming aggregates)

How to compute baselines

Alerting: threshold types and example rules

Common metrics and recommended thresholds

Implementing these alerts in practice

Preprod test architecture and gating strategies

Cost controls for telemetry in ephemeral preprod clusters

Anomaly detection and advanced strategies

Case example: One-week soak run baseline

Practical checklist before gating production

Future predictions — what to prepare for in 2026 and beyond

Actionable takeaways

Closing: Turn telemetry into repeatable preprod confidence

Call to action

Related Reading

Related Topics

preprod

Up Next

Release Freeze Checklist: What to Review in Preprod Before High-Risk Launch Windows

Multi-Cloud Preprod Architecture: When It Helps and When It Adds Unnecessary Complexity

Preprod Security Scanning: SAST, DAST, and Dependency Checks That Matter