GPU infrastructure reliability

The reliability layer
for production
GPU clusters.

Harbor correlates signals across your entire stack — hardware, network, and ML framework — to diagnose training failures and inference degradation in seconds instead of hours.

70%
of GPU "failures"
are framework issues
<60s
NCCL timeout
root-caused
silent
inference degradation
made visible
DCGM · GPU temperature nominal · all nodes NCCL collective timeout — cross-layer root cause in progress NVLink bandwidth within baseline PyTorch hang — traced to InfiniBand port flap on rank 7 vLLM TTFT degrading · KV cache pressure detected eBPF kernel hooks active · 0 anomalies GPU 3 thermal throttle — correlated with tokens/sec drop Inference P99 latency baseline drift · +340ms vs 24h avg NVML ECC memory errors · 0 · all nodes KV cache eviction rate elevated · batch size mismatch suspected DCGM · GPU temperature nominal · all nodes NCCL collective timeout — cross-layer root cause in progress NVLink bandwidth within baseline PyTorch hang — traced to InfiniBand port flap on rank 7 vLLM TTFT degrading · KV cache pressure detected eBPF kernel hooks active · 0 anomalies GPU 3 thermal throttle — correlated with tokens/sec drop Inference P99 latency baseline drift · +340ms vs 24h avg NVML ECC memory errors · 0 · all nodes KV cache eviction rate elevated · batch size mismatch suspected

Three layers.
No one watching all of them.

Whether a training job crashes or inference quietly degrades, your team opens three different dashboards — GPU metrics, network logs, framework traces — and manually connects the dots. The real culprit is almost always a cross-layer interaction that nobody can see.

  • 01
    Misattributed training failures 70–80% of apparent GPU hardware failures are actually framework or network issues. Teams replace hardware that isn't broken and restart runs that were minutes from finishing.
  • 02
    Silent inference degradation Tokens per second quietly drops while GPU utilization looks normal. KV cache exhaustion, thermal throttling, and suboptimal batching hide behind healthy-looking dashboards.
  • 03
    Hours to diagnose NCCL timeouts The most common distributed training failure takes 2–6 hours to root-cause without cross-stack visibility. Harbor does it in under 60 seconds.

Every layer.
One correlation engine.

↑ click a layer to expand
L3
ML Framework PYTORCH · VLLM · NEMO · CUDA GRAPHS
NCCL Flight Recorder vLLM KV cache PyTorch hooks
Signals we instrument
NCCL collective operations + timeout events
vLLM KV cache hit rate, eviction rate, exhaustion
Time to first token (TTFT) baseline drift
Inter-token latency + P99 degradation
Tokens/sec quiet drops vs 24h rolling baseline
PyTorch distributed rank hangs + gradient anomalies
Core wedge A NCCL collective timeout looks like a training crash. A TTFT spike looks like a capacity problem. Without cross-layer correlation, teams spend hours guessing. Harbor traces both to their root cause — hardware event, network flap, or framework misconfiguration — in under 60 seconds.
L2
Network Fabric NCCL · INFINIBAND · NVLINK · EFA
IB sysfs counters NVLink bandwidth RDMA errors
Signals we instrument
InfiniBand port flaps + link error bursts
NVLink bandwidth utilization per lane
RDMA send/receive error rates
NCCL ring buffer stalls per collective op
EFA packet retransmit rates
All-reduce / all-gather latency deltas
Why this matters InfiniBand port flaps are transient — they last milliseconds and vanish from logs before anyone opens a terminal. Harbor's ring-buffer model captures the flap, timestamps it precisely, and correlates it forward to the NCCL timeout or TTFT spike it caused — in training and inference alike.
L1
GPU Hardware DCGM · NVML · EBPF · KERNEL DRIVERS
DCGM metrics eBPF hooks thermal sensors
Signals we instrument
GPU temperature + thermal throttle events
NVML ECC error rates (single + double bit)
SM utilization vs memory utilization skew
PCIe bandwidth saturation
eBPF kernel-level I/O and scheduling events
Power draw anomalies per GPU
Deployment model Rust DaemonSet at the OS/kernel layer. Ring-buffer push-on-fail keeps overhead under 1% CPU and 50MB RAM per node. Raw telemetry never leaves the node. Fully air-gapped deployment available. Ships as an executable + Helm/Terraform charts.
Cross-stack correlation engine
Hardware → Network → Framework. One causal timeline. No manual log-surfing. No 10-person Zoom call.
<60s
Harbor
2–6h
industry avg
Inference reliability

Your cluster looks healthy.
Your users disagree.

Training failures are loud — jobs crash, teams get paged. Inference failures are silent. Throughput erodes. Latency climbs. GPU utilization stays green while your model quietly serves users at half speed. Harbor makes the invisible visible.

KV cache visibility
KV cache exhaustion is one of the most common sources of inference degradation — and one of the hardest to diagnose without framework-layer instrumentation. Harbor tracks hit rates, eviction pressure, and sizing mismatches against your actual traffic shape.
KV hit rate eviction events cache pressure
TTFT + token throughput baselines
Harbor maintains rolling baselines for time-to-first-token, inter-token latency, and tokens per second — and alerts on statistically significant drift before your users notice. P99 degradation caught at the source, not in a support ticket.
TTFT drift inter-token latency tokens/sec
Thermal throttle → latency correlation
A GPU thermal throttle during inference doesn't crash anything. It just silently reduces compute throughput, showing up as a latency bump two minutes later. Harbor correlates hardware events to serving latency with sub-second precision.
thermal events latency correlation per-GPU impact
Batching + utilization efficiency
Overprovisioning for a single request type while other jobs queue. Bad batching that leaves GPU memory half-empty. Harbor surfaces utilization skew across SM, memory, and PCIe — so you get the throughput you're paying for.
SM utilization memory skew batch efficiency

Prevent. Diagnose. Remediate.

01 / PREVENT

Pre-deployment validation

Before a training run or inference deployment starts, Harbor runs cross-stack health checks — topology verification, bandwidth baselines, memory error sweeps, KV cache sizing validation. Catch misconfigurations before they cost you a 6-hour run or a degraded rollout.

02 / DIAGNOSE

Live cross-stack monitoring

During training and inference, Harbor correlates GPU hardware, network fabric, and ML framework events in real time. Training crash or silent inference degradation — you get a causal chain, not a wall of disconnected logs.

03 / REMEDIATE

Actionable remediation

Harbor doesn't just tell you what failed — it tells you what to do. Targeted, operator-level guidance that a software-focused ML engineer can act on without a hardware background. Automated remediation on the roadmap.

Who it's for

Built for teams
who own their infra.

Harbor is purpose-built for ML infrastructure engineers at AI-native companies running dedicated GPU clusters — bare metal, neo-cloud, or self-managed Kubernetes. If your team gets paged when a training job fails, Harbor is for you.

You're running dedicated GPU infrastructure

The qualifying questionWhen a training job fails or inference latency spikes, does your team get paged and dig into the logs — or does the platform handle it? If it's the former, Harbor is built for you.

Your cluster deserves a
reliability layer.

We're working with early design partners running dedicated GPU infrastructure. If that's you, let's talk. We'll start with your specific failure modes — training, inference, or both.