A growing neocloud GPU infrastructure provider needed to scale operational automation beyond a single-site tool to every data center across North America. By deploying Itential, they replaced a siloed ops tool with unified, event-driven workflows – reducing diagnostic collection time from hours to minutes and enabling more teams to execute safely without engineering escalation.
Neocloud GPU providers are building the infrastructure layer for the AI era. They operate high-density GPU data centers, expand rapidly across geographies, and deliver compute services where reliability and speed directly affect customer experience.
But the operational model that works at one site rarely scales cleanly to many.
That was the challenge for one rapidly growing GPU infrastructure provider operating multiple data centers across North America. Their footprint included thousands of GPU nodes and a modern stack built on best-of-breed tools: containerized deployments, GitOps, secrets management, NetBox as a source of truth, and deep observability. At this scale, even a 1% daily incident rate becomes a constant operational load.
They also had strong automation. In fact, at one site they relied on an internal operations tool that gave their data center teams the ability to execute common tasks and close tickets without escalating to infrastructure engineering. Standardized self-service workflows commonly reduce escalation volume by 30-50% by removing engineering from routine triage.
It was owned externally and locked to a single site. As the company expanded, they needed a way to replicate operational capabilities across every data center and enable more teams to execute safely without depending on a handful of engineers.
At the same time, incident response was becoming a growing pain point. During GPU health and thermal events, collecting diagnostics from out-of-band controllers and assembling the right evidence was taking hours. That time cost wasn’t just operational. It affected customer experience and slowed resolution.
Teams that automate diagnostics and remediation typically cut time-to-evidence from hours to minutes and triage incidents ten times faster.
Their network fabric was built on open APIs and modern automation practices, but they wanted an OS-agnostic orchestration layer to protect their automation investments and preserve flexibility as vendor strategies evolve over time.
The team evaluated several options, including lightweight job runners. But they needed more than a way to execute scripts. They needed orchestration across systems, sites, and teams.
Itential stood out for a few key reasons:
The initial focus was clear: build event-driven workflows that could reduce operational toil immediately. The team prioritized:
The result was a scalable operating model: standardized workflows that could be replicated across sites, used by more teams, and governed consistently, without increasing engineering overhead.
With foundational workflows in production, the team is focused on scaling what works and pushing further into autonomous operations. A few key priorities are driving their roadmap:
The broader goal hasn’t changed: empower operations teams to close more tickets while increasing their capacity to build value, and keep that ratio improving as infrastructure scales.
Read the full customer story to see the architecture, workflows, and outcomes in more depth.
Watch the on-demand webinar below to see a demo of the alert-to-diagnostics-to-ticket workflow pattern built for AI data center operations.
See how Itential connects AI reasoning to governed execution across your entire infrastructure.