Manual diagnostics, site-specific automation, and engineering bottlenecks could not keep pace with rapid expansion and uptime demands. GPU thermal alerts required hours of manual triage and evidence collection across multiple systems.
Itential standardized operational workflows and vendor-agnostic orchestration, enabling automated diagnostics, self-service execution, and repeatable operations across every site. Alerts now trigger workflows that collect evidence, enrich tickets, notify teams, and support remediation.
Chosen for enterprise-grade orchestration, low-code accessibility, integration flexibility, and the ability to productize existing automation with governance and scale. Itential enabled an event-to-action model that reduced time-to-triage and supports future closed-loop operations.
A rapidly growing GPU cloud provider operates large-scale data centers across North America, delivering high-performance AI compute to customers through a substantial GPU infrastructure footprint.
As demand for AI compute surged, the company expanded into new data center builds, scaled interconnect capacity, and increased customer provisioning volume. The infrastructure team embraced modern practices. Their environment included containerized deployments, GitOps, Kubernetes, secrets management, source of truth systems, and observability platforms.
But growth created a compounding operational reality: every new site introduced more devices, more dependencies, and more operational load. The company needed to scale its operating model as fast as it was scaling physical infrastructure.
GPU thermal and health events were a recurring operational challenge. When alerts fired, teams needed to quickly determine whether the issue was transient workload behavior, an airflow or power concern, a chassis-level problem, or a degrading GPU that required intervention.
The problem was not detection. The problem was repeatable response at scale.
Collecting diagnostics from out-of-band controllers and assembling the right evidence for triage required manual effort across multiple interfaces and systems. Too many people were pulled into the process, and response quality varied depending on who was available.
The company had invested in internal tooling and automation to improve operations at one data center location. The tooling provided valuable automation capabilities for operations and customer experience teams, but it had a fundamental constraint: it could not scale across sites and was not fully owned or productized internally.
As the organization expanded, the team needed a platform that could standardize those operational workflows and extend them across every location, without rebuilding everything from scratch each time a new facility came online. They also needed a consistent alert-driven model for triage and action, so operational response did not depend on ad hoc coordination.
The goal was not just automation. It was repeatable execution with governance and visibility, usable by operations and customer experience teams.
The company’s network fabric was built on open standards and modern APIs, but the team knew vendor decisions would evolve as the footprint grew. They wanted to avoid being locked into a single network platform and ensure that day-to-day operations and provisioning workflows remained portable.
That meant shifting from vendor-specific automation toward OS-agnostic workflows and normalized data structures.
The infrastructure engineering team had deep expertise and had built meaningful automation with scripts and playbooks. But as demand increased, the overhead of maintaining custom integrations and updating code for external dependency changes became a major constraint.
Instead of focusing on business value, the team spent time on automation upkeep, integration maintenance, and platform drift.
As the organization evaluated how to scale operational execution across multiple data centers, they were clear about what they needed and what they wanted to avoid.
They did not want another point tool or a scripting framework that increased engineering burden. They needed orchestration that could productize automation with governance, reuse, and scale built in. They also needed an event-driven operational model where alerts could trigger diagnostics, notifications, and automated actions.
From scaling beyond a single site to enabling event-driven diagnostics, six requirements drove the selection of Itential as the orchestration foundation for AI data center operations.
The team needed workflows that could be repeated across current and future data centers without cloning and forking automation per location. One platform, every site – without rebuilding from scratch each time a new facility came online.
Itential enabled technical teams to build complex workflows while making them accessible to broader teams through a low-code model, expanding who could safely execute tasks without sacrificing the technical sophistication engineering required.
The platform could orchestrate existing Python and Ansible automation rather than requiring a rewrite. This preserved prior investments while enabling modernization and turning existing assets into callable building blocks instead of liabilities.
The ability to abstract network operations through normalized data models reduced lock-in risk and ensured long-term flexibility as vendor strategies evolved. Build once, execute anywhere across whatever the fabric looks like today or tomorrow.
A SaaS deployment model reduced the burden of managing yet another platform while still supporting on-premises connectivity through gateway deployment where needed – keeping the team focused on operations, not infrastructure for the orchestrator itself.
To reduce time-to-triage and enable faster resolution, the team prioritized workflows that automatically collect evidence, enrich tickets, notify the right teams, and prepare for remediation actions based on severity – the foundation for closed-loop and agentic operations.
The architectural shift came from standardizing operational execution into reusable workflows that could integrate across the company’s ecosystem – workflows that trigger automatically from events, collect and normalize diagnostic evidence, update systems of record, create or enrich tickets, guide execution for operations teams, and preserve a complete audit trail.
This created a foundation for closed-loop operations where detection, diagnostics, escalation, and remediation could be coordinated in a repeatable way.
With Itential as the orchestration foundation, the organization prioritized several high-impact workflows that span operations, diagnostics, and customer delivery.
When thermal alerts occur, workflows can automatically collect diagnostic evidence through hardware APIs, enrich tickets, and initiate operational response without manual coordination.
Workflows can programmatically populate and synchronize inventory and host-level data using APIs, enabling more accurate infrastructure context for operations and automation.
Approved workflows can be executed by operations and customer experience teams, reducing escalations and improving ticket close rates over time.
Network workflows support configuration backups, validation, and repeatable changes across a multi-site fabric, while preserving flexibility for future vendor shifts.
End-to-end activation workflows can coordinate across network, security, and interconnect providers, reducing time-to-provision and improving delivery consistency.
Moving from manual processes and site-specific tooling to orchestrated workflows produced outcomes that were both immediate and structural.
See how Itential connects AI reasoning to governed execution across your entire infrastructure.