Case Study

How a GPU Cloud Provider Is Scaling AI Data Center Operations with Itential

Scaling Self-Service Operations, Event-Driven Diagnostics, Automated Actions, & Vendor-Agnostic Network Orchestration Across a Growing Footprint.

Challenge

Manual diagnostics, site-specific automation, and engineering bottlenecks could not keep pace with rapid expansion and uptime demands. GPU thermal alerts required hours of manual triage and evidence collection across multiple systems.

Solution

Itential standardized operational workflows and vendor-agnostic orchestration, enabling automated diagnostics, self-service execution, and repeatable operations across every site. Alerts now trigger workflows that collect evidence, enrich tickets, notify teams, and support remediation.

Why Itential

Chosen for enterprise-grade orchestration, low-code accessibility, integration flexibility, and the ability to productize existing automation with governance and scale. Itential enabled an event-to-action model that reduced time-to-triage and supports future closed-loop operations.

The Challenge

When AI Infrastructure Growth Outpaces Operational Capacity

A rapidly growing GPU cloud provider operates large-scale data centers across North America, delivering high-performance AI compute to customers through a substantial GPU infrastructure footprint.

As demand for AI compute surged, the company expanded into new data center builds, scaled interconnect capacity, and increased customer provisioning volume. The infrastructure team embraced modern practices. Their environment included containerized deployments, GitOps, Kubernetes, secrets management, source of truth systems, and observability platforms.

But growth created a compounding operational reality: every new site introduced more devices, more dependencies, and more operational load. The company needed to scale its operating model as fast as it was scaling physical infrastructure.

We’re trying to enable our operations teams to increase ticket close rates and efficiency without escalating to engineering.

Infrastructure Operations Leader

GPU Alerts Were Frequent, The Response Was Manual

GPU thermal and health events were a recurring operational challenge. When alerts fired, teams needed to quickly determine whether the issue was transient workload behavior, an airflow or power concern, a chassis-level problem, or a degrading GPU that required intervention.

The problem was not detection. The problem was repeatable response at scale.

Collecting diagnostics from out-of-band controllers and assembling the right evidence for triage required manual effort across multiple interfaces and systems. Too many people were pulled into the process, and response quality varied depending on who was available.

Even getting the diagnostics and thermal information from GPUs, it takes hours.

Infrastructure Operations Leader

Custom Automation Worked, But Couldn’t Scale Beyond a Single Site

The company had invested in internal tooling and automation to improve operations at one data center location. The tooling provided valuable automation capabilities for operations and customer experience teams, but it had a fundamental constraint: it could not scale across sites and was not fully owned or productized internally.

As the organization expanded, the team needed a platform that could standardize those operational workflows and extend them across every location, without rebuilding everything from scratch each time a new facility came online. They also needed a consistent alert-driven model for triage and action, so operational response did not depend on ad hoc coordination.

The goal was not just automation. It was repeatable execution with governance and visibility, usable by operations and customer experience teams.

Vendor Abstraction Was a Strategic Priority

The company’s network fabric was built on open standards and modern APIs, but the team knew vendor decisions would evolve as the footprint grew. They wanted to avoid being locked into a single network platform and ensure that day-to-day operations and provisioning workflows remained portable.

That meant shifting from vendor-specific automation toward OS-agnostic workflows and normalized data structures.

The biggest interest is OS-agnostic data structures for provisioning the fabric… portability as we migrate away from our current platform.

Lead Infrastructure Architect

High-Code Automation Reached Its Practical Limits

The infrastructure engineering team had deep expertise and had built meaningful automation with scripts and playbooks. But as demand increased, the overhead of maintaining custom integrations and updating code for external dependency changes became a major constraint.

Instead of focusing on business value, the team spent time on automation upkeep, integration maintenance, and platform drift.

It becomes a full-time job… updates are not really anything of value, they’re simply things that have to happen because external dependencies changed.

Lead Infrastructure Architect

Why Itential

Why They Chose Itential

As the organization evaluated how to scale operational execution across multiple data centers, they were clear about what they needed and what they wanted to avoid.

They did not want another point tool or a scripting framework that increased engineering burden. They needed orchestration that could productize automation with governance, reuse, and scale built in. They also needed an event-driven operational model where alerts could trigger diagnostics, notifications, and automated actions.

Several Criteria Shaped the Decision

From scaling beyond a single site to enabling event-driven diagnostics, six requirements drove the selection of Itential as the orchestration foundation for AI data center operations.

Scale Beyond a Single Site

The team needed workflows that could be repeated across current and future data centers without cloning and forking automation per location. One platform, every site – without rebuilding from scratch each time a new facility came online.

Low-Code Accessibility Without Sacrificing Technical Depth

Itential enabled technical teams to build complex workflows while making them accessible to broader teams through a low-code model, expanding who could safely execute tasks without sacrificing the technical sophistication engineering required.

Leverage Existing Automation Investments

The platform could orchestrate existing Python and Ansible automation rather than requiring a rewrite. This preserved prior investments while enabling modernization and turning existing assets into callable building blocks instead of liabilities.

Vendor-Agnostic Orchestration Layer

The ability to abstract network operations through normalized data models reduced lock-in risk and ensured long-term flexibility as vendor strategies evolved. Build once, execute anywhere across whatever the fabric looks like today or tomorrow.

Operational Overhead Mattered

A SaaS deployment model reduced the burden of managing yet another platform while still supporting on-premises connectivity through gateway deployment where needed – keeping the team focused on operations, not infrastructure for the orchestrator itself.

Event-Driven Diagnostics & Automated Response

To reduce time-to-triage and enable faster resolution, the team prioritized workflows that automatically collect evidence, enrich tickets, notify the right teams, and prepare for remediation actions based on severity – the foundation for closed-loop and agentic operations.

Together, these capabilities allowed the organization to shift from one-off automation to a standardized orchestration operating model that could scale with both infrastructure growth and operational demand.

The Solution

Standardizing Operational Execution Across the AI Infrastructure Stack

The architectural shift came from standardizing operational execution into reusable workflows that could integrate across the company’s ecosystem – workflows that trigger automatically from events, collect and normalize diagnostic evidence, update systems of record, create or enrich tickets, guide execution for operations teams, and preserve a complete audit trail.

This created a foundation for closed-loop operations where detection, diagnostics, escalation, and remediation could be coordinated in a repeatable way.

We’re trying to save time and focus it more on the DC technicians… empower them with more access and testability.

Infrastructure Operations Leader

Orchestrating Multi-Site Operations, Diagnostics, & Provisioning at Scale

With Itential as the orchestration foundation, the organization prioritized several high-impact workflows that span operations, diagnostics, and customer delivery.

Automated GPU Thermal Diagnostics & Response

When thermal alerts occur, workflows can automatically collect diagnostic evidence through hardware APIs, enrich tickets, and initiate operational response without manual coordination.

Source of Truth Synchronization

Workflows can programmatically populate and synchronize inventory and host-level data using APIs, enabling more accurate infrastructure context for operations and automation.

Self-Service Operational Execution

Approved workflows can be executed by operations and customer experience teams, reducing escalations and improving ticket close rates over time.

Fabric Deployment & Configuration Management

Network workflows support configuration backups, validation, and repeatable changes across a multi-site fabric, while preserving flexibility for future vendor shifts.

Customer Provisioning Orchestration

End-to-end activation workflows can coordinate across network, security, and interconnect providers, reducing time-to-provision and improving delivery consistency.

The Outcome

Measurable Results Across Operations & Infrastructure Delivery

Moving from manual processes and site-specific tooling to orchestrated workflows produced outcomes that were both immediate and structural.

Faster Diagnostics & Reduced Escalation Volume

By automating evidence collection and standardizing response execution, the organization reduced time spent on diagnostic tasks and improved operational throughput.

Improved Internal Ticket Close Rates

Self-service execution allowed operations and CX teams to close more tickets independently — reducing dependency on infrastructure engineering for routine tasks.

Greater Flexibility & Reduced Vendor Risk

OS-agnostic workflows and vendor abstraction reduced the long-term cost and risk of vendor transitions across the network fabric.

Reduced Engineering Maintenance Burden

Engineering regained capacity by shifting from maintaining brittle custom integrations to building scalable, reusable workflows with governance.

Foundation for Agentic Operations

Standardized workflows, normalized data, and event-driven execution – the governed foundation FlowAI agents need to act on AI data center infrastructure safely.