Platform Operations Engineer (Site Reliability Engineer)

Vertiv•Westerville, OH

10h•Onsite

About The Position

Vertiv is seeking a skilled Platform Operations Engineer (Site Reliability Engineer) to serve as the owner of cross-platform observability, incident management, and operational reliability within Vertiv’s Digital organization. This individual contributor role is responsible for designing, implementing, and continuously improving monitoring and alerting solutions across Vertiv’s digital platform ecosystem — including Compass AI, Writer AI, Site Scope, UiPath, Workato, Cursor, and other approved enterprise tools — while owning incident response processes, SLA management, and operational governance. The Platform Operations / SRE will operate within the Digital organization and play a central role in advancing Vertiv’s Operational Excellence strategic priority by ensuring the availability, performance, and resilience of platforms that power critical digital workflows and business functions. As an individual contributor in a lead capacity, this role includes proactive reliability engineering — applying SRE principles such as SLOs, error budgets, and blameless post-mortems — and embedding secure coding and operational governance practices across the Digital organization. The Platform Operations / SRE Engineer will define and enforce observability standards, lead incident response and root cause analysis, manage platform-level SLAs, and partner with engineering, security, and business stakeholders to ensure that all digital platforms meet agreed availability and performance targets. This position partners closely with IT Security, NPDI, Digital delivery teams, and business operations, and is based on site at Vertiv’s Westerville, OH headquarters.

Requirements

Bachelor’s degree in Computer Science, Information Systems, Engineering, or a related field; equivalent practical experience considered.
5+ years of professional experience in platform operations, site reliability engineering, DevOps, or a related software/infrastructure engineering discipline.
3+ years of hands-on experience with enterprise monitoring and observability platforms (e.g., Datadog, Grafana, Prometheus, Azure Monitor, Splunk, or equivalent) in a multi-platform environment.
Demonstrated experience owning and managing incident response processes, post-mortem facilitation, and SLA/SLO governance.
Experience implementing secure coding practices, DevSecOps standards, or operational governance frameworks in an enterprise software delivery environment.
Proficiency with monitoring and observability tools (Datadog, Grafana, Prometheus, Azure Monitor, Splunk, or equivalent) for cross-platform health and performance tracking.
Strong knowledge of SRE principles, including SLOs, SLIs, blameless post-mortems, and toil reduction practices.
Hands-on experience with cloud platforms (AWS preferred) and familiarity with containerized environments (Docker, Kubernetes) and infrastructure-as-code tooling (Terraform, Ansible, or equivalent).
Proficiency in at multiple programming languages (Python, Ruby, Powershell, Java, Javascript, C#, etc.) for automation and runbook development.
Experience with CI/CD platforms (GitLab, Jenkins, GitHub Actions, Azure DevOps, or equivalent) and deployment reliability practices including progressive rollout, feature flags, and automated health checks.

Nice To Haves

Google SRE certification, AWS DevOps Professional, Azure certifications, or equivalent SRE/cloud operations certification.
Experience with AIOps tooling or AI-assisted anomaly detection and automated remediation capabilities.
Familiarity with the Vertiv digital platform ecosystem: Workato, UiPath, Power Automate, Compass AI, Writer AI, or Cursor.
Experience applying DevSecOps practices, including SAST/DAST scanning, secrets management, and compliance-as-code in enterprise environments.
Experience working in Agile/Scrum delivery environments; familiarity with ITIL incident and change management frameworks.

Responsibilities

Own Cross-Platform Monitoring & Observability: Design, implement, and maintain end-to-end monitoring, alerting, and observability solutions across Vertiv’s digital platform ecosystem — including AI platforms, automation tools, and internal applications — ensuring real-time visibility into system health, performance, and availability.
Lead Incident Response & Management: Serve as the primary escalation point and incident commander for P1/P2 incidents across Digital platforms; lead root cause analysis (RCA), blameless post-mortems, and corrective action tracking to prevent recurrence and reduce mean time to resolution (MTTR).
Manage Platform SLAs & Reliability Targets: Define, instrument, and enforce service level objectives (SLOs), service level indicators (SLIs), and error budgets across Digital platforms; produce regular SLA performance reports for leadership and drive platform improvements to meet or exceed agreed availability and performance targets.
Drive Secure Coding & Operational Governance: Champion secure coding practices and DevSecOps standards within Digital delivery teams; conduct operational readiness reviews for new platform deployments, enforce configuration management and change control processes, and partner with IT Security and NPDI to ensure all platforms meet Vertiv’s security and compliance requirements.
Automate Operations & Reduce Toil: Identify and eliminate manual operational toil through automation. This includes automated remediation runbooks and anomaly detection through the use of scripting, IaC tools, and approved automation platforms.
Capacity Planning & Performance Engineering: Analyze platform utilization trends and conduct capacity planning across Digital environments; proactively identify performance bottlenecks and recommend architectural improvements to ensure platforms scale reliably with business demand.
CI/CD Pipeline Reliability & Deployment Support: Partner with Digital delivery teams to ensure CI/CD pipelines are instrumented for reliability, deployment risk is managed through progressive rollout strategies, and production deployments are supported with appropriate rollback and health-check capabilities.
Evaluate & Advance Observability Tooling: Stay current on advancements in observability, AIOps, and SRE tooling; evaluate and recommend new tools and practices that enhance Vertiv’s platform operations maturity, and drive adoption of modern reliability engineering standards across the Digital organization.