About The Position

Inovalon was founded in 1998 on the belief that technology, and data specifically, would empower the transformation of the entire healthcare ecosystem for the better, improving both outcomes and economics. At Inovalon, we believe that when our customers are successful in their missions, healthcare improves. Therefore, we focus on empowering them with data-driven solutions. And the momentum is building. Together, as ONE Inovalon, we are a united force delivering solutions that address healthcare’s greatest needs. Through our mission-based culture of inclusion and innovation, our organization brings value not just to our customers, but to the millions of patients and members they serve. Overview The Senior Director, Customer Reliability & Technical Operations is accountable for the reliability, availability, performance, and operational health of Inovalon customer pharmacies using the ScriptMed SaaS platform. This leader directs a global operations organization (United States and India) responsible for ITIL-aligned Incident, Problem, and Change Management, as well as the technical functions that keep the platform stable and scalable, including Cloud Infrastructure Engineering, Database Administration, DevOps, and Site Reliability Engineering (SRE). The role partners closely with Product, Engineering, Security, and Customer Success to proactively detect and remediate issues using DataDog observability and ServiceNow ITSM workflows, ensuring customers experience dependable service and predictable outcomes. Scope and Impact Owns day-to-day and sustained operational performance for ScriptMed, including uptime, performance, incident response, and service restoration across customer pharmacies and tenant environments. Leads a blended onshore/offshore operating model, ensuring 24x7 coverage, clear escalation paths, and consistent execution of operational processes. Establishes and matures a Network Operations Center (NOC) and evolves it into an AI-enabled Intelligent Operations Management Center, improving detection, triage, and automation. Drives operational discipline across Incident, Problem, and Change Management, reducing customer-impacting events, shortening MTTR, and preventing recurrence. Provides executive-level visibility into platform health and risk, enabling informed decisions on investment, capacity, and reliability improvements.

Requirements

  • Bachelor’s degree in computer science, Information Systems, Engineering, or related field (or equivalent experience).
  • 10+ years of experience in technical operations, reliability engineering, platform operations, or production support for SaaS platforms.
  • 5+ years of experience leading managers and multi-team organizations, including distributed/onshore-offshore teams.
  • Demonstrated experience running ITIL-aligned processes: Incident, Problem, and Change Management.
  • Proven track record improving production stability and reliability metrics (e.g., availability, MTTR, MTTD, change failure rate).
  • Working knowledge of cloud infrastructure and operational practices, plus strong stakeholder and executive communication skills.

Nice To Haves

  • Experience supporting healthcare, pharmacy, specialty pharmacy, or other regulated SaaS environments.
  • Familiarity with GCP, Oracle, Kafka, and .NET in a production SaaS environment.
  • Experience implementing or maturing a NOC and evolving toward AI-enabled operations (event correlation, automation, AIOps concepts).
  • Strong background in SRE practices (SLOs/SLIs, error budgets, toil reduction, automation/self-healing).
  • Experience with DataDog observability and ServiceNow ITSM design and operationalization.
  • Familiarity with security, compliance, and audit readiness requirements in enterprise SaaS operations.
  • Knowledge of the ScriptMed platform and specialty pharmacy workflows supported by ScriptMed.

Responsibilities

  • Define and execute the customer reliability and technical operations strategy aligned to ScriptMed business objectives, SLAs, and customer expectations.
  • Build and lead high-performing teams across the U.S. and India, including staffing, performance management, coaching, and career development.
  • Establish clear on-call and escalation models, operational playbooks, and governance routines (daily ops review, incident review, weekly change review, reliability council).
  • Partner with Engineering, Product, Security, and Customer Success to align priorities, manage operational risk, and drive continuous improvement.
  • Own the incident management lifecycle, including detection, triage, escalation, customer-impact assessment, communications, and service restoration.
  • Ensure strong runbooks, incident roles, and standards for severity classification, timelines, and stakeholder updates.
  • Use DataDog monitoring and alerting to proactively identify issues and reduce customer impact through early detection and fast response.
  • Lead post-incident reviews, ensuring corrective actions are assigned, tracked, and validated.
  • Establish and mature a problem management program that drives root cause analysis, corrective and preventive actions, and measurable reduction in repeat incidents.
  • Create a consistent approach for trend analysis, known error management, and prevention backlog creation.
  • Partner with Engineering and Architecture to prioritize reliability improvements and reduce technical debt that drives operational instability.
  • Own change management processes to ensure reliable deployments, infrastructure changes, and operational updates with appropriate approvals and controls.
  • Define change classification standards, risk scoring, blackout windows, validation steps, and rollback plans.
  • Partner with DevOps and Engineering to implement change automation and quality gates that reduce change-related incidents.
  • Stand up and operationalize a Network Operations Center responsible for real-time monitoring, initial triage, and coordinated response.
  • Mature NOC capabilities into an AI-enabled Intelligent Operations Management Center that uses automation, correlation, noise reduction, predictive insights, and self-healing where appropriate.
  • Define and track operational KPIs (availability, MTTR, MTTD, incident volume by cause, change success rate, alert noise, customer-impact minutes).
  • Lead teams responsible for cloud infrastructure reliability, capacity planning, scaling, patching, cost optimization, and resiliency improvements within Google Cloud Platform (GCP).
  • Ensure platform availability and disaster recovery posture through tested backups, failover processes, RPO/RTO alignment, and resilience engineering.
  • Drive standardization of infrastructure as code, configuration management, and secure-by-design controls in partnership with Security.
  • Lead database administration teams responsible for availability, performance, backup/recovery, patching, access controls, and operational support of Oracle databases.
  • Establish database operational standards and performance tuning practices to ensure stable throughput and predictable customer experience.
  • Partner with Engineering to improve data lifecycle management, replication, and operational patterns that reduce risk.
  • Lead DevOps teams responsible for CI/CD operational practices, deployment tooling, environment management, and release enablement in partnership with Engineering.
  • Lead SRE teams responsible for defining and managing SLOs/SLIs, error budgets, reliability roadmaps, and operational automation.
  • Improve observability standards (logs/metrics/traces), alert tuning, and runbook maturity to reduce manual effort and increase speed of remediation.
  • Own the operational use of DataDog and ServiceNow, ensuring monitoring, alerting, incident workflows, change workflows, and problem workflows are consistently implemented and improved.
  • Improve proactive detection and remediation through alert correlation, automation, standardized dashboards, and service ownership models.

Benefits

  • Inovalon Offers a Competitive Salary and Benefits Package
  • In addition to the base compensation, this position may be eligible for performance-based incentives.
  • Inovalon invests in associates to help them stay healthy, save for long-term financial goals, and manage the demands of work and personal commitments. That’s why Inovalon offers a valuable benefits package with a wide range of choices to meet associate needs, which may include health insurance, life insurance, company-paid disability, 401k, 18+ days of paid time off, and more.
© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service