Operations Analyst

Hippocratic AIPalo Alto, CA
5dOnsite

About The Position

We are seeking a highly reliable and detail-oriented Operations Analyst to ensure the continuous, 24×7 operation of Hippocratic AI’s production systems, integrations, and customer/partner environments. This role is critical to minimizing customer and partner downtime, maintaining trust, and ensuring our AI agents and supporting systems operate smoothly at all times. As an Operations Analyst, you will be responsible for monitoring system alerts, integrations, and operational reports; performing proactive maintenance; resolving common operational issues; and triaging advanced issues to the appropriate engineering, platform, or partner teams. You will play a central role in detecting issues early, coordinating incident response, and maintaining operational excellence across all customer and partner deployments. You will work closely with engineering, infrastructure, security, customer support, and partner teams, and will help build the operational tooling, reporting, and automation needed to scale Hippocratic AI safely and reliably.

Requirements

  • Bachelor’s degree in Computer Science, Health Informatics, Information Systems, or a related field.
  • Bachelor’s degree in Information Systems, Computer Science, Operations, Engineering, or a related field (or equivalent practical experience).
  • 3+ years of experience in operations, site reliability, NOC, technical support, or production monitoring roles.
  • Hands-on experience monitoring production systems, integrations, APIs, or data pipelines in a 24×7 environment.
  • Familiarity with alerting and monitoring tools (e.g., Datadog, New Relic, CloudWatch, Prometheus, Grafana, PagerDuty, Opsgenie, or similar).
  • Ability to troubleshoot common system, integration, and data-flow issues using logs, metrics, and dashboards.
  • Experience writing scripts or automation using tools/languages such as Python, Bash, SQL, or similar.
  • Strong understanding of incident management processes, escalation procedures, and SLA-driven operations.
  • Excellent organizational skills with the ability to manage multiple alerts, issues, and priorities simultaneously.
  • Clear written and verbal communication skills, especially during high-pressure incidents.
  • Strong sense of ownership, reliability, and attention to detail.

Nice To Haves

  • Experience supporting cloud-based platforms (AWS, Azure, or Google Cloud).
  • Familiarity with REST APIs, webhooks, message queues, or integration workflows.
  • Experience in healthcare, regulated environments, or HIPAA-compliant systems.
  • Exposure to CI/CD pipelines, deployment monitoring, or change management processes.
  • Experience creating customer-facing operational or SLA reports
  • Background in Site Reliability Engineering (SRE), DevOps, or production support for SaaS platforms.
  • Experience supporting AI/ML platforms, data pipelines, or real-time systems.

Responsibilities

  • Integration Management & Development
  • Own the full integration lifecycle for major customers, from gathering requirements through design, development, testing, deployment, and ongoing support to deliver seamless connectivity between Hippocratic AI and client systems.
  • Operations Monitoring & Incident Response
  • Monitor all production systems, integrations, and automated alerts to ensure 24×7 continuous operations across customers and partners.
  • Serve as a first-line responder for operational alerts, diagnosing and resolving standard issues within defined SLAs.
  • Triage complex or advanced issues and page/engage the appropriate on-call engineers, platform teams, or partner contacts.
  • Coordinate incident response activities, track progress to resolution, and ensure clear internal handoffs during escalations.
  • Validate system recovery and perform post-incident checks to ensure full service restoration.
  • Proactive Maintenance & Reliability
  • Perform proactive system health checks, integration validations, and routine maintenance to prevent outages and degradation.
  • Identify trends in alerts, incidents, and performance metrics to recommend preventative actions and long-term fixes.
  • Help define and refine operational runbooks, escalation paths, and standard operating procedures (SOPs).
  • Participate in on-call rotations and support after-hours and weekend coverage as needed to maintain 24×7 availability.
  • Reporting, Automation & Tooling
  • Create and maintain operational reports and dashboards for internal teams, customers, and partners.
  • Build and maintain scripts and automation to monitor system health, validate integrations, and generate customer- or partner-specific reports.
  • Customize operational reporting for each customer/partner to meet contractual, SLA, and compliance requirements.
  • Continuously improve monitoring, alerting, and observability tooling to reduce noise and increase signal quality.
  • Cross-Functional Collaboration
  • Work closely with engineering, infrastructure, security, and customer support teams to resolve incidents and improve system resilience.
  • Support customer-facing teams by providing operational insights, incident summaries, and root-cause analysis.
  • Assist with onboarding new customers and partners by validating integrations, monitoring readiness, and ensuring operational coverage.
  • Contribute to post-incident reviews and continuous improvement initiatives to strengthen overall platform reliability.
© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service