About The Position

You will own infrastructure reliability, observability, and cost optimization for a production platform serving multiple customers under a 99.5% uptime SLA. This role focuses on building resilient, secure, and cost-efficient cloud infrastructure while leading incident response, monitoring, and compliance readiness initiatives.

Requirements

  • Strong experience as a Site Reliability Engineer, DevOps Engineer, or Platform Engineer.
  • Deep expertise in AWS cloud architecture (ECS, EKS, RDS, Lambda, S3, CloudFront).
  • Proven experience with Infrastructure as Code using Terraform or CloudFormation.
  • Hands-on production experience with Kubernetes and container orchestration.
  • Strong knowledge of observability and monitoring tools (Datadog, New Relic, Prometheus, Grafana).
  • Experience managing on-call rotations, incident response, and post-incident reviews.
  • Solid understanding of security practices including SIEM, vulnerability scanning, and SOC 2 compliance.
  • Demonstrated experience in cloud cost optimization and FinOps practices.
  • Ability to operate independently and prioritize reliability in high-availability environments.

Nice To Haves

  • Experience supporting SOC 2 Type II audits.
  • Background working in regulated or compliance-heavy environments (PHI/PII).
  • Experience implementing DLP and document scanning solutions.
  • Familiarity with AI/ML workload cost optimization.
  • Experience supporting SaaS platforms with customer-isolated environments.

Responsibilities

  • Ensure 99.5% uptime SLA across all production services and customer environments.
  • Design and maintain multi-region deployments to support geographic redundancy.
  • Implement automated failover mechanisms for databases, load balancers, and critical services.
  • Build and manage disaster recovery strategies, including automated backups and point-in-time recovery.
  • Lead incident detection, response, and postmortems, meeting defined SLAs for P0 issues.
  • Develop real-time observability dashboards for uptime, latency, error rates, and system health.
  • Monitor application and infrastructure performance metrics across customers.
  • Implement alerting, on-call rotations, escalation policies, and PagerDuty integrations.
  • Manage log aggregation and retention using SIEM platforms such as Splunk or Sumo Logic.
  • Support SOC 2 Type II preparation through security controls, monitoring, and documentation.
  • Implement vulnerability scanning, penetration testing coordination, and DLP controls.
  • Optimize cloud infrastructure costs through right-sizing, auto-scaling, and storage lifecycle policies.
  • Track and report infrastructure and API costs per customer, driving FinOps best practices.
  • Build automated runbooks and self-healing workflows for common incidents.
© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service