SRE/DevOps Engineer

VersanaNew York, NY

About The Position

Versana is seeking a motivated SRE/DevOps Engineer with strong observability experience to join our growing Platform Engineering squad. The squad’s goal is to manage public cloud, improve DevOps practices, and monitor Versana’s real-time syndicated loan data platform. The ideal candidate will have a deep understanding of cloud-native applications, distributed computing, CI/CD implementation, observability tools and practices.

Requirements

  • 5+ years of experience as a Site Reliability Engineer or similar role.
  • 3+ years of work experience with public cloud (Azure, AWS or GCP).
  • 3+ years of direct experience with observability tools like Datadog, Elasticsearch, and Grafana Labs, etc.
  • 3+ years of experience with containerization and orchestration technologies like Docker and Kubernetes.
  • 2+ years of experience in development and management of CI/CD pipelines (e.g., Azure DevOps, Gitlab CI/CD, Github Actions, Jenkins, etc).
  • 2+ years of experience with Infrastructure-as-code tools like Terraform, Azure Bicep, Cloud Formation, etc.
  • 1+ years of experience with site reliability tools like Gremlin, Chaos Mesh, or similar.
  • Proven track record leveraging core observability concepts, end-user monitoring, and infrastructure monitoring with SaaS solutions.
  • Experience with messaging services like Kafka or Azure Event Hubs.
  • Good understanding of the Linux operating system.

Nice To Haves

  • Experience in at least one coding language such as Java, JavaScript, Python, GoLang, or .NET.
  • Certifications in cloud technologies.
  • Experience with Azure cloud or Azure DevOps.
  • Experience with Datadog or similar modern observability tools.

Responsibilities

  • Design, implement and enhance system observability and monitoring tools
  • Monitor system performance, create incident response plans, and implement observability practices to gain insights into system behavior.
  • Implement and monitor service-level objectives (SLOs) and indicators.
  • Improve system reliability and resiliency.
  • Conduct post-incident reviews and implement necessary changes to prevent system failures.
  • Assist teams in implementing observability tools and leveraging available telemetry data to troubleshoot and resolve incidents and problems.
  • Leverage observability and event management to improve key incident management metrics, such as mean time to detect and mean time to restore services.
  • Continually optimize systems and workflows by improving architecture, infrastructure, automation, CI/CD, and observability.
  • Collaborate with developers to ensure applications are designed with DevOps best practices in mind.
  • Participate in a rotating on-call schedule for weekend releases and being available to respond to production issues outside of regular working hours, including weekends and holidays.
© 2026 Teal Labs, Inc
Privacy PolicyTerms of Service