Senior Site Reliability Engineer

Veterinary Emergency Group (VEG)White Plains, NY
1d$170,000 - $200,000Remote

About The Position

We are looking for a Senior Site Reliability Engineer who understands that at VEG, "reliability" is a medical necessity – if our proprietary platform, DogByte, goes down, a pet's life could be at risk. You will be the primary lead for our platform's resilience, transforming our infrastructure into a self-healing system that empowers our medical teams to provide 24/7/365 life-saving care. You will spend your time bridging the gap between high-level architectural strategy and hands-on technical "surgery," ensuring our engineering teams can build at pace while the foundation remains rock-solid. You will evolve and strengthen an existing system that must meet the demands of VEG’s hospital expansion – ensuring our infrastructure never limits our ability to open new hospitals or provide medical care. You will own the ongoing stability of DogByte, scaling it from its current state into a robust enterprise platform where one hospital's traffic is isolated and does not impact another's experience. This job has an opportunity to work at our VQ in White Plains or could be open to remote work.

Requirements

  • Bachelor’s Degree preferred or equivalent experience
  • 5+ years in SRE/DevOps roles, expertly handling high-concurrency environments
  • Deep understanding of the AWS ecosystem managed entirely through Infrastructure as Code
  • Expertise in traffic management, including load balancing techniques, Nginx configuration, and autoscaling to handle volatile patterns
  • Technical leadership in observability, establishing the tracing frameworks and monitoring required to diagnose latency issues and ensure high availability across the entire request lifecycle
  • You have direct experience with technologies relevant to our technical stack, which currently includes: AWS ECS, Terraform, Nginx, PostgreSQL (RDS), Python

Responsibilities

  • Formulate short- and long-term strategies to ensure DogByte withstands year-over-year volume increases, specifically solving for hospital-to-hospital traffic isolation
  • Work with engineers to ensure data flows -- from client to API to database -- are configured for high-concurrency and maximum reliability
  • Build automated processes to handle high-traffic spikes and automatically remediate common system errors
  • Set up monitoring and alerting to identify latency throughout the stack and resolve issues before they impact hospital operations
  • Establish and meet SLOs for high availability, ensuring our engineers can build products without worrying if the system can support them

Benefits

  • Competitive Compensation Including ($170,000 - $200,000) + bonus + benefits.
  • Comprehensive health and wellness benefits that start on day one, and access to free therapy or counseling
  • Paid parental leave, up to 10 weeks at 100% of regular salary, and offering inclusive fertility and family-building care for all types of families
  • Unlimited PTO to use for vacation or sick days—however you need it!
  • Generous employee referral program, so our awesome people can bring in more awesome people.
  • And the little (big) things, like casual office attire, ability to bring your fur baby to work, cool VEG swag, food in the fridge for when you’re hungry and free lunches twice a week!!
  • Company laptop and a monthly cell phone reimbursement
© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service