About The Position

Lovelace AI is seeking a highly skilled and motivated Site Reliability Engineer (SRE) to join our growing team. As an SRE at Lovelace AI, you will play a critical role in ensuring the availability, scalability, and performance of our cutting-edge AI-powered applications and infrastructure. You will bridge the gap between software development and operations, applying sound engineering principles and automation to maintain and improve our systems.

Requirements

  • 5+ years of experience in site reliability engineering, DevOps, systems administration, or related roles.
  • Proven track record of managing complex infrastructure, troubleshooting production issues, and optimizing system performance in high-scale environments.
  • Strong experience with Linux/Unix administration and proficiency in scripting languages (e.g., Python, Bash, Go).
  • Deep understanding of cloud platforms (AWS, GCP, Azure) and related services (e.g., EC2, S3, Lambda, Kubernetes).
  • Experience with containerization and orchestration technologies like Docker and Kubernetes.
  • Proficiency with monitoring and observability tools (e.g., Prometheus, Grafana, Datadog, Dynatrace, ELK Stack).
  • Strong understanding of networking fundamentals (DNS, HTTP, TCP/IP), load balancing, and CDNs.
  • Experience with CI/CD tools (e.g., Jenkins, GitLab CI, CircleCI) and infrastructure automation.
  • Familiarity with distributed systems and microservices architecture.
  • Excellent problem-solving and troubleshooting skills.
  • Strong analytical skills with the ability to identify Service Level Indicators (SLIs) and align efforts to meet availability and latency objectives.
  • Ability to balance both development and support roles effectively.
  • Strong interpersonal skills and excellent communication skills, with the ability to collaborate effectively across various teams.
  • Experience in working on projects that involve business segments.
  • Must be a US Citizen.

Responsibilities

  • Design, implement, and maintain robust monitoring, alerting, and observability solutions to proactively detect and resolve issues before they impact end-users.
  • Lead troubleshooting efforts for complex production issues, providing detailed root cause analysis (RCA) and implementing preventative measures.
  • Develop and maintain automation scripts, build systems (Bazel) and infrastructure as code (IaC) using tools like Terraform, Ansible, or CloudFormation to eliminate manual tasks and improve system reliability and efficiency.
  • Collaborate closely with software engineering teams to influence the design of new services and applications, ensuring they are scalable, reliable, and resilient from the outset.
  • Participate in on-call rotations to respond to platform emergencies, alerts, and escalations, ensuring high service uptime.
  • Analyze system performance and recommend optimizations for scalability, reliability, and efficiency.
  • Implement and enforce best practices in deployment, monitoring, and incident management to continuously improve overall system reliability and reduce downtime.
  • Develop and maintain internal tools that streamline complex operations, track bugs, manage CI/CD pipelines, and facilitate cross-team communication.
  • Conduct post-incident reviews, documenting software problems and solutions in a shared knowledge base to prevent similar issues in the future.
  • Assist with vulnerability management, system patching, and implementing security measures to protect the integrity and availability of services.

Benefits

  • Competitive compensation packages
  • Comprehensive benefits
  • Supportive and inclusive work environment
© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service