Software Engineer - Site Reliability Engineer (SRE)

Lovelace AI•Pittsburgh, PA

131d

About The Position

Lovelace AI is seeking a highly skilled and motivated Site Reliability Engineer (SRE) to join our growing team. As an SRE at Lovelace AI, you will play a critical role in ensuring the availability, scalability, and performance of our cutting-edge AI-powered applications and infrastructure. You will bridge the gap between software development and operations, applying sound engineering principles and automation to maintain and improve our systems.

Requirements

5+ years of experience in site reliability engineering, DevOps, systems administration, or related roles.
Proven track record of managing complex infrastructure, troubleshooting production issues, and optimizing system performance in high-scale environments.
Strong experience with Linux/Unix administration and proficiency in scripting languages (e.g., Python, Bash, Go).
Deep understanding of cloud platforms (AWS, GCP, Azure) and related services (e.g., EC2, S3, Lambda, Kubernetes).
Experience with containerization and orchestration technologies like Docker and Kubernetes.
Proficiency with monitoring and observability tools (e.g., Prometheus, Grafana, Datadog, Dynatrace, ELK Stack).
Strong understanding of networking fundamentals (DNS, HTTP, TCP/IP), load balancing, and CDNs.
Experience with CI/CD tools (e.g., Jenkins, GitLab CI, CircleCI) and infrastructure automation.
Familiarity with distributed systems and microservices architecture.
Excellent problem-solving and troubleshooting skills.
Strong analytical skills with the ability to identify Service Level Indicators (SLIs) and align efforts to meet availability and latency objectives.
Ability to balance both development and support roles effectively.
Strong interpersonal skills and excellent communication skills, with the ability to collaborate effectively across various teams.
Experience in working on projects that involve business segments.
Must be a US Citizen.

Responsibilities

Design, implement, and maintain robust monitoring, alerting, and observability solutions to proactively detect and resolve issues before they impact end-users.
Lead troubleshooting efforts for complex production issues, providing detailed root cause analysis (RCA) and implementing preventative measures.
Develop and maintain automation scripts, build systems (Bazel) and infrastructure as code (IaC) using tools like Terraform, Ansible, or CloudFormation to eliminate manual tasks and improve system reliability and efficiency.
Collaborate closely with software engineering teams to influence the design of new services and applications, ensuring they are scalable, reliable, and resilient from the outset.
Participate in on-call rotations to respond to platform emergencies, alerts, and escalations, ensuring high service uptime.
Analyze system performance and recommend optimizations for scalability, reliability, and efficiency.
Implement and enforce best practices in deployment, monitoring, and incident management to continuously improve overall system reliability and reduce downtime.
Develop and maintain internal tools that streamline complex operations, track bugs, manage CI/CD pipelines, and facilitate cross-team communication.
Conduct post-incident reviews, documenting software problems and solutions in a shared knowledge base to prevent similar issues in the future.
Assist with vulnerability management, system patching, and implementing security measures to protect the integrity and availability of services.

Benefits

Competitive compensation packages
Comprehensive benefits
Supportive and inclusive work environment

Stand Out From the Crowd

Upload your resume and get instant feedback on how well it matches this job.

Upload and Match Resume

What This Job Offers

Job Type

Full-time

Career Level

Mid Level

Software Engineer - Site Reliability Engineer (SRE)

About The Position

Requirements

Responsibilities

Benefits

What This Job Offers

Job Search Resources

Tools

Career Hubs

Guides

Company