Software Engineer, Site Reliability

FireworksSan Mateo, CA
37d

About The Position

As a Site Reliability Engineer (SRE) at Fireworks AI, you will play a critical role in making our world-scale virtual AI cloud reliable, performant, and efficient. You will apply your expertise in large-scale distributed systems, cloud infrastructure, and operational excellence. You will partner closely with world-class software engineers and AI experts to scale cutting-edge AI platforms to meet the fast-growing demands and ever-evolving application paradigms. This role is for someone passionate about operating highly robust, observable, and automated systems and enabling customer successes.

Requirements

  • Bachelor's degree in Computer Science, related technical field, or equivalent practical experience.
  • 5+ years of experience in Site Reliability Engineering, DevOps, or a similar role focused on large-scale production systems.
  • Deep expertise in SRE principles and practices, including SLOs, SLIs, operational automation, incident management, and post-mortems.
  • Extensive hands-on experience with public cloud platforms (AWS, GCP, Azure), including compute, networking, storage, and database services.
  • Strong experience with containerization technologies (Docker) and orchestration platforms (Kubernetes).
  • Proficiency in designing and implementing robust monitoring, logging, and alerting systems using tools like Prometheus, Grafana, ELK stack, and distributed tracing.
  • Solid programming/scripting skills in at least one language (e.g., Python, Go) for automation and tool development.
  • In-depth knowledge of Linux operating systems, networking fundamentals, and system debugging.
  • Proven ability to troubleshoot complex issues across the entire stack.
  • Excellent communication, collaboration, and problem-solving skills.
  • Willingness to participate in on-call rotations.

Nice To Haves

  • Experience of managing data center grade GPU clusters with GPU (and peripherals like HBM and RDMA enabled networking) monitoring, troubleshooting, and fixing.
  • Experience with machine learning infrastructure, model serving, or distributed AI frameworks.
  • Hands-on experience in security and data protection.

Responsibilities

  • Ensuring System Reliability: Ensure systems are designed and implemented with high availability, scalability, and performance. Focus on fault tolerance, disaster recovery, identifying and removing scaling bottlenecks, and performance optimization across our multi-cloud infrastructure.
  • Incident Management & Response: Lead efforts in incident detection, response, and resolution for critical production issues. Drive post-mortems to identify root causes and implement preventative measures to improve system reliability.
  • Observability & Monitoring: Develop, implement, and maintain comprehensive monitoring, alerting, logging, and tracing solutions to provide deep insights into system health and performance.
  • Automation & Toil Reduction: Identify and automate repetitive operational tasks to reduce toil and improve operational efficiency. Develop tools and scripts to streamline deployments, scaling, and system management.
  • Capacity Planning & Performance Tuning: Work proactively on capacity planning to ensure our infrastructure can gracefully handle growth and peak loads. Optimize system performance and resource utilization.
  • Reliability Best Practices: Collaborate with software engineers to embed reliability principles (e.g., SLOs, SLIs, error budgets) into the development lifecycle, promoting a culture of operational excellence.
  • On-call Rotation: Participate in a periodic on-call rotation to support our production environment and respond to critical alerts.

Benefits

  • Solve Hard Problems: Tackle challenges at the forefront of AI infrastructure, from low-latency inference to scalable model serving.
  • Build What's Next: Work with bleeding-edge technology that impacts how businesses and developers harness AI globally.
  • Ownership & Impact: Join a fast-growing, passionate team where your work directly shapes the future of AI-no bureaucracy, just results.
  • Learn from the Best: Collaborate with world-class engineers and AI researchers who thrive on curiosity and innovation.
© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service