Responsibilities: Design and implement solutions to enhance the reliability and scalability of AI/ML platforms and applications to accommodate fast growing demands. Partner with product engineering teams to ensure the AI/ML systems are reliable and high performing. Develop observability, security, automation and fin-ops tools and orchestration. Provide strategic technology leadership by defining and evaluating standards and architecture for reliability, observability and automation frameworks. Build strong cross-functional relationships that foster engagements across the organization and deliver solutions to user problems. Debug and solve issues in a production environment, identify root cause and remediate. Participates in on-call rotations, incident management and escalation workflows. Take full ownership of problems, develop solutions, and acquire new knowledge to complete the task. Mentor and guide junior engineers.
Stand Out From the Crowd
Upload your resume and get instant feedback on how well it matches this job.
Job Type
Full-time
Career Level
Mid Level
Number of Employees
5,001-10,000 employees