Maintenance Engineer

Luminai
11hHybrid

About The Position

As a Maintenance Engineer at Luminai, you will ensure the reliability, performance, and resilience of the systems that power mission-critical AI workflows. You’ll operate at the core of our production infrastructure — maintaining, monitoring, and continuously improving the systems that healthcare, finance, and telecommunications organizations depend on every day. This is a highly ownership-driven role for someone who thrives on operational excellence, proactively prevents issues before they arise, and takes pride in keeping complex systems running smoothly in high-stakes environments. You’ll work closely with Engineering, Product, and Forward Deployed teams to ensure our deployments are stable, secure, and scalable. This is a hybrid position. Our team is in-office 3 days a week (Mon, Tue, Thu) in San Mateo, California.

Requirements

  • 3+ years of experience in support engineering, site reliability engineering, or infrastructure maintenance
  • Strong proficiency in Python or scripting languages
  • Experience managing cloud infrastructure (AWS, GCP, or Azure)
  • Strong problem-solving skills and a proactive, preventative mindset
  • Clear communication skills and ability to collaborate across engineering and customer-facing teams
  • High ownership and accountability in high-reliability environments

Responsibilities

  • Monitor, maintain, and improve the reliability of production AI systems and workflow infrastructure
  • Proactively identify, diagnose, and resolve system issues across application, integration, and cloud infrastructure layers
  • Own incident response processes, including root cause analysis and long-term remediation
  • Implement monitoring, alerting, and observability tooling to ensure system health and uptime
  • Collaborate with Engineering to harden deployments and improve system architecture for resilience and scalability
  • Support customer-facing teams by troubleshooting and resolving technical issues in live environments
  • Document system configurations, operational procedures, and recovery protocols
  • Continuously improve reliability standards, deployment practices, and operational safeguards
© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service