Senior Site Reliability Engineer/ API Platform Engineer (AI-First)

ELSA

3d•Remote

About The Position

Join the AI Infrastructure & Platform team to build, operate, and scale the production systems that power ELSA’s APIs, platform services, and AI-enabled applications. This Senior Site Reliability Engineer / API Platform Engineer role bridges software engineering, cloud infrastructure, and operational excellence, requiring a pragmatic, highly productive individual who can use modern AI tools and automation to accelerate delivery and improve reliability. You will collaborate closely with engineering, AI, and product teams to ensure our services are secure, scalable, observable, and resilient in real-world production environments. This is not an AI Engineer role; rather, it is an infrastructure and reliability role for someone who works in an AI-first way and uses AI as a force multiplier in execution, automation, and systems operations.

Requirements

Strong experience in Site Reliability Engineering, DevOps, Platform Engineering, or Infrastructure Software Engineering, with a track record of operating production systems at scale.
Solid experience writing and maintaining production-grade software for live systems and internal platform tooling.
Deep expertise in cloud infrastructure and distributed systems, particularly on AWS, including EKS, EC2, IAM, VPC, CloudWatch, and related services.
Hands-on experience running Kubernetes-based services in production environments.
Strong experience operating APIs and microservices in production, including release workflows, failure recovery, and service hardening.
Hands-on experience with observability and monitoring tools such as Prometheus, Grafana, SigNoz, Sentry, OpenTelemetry, or similar systems.
Strong understanding of CI/CD practices, incident management, production monitoring, and service reliability engineering.
Experience with infrastructure-as-code and automation tooling.
Experience using AI tools and automation as a core part of your engineering workflow to increase productivity, reduce toil, and improve execution quality.
Strong judgment, ownership, and follow-through. You take on hard operational problems and drive them through resolution.

Nice To Haves

Experience supporting AI-powered products, inference services, or ML-adjacent systems in production.
Familiarity with GPU-based workloads and performance optimization for compute-intensive services.
Experience with performance tuning, benchmarking, capacity planning, and load testing.
Experience building internal developer platforms, self-service infrastructure, or reliability tooling.
Familiarity with AI-assisted incident response, automated remediation, or intelligent operational runbooks.
Experience working cross-functionally with AI, product, and engineering teams in fast-moving environments.
Good software engineering fundamentals, including distributed systems, APIs, containerization, and cloud-native deployment.

Responsibilities

Design, build, and operate reliable, scalable infrastructure for APIs, platform services, and AI-enabled applications on AWS and Kubernetes.
Own and enhance CI/CD pipelines, deployment workflows, and operational tooling to enable safe and fast software delivery.
Build and maintain robust observability systems across metrics, logging, tracing, alerting, and service health.
Lead incident response, root cause analysis, postmortems, and remediation efforts to continuously improve production reliability.
Automate repetitive operational work through software, infrastructure-as-code, and AI-assisted workflows.
Use AI-native engineering tools including copilots, intelligent automation, and agentic operational tooling to improve debugging, response time, analysis, and team productivity.
Partner with backend, platform, and AI engineering teams to productionize new services and ensure they meet reliability, security, and scalability standards.
Optimize infrastructure and runtime performance across latency, throughput, availability, and cost.
Define and enforce engineering standards for reliability, security, observability, and operational excellence across services.
Contribute production-grade software and internal tools that reduce toil and improve platform leverage across the organization.

Benefits

Flexible work setup: Remote-first for Indonesia, Malaysia, Thailand, Taiwan; hybrid model for Vietnam.
Comprehensive employee well-being benefits.
Free ELSA Premium courses to polish your language skills
Collaborative, international team culture.
Opportunity to contribute to a fast-growing, well-funded Silicon Valley startup with global impact.

Stand Out From the Crowd

Upload your resume and get instant feedback on how well it matches this job.

Upload and Match Resume