Senior/Staff Site Reliability Engineer

Gatik AI•Mountain View, CA

98d•Onsite

About The Position

We are seeking an experienced Senior/Staff Site Reliability Engineer to support the operation, monitoring, and scaling of our growing fleet of autonomous vehicles. In this role, you will work closely with our infrastructure and platform teams to manage rollouts of both on-premises and cloud infrastructure in support of expansions to new customer sites. You will be directly involved in the setup and monitoring of our data offload systems, remote supervision stations, and on-prem continuous integration (CI) environments, ensuring our infrastructure is highly reliable, secure, and optimized for performance. This position plays a critical role in keeping our autonomy operations running smoothly while supporting the rapid growth of our fleet and customer base. This role is onsite 5 days a week at our Mountain View, CA office!

Requirements

5+ years of experience in a related role such as Site Reliability Engineer, DevOps Engineer, or Infrastructure Engineer.
Strong knowledge of networking fundamentals, including protocols, troubleshooting, and optimization.
Hands-on experience with Docker and related ecosystem tools (e.g., Docker Compose, Kaniko).
Expertise in Kubernetes deployments and package management via Helm.
Proficiency with relational and time-series databases (e.g., Postgres, TimescaleDB, InfluxDB).
Familiarity with workflow orchestration tools such as Argo and Airflow.
Proven experience managing upgrades and rollbacks for customer-facing SaaS environments.
Scripting experience in Python and Bash for automation and tooling.
Experience building and maintaining dashboards with tools like Grafana.

Responsibilities

Upgrade and maintain both physical and cloud infrastructure used for offloading data from our autonomous vehicle fleet.
Partner with the infrastructure and platform engineering teams to monitor, maintain, and troubleshoot our on-premises data offload and CI systems.
Design, develop, and maintain business intelligence (BI) dashboards and ETL (extract, transform, load) pipelines to provide actionable insights into our infrastructure performance and health.
Architect and deploy test environments to validate internal and customer-facing infrastructure solutions.
Automate deployment, scaling, and upgrading of our remote monitoring software to ensure operational efficiency.
Perform ongoing analysis of infrastructure performance, identifying opportunities for optimization in latency, throughput, and reliability.

Stand Out From the Crowd

Upload your resume and get instant feedback on how well it matches this job.

Upload and Match Resume