Site Reliability and Infrastructure Engineer

Treeswift IncNew York, NY
Hybrid

About The Position

Help us scale and harden the platform that schedules our pipelines, runs machine learning training, and hosts our web app. We run Apache Airflow on Astronomer with DAGs that orchestrate high-volume processing across AWS and Kubernetes, including machine learning inference inside pipeline tasks. You will build the observability and reliability foundations that let us run this system confidently as customer data volume grows: monitoring, alerting, performance/cost visibility, and clear operational practices. Stay curious, collaborative, and cross-functional while also taking ownership of problems. We translate complex, real-world requirements from a critical industry into high-quality data products, so understanding the business holistically is key. We take pride in managing complexity and providing high-fidelity data that our customers can use to make better-informed decisions. You’ll be our first full-time SRE/infrastructure engineer, so we’ll look to you for leadership on how to improve and scale our infrastructure to support each part of the platform. Our data pipeline, machine learning training platform, and web app could all benefit from further productionization.

Requirements

  • You are an experienced software engineer where the last 7-10 years required significant time on observability, systems/infrastructure engineering, SRE, or DevOps (ideally in a cloud environment).
  • Ability to reason about architecture end-to-end and articulate your thoughts with product impact in mind (data movement, execution, failure handling, and operational visibility).
  • Hands-on experience with infrastructure-as-code (Terraform and similar) and using it to deliver reliable environments.
  • Experience with container orchestration and debugging in practice (Kubernetes and/or ECS/container-based deployments).
  • Strong Linux debugging skills and demonstrated ability to investigate production issues with logs/metrics and clear hypotheses.
  • Empathy and communication: you can collaborate effectively with engineers across teams (especially the data platform team) and explain tradeoffs clearly.

Nice To Haves

  • Experience working in early-stage or fast-moving environments where ownership and processes evolve quickly.
  • Experience with Apache Airflow and/or Astronomer.
  • Experience with AWS, although other cloud providers are fine. (DuploCloud experience is also helpful.)
  • Experience with geospatial/imagery/lidar/point-cloud style domains.
  • ML Ops skills (model deployment/inference reliability, packaging, CI/CD for model artifacts, and operational observability for inference pipelines).

Responsibilities

  • Partner with the data platform and engineering teams to understand how changes propagate across pipeline execution (Astronomer-hosted Airflow DAGs), containerized workers (Kubernetes), and AWS services (S3, SQS, Lambda, Step Functions, ECS).
  • Design and implement reliability and observability for high-volume pipeline operations, including: actionable monitoring/alerting for DAG/task failures and reruns
  • visibility into operational workflows like flight orchestration (including DLQ/failed-message alerting and notification pathways)
  • dashboards and SLO/SLI definitions focused on correctness, throughput, and pipeline health
  • Own CI/CD guardrails for production changes: build/deploy validation and safe rollout mechanics for Astronomer deployments (image builds pushed to ECR, and Airflow configuration updates via Astronomer CLI variable updates)
  • Make machine learning inference operations more reliable and observable: instrument inference runs executed inside pipeline runners (model checkpoint resolution, S3 sync behavior, thresholds and fallback behavior, and output correctness)
  • add operational visibility for inference outcomes (e.g., unknown classification rates, fallback usage, and failure modes)
  • Create operational tooling and continuously improve systems (‘leave it better than you found it’), including: runbooks, incident learnings, and engineering standards for debugging at scale
  • automate away toil in deployment and operations workflows as we learn what hurts most
  • On-call / incident response There is not currently an established on-call rotation for this platform, and the pipelines do not require real-time processing. That said, you’ll still help lead reliability improvements and operational readiness—so the team has faster diagnosis, better alerts, and safer releases when issues do occur.
© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service