Platform Engineer II/III

Zone 5 Technologies

About The Position

At Zone 5 Technologies, we're redefining what's possible in unmanned aircraft systems. Our team of engineers and innovators is developing cutting-edge autonomous solutions that push the boundaries of UAS technology - solving complex challenges that matter. We're building the future of UAS capabilities, and we're looking for exceptional talent to join us. If you're driven by hard problems, energized by rapid innovation, and ready to make an impact on next-generation flight systems, you belong here. We are seeking a Platform Engineer to architect and operate scalable compute infrastructure that powers our autonomous vehicle simulation and testing framework. You will build elastic compute systems across AWS and on-premises clusters, enabling engineering teams to rapidly iterate on autonomy algorithms through massive parallel simulation workloads.

Requirements

  • Bachelor's in Computer Science, Software Engineering, or related technical field – equivalent industry experience also welcome
  • 2-5+ years of experience in platform engineering, DevOps, SRE, or cloud infrastructure roles
  • Strong hands-on experience with Kubernetes for container orchestration and workload management
  • Experience with cloud computing platforms and services (compute, storage, networking)
  • Deep understanding of Linux system administration and troubleshooting
  • Strong networking fundamentals including TCP/IP, routing, DNS, VPNs, and security
  • Understanding of infrastructure as code principles and configuration management
  • Proficiency in scripting and automation (Python, Bash, or similar)
  • Experience building and maintaining CI/CD pipelines
  • Solid grasp of distributed systems concepts, job scheduling, and resource management
  • Ability to design infrastructure from first principles and make architectural decisions

Nice To Haves

  • Experience building infrastructure for simulation, robotics, or autonomous systems workloads
  • Understanding of GPU computing and accelerated workload management
  • Knowledge of job scheduling systems for batch and parallel workloads
  • Experience managing on-premises clusters and hybrid cloud architectures
  • Familiarity with robotics middleware (ROS/ROS2) or simulation platforms
  • Understanding of cost optimization for compute-intensive workloads
  • Experience with monitoring, logging, and observability systems
  • Knowledge of containerization technologies and image management
  • Background in data engineering, MLOps, or machine learning infrastructure
  • Experience with network performance analysis and troubleshooting
  • Understanding of software-defined networking and network automation
  • Familiarity with security compliance requirements in aerospace/defense environments

Responsibilities

  • Design and implement auto-scaling compute infrastructure for simulation workloads using cloud platforms
  • Build and maintain on-premises GPU and CPU clusters for simulation and machine learning training
  • Architect hybrid cloud solutions that optimize cost and performance across cloud and local compute resources
  • Implement job scheduling and orchestration systems using Kubernetes for thousands of concurrent simulations
  • Design storage solutions for large-scale simulation data, logs, and artifacts using cloud and local storage systems
  • Deploy and maintain robotics simulation environments at scale
  • Build CI/CD pipelines for automated simulation testing of autonomy software
  • Create infrastructure for distributed parameter sweeps, Monte Carlo testing, and regression suites
  • Develop monitoring and observability systems for simulation fleet health and resource utilization
  • Implement data pipelines for simulation results ingestion, analysis, and visualization
  • Write and maintain infrastructure as code for reproducible infrastructure deployment
  • Build automation tools and CLI utilities to simplify developer access to compute resources
  • Implement GitOps workflows for infrastructure changes and configuration management
  • Create self-service interfaces for engineers to launch and manage simulation jobs
  • Develop cost monitoring and optimization strategies for cloud and on-prem resources
  • Monitor and optimize infrastructure performance, reliability, and cost efficiency
  • Troubleshoot complex distributed systems issues across networking, storage, and compute layers
  • Implement backup, disaster recovery, and business continuity strategies
  • Maintain security best practices including IAM, secrets management, and network isolation
  • Collaborate with autonomy, ML, and robotics teams to understand compute requirements and optimize workflows
  • Design and implement network architectures for distributed simulation workloads across AWS and on-premises environments
  • Configure VPCs, subnets, security groups, and routing for secure, high-performance compute clusters
  • Establish hybrid cloud connectivity (VPN, Direct Connect, site-to-site tunnels) between on-premises and cloud resources
  • Optimize network performance for large data transfers, multi-node communication, and distributed workloads
  • Support internal infrastructure network design and provide technical guidance to engineering programs
  • Troubleshoot network issues including latency, packet loss, and connectivity problems across distributed systems

Benefits

  • Competitive total compensation package
  • Comprehensive benefit package options include medical, dental, vision, life, and more.
  • 401k with company-match
  • 4 weeks of paid time off each year
  • 12 annual company holidays
© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service