Sr Engineer Site Reliability

OptimumTown of Oyster Bay, NY
2d$100,246 - $143,208

About The Position

As a Site Reliability Engineer III, you will be a primary driver in the long-term management and stabilization of our Hybrid Cloud infrastructure. We maintain a permanent dual-hosting strategy, operating both Google Cloud Platform (GCP) and mission-critical On-Premises Unix/Linux footprint. You will bridge the gap between physical hardware and modern cloud-native operations, applying software engineering principles to ensure our systems are scalable, secure, and predictable across all platforms. The Mission: Hybrid Reliability & Stabilization Your mission is to unify our GCP and On-Premises environments into a single, reliable platform. Your first 12 months will focus on Stabilization and Observability. You will lead the transition away from "toil" (manual, repetitive operations) toward high-leverage automation, aggressively addressing on-prem technical debt while implementing modern SRE practices across our global data centers and cloud projects.

Requirements

  • OS Internals: Deep proficiency in Linux (RHEL/Ubuntu) and Unix (Solaris/AIX) administration and kernel tuning
  • Cloud Proficiency: Hands-on experience with GCP (IAM, VPC, Compute Engine) or equivalent public cloud providers
  • Infrastructure as Code: Proven ability to manage complex environments using Terraform and Ansible
  • Storage Protocols: Proficiency in Fiber Channel, iSCSI, and NFS. Experience with enterprise arrays (NetApp, Dell/EMC, or Pure Storage) is highly preferred
  • Software Engineering: Strong scripting ability in Python or Go to build internal tools and automation.
  • Security: Strong understanding of CVE lifecycles and cryptographic standards (AES-256)
  • Bachelor’s degree in Telecommunications, Computer Engineering, or related discipline
  • 6+ years of experience in IP networking and infrastructure support, with at least 4 years in reliability-focused roles

Responsibilities

  • Hybrid Platform Standardization: Audit, harden, and standardize Unix (Solaris/AIX) and Linux (RHEL/Ubuntu) environments across both GCP Compute Engine and physical bare-metal servers.
  • Infrastructure Stewardship (DC Support): Serve as the engineering lead for our Eastern U.S. data centers; ensure hardware health, power redundancy, and physical security standards are enforced through code and automated checks.
  • Storage Engineering (Specialization): Architect and manage enterprise-grade SAN/NAS environments alongside GCP Cloud Storage/Persistent Disk. Optimize for low latency and high IOPS while ensuring all data-at-rest complies with our Annual Encryption Strategy.
  • Automation of Toil: Design and maintain robust automation pipelines (Ansible, Terraform, Python) to ensure configuration parity and eliminate drift between cloud and on-premises environments.
  • Vulnerability Management: Transition the fleet from a "vulnerable" state to a "reliable" one by establishing a sustainable, automated monthly patching cadence.
  • Unified Observability: Implement and scale a "single pane of glass" monitoring stack (Prometheus, Grafana, Loki) to provide real-time health metrics for the entire hybrid estate.
  • Incident Response & Post-Mortems: Participate in a sustainable on-call rotation. Lead Blameless Post-Mortems for incidents involving cross-platform dependencies to ensure we "fix the system, not the person."
© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service