Site Reliability Eng, Storage(70010039)

OptimumTown of Oyster Bay, NY
2d

About The Position

As a Site Reliability Engineer II, you will be a primary driver in the long-term management and stabilization of our Hybrid Cloud infrastructure. We maintain a permanent dual-hosting strategy, operating both Google Cloud Platform (GCP) and mission-critical On-Premises Unix/Linux footprint. You will bridge the gap between physical hardware and modern cloud-native operations, applying software engineering principles to ensure our systems are scalable, secure, and predictable across all platforms. The Mission: Hybrid Reliability & StabilizationYour mission is to unify our GCP and On-Premises environments into a single, reliable platform. Your first 12 months will focus on Stabilization and Observability. You will lead the transition away from "toil" (manual, repetitive operations) toward high-leverage automation, aggressively addressing on-prem technical debt while implementing modern SRE practices across our global data centers and cloud projects.

Requirements

  • OS Internals: Deep proficiency in Linux (RHEL/Ubuntu) and Unix (Solaris/AIX) administration and kernel tuning.
  • Cloud Proficiency: Hands-on experience with GCP (IAM, VPC, Compute Engine) or equivalent public cloud providers.
  • Infrastructure as Code: Proven ability to manage complex environments using Terraform and Ansible.
  • Storage Protocols: Proficiency in Fiber Channel, iSCSI, and NFS. Experience with enterprise arrays (NetApp, Dell/EMC, or Pure Storage) is highly preferred.
  • Software Engineering: Strong scripting ability in Python or Go to build internal tools and automation.
  • Security: Strong understanding of CVE lifecycles and cryptographic standards (AES-256).

Nice To Haves

  • Bachelor's degree in Telecommunications, Computer Engineering, or related technical field.
  • 2–4 years of experience in mobile network operations or systems engineering roles.

Responsibilities

  • Storage Engineering (Specialization): Architect and manage enterprise-grade SAN/NAS environments alongside GCP Cloud Storage/Persistent Disk. Optimize for low latency and high IOPS while ensuring all data-at-rest complies with our Annual Encryption Strategy.
  • Hybrid Platform Standardization: Audit, harden, and standardize Unix (Solaris/AIX) and Linux (RHEL/Ubuntu) environments across both GCP Compute Engine and physical bare-metal servers.
  • Infrastructure Stewardship (DC Support): Serve as the engineering lead for our Eastern U.S. data centers; ensure hardware health, power redundancy, and physical security standards are enforced through code and automated checks.
  • Automation of Toil: Design and maintain robust automation pipelines (Ansible, Terraform, Python) to ensure configuration parity and eliminate drift between cloud and on-premises environments.
  • Vulnerability Management: Transition the fleet from a "vulnerable" state to a "reliable" one by establishing a sustainable, automated monthly patching cadence.
  • Unified Observability: Implement and scale a "single pane of glass" monitoring stack (Prometheus, Grafana, Loki) to provide real-time health metrics for the entire hybrid estate.
  • Incident Response & Post-Mortems: Participate in a sustainable on-call rotation. Lead Blameless Post-Mortems for incidents involving cross-platform dependencies to ensure we "fix the system, not the person."
© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service