Assoc Eng SRE(70010043)

OptimumTown of Oyster Bay, NY
3dOnsite

About The Position

As a Site Reliability Engineer I, you are the frontline engine of our hybrid platform. This role is focused on service continuity and active incident response. You will work shifts to provide support coverage, perform real-time debugging, and keep our GCP and On-Premises Unix/Linux systems running at all times.The Mission: Real-Time ReliabilityYour mission is to maintain 100% platform visibility. You will be the primary responder to our observability stack, moving beyond simple monitoring to active debugging and remediation. You will handle the "heavy lift" of shift-based support calls and system health checks, ensuring that technical debt is addressed and service disruptions are mitigated before they impact the business.

Requirements

  • Bachelor's degree in Telecommunications, Computer Engineering, or related technical field.
  • 0-2years of experience in mobile network operations or systems engineering roles.
  • OS Internals: Foundational command-line proficiency in Linux (RHEL/Ubuntu) and Unix (Solaris/AIX). Ability to troubleshoot CPU/Memory/Disk bottlenecks.
  • Debugging Skills: Familiarity with log analysis tools (Loki) and the ability to correlate metrics (Prometheus) to find root causes.
  • Cloud & Containers: Basic understanding of GCP (Compute Engine, GKE) and Kubernetes (restarting pods, viewing logs, checking ingress).
  • Kafka Awareness: Basic understanding of Kafka topics and the ability to monitor consumer group health.
  • Automation Exposure: Ability to run and verify Ansible playbooks and Terraform plans.
  • Communication: Excellent verbal communication for handling support calls and providing clear updates during high-pressure incidents.

Responsibilities

  • Shift-Based Support & Triage: Act as the primary technical point of contact during your shift. Manage the support queue, answer urgent infrastructure calls, and provide initial triage for all system anomalies.
  • Active Debugging: Investigate and resolve service issues across the stack. This includes debugging Kubernetes pod failures, resolving Kafka consumer lag, and troubleshooting Unix/Linux system errors using logs (Loki) and traces (Tempo).
  • Hybrid Platform Maintenance: Execute routine standardization tasks and health audits for Unix (Solaris/AIX) and Linux (RHEL/Ubuntu) environments to prevent environment drift.
  • Infrastructure Stewardship (DC Support): Perform on-site "Smart Hands" support in our Bethpage data center, including hardware reboots, component swaps, and verifying physical power/network redundancy.
  • Unified Observability: Maintain the "single pane of glass" (Prometheus/Grafana). Create and tune alerts to ensure the engineering team is notified of critical issues while minimizing "alert fatigue."
  • Escalation & Post-Mortems: Follow strict escalation paths to SRE2/SRE3 leads, Assist in complex outage mitigation. Contribute detailed timelines and log data to Blameless Post-Mortems.
© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service