SITE RELIABILITY ENGINEER

United State Cold Storage Inc / USCSCamden, NJ
Hybrid

About The Position

The Site Reliability Engineer (SRE) role is a founding member of US Cold’s SRE practice, aimed at transitioning the organization from reactive operations to engineered reliability. This position will focus on studying critical system failures, particularly the Phenix WMS and facility automation interfaces, and designing controls, automation, and observability to reduce incidents. Success will be measured by fewer false alerts, faster recovery, less manual intervention, and self-healing systems. The SRE will collaborate with application, infrastructure, and operations teams, and participate in on-call rotations and incident response. This is a hands-on role where improvements directly impact daily warehouse operations.

Requirements

  • 3+ years of experience in SRE, DevOps, Systems Engineering, or related roles
  • Strong Linux and Windows systems administration and troubleshooting skills
  • Hands‑on experience with automation and scripting
  • Experience designing and operating monitoring, alerting, and observability solutions
  • Practical experience working in Azure environments
  • Strong analytical skills and a bias toward eliminating root causes, not symptoms
  • Ability to collaborate across application, infrastructure, and operations teams
  • Experience supporting warehouse management systems or industrial automation platforms
  • Exposure to Kubernetes, microservices, or container orchestration
  • Hands on experience with infrastructure‑as‑code tools such as Terraform or Ansible
  • Understanding of distributed systems and high‑availability design
  • Experience with SRE practices such as SLO‑based operations, runbook automation, or chaos testing

Responsibilities

  • Reliability of the Phenix WMS and its integration with facility automation systems (robotics, conveyors, and control interfaces)
  • Definition and implementation of SLIs and SLOs that measure meaningful system health, not just availability
  • Observability across the full stack, correlating cloud services, APIs, and on‑premise facility operations
  • Automation to eliminate operational toil, including patching, data corrections, restarts, and recovery tasks
  • Development of self‑healing behaviors for common failure modes
  • Participation in on‑call rotations and leadership of blameless post‑incident reviews
  • Design and execution of disaster recovery tests across SaaS, cloud, and on‑premise environments
© 2026 Teal Labs, Inc
Privacy PolicyTerms of Service