About The Position

The Managed Services SRE is responsible for deploying, operating, and maintaining customer applications across Linux bare metal servers and Red Hat OpenShift (OCP) containerized platforms. This role focuses on application deployment, release management, reliability, and operational support in a live production environment. The SRE will participate in on-call rotations, night-time deployments, and support, ensuring systems meet SLA requirements while continuously improving reliability and automation practices.

Requirements

  • Minimum 5 years Linux system administration experience
  • Minimum 5 years Kubernetes (K8s) experience
  • Exposure to RedHat OpenShift
  • Experience with application servers such as JBoss or WebLogic
  • Experience with monitoring tools: Zabbix, Prometheus, Grafana
  • Experience with logging pipelines: Elasticsearch, Logstash, Kibana (ELK)
  • Exposure to web servers – Apache, Nginx
  • Experience with Ansible
  • Basic networking skills
  • Basic SQL skills
  • Strong troubleshooting skills and ability to operate in a live production environment
  • Willingness to: Carry pager for on-call rotations (typically a week at a time)
  • Support night-time deployments
  • Work off-hours, including weekends and holidays in emergencies
  • Learn on the fly and develop new skills
  • Strong problem-solving and troubleshooting skills
  • Excellent communication and teamwork abilities
  • Self-driven, proactive, and willing to take ownership
  • Ability to operate effectively in a fast-paced, SLA-driven environment

Responsibilities

  • Deploy, manage, and maintain applications on Linux bare metal servers and OpenShift/Kubernetes clusters
  • Execute CI/CD pipelines and ensure reliable, repeatable releases across hybrid environments
  • Build and maintain observability for deployed applications using Prometheus, Grafana, Zabbix
  • Implement and maintain centralized logging solutions using Grafana Loki, OpenSearch/Elasticsearch, Fluentd/Fluent Bit
  • Develop automation scripts to streamline deployments and reduce operational toil (Bash, Python, JavaScript)
  • Participate in incident response and troubleshoot application or platform issues in a live production environment
  • Support night-time deployments and carry pager on rotation; respond to emergencies, including weekends and holidays
  • Collaborate with internal teams to continuously improve deployment reliability and efficiency
  • Learn new technologies, take direction, and develop skills as needed
© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service