Engineer

TATA Consulting Services•Irving, TX

49d•$90,000 - $110,000

About The Position

We are seeking an expert-level OpenShift Site Reliability Engineer (SRE) to join our Openshift platform team. This senior role is responsible for ensuring the ultimate reliability, scalability, and performance of our enterprise-wide Red Hat OpenShift container platform. As an SRE, you will be the final escalation point for complex technical challenges and will blend software engineering and systems administration expertise to build and run our large-scale, distributed, fault-tolerant systems. You will drive the automation, observability, and strategic evolution of the platform to support mission-critical applications.

Requirements

8+ years of overall experience in roles such as Site Reliability Engineering, DevOps, or Linux Systems Engineering.
5+ years of hands-on, intensive experience administering, automating, and troubleshooting Red Hat OpenShift (OCP 4.x preferred) in large-scale production environments.
Proven experience in a senior or lead engineering role, demonstrating ownership of complex projects and mentorship of others.
Expert-Level OpenShift: Deep, authoritative knowledge of OCP installation (IPI/UPI), upgrades, cluster administration, node management, and disaster recovery.
Kubernetes Mastery: A fundamental and deep understanding of Kubernetes architecture and components (etcd, kube-apiserver, scheduler, etc.) and Operators (OLM).
Infrastructure as Code (IaC): Strong proficiency with Ansible and Terraform for automating infrastructure provisioning and configuration management.
Programming/Scripting: Advanced scripting and software development skills in Python or Go, as well as Bash.
Observability: Hands-on experience building and managing monitoring and logging solutions (e.g., Prometheus, Grafana, Thanos, Alertmanager, ELK Stack, Splunk, Fluentd/Vector/OTEL).
CI/CD & GitOps: Expertise with CI/CD tooling (e.g., Tekton ,Jenkins, GitLab CI, ArgoCD, GitHub Actions).
Core Infrastructure: Strong proficiency in Linux/RHEL administration, networking (SDN, OVS, routing, firewalls, load balancer), and storage (Ceph, NFS, block storage, Object).
Analytical Mindset: Exceptional problem-solving skills with the ability to diagnose complex technical issues across multiple platform layers.
Ownership and Accountability: A strong sense of ownership and the drive to see issues through to resolution.
Communication: Excellent communication and interpersonal skills, capable of explaining complex topics to both technical and non-technical audiences.
Composure: Ability to remain calm and effective under pressure during critical incidents.
Willingness to participate in a 24x7 on-call rotation to handle critical platform incidents.

Nice To Haves

Red Hat Certified Specialist in OpenShift Administration (EX280) or Red Hat Certified Specialist in OpenShift Reliability (EX380/EX480). Certified Kubernetes Administrator (CKA) is a strong plus.

Responsibilities

Define and Uphold Reliability Standards: Establish and manage Service Level Objectives (SLOs), Service Level Indicators (SLIs), and error budgets for the OpenShift platform and its core services.
Automate Everything: Design, build, and maintain robust automation to handle the full lifecycle of OpenShift clusters, including provisioning, upgrades, patching, scaling, and disaster recovery.
Reduce Toil: Proactively identify and eliminate manual, repetitive operational work by developing and maintaining automation scripts (Python, Go, Bash) and Ansible playbooks.
Incident Response and Root Cause Analysis: Lead high-severity incident response and conduct deep, blameless post-mortems to identify and implement permanent solutions to prevent recurrence.
Proactive Health Management: Develop and implement automated health checks and self-healing capabilities to ensure cluster and application resilience.
Subject Matter Expertise: Serve as the top-tier technical authority for OpenShift Container Platform architecture, networking (OVN-Kubernetes, SDN), load balancing, cross cluster management, storage (OpenShift Data Foundation/Ceph), and security.
Observability: Architect and manage a comprehensive observability stack (e.g., Prometheus, Grafana, ELK/Fluentd) to provide deep insights into platform and application performance.
CI/CD and GitOps: Engineer and optimize CI/CD pipelines for both platform components and tenant applications, championing GitOps principles for declarative configuration management.
Capacity and Performance: Conduct advanced performance tuning, load testing, and capacity planning to ensure the platform can meet future demand.

Stand Out From the Crowd

Upload your resume and get instant feedback on how well it matches this job.

Upload and Match Resume