Site Reliability Engineer

InfraCloud Technologies

21d

About The Position

We are seeking a highly skilled Site Reliability Engineer (SRE) to design, build, automate, and operate scalable, secure, and highly available cloud-native platforms. The ideal candidate will have strong expertise in Kubernetes ecosystem technologies , Google Cloud Platform (GCP) , Infrastructure as Code (Terraform) , GitOps , Observability , Service Mesh , and Secrets Management . The SRE will work closely with Development, Platform Engineering, Security, and DevOps teams to ensure reliability, performance, scalability, and operational excellence across production environments.

Requirements

Container & Kubernetes Ecosystem: Kubernetes (Production-grade administration), Cilium, Istio Service Mesh, Kubernetes Ingress Controllers, Container Networking, Cluster Security and RBAC
Cloud Platforms: Google Cloud Platform (GCP), GKE, Cloud Networking, IAM and Security Controls
Infrastructure as Code: Terraform, Infrastructure Automation, Configuration Management Concepts
Deployment & GitOps: ArgoCD, GitOps Methodologies, GitLab CI/CD Pipelines
Secrets & Service Networking: HashiCorp Vault, Consul
Monitoring & Observability: Prometheus, Prometheus Operator, Grafana, Loki, Tempo, Alloy, Mimir, Pyroscope
Operating Systems & Networking: Linux Administration, TCP/IP, DNS, Load Balancing, SSL/TLS, Network Troubleshooting
5–10+ years of overall infrastructure/platform engineering experience.
3–5+ years of hands-on Kubernetes production experience.
Strong experience in cloud-native platforms, observability, automation, and GitOps-driven operations.

Nice To Haves

Experience managing large-scale Kubernetes platforms.
Experience supporting mission-critical production systems.
Strong understanding of distributed systems concepts.
Knowledge of cloud security best practices.
Experience implementing SRE principles such as: SLI/SLO/Error Budgets Capacity Planning Incident Management Reliability Engineering
Experience with multi-cluster Kubernetes environments.
Relevant certifications such as: Certified Kubernetes Administrator (CKA) Certified Kubernetes Security Specialist (CKS) Google Cloud Professional Certifications HashiCorp Terraform Associate

Responsibilities

Design, deploy, and manage large-scale Kubernetes clusters in production environments.
Administer and optimize Kubernetes networking using: Cilium Istio Service Mesh Kubernetes Ingress Controllers
Build highly available and resilient container platforms.
Implement cluster lifecycle management, upgrades, scaling, and capacity planning.
Troubleshoot complex Kubernetes infrastructure and application issues.
Design and operate cloud-native infrastructure on Google Cloud Platform.
Manage services such as: GKE (Google Kubernetes Engine) VPC Networking IAM Cloud Load Balancers Cloud Storage Monitoring and Logging services
Ensure security, scalability, and cost optimization of cloud environments.
Implement multi-environment and multi-region deployment strategies.
Develop and maintain reusable Terraform modules.
Automate provisioning and management of cloud infrastructure.
Implement infrastructure standards and governance.
Maintain version-controlled infrastructure repositories.
Ensure repeatable, auditable, and scalable infrastructure deployments.
Create and maintain Helm charts for platform and application deployments.
Standardize deployment practices across teams.
Manage Helm repositories and release strategies.
Support blue-green, canary, and rolling deployment methodologies.
Build and maintain GitOps workflows using ArgoCD.
Automate application deployment pipelines.
Implement environment promotion strategies.
Maintain deployment compliance and auditability.
Drive CI/CD best practices across engineering teams.
Manage secrets, certificates, and application credentials using Vault.
Implement secure secret injection patterns for Kubernetes workloads.
Configure and maintain Consul for service discovery and service networking.
Establish access control and security policies for sensitive workloads.
Build comprehensive observability solutions using: Prometheus Prometheus Operator Grafana Loki Tempo Alloy Mimir Pyroscope
Define and implement: Service Level Indicators (SLIs) Service Level Objectives (SLOs) Error Budgets
Create dashboards, alerts, and operational runbooks.
Conduct root cause analysis (RCA) and postmortems.
Improve system reliability, performance, and operational visibility.
Participate in on-call rotations.
Lead incident management during production outages.
Troubleshoot infrastructure, networking, application, and platform issues.
Develop automation to reduce operational toil.
Create disaster recovery and business continuity procedures.
Develop automation scripts and operational tooling.
Improve platform self-service capabilities.
Drive reliability engineering best practices.
Eliminate manual operational processes through automation.