Site Reliability Engineer

InfraCloud Technologies

About The Position

We are seeking a highly skilled Site Reliability Engineer (SRE) to design, build, automate, and operate scalable, secure, and highly available cloud-native platforms. The ideal candidate will have strong expertise in Kubernetes ecosystem technologies , Google Cloud Platform (GCP) , Infrastructure as Code (Terraform) , GitOps , Observability , Service Mesh , and Secrets Management . The SRE will work closely with Development, Platform Engineering, Security, and DevOps teams to ensure reliability, performance, scalability, and operational excellence across production environments.

Requirements

  • Container & Kubernetes Ecosystem: Kubernetes (Production-grade administration), Cilium, Istio Service Mesh, Kubernetes Ingress Controllers, Container Networking, Cluster Security and RBAC
  • Cloud Platforms: Google Cloud Platform (GCP), GKE, Cloud Networking, IAM and Security Controls
  • Infrastructure as Code: Terraform, Infrastructure Automation, Configuration Management Concepts
  • Deployment & GitOps: ArgoCD, GitOps Methodologies, GitLab CI/CD Pipelines
  • Secrets & Service Networking: HashiCorp Vault, Consul
  • Monitoring & Observability: Prometheus, Prometheus Operator, Grafana, Loki, Tempo, Alloy, Mimir, Pyroscope
  • Operating Systems & Networking: Linux Administration, TCP/IP, DNS, Load Balancing, SSL/TLS, Network Troubleshooting
  • 5–10+ years of overall infrastructure/platform engineering experience.
  • 3–5+ years of hands-on Kubernetes production experience.
  • Strong experience in cloud-native platforms, observability, automation, and GitOps-driven operations.

Nice To Haves

  • Experience managing large-scale Kubernetes platforms.
  • Experience supporting mission-critical production systems.
  • Strong understanding of distributed systems concepts.
  • Knowledge of cloud security best practices.
  • Experience implementing SRE principles such as: SLI/SLO/Error Budgets Capacity Planning Incident Management Reliability Engineering
  • Experience with multi-cluster Kubernetes environments.
  • Relevant certifications such as: Certified Kubernetes Administrator (CKA) Certified Kubernetes Security Specialist (CKS) Google Cloud Professional Certifications HashiCorp Terraform Associate

Responsibilities

  • Design, deploy, and manage large-scale Kubernetes clusters in production environments.
  • Administer and optimize Kubernetes networking using: Cilium Istio Service Mesh Kubernetes Ingress Controllers
  • Build highly available and resilient container platforms.
  • Implement cluster lifecycle management, upgrades, scaling, and capacity planning.
  • Troubleshoot complex Kubernetes infrastructure and application issues.
  • Design and operate cloud-native infrastructure on Google Cloud Platform.
  • Manage services such as: GKE (Google Kubernetes Engine) VPC Networking IAM Cloud Load Balancers Cloud Storage Monitoring and Logging services
  • Ensure security, scalability, and cost optimization of cloud environments.
  • Implement multi-environment and multi-region deployment strategies.
  • Develop and maintain reusable Terraform modules.
  • Automate provisioning and management of cloud infrastructure.
  • Implement infrastructure standards and governance.
  • Maintain version-controlled infrastructure repositories.
  • Ensure repeatable, auditable, and scalable infrastructure deployments.
  • Create and maintain Helm charts for platform and application deployments.
  • Standardize deployment practices across teams.
  • Manage Helm repositories and release strategies.
  • Support blue-green, canary, and rolling deployment methodologies.
  • Build and maintain GitOps workflows using ArgoCD.
  • Automate application deployment pipelines.
  • Implement environment promotion strategies.
  • Maintain deployment compliance and auditability.
  • Drive CI/CD best practices across engineering teams.
  • Manage secrets, certificates, and application credentials using Vault.
  • Implement secure secret injection patterns for Kubernetes workloads.
  • Configure and maintain Consul for service discovery and service networking.
  • Establish access control and security policies for sensitive workloads.
  • Build comprehensive observability solutions using: Prometheus Prometheus Operator Grafana Loki Tempo Alloy Mimir Pyroscope
  • Define and implement: Service Level Indicators (SLIs) Service Level Objectives (SLOs) Error Budgets
  • Create dashboards, alerts, and operational runbooks.
  • Conduct root cause analysis (RCA) and postmortems.
  • Improve system reliability, performance, and operational visibility.
  • Participate in on-call rotations.
  • Lead incident management during production outages.
  • Troubleshoot infrastructure, networking, application, and platform issues.
  • Develop automation to reduce operational toil.
  • Create disaster recovery and business continuity procedures.
  • Develop automation scripts and operational tooling.
  • Improve platform self-service capabilities.
  • Drive reliability engineering best practices.
  • Eliminate manual operational processes through automation.
© 2026 Teal Labs, Inc
Privacy PolicyTerms of Service