Site Reliability Engineer | Growth and Transformation

Red VenturesCharlotte, NC
$100,000 - $145,000Hybrid

About The Position

The Red Platform - Platform Engineering (RPPE) team at Red Ventures is seeking a Site Reliability Engineer to ensure our platforms and applications are resilient, scalable, and perform at scale. This is a strategic role focused on engineering reliability through first principles. Designing observability, automation, and operational practices that prevent failures rather than just responding to them. You'll work in a small, high-impact team managing enterprise-scale systems across AWS, GCP, and Kubernetes environments with strict uptime requirements. This role emphasizes building reliability guardrails, comprehensive monitoring, and automation that enable the organization to operate with confidence and velocity.

Requirements

  • 3–5 years of experience in SRE, DevOps, or cloud infrastructure engineering roles
  • Experience leveraging AI/ML tools to enhance observability, including anomaly detection, alert noise reduction, and predictive incident identification
  • Strong hands-on experience with AWS and GCP cloud platforms
  • Deep Kubernetes expertise (EKS, GKE), including security, networking, and operational best practices
  • Proficiency with infrastructure-as-code using Terraform
  • Experience building and maintaining observability systems (New Relic, Grafana, Prometheus, OpenTelemetry, or similar)
  • Solid understanding of CI/CD pipelines and automated deployment strategies (Harness, Jenkins, GitLab CI, or similar)
  • Strong scripting and automation skills (Python, Bash, Go, or similar languages)
  • Proven track record of maintaining high-availability systems (99.9%+ uptime)
  • Deep understanding of distributed systems, microservices architectures, and scalability patterns
  • Experience with incident management, troubleshooting complex systems, and learning from failures
  • Strong first-principles thinking, ability to reason from fundamentals rather than relying solely on existing patterns
  • Excellent written and verbal communication skills with the ability to explain complex technical concepts clearly

Nice To Haves

  • Cloud certifications (AWS Solutions Architect, GCP Professional Cloud Architect, or equivalent)
  • Experience with data platform infrastructure (Databricks, Snowflake, or similar)
  • Familiarity with security scanning and remediation tools (Wiz, Aqua, Prisma Cloud, or similar)
  • Knowledge of compliance frameworks (SOC 2, PCI-DSS, HIPAA) and their operational implications
  • Experience with chaos engineering, resilience testing, or systematic failure injection
  • Database performance tuning and optimization expertise (PostgreSQL, MySQL, etc.)
  • Experience with log aggregation and analytics platforms (ELK Stack, Splunk, or similar)
  • Understanding of cloud security, network architecture, and multi-region deployment patterns
  • Familiarity with DLP (Data Loss Prevention) solutions (Netskope, Zscaler, or similar)
  • Background working with regulated industries or highly available consumer-facing applications

Responsibilities

  • Ensure system reliability and performance across multi-cloud, multi-region platforms using first principles thinking
  • Build and maintain comprehensive observability solutions (OpenTelemetry, New Relic, Grafana, Prometheus) that provide actionable insights into system health and performance.
  • Automate infrastructure provisioning and deployments using Terraform and infrastructure-as-code practices
  • Define, implement, and monitor SLOs/SLIs that align with business-critical SLAs and drive accountability for reliability.
  • Manage and optimize Kubernetes clusters (EKS, GKE) with a focus on security hardening, performance, and operational excellence.
  • Lead incident response efforts, troubleshoot complex system issues, restore service quickly, and conduct thorough root cause analysis
  • Implement preventive measures and reliability improvements based on lessons learned from incidents and system behavior patterns.
  • Partner with platform engineers and developers to embed reliability best practices into system architecture and delivery pipelines
  • Proactively scale infrastructure capacity based on growth forecasts and traffic patterns.
  • Contribute to architecture reviews with a deep focus on reliability, performance, and operational sustainability.
  • Foster a culture of continuous improvement, systematic problem-solving, and operational excellence.

Benefits

  • Health Insurance Coverage (medical, dental, and vision)
  • Life Insurance
  • Short and Long-Term Disability Insurance
  • Flexible Spending Accounts
  • Holiday Pay
  • 401(k) with match
  • Employee Assistance Program
  • Paid Parental Bonding Benefit Program
  • Flexible Paid Time Off (PTO): We believe time to rest and recharge is essential. That’s why we offer a generous and flexible PTO policy. Full-time employees accrue 20 days of PTO for a full calendar year annually, with an increase to 25 days after five years of service.
© 2026 Teal Labs, Inc
Privacy PolicyTerms of Service