Site Reliability Engineer | Growth and Transformation

Red Ventures•Charlotte, NC

4h•$100,000 - $145,000•Hybrid

About The Position

The Red Platform - Platform Engineering (RPPE) team at Red Ventures is seeking a Site Reliability Engineer to ensure our platforms and applications are resilient, scalable, and perform at scale. This is a strategic role focused on engineering reliability through first principles. Designing observability, automation, and operational practices that prevent failures rather than just responding to them. You'll work in a small, high-impact team managing enterprise-scale systems across AWS, GCP, and Kubernetes environments with strict uptime requirements. This role emphasizes building reliability guardrails, comprehensive monitoring, and automation that enable the organization to operate with confidence and velocity.

Requirements

3–5 years of experience in SRE, DevOps, or cloud infrastructure engineering roles
Experience leveraging AI/ML tools to enhance observability, including anomaly detection, alert noise reduction, and predictive incident identification
Strong hands-on experience with AWS and GCP cloud platforms
Deep Kubernetes expertise (EKS, GKE), including security, networking, and operational best practices
Proficiency with infrastructure-as-code using Terraform
Experience building and maintaining observability systems (New Relic, Grafana, Prometheus, OpenTelemetry, or similar)
Solid understanding of CI/CD pipelines and automated deployment strategies (Harness, Jenkins, GitLab CI, or similar)
Strong scripting and automation skills (Python, Bash, Go, or similar languages)
Proven track record of maintaining high-availability systems (99.9%+ uptime)
Deep understanding of distributed systems, microservices architectures, and scalability patterns
Experience with incident management, troubleshooting complex systems, and learning from failures
Strong first-principles thinking, ability to reason from fundamentals rather than relying solely on existing patterns
Excellent written and verbal communication skills with the ability to explain complex technical concepts clearly

Nice To Haves

Cloud certifications (AWS Solutions Architect, GCP Professional Cloud Architect, or equivalent)
Experience with data platform infrastructure (Databricks, Snowflake, or similar)
Familiarity with security scanning and remediation tools (Wiz, Aqua, Prisma Cloud, or similar)
Knowledge of compliance frameworks (SOC 2, PCI-DSS, HIPAA) and their operational implications
Experience with chaos engineering, resilience testing, or systematic failure injection
Database performance tuning and optimization expertise (PostgreSQL, MySQL, etc.)
Experience with log aggregation and analytics platforms (ELK Stack, Splunk, or similar)
Understanding of cloud security, network architecture, and multi-region deployment patterns
Familiarity with DLP (Data Loss Prevention) solutions (Netskope, Zscaler, or similar)
Background working with regulated industries or highly available consumer-facing applications

Responsibilities

Ensure system reliability and performance across multi-cloud, multi-region platforms using first principles thinking
Build and maintain comprehensive observability solutions (OpenTelemetry, New Relic, Grafana, Prometheus) that provide actionable insights into system health and performance.
Automate infrastructure provisioning and deployments using Terraform and infrastructure-as-code practices
Define, implement, and monitor SLOs/SLIs that align with business-critical SLAs and drive accountability for reliability.
Manage and optimize Kubernetes clusters (EKS, GKE) with a focus on security hardening, performance, and operational excellence.
Lead incident response efforts, troubleshoot complex system issues, restore service quickly, and conduct thorough root cause analysis
Implement preventive measures and reliability improvements based on lessons learned from incidents and system behavior patterns.
Partner with platform engineers and developers to embed reliability best practices into system architecture and delivery pipelines
Proactively scale infrastructure capacity based on growth forecasts and traffic patterns.
Contribute to architecture reviews with a deep focus on reliability, performance, and operational sustainability.
Foster a culture of continuous improvement, systematic problem-solving, and operational excellence.

Benefits

Health Insurance Coverage (medical, dental, and vision)
Life Insurance
Short and Long-Term Disability Insurance
Flexible Spending Accounts
Holiday Pay
401(k) with match
Employee Assistance Program
Paid Parental Bonding Benefit Program
Flexible Paid Time Off (PTO): We believe time to rest and recharge is essential. That’s why we offer a generous and flexible PTO policy. Full-time employees accrue 20 days of PTO for a full calendar year annually, with an increase to 25 days after five years of service.

Stand Out From the Crowd

Upload your resume and get instant feedback on how well it matches this job.

Upload and Match Resume