Customer Reliability Engineer

Cloudflare•San Francisco, CA

7d•Hybrid

About The Position

Cloudflare is seeking a Customer Reliability Engineer (CRE) to join their customer-facing engineering organization. This role is crucial for ensuring the reliability of systems that customers depend on, acting as a bridge between traditional support and engineering functions. CREs are responsible for both rapid response to high-severity customer issues and proactive engineering to prevent future crises. The role involves deep debugging across the full stack, driving fixes with Product Engineering, and contributing to product capabilities that identify and resolve customer issues before they escalate. Cloudflare is building CRE as an AI-native function, working with and developing AI agents and tooling to assist in diagnostics and problem-solving. This position offers the opportunity to directly impact product development and hold a high standard of reliability across Cloudflare's customer base, particularly for critical infrastructure users like banks, governments, and media companies.

Requirements

Minimum 5 years of hands-on experience in site reliability engineering, escalation engineering, systems engineering, or a comparable deeply technical support / operations role, with at least 2 years in customer-facing environments.
Strong foundation in networking and security: TCP/IP fundamentals: OSI model, IPv4/IPv6 addressing, subnetting, routing, switching. Core protocols: DNS, HTTP/S, TLS/SSL, SMTP, SNMP, NTP. Routing protocols: BGP, OSPF, including path selection and route propagation. Firewall concepts: stateful/stateless inspection, rule sets, NAT, ACLs. VPN and encryption: IPSec, SSL/TLS tunnels, GRE. Zero Trust architecture, network segmentation, modern security models.
Proficiency with observability and diagnostic tooling: packet capture and analysis (Wireshark, tcpdump), log aggregation (Kibana, Elasticsearch), metrics dashboards (Grafana), distributed tracing.
Strong scripting and automation skills (Bash, Python) with a track record of shipping tooling that improves reliability and reduces toil.
Experience with incident management, postmortem culture, and SLO/SLI-based reliability practices.
Excellent written and verbal communication. Able to convey complex technical information clearly to engineers, leadership, and customers.
Comfort owning ambiguous, cross-layer problems.
Composure under pressure during high-severity incidents.

Nice To Haves

SRE, DevOps, or platform engineering experience with direct customer-facing accountability.
Deep expertise at both L3/L4 (network infrastructure) and L7 (application protocols, DNS, HTTP, WebSocket).
Expert-level proficiency with Linux command-line tools: curl, dig, git, traceroute, mtr, strace, ss.
Data-at-scale analysis using SQL, PromQL, or equivalent.
Familiarity with CI/CD pipelines, infrastructure-as-code (Terraform, Pulumi), and container orchestration (Kubernetes, Docker).
Track record of building internal tooling or diagnostic utilities that measurably improved team efficiency.
Demonstrated technical leadership: mentoring engineers, driving cross-team initiatives, influencing outcomes without direct authority.
Experience applying AI/ML to production engineering or operational workflows.
Comfort engaging directly with enterprise customer engineering teams, including on calls during incidents.
Active Cloudflare user who understands the platform as a practitioner.
Hands-on experience with Workers, Pages, R2, D1, or other developer platform services.
Cloud networking and security experience across AWS, Azure, or GCP.
Web programming (HTML, JavaScript) and regular expressions.
Chaos engineering or formal reliability frameworks (e.g., Google SRE principles).
Managing or configuring non-HTTP services: email, DNS authoritative/recursive, FTP, SSH.

Responsibilities

Rapid incident response and root cause analysis: Own complex, high-severity customer issues end-to-end, from first signal through confirmed resolution. Lead deep-dive debugging across the full stack: edge, network, DNS, transport, APIs, application, customer-side configuration. Reproduce defects, validate fixes with Engineering, and confirm customer-side resolution. Produce postmortems other engineers rely on. Hold on-call for high-severity incidents as part of a global rotation that includes weekends.
Proactive reliability engineering: Analyze support and telemetry signals across the customer base to find systemic risks before they become incidents. Contribute monitoring, detection, and diagnostic capability to the core product and the engineering systems that give Customer Support early visibility into customer-affecting issues. Define customer-facing reliability metrics (error rates, resolution times, repeat-contact rates) and drive measurable improvement. Write automation that reduces mean-time-to-detect and mean-time-to-resolve.
Cross-functional partnership: Manage the technical escalation lifecycle with clear ownership and timely communication. Partner with Product Engineering to drive fixes, workarounds, and configuration changes that address underlying gaps. Represent the customer reliability perspective in engineering syncs, incident reviews, and post-mortem processes.
Technical leadership and enablement: Raise the technical floor of Customer Support through pair-debugging, structured knowledge transfer, and shared tooling. Document diagnostic procedures and resolution patterns in runbooks, internal knowledge bases, and AI skills. Share insights from customer-facing incidents to improve product documentation and operational readiness.
Product and platform depth: Maintain deep, current expertise across Cloudflare's product portfolio: edge networking, DNS, CDN, WAF, DDoS mitigation, Zero Trust, Workers, and our developer platform. Anticipate customer impact from new releases and architecture changes. Serve as a go-to subject-matter expert in one or more domains.

Benefits

Medical/Rx Insurance
Dental Insurance
Vision Insurance
Flexible Spending Accounts
Commuter Spending Accounts
Fertility & Family Forming Benefits
On-demand mental health support and Employee Assistance Program
Global Travel Medical Insurance
Short and Long Term Disability Insurance
Life & Accident Insurance
401(k) Retirement Savings Plan
Employee Stock Participation Plan
Flexible paid time off covering vacation and sick leave
Leave programs, including parental, pregnancy health, medical, and bereavement leave
Equity