Network Reliability Engineer

Cloudflare•San Francisco, CA

9d•Hybrid

About The Position

At Cloudflare, we are on a mission to help build a better Internet. Today the company runs one of the world’s largest networks that powers millions of websites and other Internet properties for customers ranging from individual bloggers to SMBs to Fortune 500 companies. Cloudflare protects and accelerates any Internet application online without adding hardware, installing software, or changing a line of code. Internet properties powered by Cloudflare all have web traffic routed through its intelligent global network, which gets smarter with every request. As a result, they see significant improvement in performance and a decrease in spam and other attacks. Cloudflare was named to Entrepreneur Magazine’s Top Company Cultures list and ranked among the World’s Most Innovative Companies by Fast Company. At Cloudflare, we’re not looking for people who wait for a polished roadmap; we’re looking for the builders who see the cracks in the Internet that everyone else has simply learned to live with. We value candidates who have the instinct to spot a "normalized" problem and the AI-native curiosity to create a solution using the latest tools. Our culture is built on iteration, leveraging AI to ship faster today to make it better tomorrow, while ensuring that every improvement, no matter how small, is shared across the team to lift everyone up. If you’re the type of person who values curiosity over bureaucracy, and that AI is a partner in solving tough problems to keep the Internet moving forward, you’ll fit right in. Available Locations: Austin, Atlanta, Denver, Seattle, Washington D.C. (Hybrid) About the Role (or What you'll do) Cloudflare operates a large global network spanning hundreds of cities (data centers). You will join a team of talented network engineers who are building software solutions to improve network resilience and reduce operational toil. This position will be responsible for the technical operation and engineering of the Cloudflare's core data center network, including the planning, installation and management of the hardware and software as well as the day-to-day operations of the network. The core network supports our critical internal needs such as databases, high volume logging, and internal application clusters. This is an opportunity to be part of the team that is building a high-performance network that is accessible to any web property online. You will build tools to automate operational tasks, streamline deployment processes and provide a platform for other engineering teams to build upon. You will nurture a passion for an “automate everything” approach that makes systems failure-resistant and ready-to-scale. Furthermore, you will be required to play a key role in system design and demonstrate the ability to bring an idea from design all the way to production.

Requirements

3 years of relevant Network/Site Reliability Engineering experience
BA/BS in Computer Science or equivalent experience
Solid foundation on configuration management frameworks: Saltstack, Ansible, Chef
Experience with NX-OS, JUNOS, EOS, Cumulus, or Sonic Network Operating Systems
AI-native: being able to leverage LLM to: build agentic deployment and troubleshooting tools on top of the Cloudflare stack automate configurations (SaltStack + Temporal) parse complex log files, and streamline documentation
Solid Linux systems administration experience
Linux networking - iproute2, Traffic Control, Devlink, etc.
Strong software development skills in Go and Python

Nice To Haves

Deep knowledge of BGP and other routing protocols
Workflow Management (AirFlow, Temporal)
Open Source Routing Daemons (FRR, Bird, GoBGP)
Experience with bare metal switching
Experience with network programming in C, C++ or rust
Experience with the Linux kernel and Linux software packaging
Strong tooling and automations development experience
Time series databases (Prometheus, Grafana, Thanos, Clickhouse)
Other Tools - Kubernetes, Docker, Prometheus, Consul

Responsibilities

technical operation and engineering of the Cloudflare's core data center network
planning, installation and management of the hardware and software as well as the day-to-day operations of the network
build tools to automate operational tasks
streamline deployment processes
provide a platform for other engineering teams to build upon
play a key role in system design
demonstrate the ability to bring an idea from design all the way to production