Senior Network Reliability Engineer

Group 1001•Zionsville, IN

55d•Remote

About The Position

The Platform Engineering Services team at Group 1001 is building a Site Reliability Engineering practice with a network scope. We're hiring an Sr. Network Reliability Engineer who embodies Innovation and Excellence, and will apply SRE principles — code-as-source-of-truth, SLOs and error budgets, alerting on symptoms rather than causes, failure-mode-first design, and the elimination of toil — to the firm's network platform from carrier edge through cloud fabric to Kubernetes pod boundary. This is not a "keep the lights on" role. You will systematically engineer the lights-on work out of existence, build the abstractions that let other engineering teams express network intent in code, and treat the network as a single engineered system rather than a collection of vendor consoles. You will operate inside a DevSecOps practice spanning multi-cloud, multi-region environments, and you will partner closely with Cloud and Data Platforms, the NOC/SOC, and Cyber Security to extend reliability practice across the firm.

Requirements

Deep understanding of TCP/IP, BGP, OSPF, VPNs, and SD-WAN architecture.
Proven experience with Terraform (state management, modules) and Ansible (playbooks, roles) – or similar – in a production environment.
Proficiency in Python for automation and API interaction, or similar.
Hands-on experience with Cloudflare, zScaler, and/or enterprise firewalls.
Experience configuring monitoring tools (e.g., Datadog, Prometheus, Grafana) to create meaningful alerts and dashboards.
A strong belief that a job isn't done until the documentation in written.
A mindset that actively seeks to automate repetitive tasks.
Willingness to handle physical hardware tasks when required while maintaining a software-centric engineering mindset.

Nice To Haves

Service mesh experience (Istio, Linkerd, Consul Connect, Cilium).
eBPF-based observability (Hubble, Pixie).
AWS Multi-account landing zone tooling experience (AFT, Control Tower, or equivalent).
Policy as Code experience (OPA/Rego, Sentinel, Cilium NetworkPolicy).

Responsibilities

Treat reliability as an engineered property. Define SLOs and error budgets for the network platform — DNS resolution, edge availability, mesh ingress success, cross-region path health — and use them to gate changes, not just to color dashboards.
Lead postmortems with a focus on permanent remediation, not pattern-recognition.
Alert on symptoms users feel, not on causes that may or may not produce impact.
Move network state into code. Use Terraform (or Pulumi), Ansible, and Python to replace CLI-driven configuration with declarative, version-controlled, peer-reviewed change running through Infra CI/CD. This applies equally to the edge tier (Cloudflare), security platforms (Zscaler ZIA/ZPA, ZTNA policies, next-gen firewalls), the cloud network fabric (Transit Gateway, Cloud WAN, VPCs, Route53, IPAM), and increasingly the Kubernetes and service-mesh layer.
Build network policy as intent, not rule lists. Express what flows are permitted, what segments are isolated, what egress is inspected, what zones share DNS — and engineer the compilers that turn that intent into per-vendor configuration. Use Policy as Code (OPA/Rego, Sentinel, Cilium NetworkPolicy) to catch invariant violations at plan time, not apply time.
Engineer the cloud network platform. Operate and extend our multi-account AWS Landing Zone — Cloud WAN segmentation, Transit Gateway peering, IPAM-driven CIDR allocation, shared private DNS, cross-account telemetry pipelines. Build the platform abstractions that make a new account or service land correctly with policy and connectivity composed from declarative inputs.
Extend platform thinking into the container tier. Kubernetes networking, service mesh (Istio, Linkerd, Consul Connect), eBPF-based observability and policy (Cilium, Hubble), and the integration points where mesh-level authz meets cloud-tier identity. Recognize that an "internal" service is one logical hop on a chain of policy enforcement points and engineer for that explicitly.
Improve telemetry and observability with intent. Build alerts as structured payloads with runbook links, suspected blast radius, and dependency-aware suppression. Author both system-health dashboards for operators and end-user monitoring dashboards that reflect actual user experience. Use Grafana, Elastic, Open Telemetry where each fits.
Mentor and grow the team. Provide technical guidance to junior engineers, foster a culture of learning, and work out loud across Platform Engineering so the patterns you build cross-pollinate to adjacent domains.
Handle hardware when required. Provide maintenance and configuration support for routers, switches, and firewalls at data centers and offices when needed — bringing code-first practices to physical hardware where possible (templating, change validation, zero-touch provisioning) and direct hands-on competence where it isn't.
Serve as an escalation point for network issues, some complex and some basic but not yet covered by runbooks. Troubleshooting with a focus on root cause analysis and permanent remediation with a documentation-first mindset.
Reduce toil and hand off cleanly. Repetitive operational tasks are scoped engineering problems with measurable payoff. Author runbooks and SOPs that the NOC can execute confidently; package routine work for L1/L2 handoff so engineering interrupt drops over time.
Coordinate across Data Platforms, NOC/SOC, and Cyber Security so reliability practices spread instead of staying siloed.