Senior Network Operations Engineer

Together AI-posted 3 months ago

$160,000 - $230,000/Yr

Full-time • Senior

San Francisco, CA

101-250 employees

Resume

Match Score

Upload and Match ResumeTrack Jobs with Teal

As a Senior Network Operations Engineer at Together AI, you are our front-line responder for break/fix incidents-owning alert triage, collaborating with SRE and MLOps teams, and driving rapid resolution to keep our global network and platform running smoothly. You combine strong operational discipline with hands-on troubleshooting and a bias for automation. Beyond traditional networking, you'll work hands-on with Kubernetes and Slurm to diagnose issues that span infrastructure, container networking, and HPC job fabrics. You're fluent in routing/switching and network security fundamentals, comfortable on Linux, and thrive in fast-moving environments where clear communication and crisp execution matter. You'll improve monitoring, runbooks, and recovery playbooks to reduce MTTA/MTTR and prevent repeat incidents. Outstanding problem-solving abilities and a solid understanding of fundamental network theory are also critical to your success.

Serve as first responder for network alerts and incidents: assess impact, prioritize, mitigate, and escalate as needed to SRE/MLOps/Network Engineering.
Own end-to-end incident lifecycle: detection, triage, containment, remediation, comms, and post-incident reviews with clear timelines and action items.
Monitor network health and capacity across routing/switching, firewalls, and data center fabrics; tune alert thresholds and dashboards to reduce noise.
Troubleshoot L2-L4 issues (ARP, VLAN/VXLAN/EVPN, routing protocols, ACLs/NAT, DNS, TLS termination, QoS) using packet capture and flow/telemetry tools.
Execute standard changes (MOPs) and emergency changes with rigorous change control and validation; document outcomes and update runbooks.
Operate multi-cluster add-ons (e.g., MetalLB/Traefik/NGINX), observe health via Prometheus/Grafana/Loki, and tune alerts to reduce noise.
Debug CNI/data plane (e.g., VXLAN/EVPN, iptables/nftables, network policies), kube-proxy/iptables mode, CoreDNS, Services (ClusterIP/NodePort/LoadBalancer), and Ingress/EGRESS.
Maintain accurate network documentation: diagrams, inventories, IPAM, device configs, and topology state.
Improve operational excellence: automate repetitive tasks, enhance self-service tooling, and contribute to SLOs, error budgets, and reliability roadmaps.
Participate in a shared on-call rotation providing 24×7 coverage for critical services.

3+ years in a NOC/Network Operations or Network Support role for large-scale data center or service provider-style environments (hybrid/on-prem + cloud).
Solid understanding of TCP/IP and core protocols: BGP, OSPF/IS-IS, VLAN, VXLAN, EVPN, ACLs/NAT, DHCP, DNS, and QoS.
Proficiency with troubleshooting tools: Wireshark/tcpdump, mtr/traceroute, nmap, curl, iperf; comfortable on Linux for diagnostics and log analysis.
Experience operating multi-vendor networks (e.g., Arista, Cisco, Juniper, NVIDIA/Mellanox) and load balancers/firewalls.
Familiarity with AWS/GCP/Azure networking concepts (VPC/VNet, IGW/NATGW, peering, PrivateLink, routing, security groups).
Strong scripting/automation fundamentals (e.g., Bash/Python), and comfort with Git-based workflows for config versioning and change reviews.
Clear, concise communicator-able to write incident timelines, RCAs, and user-facing updates under time pressure.

Knowledge of RoCE and Infiniband protocols a plus
Hands-on Kubernetes troubleshooting experience: CNI fundamentals (policies, encapsulation), Services/Ingress, DNS (CoreDNS), kube-proxy, and container runtime basics a huge plus
Understanding of AI training workloads and the demands they exert on networks a plus.

Competitive compensation
Startup equity
Health insurance
Other competitive benefits

Track Jobs with Teal

Job Search Resources

•

AI Resume Builder

•

Operations Engineer Resume Examples

•

Operations Engineer Cover Letter Examples

Senior Network Operations Engineer

Job Search Resources

Tools

Career Hubs

Guides

Company