Together AI-posted 3 months ago
$160,000 - $230,000/Yr
Full-time • Senior
San Francisco, CA
101-250 employees

As a Senior Network Operations Engineer at Together AI, you are our front-line responder for break/fix incidents-owning alert triage, collaborating with SRE and MLOps teams, and driving rapid resolution to keep our global network and platform running smoothly. You combine strong operational discipline with hands-on troubleshooting and a bias for automation. Beyond traditional networking, you'll work hands-on with Kubernetes and Slurm to diagnose issues that span infrastructure, container networking, and HPC job fabrics. You're fluent in routing/switching and network security fundamentals, comfortable on Linux, and thrive in fast-moving environments where clear communication and crisp execution matter. You'll improve monitoring, runbooks, and recovery playbooks to reduce MTTA/MTTR and prevent repeat incidents. Outstanding problem-solving abilities and a solid understanding of fundamental network theory are also critical to your success.

  • Serve as first responder for network alerts and incidents: assess impact, prioritize, mitigate, and escalate as needed to SRE/MLOps/Network Engineering.
  • Own end-to-end incident lifecycle: detection, triage, containment, remediation, comms, and post-incident reviews with clear timelines and action items.
  • Monitor network health and capacity across routing/switching, firewalls, and data center fabrics; tune alert thresholds and dashboards to reduce noise.
  • Troubleshoot L2-L4 issues (ARP, VLAN/VXLAN/EVPN, routing protocols, ACLs/NAT, DNS, TLS termination, QoS) using packet capture and flow/telemetry tools.
  • Execute standard changes (MOPs) and emergency changes with rigorous change control and validation; document outcomes and update runbooks.
  • Operate multi-cluster add-ons (e.g., MetalLB/Traefik/NGINX), observe health via Prometheus/Grafana/Loki, and tune alerts to reduce noise.
  • Debug CNI/data plane (e.g., VXLAN/EVPN, iptables/nftables, network policies), kube-proxy/iptables mode, CoreDNS, Services (ClusterIP/NodePort/LoadBalancer), and Ingress/EGRESS.
  • Maintain accurate network documentation: diagrams, inventories, IPAM, device configs, and topology state.
  • Improve operational excellence: automate repetitive tasks, enhance self-service tooling, and contribute to SLOs, error budgets, and reliability roadmaps.
  • Participate in a shared on-call rotation providing 24×7 coverage for critical services.
  • 3+ years in a NOC/Network Operations or Network Support role for large-scale data center or service provider-style environments (hybrid/on-prem + cloud).
  • Solid understanding of TCP/IP and core protocols: BGP, OSPF/IS-IS, VLAN, VXLAN, EVPN, ACLs/NAT, DHCP, DNS, and QoS.
  • Proficiency with troubleshooting tools: Wireshark/tcpdump, mtr/traceroute, nmap, curl, iperf; comfortable on Linux for diagnostics and log analysis.
  • Experience operating multi-vendor networks (e.g., Arista, Cisco, Juniper, NVIDIA/Mellanox) and load balancers/firewalls.
  • Familiarity with AWS/GCP/Azure networking concepts (VPC/VNet, IGW/NATGW, peering, PrivateLink, routing, security groups).
  • Strong scripting/automation fundamentals (e.g., Bash/Python), and comfort with Git-based workflows for config versioning and change reviews.
  • Clear, concise communicator-able to write incident timelines, RCAs, and user-facing updates under time pressure.
  • Knowledge of RoCE and Infiniband protocols a plus
  • Hands-on Kubernetes troubleshooting experience: CNI fundamentals (policies, encapsulation), Services/Ingress, DNS (CoreDNS), kube-proxy, and container runtime basics a huge plus
  • Understanding of AI training workloads and the demands they exert on networks a plus.
  • Competitive compensation
  • Startup equity
  • Health insurance
  • Other competitive benefits
© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service