AI Infrastructure Benchmarking and Network Validation Engineer

CiscoSan Jose, CA
$199,700 - $292,800

About The Position

As a key contributor to Cisco’s AI/ML infrastructure initiatives, you will plan, execute, and analyze comprehensive benchmarks on Cisco switches, focusing on throughput, latency, congestion, incast, failover, path diversity, and workload performance to ensure optimal AI/ML network operations. You will be guiding AI/ML workload deployments from initial scoping and test planning through execution and benchmark analysis, ensuring success criteria are met. Your role includes developing AI-driven automation workflows to streamline network development, operations, and implementations. You will define rigorous benchmark methodologies, test plans, KPIs, pass/fail criteria, and reporting structures for AI RoCE Ethernet fabrics, benchmarking fabric performance across critical metrics including latency, throughput, path diversity, ECMP and link utilization, congestion behavior, packet drops, retransmissions, queue occupancy, and recovery behavior. You will run and analyze performance tests using industry-standard tools such as NCCL, RCCL, ib_write_bw, ib_read_bw, ib_send_bw, ib_write_lat, netperf, iperf, MPI, OSU benchmarks, and microburst test methods. You will validate switch ASIC features including buffers, schedulers, QoS/queuing, ECMP behavior, telemetry, hashing, traffic distribution, and congestion visibility. Owning switch OS configuration and automation, you will utilize SONiC, NX-OS, Ansible, Python, Bash, Git, and related tooling to implement and validate advanced features such as SRv6, segment routing, uSID, Adj-SID, and policy-based pathing as required. You will document PoC architecture, benchmark methodologies, topology diagrams, configurations, results, findings, and recommendations. This role empowers you to shape the future of AI infrastructure networking by delivering scalable, high-performance, and resilient network fabrics that meet the stringent demands of AI/ML workloads, driving innovation and customer success at Cisco.

Requirements

  • Bachelors + 7 years of related experience, or Masters + 4 years of related experience.
  • Python for automation experience.
  • Experience with L2/L3 network protocols such as BGP, OSPF, EVPN, VxLAN, IPv6 or similar.
  • Experience with Traffic tools such as Spirent, IXIA or similar.
  • Docker or Kubernetes experience.
  • Experience with network testing and validation.

Nice To Haves

  • Clear written and verbal communication skills as well as documentation skills.
  • SONiC, NxOS, Linux or other open source network operating systems experience.
  • Deep understanding of Leaf-spine fabric and troubleshooting them.
  • Experience with Cisco Nexus Dashboard and related automation tools for provisioning, managing and troubleshooting the fabric.
  • Experience handling complex network segmentation, security policies, and multi-site fabric designs.
  • Experience with RDMA, RoCEv2, PFC, ECN, congestion control, QoS, buffer behavior, and lossless Ethernet concepts.

Responsibilities

  • Plan, execute, and analyze comprehensive benchmarks on Cisco switches, focusing on throughput, latency, congestion, incast, failover, path diversity, and workload performance to ensure optimal AI/ML network operations.
  • Guide AI/ML workload deployments from initial scoping and test planning through execution and benchmark analysis, ensuring success criteria are met.
  • Develop AI-driven automation workflows to streamline network development, operations, and implementations.
  • Define rigorous benchmark methodologies, test plans, KPIs, pass/fail criteria, and reporting structures for AI RoCE Ethernet fabrics, benchmarking fabric performance across critical metrics including latency, throughput, path diversity, ECMP and link utilization, congestion behavior, packet drops, retransmissions, queue occupancy, and recovery behavior.
  • Run and analyze performance tests using industry-standard tools such as NCCL, RCCL, ib_write_bw, ib_read_bw, ib_send_bw, ib_write_lat, netperf, iperf, MPI, OSU benchmarks, and microburst test methods.
  • Validate switch ASIC features including buffers, schedulers, QoS/queuing, ECMP behavior, telemetry, hashing, traffic distribution, and congestion visibility.
  • Own switch OS configuration and automation, utilizing SONiC, NX-OS, Ansible, Python, Bash, Git, and related tooling to implement and validate advanced features such as SRv6, segment routing, uSID, Adj-SID, and policy-based pathing as required.
  • Document PoC architecture, benchmark methodologies, topology diagrams, configurations, results, findings, and recommendations.

Benefits

  • medical, dental and vision insurance
  • a 401(k) plan with a Cisco matching contribution
  • paid parental leave
  • short and long-term disability coverage
  • basic life insurance
  • grants of Cisco restricted stock units
  • 10 paid holidays per full calendar year
  • 1 floating holiday for non-exempt employees
  • 1 paid day off for employee’s birthday
  • paid year-end holiday shutdown
  • 4 paid days off for personal wellness
  • 16 days of paid vacation time per full calendar year (non-exempt employees)
  • flexible vacation time off program (exempt employees)
  • 80 hours of sick time off provided on hire date and each January 1st thereafter
  • up to 80 hours of unused sick time carried forward from one calendar year to the next
  • Additional paid time away may be requested to deal with critical or emergency issues for family members
  • Optional 10 paid days per full calendar year to volunteer
  • annual bonuses (for non-sales roles)
  • performance-based incentive pay (for sales roles)
© 2026 Teal Labs, Inc
Privacy PolicyTerms of Service