Staff Engineer, Network Observability

CoreWeaveSunnyvale, CA
$207,000 - $275,000Onsite

About The Position

The Network Observability team is responsible for how CoreWeave observes, understands, and operates its network at scale. As a Staff Engineer for Network Observability, you will define and evolve the technical direction for network observability, partnering across Network Engineering, SRE, Platform, and adjacent infrastructure teams to build resilient telemetry systems, raise engineering standards, and turn observability into a strategic advantage for the business. Your mission: build and scale a network observability platform that provides CoreWeave fast, trustworthy insight into network behavior, enables proactive risk detection, improves how engineering teams make decisions during both normal operations and incidents, and enables closed-loop automation workflows.

Requirements

  • Deep expertise in building flexible network observability solutions, with diverse implementation options for collectors, distribution, processing, persistence, alerting, analytics, and visualization
  • Experience as a Network Engineer, SRE, Software Engineer, or Systems Engineer in large-scale environments, with a strong track record of building and operating observability or infrastructure platforms that support multiple teams
  • Demonstrated ability to lead through ambiguity, shape technical direction, and make sound architectural and operational tradeoffs that balance immediate needs with long-term maintainability
  • Strong systems thinking and practical experience designing resilient, scalable solutions that improve visibility, incident response, and engineering efficiency
  • Proven ability to work across teams and functions, influence without formal authority, and build trust with both technical and non-technical stakeholders
  • Proficient with Python, Go, and Bash, plus familiarity with configuration management and templating tools such as Ansible and Jinja2
  • Comfortable containerizing and operating solutions in Kubernetes, including designing, building, and deploying container-based workloads efficiently
  • Strong knowledge of Linux systems and IP networking concepts, with hands-on experience in routing, switching, and network troubleshooting
  • Practical experience with networking platforms such as SONiC, HPE Junos, NVIDIA Cumulus Linux, Nokia SR OS, and SR Linux
  • A strong mentorship mindset and a history of helping other engineers grow through coaching, design feedback, documentation, and technical leadership

Nice To Haves

  • Bachelor's degree in Computer Science or a related field
  • Hands-on experience applying machine learning techniques or tools to proactively detect performance or security anomalies in network traffic
  • Experience with OpenTelemetry, Jaeger, Zipkin, or similar tooling for end-to-end tracing across distributed systems and infrastructure components
  • Experience shaping technical roadmaps, setting standards, or leading platform investments that materially improved reliability or scalability across multiple teams
  • Network certifications such as CCNA, CCNP, or similar

Responsibilities

  • Set technical direction for network observability across multiple teams, ensuring the platform, data models, and telemetry strategy align with long-term engineering and business goals
  • Lead the design and evolution of scalable observability solutions using diverse collector technologies (e.g., gNMI, SNMP, Prometheus scraping, OTEL, logs, flows, etc.), persistence databases (e.g., Prometheus-like, Loki, Clickhouse), and visualization and alerting ones (e.g., Grafana, Alertmanager), with a strong focus on reliability, usability, and future scale
  • Drive cross-team initiatives to standardize observability patterns, improve signal quality, and create a consistent approach to logs, metrics, events, flows, and related diagnostics across the network stack
  • Partner closely with engineering leadership and technical stakeholders to prioritize investments, navigate ambiguity, and make high-leverage technical tradeoffs that improve resilience, scalability, and operator efficiency
  • Act as a go-to technical expert for critical observability challenges, especially when incidents, architectural complexity, or unclear ownership require strong judgment and coordination
  • Mentor junior and senior engineers through technical reviews, design guidance, and hands-on problem solving, raising the bar for engineering quality and multiplying the impact of the broader team
  • Participate in design discussions, RFCs, and architectural decisions across the broader infrastructure organization, helping teams converge on scalable, maintainable solutions
  • Join a rotating on-call schedule as a senior escalation point for observability-related issues, helping teams quickly isolate failures, improve incident response, and drive durable follow-through after outages

Benefits

  • Medical, dental, and vision insurance - 100% paid for by CoreWeave
  • Company-paid Life Insurance
  • Voluntary supplemental life insurance
  • Short and long-term disability insurance
  • Flexible Spending Account
  • Health Savings Account
  • Tuition Reimbursement
  • Ability to Participate in Employee Stock Purchase Program (ESPP)
  • Mental Wellness Benefits through Spring Health
  • Family-Forming support provided by Carrot
  • Paid Parental Leave
  • Flexible, full-service childcare support with Kinside
  • 401(k) with a generous employer match
  • Flexible PTO
  • Catered lunch each day in our office and data center locations
  • A casual work environment
  • A work culture focused on innovative disruption
© 2026 Teal Labs, Inc
Privacy PolicyTerms of Service