Senior Software Engineer, Observability

Core WeaveSunnyvale, CA
45dHybrid

About The Position

We are seeking senior engineers who specialize in all pillars of Observability to play a pivotal role in CoreWeave's ability to bring the best performing systems to the market. Whether you focus on metrics, logging and tracing, or pipelines and visualization, you'll have an outsize opportunity to enable both CoreWeave and its customers to understand, troubleshoot, and optimize complex systems at the forefront of Artificial Intelligence. Some of what you'll work on: Modernize logging platforms at cloud-scale. Design and execute migrations that are transparent to platform consumers. Build governance mechanisms that empower CoreWeavers to effectively manage the telemetry their services produce and adopt best practices. Develop and enforce best practices regarding the health of telemetry ETL pipelines. Improve the performance, security, reliability, and scalability of observability services while participating in the team's on-call rotation.

Requirements

  • Six or more years of experience in software or infrastructure engineering, with a proven track record of designing, building, and operating large-scale distributed systems in production.
  • Proficiency in Go (our primary language) or Python, with a strong ability to write clean, resilient, and testable code for production-grade software.
  • Non-negotiable hands-on production Kubernetes experience, including familiarity with containerization and microservices architectures, and understanding its observability challenges.
  • Proven track record of designing, building, and delivering robust and scalable production systems. A commitment to operational excellence, writing high-quality code and implementing best practices for system reliability, including effective testing and progressive release strategies.
  • Ability to analyze and decompose complex problems in elastic architectures into manageable tasks.
  • Comfortable with helm and YAML configuration for deploying and managing services, including templating, automation, and infrastructure-as-code practices.
  • A customer-obsessed mindset, eager to provide infrastructure as a service and apply a product lens when evaluating platform scale problems.
  • Experience participating in an on-call rotation for critical production systems.

Nice To Haves

  • Direct, hands-on experience designing, operating, or scaling logging, tracing, and/or metrics platforms (e.g., Loki, ClickHouse, Elasticsearch, Prometheus, VictoriaMetrics, Grafana, Thanos).
  • Familiarity with data streaming systems (e.g., Kafka, Kafka Connect,) for observability pipelines.
  • Experience automating and provisioning infrastructure as part of the software development lifecycle, using tools like Terraform
  • Knowledge of Linux systems, shell scripting, and the Linux storage and networking stacks.
  • Experience with OpenTelemetry for unified telemetry collection.
  • Interest in contributing to open source projects.

Responsibilities

  • Design, build, and own core observability infrastructure, including highly scalable and reliable logging, metrics, and tracing platforms.
  • Develop and implement scalable, high-throughput telemetry pipelines that ingest, transform, and expose observability data, ensuring high reliability, security, and transparent data migrations for platform consumers.
  • Establish and build governance mechanisms and best practices to empower CoreWeave engineers to effectively manage the telemetry their services produce, fostering effective usage patterns and a self-service model.
  • Continuously improve the performance, security, reliability, and scalability of observability services through software enhancements and new feature development.
  • Participate in the team's on-call rotation to support critical production systems, focusing on root cause analysis and building durable solutions to prevent future incidents.
  • Collaborate closely with internal engineering teams, applying a platform-as-a-product mindset to understand their needs and embed observability best practices and custom tooling into their systems.
  • Contribute to the overall observability strategy, influencing the direction of our platform

Benefits

  • Medical, dental, and vision insurance - 100% paid for by CoreWeave
  • Company-paid Life Insurance
  • Voluntary supplemental life insurance
  • Short and long-term disability insurance
  • Flexible Spending Account
  • Health Savings Account
  • Tuition Reimbursement
  • Ability to Participate in Employee Stock Purchase Program (ESPP)
  • Mental Wellness Benefits through Spring Health
  • Family-Forming support provided by Carrot
  • Paid Parental Leave
  • Flexible, full-service childcare support with Kinside
  • 401(k) with a generous employer match
  • Flexible PTO
  • Catered lunch each day in our office and data center locations
  • A casual work environment
  • A work culture focused on innovative disruption

Stand Out From the Crowd

Upload your resume and get instant feedback on how well it matches this job.

Upload and Match Resume

What This Job Offers

Job Type

Full-time

Career Level

Senior

Industry

Professional, Scientific, and Technical Services

Education Level

No Education Listed

Number of Employees

501-1,000 employees

© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service