Staff Observability Engineer

FanDuel•New York, NY

3d•Hybrid

About The Position

FanDuel is looking for a Staff Observability Engineer to design, build, and mature the observability ecosystem that underpins its platform and services. The role involves delivering deep visibility into system behavior by combining system telemetry with user signals to provide a holistic view of performance, reliability, and user experience. The engineer will also explore how AI and machine learning can enhance observability, from intelligent alerting and anomaly detection to accelerating root cause analysis. This is a hands-on role where the engineer will partner closely with engineering and product teams to deliver scalable observability capabilities, serve as a subject matter expert in monitoring, alerting, and incident management, and equip teams with self-service insights and tooling. By connecting system behavior to real user impact and leveraging AI-assisted workflows to surface issues faster, the engineer will drive improvements in reliability, performance, and data-informed decision-making across the organization. Employees may be required to perform other such duties as assigned by the Company to ensure operational flexibility and meet evolving business needs.

Requirements

Significant hands-on experience in observability engineering, SRE, platform engineering, or related roles, with a track record of driving impact beyond individual teams.
Strong expertise in monitoring and observability, with significant hands-on experience in Datadog.
Experience defining and driving observability or reliability strategy across teams or domains.
Proficiency with Kubernetes, cloud infrastructure (AWS), and infrastructure-as-code tools (Terraform).
Proven ability to influence technical direction and decision-making across multiple teams and stakeholders.
Deep understanding of distributed systems principles (e.g. consistency, availability, partition tolerance) and their real-world trade-offs.
Experience defining and implementing SLOs, SLIs, and alerting strategies, including user-centric and business-aligned metrics.
Strong software engineering fundamentals, with proficiency in at least one modern programming language (e.g. Go, Java, Python, or TypeScript), and the ability to design scalable systems, build tooling and automation, and operate effectively within large, complex codebases.
Experience driving large-scale improvements through automation, reducing organizational toil, and eliminating classes of recurring issues.
Strong analytical skills, with the ability to translate technical signals into business and customer impact.
Excellent communication and stakeholder management skills, with the ability to influence both technical and non-technical audiences.
A mindset of ownership, with a focus on long-term impact, scalability, and continuous improvement.

Nice To Haves

Don’t check all the boxes? That’s okay! We encourage you to still apply if you feel like you possess an adjacent skill set and are interested in learning more about this position.

Responsibilities

Contribute in defining and driving the observability strategy and roadmap across multiple teams, aligning with business priorities and engineering goals.
Design and improve scalable observability capabilities that provide actionable insights into system health, performance, and user experience.
Establish and standardize best practices for monitoring, alerting, incident management, and postmortems across the organization.
Drive operational excellence by evolving incident management, on-call practices, and post-incident learning, ensuring systemic improvements over local fixes.
Lead cross-team initiatives to improve end-to-end reliability, identifying systemic risks and driving their resolution.
Leverage automation and AI-assisted workflows to accelerate root cause analysis and reduce operational toil at scale.
Partner with engineering and product leadership to translate observability insights into strategic roadmap decisions.
Identify trends across system and user signals to proactively detect, prevent, and mitigate large-scale issues.
Optimize observability platforms for cost, scalability, and long-term sustainability.
Mentor engineers and raise the reliability and observability maturity across the organization.

Benefits

Array of health plans to choose from (some as low as $0 per paycheck) that include programs for fertility and family planning, mental health support, and fitness benefits.
Generous paid time off (PTO & sick leave)
Annual bonus and long-term incentive opportunities (based on performance)
401k with up to a 5% match
Commuter benefits
Pet insurance
Medical, vision, and dental insurance
Life insurance
Disability insurance
401(k) matching program
Short-term or long-term incentive compensation, including, but not limited to, cash bonuses and stock program participation.
Paid personal time off
14 paid company holidays
Paid sick time in accordance with all applicable state and federal laws.