Sr. Manager, Observability Platform Engineering

Databricks•Mountain View, CA

23d

About The Position

At Databricks, we are passionate about enabling data teams to solve the world's toughest problems — from making the next mode of transportation a reality to accelerating the development of medical breakthroughs. We do this by building and running the world's best data and AI infrastructure platform so our customers can use deep data insights to improve their business. At Databricks, we are passionate about enabling data teams to solve the world's toughest problems — from making the next mode of transportation a reality to accelerating the development of medical breakthroughs. We do this by building and running the world's best data and AI infrastructure platform so our customers can use deep data insights to improve their business. As the Manager of the Observability Platform team, you will lead the engineers responsible for building and scaling the next generation of Databricks’ global observability systems. Your team enables every Databricks engineer—and our customers—to monitor, diagnose, and improve the reliability of our platform at massive scale. You will guide the strategy, architecture, and execution of systems that handle billions of active time series and process petabytes of logs daily, ensuring world-class visibility into the health and performance of our products.

Requirements

5+ years experience in the performance analysis discipline. Ability to identify performance issues, root cause problems, and be able to come up with potential solutions.
Ability to build strong working relationships with developers and field engineers to facilitate triaging and mitigation of performance problems.
At least 3 years of experience in managing top-tier engineering teams
BS in Computer Science (Masters or higher level of education preferred)
Expertise in attracting, hiring and coaching engineers, who will meet the Databricks hiring standards. Experience up-leveling teams via hiring top-notch talent and growing existing team members.

Responsibilities

Lead the design and development of the next-generation observability platforms that support billions of active time series and process petabytes of logs every day.
Oversee infrastructure deployed across nearly a hundred cloud regions, empowering internal engineers and customers to effectively monitor the reliability and performance of Databricks.
Drive the creation of advanced troubleshooting workflows that accelerate incident diagnosis, enabling engineers to rapidly derive insights from logs, metrics, and other telemetry.
Leverage Databricks’ own data intelligence platform to push the boundaries of observability, setting new standards for industry-leading incident analysis and reliability practices.
Establish and uplevel monitoring and reliability best practices across Databricks engineering by developing opinionated tools and standards for structured logs, metrics, alerts, dashboards, and on-call operations.
Mentor, grow, and inspire engineers, fostering a culture of technical excellence and strengthening the broader observability community within Databricks.

Stand Out From the Crowd

Upload your resume and get instant feedback on how well it matches this job.

Upload and Match Resume