Principal Operations Engineer

Salesforce•New York, NY

1d•$197,300 - $344,700

About The Position

Salesforce's Digital Enterprise Technology (DET) organization is establishing a new, engineering-first operations function. This function aims to transition the entire organization from reactive, manual processes to automated, intelligent, and proactive operations at scale. As a Principal Operations Engineer focused on Operational Excellence, you will be a foundational technical leader in this team. Your role will involve defining how DET detects, responds to, and prevents issues, while simultaneously eliminating toil and enhancing the reliability of critical, customer-facing systems. This is a high-visibility, high-impact position for an individual eager to influence not only what is built but also how an organization operates.

Requirements

12+ years of experience in engineering, operations engineering, SRE, or related roles.
Proven track record of automating complex operational workflows and improving reliability and operational maturity at scale.
Deep expertise in incident management systems, observability (metrics, logging, tracing), and distributed systems and microservices.
Strong experience with automation frameworks, scripting, Infrastructure as Code, and modern DevOps practices.
Experience operating high-availability, customer-facing systems in enterprise environments.
Strong written and verbal communication skills with the ability to influence senior engineering leaders and drive outcomes across teams without formal authority.
A related technical degree required.

Nice To Haves

Experience building self-service or platform-based operational tooling.
Background in automation-driven operations or platform engineering.
Experience leading large-scale incident management transformations.
Familiarity with AI/ML-driven operations (AIOps).
Experience in SaaS/PaaS enterprise environments.
Salesforce ecosystem experience (Apex, LWC, APIs, etc.).

Responsibilities

Lead the design and implementation of automation-first operations, eliminating manual workflows across incident management, alerting, escalation, runbooks, and day-to-day operational processes.
Build and scale alert-to-incident automation pipelines to accelerate detection and response times.
Identify and prioritize high-impact toil reduction opportunities across the ecosystem.
Drive adoption of self-healing systems and automated remediation patterns.
Provide Tier 2+ advanced application support for complex production issues and lead deep-dive investigations into system failures.
Drive a culture of automation-first thinking, ownership, accountability, and continuous improvement.
Lead the evolution from reactive incident response to proactive reliability engineering, improving MTTD, MTTR, and the percentage of incidents detected automatically.
Serve as a key technical leader in incident management, escalation strategy, and post-incident analysis.
Establish and enforce SLI, SLA, and SLO frameworks across critical Tier-1 services.
Drive deep understanding of system dependencies and failure modes.
Architect operational strategies with a focus on customer intent, experience, and outcomes.
Identify and prioritize critical user journeys, ensuring they are observable, reliable, and performant.
Align operational priorities with business impact.
Partner with stakeholders to define and execute quarterly and annual operational roadmaps (OKRs).
Translate business needs into scalable operational capabilities, balancing reliability, speed, and cost efficiency.