About The Position

About Salesforce Salesforce is the #1 AI CRM, where humans with agents drive customer success together. Here, ambition meets action. Tech meets trust. And innovation isn’t a buzzword — it’s a way of life. The world of work as we know it is changing and we're looking for Trailblazers who are passionate about bettering business and the world through AI, driving innovation, and keeping Salesforce's core values at the heart of it all. Ready to level-up your career at the company leading workforce transformation in the agentic era? You’re in the right place! Agentforce is the future of AI, and you are the future of Salesforce. The Availability Standards team is part of the overall Salesforce technology organization. We manage the high-level frameworks used to measure platform uptime and performance, bridging the gap between centralized reporting and the individual engineering teams that own specific services. We follow a consultative engineering approach where our experts partner with service owners to build a deep understanding of service health, telemetry, and automated testing. This level of expertise allows our team to advocate for the customer and influence the product roadmap by ensuring that every service team has the visibility they need to maintain world-class availability. Role Description: The Engineering Availability Standards position is a critical role designed for a seasoned engineering veteran who has experience managing, leading, or coordinating with high-scale cloud services. Your mission is to transform how we calculate, visualize, and act upon platform health data. You will serve as the technical bridge between our global availability standards and the distributed engineering teams that power our infrastructure. You will be responsible for shifting our monitoring strategy from simple reporting into active, high-fidelity signals that engineering teams use for real-time alerting and incident response. This role requires the ability to influence technical roadmaps across different product families and automate the integration of reliability testing and observability into standard software development lifecycles.

Requirements

  • A related technical degree required.
  • 5+ years of proven experience in production environments (this could include previous experience as a software engineer, systems engineer, service owner, or lead developer).
  • Fluency in Java or a similar object-oriented language (Python, C++, etc.) to provide input on platform requirements and automation.
  • Deep understanding of telemetry systems and experience building or managing production monitoring and alerting frameworks.
  • Experience using Linux environments and the ability to navigate complex, distributed system architectures.
  • Familiarity with core web technologies: HTTP, JSON, REST, and XML.

Nice To Haves

  • Previous experience in a Service Owner or Technical Lead role within a high-scale, multi-tenant cloud environment.
  • Strong background in Site Reliability Engineering (SRE) principles and industry-standard availability best practices.
  • Experience with automated testing frameworks (e.g., Selenium, Integration testing, or Chaos Engineering).
  • Log parsing and data analysis experience using platforms such as Splunk or ELK.
  • Experience with SQL and relational databases (PostgreSQL, Oracle, etc.).
  • Ability to influence technical change across a large, matrixed organization without direct authority.

Responsibilities

  • Utilize software engineering skills and production experience to provide input into long-range platform requirements and operational guidelines, with a focus on making health data actionable for service owners.
  • Analyze and understand how service teams manage their telemetry, and help drive continuous improvement of health signals based on the knowledge of specific service architectures.
  • Partner with internal engineering teams to integrate global availability standards into their existing monitoring pipelines, dashboards, and automated alerting flows.
  • Identify and mitigate friction in the onboarding process by leveraging existing automated test suites to create high-quality, streamlined reliability signals with minimal manual effort.
  • Serve as a technical subject matter expert to ensure that centralized infrastructure services (logging, monitoring, and data platforms) are optimized to support the needs of individual service owners.
  • Quarterback the integration of failure signals into standard engineering workflows, ensuring that detected issues result in automated work items and proactive investigations.
  • Deliver presentations highlighting availability metrics, reliability trends, and success stories to diverse engineering and leadership audiences.

Benefits

  • Salesforce offers a variety of benefits to help you live well including: time off programs, medical, dental, vision, mental health support, paid parental leave, life and disability insurance, 401(k), and an employee stock purchasing program.
  • More details about company benefits can be found at the following link: https://www.salesforcebenefits.com.
© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service