About The Position

Join our team and contribute to the operational excellence of the Salesforce GovCloud! Are you passionate about ensuring the reliability and performance of mission-critical cloud services? Salesforce is seeking a talented Site Reliability Engineer to join our dynamic team in our Denver, CO, location, supporting our GovCloud environment. As a key member of our Site Reliability organization, you'll play a vital role in maintaining 99.99% uptime for customer-facing services, proactively addressing issues, and ensuring the security of our data. We foster a collaborative and innovative culture, where you’ll work alongside skilled engineers to solve complex problems and drive continuous improvement. Please Note: This position requires a successful background investigation and the ability to obtain and maintain a specific level of U.S. government background clearance. Details will be provided during the interview process. Shift Requirements: This role involves shift work, including night shifts, as part of a 24/7 support team. We provide a rotating schedule and ensure adequate compensation for shift differentials. About the Role: The Site Reliability team at Salesforce is the backbone of our cloud operations, working around the clock to keep our services available and our customers protected. You will be a crucial part of the GovCloud Incident Response (GIR) team, which maintains the current infrastructure through day-to-day alert response, smart hands support, and comprehensive incident management, including retrospectives and long-term remediation.

Requirements

  • Citizenship: U.S. citizen (U.S. born or naturalized) who does not hold dual citizenship.
  • You agree to complete a Minimum Background Investigation (MBI) for a Moderate Public Trust position with the U.S. federal government or other clearances as deemed appropriate for the role.
  • Education: Bachelor’s degree in Computer Science, Engineering, Information Technology, or a related technical field.
  • Experience: Systems engineering experience in enterprise-scale internet service engineering or support role.
  • Technical Skills: Expertise in TCP/IP related technologies (networking protocols, network programming, etc.).
  • Expertise in CLI enterprise support of Unix variants (Linux/Solaris/BSD), with significant exposure to Red Hat Enterprise Linux and Solaris.
  • Strong understanding of monitoring security systems and administration.
  • Experience provisioning, operating, and running AWS/C2S based infrastructure and systems.
  • Proficiency in scripting with Python, Go, or other languages.
  • Communication: Strong written and oral communication skills.
  • Incident Management: Past experience in Incident Management and a good understanding of ITIL service operations.
  • Availability: Ability to participate in a 24/7 on-call rotation supporting large data center operations and be available for shift work.

Nice To Haves

  • Prior experience with Chef/Puppet or automated deployment. (This helps streamline our infrastructure management.)
  • Prior experience with Jenkins/Bamboo/Spinnaker pipeline execution. (This aids in our continuous integration and deployment processes.)
  • Experience supporting and maintaining monitoring and alert systems. (Ensures proactive issue detection.)
  • Experience supporting and maintaining Java applications. (Supports our application stack.)
  • Hands-on experience configuring and running AWS (Amazon Web Services) using the CLI/SDKs. (Essential for our cloud infrastructure.)
  • Certifications in Linux+, RedHat, and AWS. (Validates technical expertise.)
  • Experience supporting and leading Kubernetes-based applications and services. (Supports our containerized environment.)
  • Familiarity with Agile Process and DevOps practices. (Enables efficient workflow and collaboration.)
  • Experience participating in blameless retrospectives, learning from incidents, and conducting post-incident investigations, with an interest in how AI can assist in root cause analysis and pattern identification. (Promotes a culture of continuous improvement.)
  • Working knowledge of and interest in resilience engineering, including concepts such as Safety II and proactive problem prevention, leveraging AI for proactive risk identification and system optimization. (Enhances system reliability.)
  • Experience with AI/ML concepts and tools for operational insights, predictive maintenance, or intelligent automation.
  • Familiarity with data analysis and visualization tools to interpret AI-generated insights.

Responsibilities

  • Ensure 99.99% uptime for customer-facing services by proactively monitoring and maintaining the health of supporting systems, contributing directly to customer satisfaction and trust.
  • Act in key support roles during major incidents (e.g., Sev0, Sev1) and participate in technical incident reviews for problem management.
  • Contribute to Problem Management by populating and participating in Root Cause Analyses (RCAs) and handing them off to the Global Solutions team.
  • Ensure all work carried out by the Site Reliability team aligns with the company’s internal compliance policies and directives.
  • Collaborate with technical staff to solve complex technical issues and customer concerns.
  • Lead and mentor other team members in staying abreast of industry innovations and technologies, and assist in team development growth.
  • Thrive in a fast-paced environment, solving sophisticated issues quickly and successfully balancing multiple priorities.
  • Automate the detection and resolution of recurring issues in the production environment.
  • Help create and improve current processes to reduce operational and engineering toil, including the implementation of AI-driven automation for routine tasks.

Benefits

  • time off programs
  • medical, dental, vision, mental health support
  • paid parental leave
  • life and disability insurance
  • 401(k)
  • employee stock purchasing program
© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service