Senior Site Reliability Engineer

WalmartSunnyvale, CA
15h$117,000 - $234,000Onsite

About The Position

What you'll do... Position: Senior Site Reliability Engineer Job Location: 1375 Crossman Avenue, Sunnyvale, CA 94089 Duties: Detect and document defects bugs and errors for assigned component module and conducts analysis to determine the sources under guidance. Troubleshoot performance and availability bottlenecks for assigned application under guidance. Utilize established criteria for example probability of failure frequency of failure to measure site reliability. Monitors site reliability conditions and new reliability requirements. Assists in the design and development of a reliability program plan for a specific site environment. Applies appropriate tools services or applications for reliability prediction and other site improvements. Researches and assesses various reliability models for different site environments. Assist in creation of simple modular extensible and functional design for the product solution in adherence to the requirements. Evaluate tradeoffs while designing across multiple components in a system based on the business requirements. Convert HLD to create detailed design for specific modules components of a product system. Understand nuances of designing for disaster recovery Undertake infrastructure coding automation. Assist in creation of simple modular extensible and functional design for the product solution in adherence to the requirements. Evaluate tradeoffs while designing across multiple components in a product based on the business requirements. Convert HLD to create detailed design using mock screens pseudo codes and detailed functional logic of the modules for specific modules components of a product. Understand nuances of designing for disaster recovery. Design and create MVP to clarify requirements and design and uncover risks. Independently refine the MVP design for early defects and revised customer requirements. Adhere to all relevant coding guidelines. Create and configure minimalistic Less Complex Highly Robust and high-quality code for a component module under guidance. Maintain records by documenting program development and revisions. Stay updated on the prevalent coding languages and frameworks in the industry outside the immediate scope of delivery. Identify repetitive and routine tasks in Continuous Integration Continuous Delivery CICD Testing or any other process that can be automated. Implement telemetry features as required under guidance. Apply security policy requirements to component module during code development configuration. Work with business partners to identify and document critical applications. Interprets and follows procedures in contingency plans. Explains the contingency and disaster recovery plans for assigned environment. Executes established procedures necessary to continue operations in an emergency. Participates in the design of a minimum operating environment for a computer based facility. Suggest metrics to monitor software or system performance. Monitors current performance data to ensure compliance with defined SLOs for multiple applications systems. Determines thresholds for monitoring metrics and triggers alerts based on thresholds. Supervises specific procedures to proactively check the health of applications and infrastructure including a variety of operating systems hardware and software. Makes recommendations regarding situational awareness and alerting. Make recommendations regarding instrumentation gaps and alerting logic including a variety of operating systems hardware and software. Makes recommendations regarding situational awareness and alerting. Make recommendations regarding instrumentation gaps and alerting logic. Minimum education and experience required: Master’s degree or equivalent in computer science, computer engineering, computer information systems, software engineering, or related area and 1 year of experience in site reliability engineering, site and system administration, infrastructure management, or related area; OR Bachelor's degree or equivalent in computer science, computer engineering, computer information systems, software engineering, or related area and 3 years of experience in site reliability engineering, site and system administration, infrastructure management, or related area. Skills required: Experience designing and implementing performance test strategies for complex web, mobile, API, and backend systems for Jira and Confluence data center instances. Experience building and maintaining automated performance test scripts using tools including JMeter, Gatling, LoadRunner, and k6. Experience performing root cause analysis of performance issues in production and test environments for Jira and Confluence Data Center Instances, identifying CPU, memory, database, thread, and network bottlenecks. Experience monitoring system health, performance, and usage using tools including Grafana, Splunk, and Dynatrace, and ensuring compliance with internal SLAs. Experience designing and implementing observability (monitoring, logging, alerting) and ensuring SLAs and SLOs are met. Experience designing, implementing, and supporting large-scale Jira Software, Jira Service Management, and Confluence instances. Experience performing upgrades, patching, plugin management, and performance tuning for Atlassian platforms. Experience in integrating enterprise platforms with CI/CD pipelines, and observability tools to automate workflows, improve incident response, and enhance system reliability. Experience managing infrastructure components including Linux servers, databases, and storage supporting Atlassian tools in both on-prem and cloud environments. Experience working on scripting languages including Groovy, Bash and PowerShell to automate tasks on Linux and Windows. Experience implementing and maintaining backup, recovery, and disaster recovery plans for Atlassian tools. Employer will accept any amount of experience with the required skills. Salary Range: $117,000 to $234,000. Additional compensation includes annual or quarterly performance incentives. Benefits: At Walmart, we offer competitive pay as well as performance-based incentive awards and other great benefits for a happier mind, body, and wallet. Health benefits include medical, vision and dental coverage. Financial benefits include 401(k), stock purchase and company-paid life insurance. Paid time off benefits include PTO (including sick leave), parental leave, family care leave, bereavement, jury duty and voting. Other benefits include short-term and long-term disability, education assistance with 100% company paid college degrees, company discounts, military service pay, adoption expense reimbursement, and more. Eligibility requirements apply to some benefits and may depend on your job classification and length of employment. Benefits are subject to change and may be subject to a specific plan or program terms. For information about benefits and eligibility, see One.Walmart.com. Wal-Mart is an Equal Opportunity Employer. #LI-DNI #LI-DNP Walmart and its subsidiaries are committed to maintaining a drug-free workplace and has a no tolerance policy regarding the use of illegal drugs and alcohol on the job. This policy applies to all employees and aims to create a safe and productive work environment. About Walmart: Fifty years ago, Sam Walton started a single mom-and-pop shop and transformed it into the world's biggest retailer. Since those founding days, one thing has remained consistent: our commitment to helping our customers save money so they can live better. Today, we're reinventing the shopping experience and our associates are at the heart of it. You'll play a crucial role in shaping the future of retail, improving millions of lives around the world. This is that place where your passions meet purpose. Join our family and create a career you're proud of.

Requirements

  • Master’s degree or equivalent in computer science, computer engineering, computer information systems, software engineering, or related area and 1 year of experience in site reliability engineering, site and system administration, infrastructure management, or related area; OR Bachelor's degree or equivalent in computer science, computer engineering, computer information systems, software engineering, or related area and 3 years of experience in site reliability engineering, site and system administration, infrastructure management, or related area.
  • Experience designing and implementing performance test strategies for complex web, mobile, API, and backend systems for Jira and Confluence data center instances.
  • Experience building and maintaining automated performance test scripts using tools including JMeter, Gatling, LoadRunner, and k6.
  • Experience performing root cause analysis of performance issues in production and test environments for Jira and Confluence Data Center Instances, identifying CPU, memory, database, thread, and network bottlenecks.
  • Experience monitoring system health, performance, and usage using tools including Grafana, Splunk, and Dynatrace, and ensuring compliance with internal SLAs.
  • Experience designing and implementing observability (monitoring, logging, alerting) and ensuring SLAs and SLOs are met.
  • Experience designing, implementing, and supporting large-scale Jira Software, Jira Service Management, and Confluence instances.
  • Experience performing upgrades, patching, plugin management, and performance tuning for Atlassian platforms.
  • Experience in integrating enterprise platforms with CI/CD pipelines, and observability tools to automate workflows, improve incident response, and enhance system reliability.
  • Experience managing infrastructure components including Linux servers, databases, and storage supporting Atlassian tools in both on-prem and cloud environments.
  • Experience working on scripting languages including Groovy, Bash and PowerShell to automate tasks on Linux and Windows.
  • Experience implementing and maintaining backup, recovery, and disaster recovery plans for Atlassian tools.
  • Employer will accept any amount of experience with the required skills.

Responsibilities

  • Detect and document defects bugs and errors for assigned component module and conducts analysis to determine the sources under guidance.
  • Troubleshoot performance and availability bottlenecks for assigned application under guidance.
  • Utilize established criteria for example probability of failure frequency of failure to measure site reliability.
  • Monitors site reliability conditions and new reliability requirements.
  • Assists in the design and development of a reliability program plan for a specific site environment.
  • Applies appropriate tools services or applications for reliability prediction and other site improvements.
  • Researches and assesses various reliability models for different site environments.
  • Assist in creation of simple modular extensible and functional design for the product solution in adherence to the requirements.
  • Evaluate tradeoffs while designing across multiple components in a system based on the business requirements.
  • Convert HLD to create detailed design for specific modules components of a product system.
  • Understand nuances of designing for disaster recovery Undertake infrastructure coding automation.
  • Assist in creation of simple modular extensible and functional design for the product solution in adherence to the requirements.
  • Evaluate tradeoffs while designing across multiple components in a product based on the business requirements.
  • Convert HLD to create detailed design using mock screens pseudo codes and detailed functional logic of the modules for specific modules components of a product.
  • Understand nuances of designing for disaster recovery.
  • Design and create MVP to clarify requirements and design and uncover risks.
  • Independently refine the MVP design for early defects and revised customer requirements.
  • Adhere to all relevant coding guidelines.
  • Create and configure minimalistic Less Complex Highly Robust and high-quality code for a component module under guidance.
  • Maintain records by documenting program development and revisions.
  • Stay updated on the prevalent coding languages and frameworks in the industry outside the immediate scope of delivery.
  • Identify repetitive and routine tasks in Continuous Integration Continuous Delivery CICD Testing or any other process that can be automated.
  • Implement telemetry features as required under guidance.
  • Apply security policy requirements to component module during code development configuration.
  • Work with business partners to identify and document critical applications.
  • Interprets and follows procedures in contingency plans.
  • Explains the contingency and disaster recovery plans for assigned environment.
  • Executes established procedures necessary to continue operations in an emergency.
  • Participates in the design of a minimum operating environment for a computer based facility.
  • Suggest metrics to monitor software or system performance.
  • Monitors current performance data to ensure compliance with defined SLOs for multiple applications systems.
  • Determines thresholds for monitoring metrics and triggers alerts based on thresholds.
  • Supervises specific procedures to proactively check the health of applications and infrastructure including a variety of operating systems hardware and software.
  • Makes recommendations regarding situational awareness and alerting.
  • Make recommendations regarding instrumentation gaps and alerting logic including a variety of operating systems hardware and software.
  • Makes recommendations regarding situational awareness and alerting.
  • Make recommendations regarding instrumentation gaps and alerting logic.

Benefits

  • At Walmart, we offer competitive pay as well as performance-based incentive awards and other great benefits for a happier mind, body, and wallet.
  • Health benefits include medical, vision and dental coverage.
  • Financial benefits include 401(k), stock purchase and company-paid life insurance.
  • Paid time off benefits include PTO (including sick leave), parental leave, family care leave, bereavement, jury duty and voting.
  • Other benefits include short-term and long-term disability, education assistance with 100% company paid college degrees, company discounts, military service pay, adoption expense reimbursement, and more.
  • Eligibility requirements apply to some benefits and may depend on your job classification and length of employment.
  • Benefits are subject to change and may be subject to a specific plan or program terms.
  • For information about benefits and eligibility, see One.Walmart.com.
© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service