Alibaba Cloud-Site Reliability Engineer (Apsara Lab)-Bellevue

Alibaba GroupBellevue, WA
67d$133,200 - $219,600

About The Position

We are the Apsara Lab at Alibaba Cloud Intelligence Group, committed to delivering a cutting-edge MaaS platform and toolkits for application development through technological innovation and engineering practices. Our team focuses on the fundamental R&D in model services, while also providing full-stack development that ranges from architecture design to model applications. Our goal is to build the industry's largest model service platform with excellent cost-efficiency, high performance, and enterprise-level reliability. By doing so, we aim to empower numerous enterprise clients to accelerate the development of model applications. We are seeking a passionate and technically skilled Site Reliability Engineer (SRE) to join our team. You will play a critical role in building and maintaining a highly available, high-performance model service platform.

Requirements

  • 3+ years of experience in SRE, DevOps, or backend development, with expertise in distributed system operations.
  • Experience in cloud computing, AI infrastructure, Alibaba Cloud is a plus.
  • Experience programming with at least one modern language such as Python, Golang, Java, C++.
  • Strong ability to work under pressure, manage critical incidents, and participate in an on-call rotation.
  • Fluency in both Chinese and English for daily communication.

Nice To Haves

  • Familiarity with MaaS or related knowledge.
  • Deep knowledge of Linux systems, network protocols (TCP/HTTP), and databases, with a deep understanding of cloud-native architecture design.
  • Experience with large-scale containers, Kubernetes cluster operation and maintenance, with strong professional knowledge of Cloud Native related components (e.g., Prometheus, Istio, Calico, etc.).
  • Extensive experience in building large-scale monitoring systems and utilizing them for in-depth analysis and operations.

Responsibilities

  • Oversee the deployment, operation, maintenance, and continuous improvement of the standalone website and platform, including its initial construction and subsequent operational changes.
  • Oversee the monitoring and alerting of our platform's and system applications, rapidly diagnosing and resolving network, service, and hardware-level failures to meet SLA targets.
  • Design and optimize monitoring metrics, log collection, and alerting strategies to enhance system observability.
  • Participate in the emergency response and handling of online incidents, conduct root cause analysis (RCA), and drive long-term solutions to prevent recurrence.
  • Investigate and resolve customer-reported issues related to QoS of API service (e.g., latency, performance, optimization), collaborating with development teams to identify flaws in application clusters, edge networks, or infrastructure.
  • Develop tools and scripts (Python/Go) to automate deployment, scaling, fault recovery, and other operational workflows.
  • Build automated diagnostic toolchains to accelerate issue resolution and improve customer satisfaction.

Benefits

  • Medical, dental, and vision insurance.
  • 401(k) plan and basic life insurance.
  • Wellbeing benefits like FSA.
  • Up to 12 paid holidays.
  • Accrue up to 15 paid vacation days.
  • Receive up to 72 hours paid sick time (front-loaded) per calendar year.

Stand Out From the Crowd

Upload your resume and get instant feedback on how well it matches this job.

Upload and Match Resume

What This Job Offers

Job Type

Full-time

Industry

Sporting Goods, Hobby, Musical Instrument, Book, and Miscellaneous Retailers

Number of Employees

5,001-10,000 employees

© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service