Site Reliability Engineer - Trae USDS

Tiktok•San Jose, CA

34d•Hybrid

About The Position

TRAE (The Real AI Engineer) is an intelligent engineering product capable of understanding requirements, orchestrating tools, and independently completing development tasks, providing users with end-to-end software generation capabilities. As one of the most popular AI programming products and the world's first end-to-end AI software development agent, TRAE covers a full spectrum of development scenarios, from simple to highly complex. We are looking for passionate and creative engineers to join us in reshaping the development paradigm and defining the future of AI-driven software engineering. In order to enhance collaboration and cross-functional partnerships, among other things, at this time, our organization follows a hybrid work schedule that requires employees to work in the office 3 days a week, or as directed by their manager/department. We regularly review our hybrid work model, and the specific requirements may change at any time.

Requirements

Bachelor's degree or higher in Computer Science or related fields from accredited and reputable institutions;
Relevant work experience, with solid knowledge of computer software fundamentals, and understanding of Linux operating system, storage, network IO and other related principles;
Familiar with one or more programming languages, such as Python/Go/Java/Shell;
Capable of solving problems systematically, with excellent communication skills and a sense of responsibility;
Prior experience in related computing/distributed/big data systems is preferred, such as Kubernetes/Docker/Spark/Flink, etc.

Nice To Haves

Master's degree or higher in Computer Science or related fields from accredited and reputable institutions;
Possessing algorithmic thinking, good ability in data structures and system design is preferred;
3 years of relevant experience in top tech companies

Responsibilities

Develop and maintain automation procedures to maximize system efficiency and minimize human intervention.
Work closely with software engineering teams to design, deploy and operate elements to ensure that systems are functionally robust.
Ensure system scalability to handle growth in user traffic and data.
Implement monitoring tools and set up metrics to keep track of system health and performance.
Participate in on-call rotations, assist with incident management, and diagnose, resolve, and prevent production issues.
Conduct performance tests to find and address system bottlenecks.
Collaborate with teams across the organization to define Service Level Objectives (SLOs), Service Level Indicators (SLIs), and Service Level Agreements (SLAs).

Stand Out From the Crowd

Upload your resume and get instant feedback on how well it matches this job.

Upload and Match Resume