This job is closed

We regret to inform you that the job you were interested in has been closed. Although this specific position is no longer available, we encourage you to continue exploring other opportunities on our job board.

Tiktok - Seattle, WA

posted 2 months ago

Full-time - Mid Level
Seattle, WA
Computing Infrastructure Providers, Data Processing, Web Hosting, and Related Services

About the position

The Site Reliability Engineer (SRE) at TikTok plays a crucial role in ensuring the reliability, scalability, and performance of AI applications. This position combines software engineering and systems engineering expertise to maintain and optimize the infrastructure that supports AI and machine learning technologies. The SRE will work within a dynamic team focused on empowering content interaction and creation through advanced speech and audio technologies, contributing to TikTok's mission of inspiring creativity and bringing joy to users globally.

Responsibilities

  • Develop and implement monitoring solutions to track the performance and reliability of AI systems.
  • Respond to incidents, diagnose issues, and implement fixes to minimize downtime.
  • Automate repetitive tasks, streamline deployments, and create tools to improve the efficiency and reliability of AI operations.
  • Analyze and optimize the performance of AI applications and the underlying infrastructure, including tuning algorithms and resource management.
  • Forecast infrastructure needs and ensure that the AI applications have the necessary resources to handle future workloads.
  • Implement and maintain security best practices to protect data and applications, ensuring compliance with relevant regulations.
  • Create and maintain detailed documentation of infrastructure, processes, and procedures to ensure knowledge sharing and continuity.
  • Identify opportunities for process improvements and implement solutions to enhance the reliability and performance of AI systems.

Requirements

  • Bachelor's or Master's degree in Computer Science, Engineering, or a related field.
  • 3+ years of experience in site reliability engineering, DevOps, or a related role.
  • Proven experience managing and optimizing AI and machine learning infrastructure.

Nice-to-haves

  • Proficiency with cloud platforms such as AWS, Google Cloud, or Azure.
  • Strong programming skills in languages such as Python, Go, or Java.
  • Experience with containerization and orchestration tools like Docker and Kubernetes.
  • Familiarity with CI/CD pipelines and tools such as Jenkins, GitLab CI, or CircleCI.
  • Knowledge of monitoring and logging tools such as Prometheus, Grafana, ELK stack, or Datadog.
  • Understanding of networking, security principles, and best practices in cloud environments.
  • Strong problem-solving capabilities, with a detail-oriented and user-focused approach.
  • Strong communication and interpersonal skills, capable of engaging effectively with both technical and non-technical stakeholders.

Benefits

  • 100% premium coverage for employee medical insurance, approximately 75% premium coverage for dependents, and a Health Savings Account (HSA) with a company match.
  • Dental, Vision, Short/Long term Disability, Basic Life, Voluntary Life and AD&D insurance plans.
  • 10 paid holidays per year plus 17 days of Paid Personal Time Off (PPTO) and 10 paid sick days per year.
  • 12 weeks of paid Parental leave and 8 weeks of paid Supplemental Disability.
  • Mental and emotional health benefits through EAP and Lyra.
  • 401K company match, gym and cellphone service reimbursements.
Job Description Matching

Match and compare your resume to any job description

Start Matching
© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service