Sr. Member of Technical Staff

Cerebras SystemsSunnyvale, CA
Remote

About The Position

Cerebras Systems builds the world's largest AI chip, 56 times larger than GPUs. Our novel wafer-scale architecture provides the AI compute power of dozens of GPUs on a single chip, with the programming simplicity of a single device. This approach allows Cerebras to deliver industry-leading training and inference speeds and empowers machine learning users to effortlessly run large-scale ML applications, without the hassle of managing hundreds of GPUs or TPUs. Cerebras' current customers include top model labs, global enterprises, and cutting-edge AI-native startups. OpenAI recently announced a multi-year partnership with Cerebras, to deploy 750 megawatts of scale, transforming key workloads with ultra high-speed inference. Thanks to the groundbreaking wafer-scale architecture, Cerebras Inference offers the fastest Generative AI inference solution in the world, over 10 times faster than GPU-based hyperscale cloud inference services. This order of magnitude increase in speed is transforming the user experience of AI applications, unlocking real-time iteration and increasing intelligence via additional agentic computation. Cerebras Systems Inc. has multiple openings for Sr. Member of Technical Staff.

Requirements

  • Master’s degree or foreign equivalent degree in Computer Science, or a related field and 18 months of experience as Information Security Analyst, Software Engineer, Sr. Member of Technical Staff, IT Senior Applications Engineer, or a related occupation required.
  • 18 months of experience with Infrastructure-as-Code and deployment automation: Terraform, AWS CloudFormation, AWS CDK, and Ansible
  • 18 months of experience with Containerization and orchestration: Docker, Kubernetes, AWS EKS, AWS Elastic Container Service (ECS), AWS Fargate, and Helm
  • 18 months of experience with Compute and serverless services : AWS EC2, AWS Lambda functions, and Auto Scaling Groups
  • 18 months of experience with Monitoring, logging, and distributed tracing : AWS CloudWatch, AWS X-Ray, ELK (Elasticsearch, Logstash, Kibana), Prometheus, and Grafana
  • 18 months of experience with Programming languages and frameworks : Python, Node.js, JavaScript, and Flask
  • 18 months of experience with Data storage and caching : PostgreSQL, Redis, and NFS
  • 18 months of experience with CI/CD and version control : Jenkins and Git

Responsibilities

  • Design and develop software features that support system resiliency and high availability, including automated recovery mechanisms and fault-tolerant architecture across distributed environments.
  • Develop and maintain cloud-based deployment workflows for AI inference software using AWS tools and services to support low-latency and scalable system performance.
  • Develop Python-based scripts and APIs to streamline data preprocessing, inference execution, and post-processing for real-time inference tasks.
  • Use parallel programming techniques (e.g., multi-threading, asynchronous processing) to maximize resource efficiency on AWS compute instances.
  • Develop software components to support visualization and analysis of system performance metrics, enhancing the monitoring and usability of inference services.
  • Develop inference software in Docker containers and define Kubernetes orchestration strategies that ensure software reliability and efficient scaling.
  • Develop automated scripts to detect and mitigate common failure modes, improving software system reliability.
  • Debug issues related to model deployment, container orchestration, networking configurations, documenting steps to reproduce and root-cause defects.
  • Triage and resolve defects in the software service by analyzing logs, metrics, and distributed traces using tools like AWS CloudWatch, Grafana, or custom Python scripts.
  • Work with product management and user experience teams to define requirements for inference service interfaces, including configuration, monitoring, and event logging.
  • Author detailed technical documentation for infrastructure configurations, inference workflows, and APIs, ensuring clarity for internal teams and external customers.
  • Document and track defects, enhancements, and release notes using tools like Jira and Git, ensuring version control and traceability.
© 2026 Teal Labs, Inc
Privacy PolicyTerms of Service