Site Reliability Engineer, Metal

Tenstorrent•Toronto, ON

75d•$100,000 - $500,000•Hybrid

About The Position

Tenstorrent is leading the industry on cutting-edge AI technology, revolutionizing performance expectations, ease of use, and cost efficiency. With AI redefining the computing paradigm, solutions must evolve to unify innovations in software models, compilers, platforms, networking, and semiconductors. Our diverse team of technologists have developed a high performance RISC-V CPU from scratch, and share a passion for AI and a deep desire to build the best AI platform possible. We value collaboration, curiosity, and a commitment to solving hard problems. We are growing our team and looking for contributors of all seniorities. Tenstorrent is building large-scale AI systems across internal clusters and customer deployments. This role sits at the intersection of site reliability, infrastructure operations, and customer engineering, ensuring our systems are reliable, observable, and production-ready. This role is hybrid, based out of Toronto, ON; Austin, TX; or Santa Clara, CA. We welcome candidates at various experience levels for this role. During the interview process, candidates will be assessed for the appropriate level, and offers will align with that level, which may differ from the one in this posting.

Requirements

Experienced in site reliability, infrastructure, or systems engineering in distributed environments.
Strong Linux systems knowledge with the ability to troubleshoot complex multi-layer issues.
Proficient with observability tools such as Prometheus, Grafana, and alerting systems.
Comfortable with scripting and automation using Python, Go, or similar languages.
Solid understanding of networking fundamentals and how systems behave at scale.

Responsibilities

Ensure reliability and operational health of Tenstorrent systems across internal and customer environments.
Troubleshoot complex issues across compute, networking, and software layers.
Partner with engineering teams and customers to resolve production incidents.
Design and improve monitoring, observability, and alerting systems.
Build automation to reduce operational toil and improve system reliability.

Stand Out From the Crowd

Upload your resume and get instant feedback on how well it matches this job.

Upload and Match Resume