Sciforium is seeking a highly skilled Distributed Training Engineer to build, optimize, and maintain the critical software stack that powers our large-scale AI training workloads. In this role, you will work across the entire machine learning infrastructure from low-level CUDA/ROCm runtimes to high-level frameworks like JAX and PyTorch to ensure our distributed training systems are fast, scalable, stable, and efficient. This position is ideal for someone who loves deep systems engineering, debugging complex hardware–software interactions, and optimizing performance at every layer of the ML stack. You will play a pivotal role in enabling the training and deployment of next-generation LLMs and generative AI models.
Stand Out From the Crowd
Upload your resume and get instant feedback on how well it matches this job.
Job Type
Full-time
Career Level
Mid Level