Mistral AI-posted about 2 months ago
Full-time • Mid Level
Hybrid • Amsterdam, NY
251-500 employees
Publishing Industries

We are building one of Europe's largest AI infrastructure offering that will provide our customers a private and integrated stack in every form factor they may need - from bare-metal servers to fully-managed PaaS. As a DevOps Engineer, you will join a fast growing team to help building, scaling and automating our computing management stack. You will be responsible for building fault-tolerant and reliable infrastructure to support both our internal processes and customer platform. As a Software/DevOps Engineer in our Compute team, your primary responsibility will be to engineer robust and dependable infrastructure that supports both our internal operations and customer-facing platforms.

  • Design, build, and operate a scalable Kubernetes-based platform to host large-scale AI and HPC workloads, ensuring high performance, reliability, and security.
  • Own the full lifecycle of cluster management, from bootstrapping and provisioning to global operations, by integrating and developing the necessary software components-including automation, monitoring, and orchestration tools.
  • Drive infrastructure innovation by designing workflows, tooling (scripts, APIs, dashboards), and CI/CD pipelines to optimize system reliability, availability, and observability.
  • Champion a zero-trust security model, strengthening IAM, networking (VPC), and access controls to safeguard the platform.
  • Develop user-centric features that simplify operations for both sysadmins and end customers, reducing friction in daily workflows.
  • Lead incident resolution with rigorous root-cause analysis to prevent recurrence and improve system resilience.
  • Successful experience in an Infrastructure Engineering role (SWE, DevOps, SRE, Platform...)
  • Strong proficiency in software development (preferably Golang) and knowledge of software development best practices
  • Deep understanding of Kubernetes internals and hands-on experience with containerization and orchestration tools (Docker, Kubernetes, Openstack...)
  • Familiarity with infrastructure-as-code tools like Terraform or CloudFormation
  • Knowledge of monitoring, logging, alerting and observability tools (Prometheus, Grafana, ELK, Datadog...)
  • Exposure to highly available distributed systems and site reliability issues in critical environments (issue root cause analysis, in-production troubleshooting, on-call rotations...)
  • Experience working against reliability KPIs (observability, alerting, SLAs)
  • Excellent problem-solving and communication skills
  • Self-motivation and ability to thrive in a fast-paced startup environment
  • Experience with HPC workload managers (Slurm) and distributed storage systems (Lustre, Ceph)
  • Demonstrated history of contributing to open-source projects (e.g., code, documentation, bug fixes, feature development, or community support).
  • Competitive salary and equity
  • ️ Health insurance
  • Transportation allowance
  • Sport allowance
  • Meal vouchers
  • Private pension plan
  • Parental : Generous parental leave policy
  • Visa sponsorship
© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service