Senior Manager, Platform, Lifecycle, & Troubleshooting

Vultr
$120,000 - $140,000Remote

About The Position

Vultr is seeking a highly skilled and experienced Platform & Lifecycle Team Manager to drive deep technical troubleshooting and lifecycle excellence across its expanding server fleet. The ideal candidate is a technical leader with strong Linux/platform expertise and a passion for solving complex issues in high-performance cloud environments (GPU, storage, RDMA, etc.). This is a highly visible role in a high-growth technology company, which will require both hands-on engineering depth and team leadership. This is an opportunity to join a fast-growing team and leave a mark on Vultr and the future of Cloud Infrastructure. The role will lead the team responsible for keeping thousands of production servers running reliably, owning complex platform troubleshooting, large-scale migrations (including OS/distribution changes), and post-onboard lifecycle — directly contributing to Vultr’s uptime, performance leadership in GPUs and bare metal, and ability to support demanding AI and enterprise workloads.

Requirements

  • 8+ years of experience in Linux systems administration, platform engineering, or SRE-style operations in cloud or large-scale infrastructure environments.
  • Deep expertise in troubleshooting GPU, storage, RDMA, and high-performance networking issues.
  • Proven track record leading technical teams, including on-call rotations and complex migrations.
  • Strong scripting/automation skills (Python, Bash, Ansible, etc.) and experience with monitoring tools.
  • Excellent problem-solving, documentation, and cross-team communication abilities.
  • Bachelor’s degree in Computer Science, Engineering, or equivalent experience.

Responsibilities

  • Lead the Platform, Lifecycle & Troubleshooting team in resolving complex incidents and platform issues.
  • Own server repurposing, migrations (e.g., OS/distribution upgrades), and deeper lifecycle management.
  • Perform and guide advanced troubleshooting for RDMA links, GPU, storage, and server-side networking.
  • Validate firmware choices and handle complex/ongoing firmware updates.
  • Provide 24/7 on-call leadership and drive incident response improvements.
  • Develop runbooks, automation, and self-healing processes to reduce toil and improve MTTR.
  • Collaborate closely with Hardware and Onboarding teams on handoffs and mixed tickets.
  • Partner with Engineering, Networking, and Solutions teams on technical escalations and improvements.
  • Mentor senior engineers and build a high-performing team focused on root-cause analysis.
  • Track key metrics (uptime, incident trends, migration success) and drive operational maturity.

Benefits

  • 100% company-paid insurance premiums for employee medical, dental and vision plans.
  • 401(k) plan that matches 100% up to 4%, with immediate vesting
  • Professional Development Reimbursement of $2,500 each year
  • 11 Holidays + Paid Time Off Accrual + Rollover Plan
  • Increased PTO at 3 year and 10 year anniversary
  • 1 month paid sabbatical every 5 years
  • Anniversary Bonus each year
  • $500 stipend for remote office setup in first year + $400 each following year
  • Internet reimbursement up to $75 per month
  • Gym membership reimbursement up to $50 per month
  • Company paid Wellable subscription
© 2026 Teal Labs, Inc
Privacy PolicyTerms of Service