Senior Manager, Platform, Lifecycle, & Troubleshooting

Vultr

9d•$120,000 - $140,000•Remote

About The Position

Vultr is seeking a highly skilled and experienced Platform & Lifecycle Team Manager to drive deep technical troubleshooting and lifecycle excellence across its expanding server fleet. The ideal candidate is a technical leader with strong Linux/platform expertise and a passion for solving complex issues in high-performance cloud environments (GPU, storage, RDMA, etc.). This is a highly visible role in a high-growth technology company, which will require both hands-on engineering depth and team leadership. This is an opportunity to join a fast-growing team and leave a mark on Vultr and the future of Cloud Infrastructure. The role will lead the team responsible for keeping thousands of production servers running reliably, owning complex platform troubleshooting, large-scale migrations (including OS/distribution changes), and post-onboard lifecycle — directly contributing to Vultr’s uptime, performance leadership in GPUs and bare metal, and ability to support demanding AI and enterprise workloads.

Requirements

8+ years of experience in Linux systems administration, platform engineering, or SRE-style operations in cloud or large-scale infrastructure environments.
Deep expertise in troubleshooting GPU, storage, RDMA, and high-performance networking issues.
Proven track record leading technical teams, including on-call rotations and complex migrations.
Strong scripting/automation skills (Python, Bash, Ansible, etc.) and experience with monitoring tools.
Excellent problem-solving, documentation, and cross-team communication abilities.
Bachelor’s degree in Computer Science, Engineering, or equivalent experience.

Responsibilities

Lead the Platform, Lifecycle & Troubleshooting team in resolving complex incidents and platform issues.
Own server repurposing, migrations (e.g., OS/distribution upgrades), and deeper lifecycle management.
Perform and guide advanced troubleshooting for RDMA links, GPU, storage, and server-side networking.
Validate firmware choices and handle complex/ongoing firmware updates.
Provide 24/7 on-call leadership and drive incident response improvements.
Develop runbooks, automation, and self-healing processes to reduce toil and improve MTTR.
Collaborate closely with Hardware and Onboarding teams on handoffs and mixed tickets.
Partner with Engineering, Networking, and Solutions teams on technical escalations and improvements.
Mentor senior engineers and build a high-performing team focused on root-cause analysis.
Track key metrics (uptime, incident trends, migration success) and drive operational maturity.

Benefits

100% company-paid insurance premiums for employee medical, dental and vision plans.
401(k) plan that matches 100% up to 4%, with immediate vesting
Professional Development Reimbursement of $2,500 each year
11 Holidays + Paid Time Off Accrual + Rollover Plan
Increased PTO at 3 year and 10 year anniversary
1 month paid sabbatical every 5 years
Anniversary Bonus each year
$500 stipend for remote office setup in first year + $400 each following year
Internet reimbursement up to $75 per month
Gym membership reimbursement up to $50 per month
Company paid Wellable subscription