We are looking for a change agent — not a caretaker. This role demands an automation-first mindset, strong technical depth across Windows and Linux environments, and the leadership presence to raise the bar for how our infrastructure team operates. You will own the health, reliability, and evolution of the firm's compute and M365 environment while developing a high-performing team. Essential Functions and Responsibilities: Compute & M365 Stability and Availability Ensure the reliability, performance, and availability of all compute infrastructure and M365 services — on-prem servers, virtual machines, cloud instances, Exchange Online, Teams, SharePoint, and remote access services Own regular maintenance windows for patching, upgrades, and housekeeping across Windows, Linux, AWS compute, and M365 workloads Establish and enforce operational standards, runbooks, and procedures that create consistency and reduce dependency on tribal knowledge Monitor compute and M365 environment health continuously; act on signals before they become incidents Manage compute capacity across on-prem and AWS — ensuring right-sized, available resources that meet demand without unnecessary cost M365 Administration Own firm-wide M365 administration including Exchange Online, Teams, SharePoint, OneDrive, Entra ID, and associated services Manage tenant configuration, licensing, service health, and policy governance across the M365 platform Partner with the business to understand collaboration needs and translate them into well-governed M365 solutions Maintain a clear operational model for M365 — covering administration, support escalation, and change management Stay current on the M365 roadmap; evaluate new capabilities and drive adoption where they deliver business value Incident Management Lead incident response for P1/P2 compute and M365 events; coordinate across teams and drive to resolution with urgency and clarity Establish and maintain incident runbooks, post-mortems, and lessons-learned processes Track incident trends and use data to drive systemic fixes and reduce repeat events Ensure clear on-call coverage, escalation paths, and communication protocols are in place and understood by the team Vulnerability & Security Operations Own the vulnerability management lifecycle across compute: scanning, prioritization, remediation tracking, and reporting Partner with the Infosec team to align operational processes with security policies, ensuring consistent enforcement across compute and M365 Serve as the operational bridge between infrastructure and security — translating policy requirements into executable team workflows Ensure timely closure of critical and high findings, with clear escalation paths for exceptions and risk acceptance Observability Platform Define, deliver, and own the firm's observability platform for compute and M365 — spanning open source and commercial tooling Architect a unified view of environment health across on-prem, AWS, and M365 (metrics, logs, traces, and events) Establish proactive alerting, dashboards, and runbooks to reduce MTTR Drive adoption of the platform across Windows and Linux teams, ensuring consistent coverage and actionable signal Automation & Tooling Champion an automation-first culture; eliminate manual, repetitive operational tasks through scripting and orchestration (Ansible, PowerShell, Bash, Terraform) Drive infrastructure-as-code adoption for compute provisioning and configuration across on-prem and AWS Leverage M365 automation capabilities — Power Automate, Graph API, and PowerShell — to streamline administration and reduce manual effort Identify and implement tooling to reduce toil and accelerate delivery Technology Evolution & Technical Debt Maintain a living inventory of technical debt across all compute and M365 ownership areas — server platforms, operating systems, virtualization, messaging, collaboration, and remote access Develop and own multi-horizon technology roadmaps that balance operational stability with modernization Make the case for investment: translate technical debt and risk into business impact for leadership Establish a cadence of review and retirement — ensuring aging technologies are actively replaced, not just maintained Champion forward-looking decisions on platform lifecycle, vendor strategy, and architectural direction Cross-Functional Partnership & Technology Strategy Partner with the AppDev team to ensure prompt, reliable delivery of compute services — with clear cost accountability and service-level expectations on both sides Collaborate with FinOps to drive compute and M365 cost optimization, ensuring spend is visible, justified, and continuously improved Partner with Architecture to create, update, and execute against technology roadmaps that align compute and M365 direction with firm-wide strategy Continuously evaluate emerging technologies to reduce risk, lower costs, and improve the reliability, scalability, maintainability, and security of the environment Represent compute and M365 capabilities and constraints in cross-functional planning forums, ensuring operational realities inform strategic decisions Team Leadership & Workload Management Lead and mentor a team of Windows and Linux admins; bridge the gap between both disciplines Manage operational queue: balance incident response, project work, and proactive improvements Drive accountability through sprint planning and ticket hygiene with clear escalation paths Conduct regular 1:1s, performance reviews, and career development conversations Project & Change Accountability Own end-to-end delivery of compute and M365 projects — on scope, on time, with documented outcomes Manage change control processes; reduce risk through peer review and staged rollouts Communicate status, risks, and blockers to leadership proactively
Stand Out From the Crowd
Upload your resume and get instant feedback on how well it matches this job.
Job Type
Full-time
Career Level
Mid Level