GCP) - VP

Morgan Stanley•New York, NY

60d

About The Position

We are seeking a Senior Cloud Engineer / Site Reliability Engineer (SRE) to design, build, and operate secure, scalable cloud platforms across AWS, Azure, and GCP. This role is responsible for configuring, deploying, and maintaining virtual machines and containerized applications, using Terraform to automate infrastructure provisioning and lifecycle management. You will provide specialized support for high-stakes production deployments, lead incident response for technical escalations, and apply SRE principles (SLIs/SLOs, error budgets, automation, and reliability engineering) to improve availability, performance, and operational excellence in a multi-cloud environment.

Requirements

10+ years in cloud engineering, platform engineering, DevOps, or SRE roles with significant production ownership.
Strong hands-on experience across AWS and Azure, plus practical experience in GCP (production exposure preferred).
Expert-level Terraform (modules, state, CI integration, scalable environment patterns).
Strong Kubernetes operations experience (EKS/AKS/GKE), including upgrades, scaling, and workload reliability.
Experience implementing SRE practices: SLIs/SLOs, alerting strategies, incident response, postmortems, and automation/toil reduction.
Strong Linux and scripting (Bash/Python) and ability to debug systems from symptoms to root cause.
Strong security fundamentals: IAM/RBAC, encryption, secrets, and auditability in cloud environments.
Proven ability to lead technical escalations and coordinate resolution across teams.

Responsibilities

Architect, implement, and maintain cloud infrastructure across AWS, Azure, and GCP using Terraform (IaC).
Design and implement cloud landing zones aligned with best practices: Account/subscription/project structure, environment separation, identity boundaries
Baseline guardrails and policy enforcement (Azure Policy, AWS Organizations/SCPs, GCP Org Policies)
Centralized audit logging, monitoring, and cost allocation standards
Build and operate cloud-native virtual network constructs (cloud-focused only): Azure: VNETs, subnets, NSGs, route tables, Private Endpoints, hub/spoke patterns. AWS: VPCs, subnets, security groups, NACLs, route tables, VPC endpoints/PrivateLink, multi-account connectivity patterns. GCP: VPC networks, subnets, firewall rules, routes, Private Service Connect, Shared VPC patterns.
Implement private-by-default service access patterns (private endpoints, controlled egress, service-to-service access controls).
Configure, deploy, and maintain virtual machines and scalable compute patterns: AWS EC2 (Launch Templates, Auto Scaling Groups) Azure Virtual Machines / VM Scale Sets GCP Compute Engine / Managed Instance Groups
Own OS hardening, baseline configuration, patching strategies, and instance bootstrapping (cloud-init, image pipelines).
Deploy and operate containerized workloads using Kubernetes: EKS / AKS / GKE (cluster design, upgrades, node pools, RBAC, scaling)
Container registries (ECR / ACR / Artifact Registry) and artifact promotion strategies
Implement workload delivery patterns (Helm/Kustomize), rollout strategies (blue/green, canary), and safe rollbacks.
Build reusable, versioned Terraform modules with standards for naming, tagging/labels, and secure defaults.
Implement Terraform best practices: remote state, locking, environment isolation, secrets handling, and drift detection.
Integrate IaC into CI/CD pipelines (e.g., GitHub Actions, Azure DevOps, GitLab CI): Automated validation, linting, security scanning, plan/apply workflows, approvals, and promotions
Implement policy-as-code guardrails (OPA/Conftest, Sentinel where applicable) to prevent unsafe changes.
Define, implement, and improve SLIs/SLOs (availability, latency, error rates, saturation) for critical services and platforms.
Manage and enforce error budgets to balance reliability with delivery velocity.
Establish and continuously improve observability standards: Metrics, logs, traces, dashboards, and alerting across cloud services and Kubernetes
Tooling such as CloudWatch, Azure Monitor/Log Analytics, GCP Cloud Monitoring/Logging, OpenTelemetry, Prometheus/Grafana (where used)
Improve incident detection quality by reducing alert noise, implementing actionable alerts, and creating clear escalation paths.
Drive reliability improvements through: Capacity planning, performance tuning, load testing support
Resilience engineering (multi-zone design, graceful degradation, retries/timeouts, backpressure)
Continuous automation to eliminate toil (self-healing, auto-remediation runbooks, ChatOps where applicable)
Provide specialized support for high-stakes production deployments (major releases, platform cutovers, migrations).
Lead incident response: triage, mitigation, recovery, communication, and post-incident review (PIR/RCA).
Troubleshoot escalations across cloud services, Kubernetes, IAM, storage, and CI/CD pipelines using evidence-driven debugging.
Build and maintain runbooks, operational playbooks, and postmortem action tracking to prevent repeat incidents.
Participate in on-call rotation and continuously improve on-call health through automation and better observability.
Implement least-privilege access controls across AWS/Azure/GCP (IAM/RBAC), including role design and permission boundaries.
Enforce secure configurations: encryption at rest/in transit, secrets management, key management (KMS/Key Vault/Cloud KMS).
Implement compliance-oriented logging and auditing, and partner with security teams to remediate findings and harden platforms.

Benefits

Ample opportunity to move about the business for those who show passion and grit in their work.
Attractive and comprehensive employee benefits and perks in the industry.
Support for employees and their families at every point along their work-life journey.

Stand Out From the Crowd

Upload your resume and get instant feedback on how well it matches this job.

Upload and Match Resume