About The Position

Goldman Sachs Engineers are innovators and problem-solvers who thrive in fast-paced global environments. We are seeking a motivated Cloud Site Reliability Engineer (SRE) to support the WM Data Engineering ecosystem. In this role, you will apply software engineering principles to operational challenges, ensuring that our cloud-native services - primarily running on AWS - are resilient, scalable, and cost-optimized. As we transition from on-premises legacy systems to AWS, you will be the guardian of system health, moving beyond traditional dashboards to implement predictive remediation and SLOs-as-Code.

Requirements

  • 4+ years in SRE, DevOps, or Cloud Engineering roles, with a strong focus on production operations for distributed systems.
  • Deep proficiency in Amazon ECS (Fargate and EC2 launch types).
  • Experience with Docker containerization and managing service-to-service connectivity.
  • Strong proficiency in Python or Java for automation and tool development.
  • Expert-level SQL for data-driven reliability analysis.
  • Advanced knowledge of AWS core services (VPC, IAM, S3, Lambda) and networking (Transit Gateway, PrivateLink).
  • Hands-on experience with modern monitoring and tracing tools such as Prometheus, Grafana, AWS X-Ray, or Splunk.
  • Proven ability to build automated deployment pipelines for ECS using AWS CodePipeline, GitHub Actions, or Terraform, incorporating blue/green or canary deployment strategies.
  • Strong problem-solving "builder" mindset and the ability to communicate technical concepts within a team environment.
  • Bachelor’s or Master’s degree in computer science, Engineering, Mathematics, or a related field.

Responsibilities

  • Define and enforce Service Level Objectives (SLOs) and Service Level Indicators (SLIs) using OpenSLO or similar declarative frameworks. Manage "Error Budgets" to balance the pace of innovation with system stability.
  • Implement AI-driven observability stacks (e.g., Datadog, Amazon CloudWatch Container Insights, or OpenTelemetry) to detect "p99" latency spikes and subtle configuration drifts before they impact users.
  • Lead high-severity incident restoration and conduct blameless post-mortems to identify root causes and automate future prevention.
  • Support the migration of on-premises microservices to Amazon ECS (Fargate/EC2). Design and maintain task definitions, service discovery via AWS Cloud Map, and inter-service communication using Amazon ECS Service Connect.
  • Develop and maintain modular, version-controlled infrastructure using Terraform or AWS CDK, ensuring that reliability guardrails are baked into every deployment.
  • Identify and eliminate repetitive manual tasks ("toil") by developing custom automation tools in Python or Go.
  • Contribute to the migration of on-premises data workloads to AWS.

Benefits

  • opportunities to grow professionally and personally
  • training and development opportunities
  • firmwide networks
  • benefits
  • wellness and personal finance offerings
  • mindfulness programs
  • reasonable accommodations for candidates with special needs or disabilities during our recruiting process

Stand Out From the Crowd

Upload your resume and get instant feedback on how well it matches this job.

Upload and Match Resume

What This Job Offers

Job Type

Full-time

Career Level

Entry Level

Number of Employees

5,001-10,000 employees

© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service