About The Position

Goldman Sachs is seeking a motivated Cloud Site Reliability Engineer (SRE) to support the WM Data Engineering ecosystem. This role involves applying software engineering principles to operational challenges, ensuring cloud-native services, primarily on AWS, are resilient, scalable, and cost-optimized. The engineer will be crucial in the transition from on-premises legacy systems to AWS, focusing on system health, predictive remediation, and implementing SLOs-as-Code.

Requirements

  • 4+ years in SRE, DevOps, or Cloud Engineering roles, with a strong focus on production operations for distributed systems.
  • Deep proficiency in Amazon ECS (Fargate and EC2 launch types).
  • Experience with Docker containerization and managing service-to-service connectivity.
  • Strong proficiency in Python or Java for automation and tool development.
  • Expert-level SQL for data-driven reliability analysis.
  • Advanced knowledge of AWS core services (VPC, IAM, S3, Lambda) and networking (Transit Gateway, PrivateLink).
  • Hands-on experience with modern monitoring and tracing tools such as Prometheus, Grafana, AWS X-Ray, or Splunk.
  • Proven ability to build automated deployment pipelines for ECS using AWS CodePipeline, GitHub Actions, or Terraform, incorporating blue/green or canary deployment strategies.
  • Strong problem-solving "builder" mindset and the ability to communicate technical concepts within a team environment.
  • Bachelor’s or Master’s degree in computer science, Engineering, Mathematics, or a related field.

Responsibilities

  • Define and enforce Service Level Objectives (SLOs) and Service Level Indicators (SLIs) using OpenSLO or similar declarative frameworks.
  • Manage "Error Budgets" to balance the pace of innovation with system stability.
  • Implement AI-driven observability stacks (e.g., Datadog, Amazon CloudWatch Container Insights, or OpenTelemetry) to detect "p99" latency spikes and subtle configuration drifts before they impact users.
  • Lead high-severity incident restoration and conduct blameless post-mortems to identify root causes and automate future prevention.
  • Support the migration of on-premises microservices to Amazon ECS (Fargate/EC2).
  • Design and maintain task definitions, service discovery via AWS Cloud Map, and inter-service communication using Amazon ECS Service Connect.
  • Develop and maintain modular, version-controlled infrastructure using Terraform or AWS CDK, ensuring that reliability guardrails are baked into every deployment.
  • Identify and eliminate repetitive manual tasks ("toil") by developing custom automation tools in Python or Go.
  • Contribute to the migration of on-premises data workloads to AWS.

Benefits

  • Training and development opportunities
  • Firmwide networks
  • Wellness offerings
  • Personal finance offerings
  • Mindfulness programs
  • Reasonable accommodations for candidates with special needs or disabilities during our recruiting process

Stand Out From the Crowd

Upload your resume and get instant feedback on how well it matches this job.

Upload and Match Resume

What This Job Offers

Job Type

Full-time

Career Level

Mid Level

Number of Employees

5,001-10,000 employees

© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service