About The Position

In the Technology division, Morgan Stanley leverages innovation to build connections and capabilities that power the Firm, enabling clients and colleagues to redefine markets and shape the future. This Director-level Software Engineering position is part of the job family responsible for developing and maintaining software solutions that support business needs. Morgan Stanley, a global leader in financial services since 1935, is committed to evolving and innovating to serve clients and communities in over 40 countries. The mission of this role is to contribute to a firmwide Artificial Intelligence (AI) Development Platform that aligns with the firm’s Technology principles, driving efficiency, consistency, controls, security, strong governance, and promoting innovation. This platform enables teams to build applications leveraging AI capabilities and accelerates AI adoption across businesses. The role is for an experienced Site Reliability Engineer (SRE) to join the AI Platform team, supporting, scaling, and hardening the infrastructure for AI/ML systems. The SRE will collaborate with infrastructure engineering, cloud engineering, data engineering, and security teams to ensure the availability, reliability, performance, and security of production AI workloads (training, inference, data pipelines) within a regulated financial environment. This position requires deep operations, automation, and systems engineering skills to ensure reliable model and pipeline execution at scale, while balancing cost, security, and compliance. The ideal candidate will have strong hands-on experience with platforms such as Kubernetes, Cloud (AWS, Azure, and/or Google), API based development, REST framework, data engineering, and large-scale API Gateway environments. Knowledge of AIML and hands-on experience with Generative AI solutions are also preferred. Strong communication skills, a team-based mentality, and a passion for using AI to increase productivity and generate new product/technical improvements are essential.

Requirements

  • Bachelor’s or Master’s degree in Computer Science or related field, or equivalent job experience
  • 5 years of production experience in SRE / Infrastructure / ops for large-scale systems
  • Strong programming/scripting skills (Python, Go, Java, or equivalent)
  • Deep experience with containerization (Docker), orchestration (Kubernetes, etc.)
  • Infrastructure-as-code (Terraform, Helm, CloudFormation, Ansible, etc.)
  • Familiarity with GPU / AI compute clusters, high-performance data storage, and distributed architectures
  • Experience with monitoring / observability / logging / alerting tools (Prometheus, Grafana, ELK / EFK, Datadog, etc.)
  • Networking & systems engineering knowledge (TCP/IP, DNS, routing, load balancing, distributed storage)
  • Solid experience in capacity planning, performance tuning, scaling, and incident response
  • Demonstrated ability to lead RCAs, deploy fixes, and drive reliability improvements
  • Experience in regulated environments (financial services, compliance, audit, security) is a strong plus
  • Excellent communication, documentation, and cross-team collaboration skills
  • Proven track record of reducing operational toil via automation

Nice To Haves

  • Understanding of SRE techniques
  • Proficiency with Open Telemetry tools including Grafana, Loki, Prometheus, and Cortex
  • Good knowledge of Microservice based architecture, industry standards, for both public and private cloud
  • Knowledge of data pipeline technologies (Kafka, Spark, Flink, etc.)
  • Good knowledge of various DB engines (SQL, Redis, Kafka, Snowflake, etc) for cloud app storage
  • Experience working with Generative AI development, embeddings, fine tuning of Generative AI models
  • Experience in high-performance computing (HPC), distributed GPU cluster scheduling (e.g. Slurm, Kubernetes GPU scheduling)
  • Understanding of ModelOps/ ML Ops/ LLM Op
  • Experience with chaos engineering, canary deployments, blue/green rollouts

Responsibilities

  • Operate, monitor, and maintain the infrastructure supporting GenAI applications (training, inference, feature store, data ingestion, model serving)
  • Design and build automation for core platform capabilities, reducing manual toil
  • Develop and maintain infrastructure-as-code (IaC) for provisioning and managing compute, storage, network, GPU clusters, Kubernetes / container orchestration, etc.
  • Establish, monitor, and enforce SLOs/SLIs/SLAs, error budgets, alerting, and dashboards
  • Lead incident response, root cause analysis (RCA), postmortems, and systemic remediation
  • Perform capacity planning, scaling strategies, workload scheduling, and resource forecasting
  • Optimize cost vs. performance tradeoffs in large-scale compute environments
  • Harden systems for security, compliance, auditability, and data governance
  • Collaborate across teams (cloud engineers, data engineers, infrastructure, security) to ensure safe deployment, rollout, rollback, and integration of new systems
  • Define disaster recovery (DR) strategies, backup/restore practices, fault tolerance mechanisms
  • Maintain runbooks, operational playbooks, documentation, and training materials
  • Participate in on-call rotations and respond to production incidents 24/7 as needed
  • Continuously evaluate and integrate new tools, frameworks, or technologies to enhance platform reliability
© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service