About the position

Ready to make a global impact by industrializing AI? Visa AI as a Service (AIaS) operationalizes the delivery of AI and decision intelligence to ensure their ongoing business values. Built with composable AI capabilities, privacy-enhancing computation, and cloud native platforms, AIaS powers and automates industrialization of data, models, and applications for predictive and generative AI. Combined with strong governance, AIaS optimizes the performance, scalability, interpretability and reliability of AI models and services. If you want to be in the exciting payment and AI space, learn fast, and make big impacts, Visa AI as a Service is an ideal place for you! This role is for a Sr. ML Engineer – Cloud Observability. We are seeking for a talented professional with a solid background in public cloud and AI/ML production systems. This role offers ample opportunities for learning and growth, and the chance to be part of delivering the next big thing for our AI as Services team.

Responsibilities

  • Implement and Maintain Cloud Observability Solutions: Build and maintain monitoring, logging and tracing systems (E.g. Prometheus, Grafana, Druid, ELK Stack) for cloud-native AI services on AWS/Azure/GCP.
  • Partner with data engineers and data scientists to embed observability into ML workflows and ensure real-time insights.
  • Collaborate on AI Model Monitoring: Work closely with data scientists and product owners to design and implement observability solutions for monitoring AI/ML model performance (e.g. accuracy, latency, data drift) in production.
  • Develop dashboards and alerts to detect anomalies, model degradation, or bias, ensuring alignment with business SLAs.
  • Automate Devops Practices: Develop tools for automated deployment, alerting and incident response using CI/CD pipelines like Jenkins and Github flows and infrastructure as code like Terraform.
  • Document & Reporting: Create and maintain clear documentation for observability processes and best practices.
  • Generate reports to track system health and performance trends for business and technology stakeholders.
  • Incident Response: Assist in diagnosing and troubleshooting issues by analyzing metrics, logs and performance data and collaborate with cross functional teams to improve system level observability from the learning.
  • Stay Ahead of Trends: Explore emerging cloud and observability technologies to drive innovation.

Requirements

  • 2 or more years of work experience with a Bachelor’s Degree or an Advance Degree (e.g. Masters, MBA, JD, MD).
  • Strong development experience in one or more the following programming languages: Java, Go, Rust, C++.
  • 2 years of related experience with AWS, GCP, or Azure, preferably in an AI/ML production environment.

Nice-to-haves

  • 3 or more years of work experience with a Bachelor’s Degree or more than 2 years of work experience with an Advanced Degree (e.g. Masters, MBA, JD, MD).
  • Experience with one of the following: Prometheus, Grafana, Druid, ELK Stack - highly preferred.
  • Experience in observability eco-system highly preferred.

Benefits

  • Medical
  • Dental
  • Vision
  • 401 (k)
  • FSA/HSA
  • Life Insurance
  • Paid Time Off
  • Wellness Program
© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service