Senior Engineer, AIOps

Royal Caribbean Cruises LtdMiami, FL
Onsite

About The Position

The Royal Caribbean Group’s AI & Analytics Team has an exciting career opportunity for a full time Senior Engineer, AIOps reporting to the Senior Manager, Data Intelligence Operations. The position is onsite and based in Miramar, Florida. The position is also not eligible for work authorization sponsorship. The Senior Engineer, AIOps serves as a technical anchor for the reliability, scalability, and continuous improvement of Royal Caribbean Group’s enterprise AI, Generative AI (GenAI), and modern data platforms. This senior-level role leads incident response, drives operational maturity, mentors junior team members, and partners with platform engineering and data science teams to shape how AI and data systems are built, deployed, and maintained at scale. The ideal candidate brings deep expertise in Microsoft Azure and Databricks, strong command of LLM and GenAI tooling, and the judgment to make sound architectural and operational decisions independently.

Requirements

  • Bachelor’s degree in Computer Science, Engineering, or related field required; Master’s degree preferred.
  • 7+ years of experience in platform operations, cloud engineering, AI/data platform support, or site reliability engineering in enterprise environments.
  • Deep hands-on experience with Microsoft Azure, including Azure OpenAI Service, Azure AI Search, Azure Data Factory, Azure Monitor, and related data and AI services.
  • Expert-level experience with Databricks, including Unity Catalog administration, cluster and pool management, Delta Lake operations, and job orchestration at scale.
  • Strong command of LLM and GenAI concepts, including inference architectures, RAG pipelines, embeddings, vector databases, and model serving patterns.
  • Proficiency in Python and SQL, with experience automating operational tasks and reviewing pipeline and application code.
  • Demonstrated ability to lead incident response independently, produce high-quality RCAs, and drive cross-functional remediation.
  • Experience with ITSM platforms (ServiceNow preferred) and formal incident and change management processes.
  • Strong communication skills, able to translate complex technical issues into clear, actionable updates for both technical and non-technical stakeholders.

Nice To Haves

  • Expertise in AI and data platform operations, observability, and incident management.
  • Proficiency in cloud cost optimization and FinOps practices.
  • Experience with CI/CD pipelines, DevOps practices, and automation tools.
  • Strong understanding of platform security, governance, and compliance requirements.
  • Demonstrated ability to mentor and guide junior engineers.
  • Strong organizational, analytical, and problem-solving skills.
  • Ability to foster a culture of operational excellence and continuous improvement.
  • Effective collaborator with cross-functional teams and external partners.

Responsibilities

  • Leads the operational health and reliability of enterprise AI, GenAI, and data platforms, ensuring high availability and performance.
  • Serves as the senior technical escalation point for L2/L3 production issues across AI and GenAI-enabled applications, including LLM-based services and RAG pipelines.
  • Designs and owns observability strategies for AI platform health, covering availability, latency, throughput, cost attribution, and model behavior drift.
  • Leads root cause analysis for complex AI inference failures and drives permanent remediation across engineering and product teams.
  • Evaluates, onboards, and operationalizes new GenAI capabilities, including Azure OpenAI Service, Foundation Model APIs, and vector store solutions.
  • Defines operational standards, SLAs, and runbooks for AI platform services, championing a proactive operations culture.
  • Builds and operates AIOps pipelines that leverage GenAI to analyze incidents, identify failure causes, and recommend remediation actions.
  • Integrates AIOps insights into CI/CD pipelines, validating deployments against known failure patterns and implementing AI-driven quality gates.
  • Owns the operational health of enterprise data platforms built on Azure and Databricks, including governance, table management, and job orchestration.
  • Leads cloud cost governance efforts for Databricks and Azure services, partnering with FinOps to optimize spend.
  • Enforces and continuously improves platform security posture, including RBAC, managed identity, network policies, and secrets management.
  • Leads major incident response for platform outages, produces high-quality RCAs, and drives post-incident improvements.
  • Mentors and guides junior engineers, contributing to hiring, onboarding, and skills development within the AI Ops team.

Benefits

  • competitive compensation and benefits package
  • excellent career development opportunities

Stand Out From the Crowd

Upload your resume and get instant feedback on how well it matches this job.

Upload and Match Resume

What This Job Offers

Job Type

Full-time

Career Level

Senior

Number of Employees

5,001-10,000 employees

© 2026 Teal Labs, Inc
Privacy PolicyTerms of Service