Senior DevOps Engineer

MercivNew York, NY
118d$160,000 - $220,000

About The Position

Merciv is transforming retail through autonomous AI that doesn't just analyze businesses - it helps run them. Our platform powers some of the largest retail organizations in the world, processing data and managing workflows across millions of datapoints. We're building the infrastructure that enables AI agents to inform consumer-facing strategies, make million-dollar inventory decisions, optimize pricing in real-time, and orchestrate complex retail operations 24/7. As we scale our enterprise platform, we need a Senior DevOps Engineer who can build and maintain the rock-solid, secure infrastructure that autonomous commerce demands. This is your chance to architect the systems powering the future of retail AI. You'll own the infrastructure that supports AI agents making split-second decisions for major retailers. Working closely with our ML and backend engineers, you'll ensure our platform maintains 99.97%+ uptime while handling Black Friday-level traffic every day. This is a hands-on role where you'll build the secure, scalable systems that enterprise retailers trust with their entire operations. In this role, you'll be the guardian of infrastructure that must be enterprise-grade (SOC 2, GDPR, ISO 27001 compliant) while maintaining startup agility. Your work directly impacts whether a retailer's AI can respond to market changes in milliseconds or minutes - a difference measured in millions of dollars.

Requirements

  • 6-10+ years of industry experience with at least 4 years in hands-on DevOps roles
  • 4+ years managing cloud infrastructure in production (AWS strongly preferred)
  • 2+ years of production Kubernetes experience (EKS preferred)
  • Expert-level AWS knowledge (EC2, EKS, Lambda, S3, RDS, IAM, Secrets Manager, KMS)
  • Advanced Infrastructure-as-Code expertise with Terraform and Terragrunt
  • Strong GitOps experience and configuration management (Ansible)
  • Hands-on experience with bare metal configuration and machine templates
  • Advanced Docker knowledge and container debugging skills
  • Production Kubernetes with Helm, FluxCD, and KEDA
  • Strong Python and Bash scripting for automation and CLI tool development
  • CI/CD pipeline design with GitHub Actions and other platforms
  • Experience with observability stacks (NewRelic preferred, CloudWatch, Prometheus/InfluxDB)
  • Deep understanding of network security, load balancing, and DNS
  • Solid Linux administration and system debugging skills

Nice To Haves

  • Backend or full-stack development experience
  • AI/ML infrastructure experience (model serving, GPU clusters, training pipelines)
  • Experience with real-time, high-throughput data systems
  • Multi-tenant SaaS platform expertise
  • Retail or e-commerce domain knowledge
  • eBPF for advanced observability
  • Experience with Terraform Cloud at scale
  • Service mesh technologies
  • Multi-region deployment expertise for global retail operations
  • SecOps experience at enterprise scale
  • Experience with event-driven architectures
  • Knowledge of streaming platforms (Kafka, Kinesis)

Responsibilities

  • Scale AI Infrastructure: Architect and optimize infrastructure supporting high-volume daily agentic decisions
  • Ensure Enterprise Reliability: Build systems that maintain 99.97% uptime for mission-critical retail operations across Fortune 500 clients
  • Automate Everything: Develop robust CI/CD pipelines for rapid ML model deployment and infrastructure updates without downtime
  • Secure Sensitive Data: Implement and maintain SOC 2, GDPR, and ISO 27001 compliant infrastructure for enterprise retail data
  • Optimize AI/ML Workflows: Partner with engineers to streamline model training, deployment, and inference pipelines at scale
  • Champion GitOps: Implement infrastructure-as-code practices that let us scale from hundreds to thousands of AI agents seamlessly
  • Monitor Autonomous Systems: Build observability into distributed agent networks processing millions of retail data points
  • Enable Multi-Tenancy: Design secure, isolated environments for enterprise clients while maintaining operational efficiency
  • Integrate Enterprise Systems: Support seamless connections with Shopify Plus, SAP, Oracle Retail, and other major platforms
  • Own Production Excellence: Lead incident response for a platform where minutes of downtime could mean millions in lost revenue

Benefits

  • Health
  • Dental
  • Vision
  • Life
  • Commuter
© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service