About The Position

DistroKid is the world’s largest distributor of music to Spotify, Apple Music, YouTube, and beyond. Most new music today is released through DistroKid. We are seeking a highly skilled Senior Systems Operations Engineer with deep expertise in cloud infrastructure, Infrastructure-as-Code (IaC), and AI-enhanced operations. This role is a critical technical leadership position on the Systems Operations (SysOps) team, responsible for architecting and managing our cloud environment, driving IaC maturity, and integrating AI-powered practices that improve reliability, reduce toil, and scale our operational capabilities. You will serve as a subject matter expert in infrastructure domains, own complex workstreams end-to-end, and partner strategically with peers, engineering teams, and guidance to deliver impactful outcomes across the organization. This is a fully remote position, and success in the role depends on clear, open, and proactive communication to keep distributed teammates informed, aligned, and unblocked.

Requirements

  • Bachelor’s degree in Computer Science, Information Technology, a related field, or equivalent practical experience.
  • 5+ years of experience in systems operations, platform engineering, or DevOps with a focus on cloud infrastructure and containerized environments.
  • Proven production experience with AWS services (EC2, EKS, S3, RDS, IAM, VPC, API Gateway, Event Bridge, etc) and Kubernetes.
  • 5+ years of hands-on experience with Infrastructure as Code tools, specifically Terraform and/or OpenTofu, including module design, state management, remote backends, and IaC testing.
  • Strong knowledge of Linux/Unix administration, systems, and shell scripting.
  • Proficiency in Python, Go, or similar programming languages.
  • Experience with CI/CD pipelines for infrastructure deployments (Bitbucket Pipelines, Jenkins, or similar).
  • Experience with monitoring and observability tools (Prometheus, Grafana, CloudWatch, or Datadog).
  • Demonstrated experience implementing or working with AIOps tools, practices, or AI-assisted operations in a professional context.
  • Experience using AI-assisted development tools (e.g., Cursor, Warp, Claude, or similar) to accelerate engineering work.
  • Strong communication skills with the ability to engage effectively across technical and non-technical audiences.
  • Practices open, transparent, and proactive communication in a fully remote environment; defaults to over-communication to keep distributed teammates informed and aligned across time zones and async workflows.
  • Demonstrated ability to guide and influence without formal authority.
  • Excellent problem-solving skills with the composure to guide through incidents under pressure.
  • Ability to work in a fast-paced, dynamic environment with shifting priorities while maintaining a high-quality bar.

Nice To Haves

  • AWS Certified Solutions Architect, DevOps Engineer, or equivalent certification.
  • Prior experience designing or implementing an Internal Developer Portal (IDP) using platforms such as Backstage, Port, Cortex, or equivalent.
  • Experience with policy-as-code tools such as OPA, Checkov, or Sentinel.
  • Experience with service mesh technologies (Istio, Linkerd, or similar).
  • Familiarity with Docker and container orchestration tools beyond Kubernetes.

Responsibilities

  • Cloud & Infrastructure Architecture
  • Design, deploy, and manage scalable and highly available cloud infrastructure on AWS, with deep expertise in core services (EC2, EKS, S3, RDS, IAM, VPC, and beyond).
  • Develop and maintain disaster recovery plans leveraging AWS capabilities for backup and replication to ensure business continuity.
  • Collaborate with engineering and security teams to improve infrastructure health, security, and long-term scalability.
  • Infrastructure as Code (IaC)
  • Design reusable Terraform/OpenTofu modules following DRY principles and organizational standards; implement module versioning and lifecycle strategies.
  • Direct the migration of manual infrastructure to code; establish patterns and best practices for IaC adoption across the team.
  • Implement IaC testing strategies, including validation, linting, and integration testing, using tools such as Terraform-Compliance or Checkov.
  • Architect and maintain complex Bitbucket pipeline configurations for multi-environment IaC deployments; implement pipeline security best practices.
  • AI-Enhanced Operations (AIOps)
  • Implement AIOps practices, leveraging AI tools to enhance monitoring, incident response, and predictive alerting.
  • Use AI-assisted development and operations tools (e.g., Cursor, Claude) to accelerate troubleshooting, code review, and documentation generation.
  • Evaluate and implement AI-powered automation to reduce operational toil, improve repeatability, and scale platform capabilities.
  • Reliability & Observability
  • Define and implement SLOs for services; guide and/or participate in incident response and conduct blameless postmortems.
  • Implement chaos engineering practices to proactively identify system weaknesses before they impact production.
  • Build and maintain comprehensive monitoring solutions using tools such as CloudWatch and Datadog to track performance and drive optimization.
  • Automation, Developer Experience & Internal Developer Portal
  • Develop automation scripts and tools in Python, Bash, or similar languages to streamline operations and eliminate manual toil.
  • Build self-service capabilities for development teams to reduce cognitive load and enable developer autonomy across the organization.
  • Guide the solution architecture and end-to-end implementation of DistroKid’s first Internal Developer Portal (IDP).
  • Define the IDP roadmap and success criteria in partnership with engineering leadership; establish golden paths, service catalogs, and self-service workflows that reduce deployment friction and accelerate developer productivity.
  • Drive adoption of the IDP across engineering teams; gather feedback, iterate on the platform, and measure impact through developer experience metrics and reduced time-to-deploy.
  • Cost Optimization
  • Guide cost optimization initiatives; implement rightsizing recommendations, reserved-capacity strategies, and tagging standards for cost allocation.
  • Monitor and optimize AWS resource usage; select appropriate services and configurations to meet performance requirements cost-effectively.
  • Technical Leadership & Collaboration
  • Direct planning, decision-making, and execution for infrastructure projects; own workstreams end-to-end.
  • Partner cross-functionally with engineering, security, and product teams; communicate impact in terms of company strategy and OKRs.
  • Provide technical mentorship to junior and mid-level engineers; invest in team growth and foster a culture of continuous learning.
  • Maintain and contribute to infrastructure documentation, runbooks, and architectural decision records to ensure knowledge sharing and operational consistency.

Benefits

  • Retirement plans (401k, SIPP, etc.)
  • Health insurance
  • Generous paid time off
  • Parental leave
  • Home office allowance
  • Flexible work schedules
  • Paid and discounted subscriptions
  • Regular engagement activities
© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service