Senior Systems Operations Engineer

DistroKid

2d•Remote

About The Position

DistroKid is the world’s largest distributor of music to Spotify, Apple Music, YouTube, and beyond. Most new music today is released through DistroKid. We are seeking a highly skilled Senior Systems Operations Engineer with deep expertise in cloud infrastructure, Infrastructure-as-Code (IaC), and AI-enhanced operations. This role is a critical technical leadership position on the Systems Operations (SysOps) team, responsible for architecting and managing our cloud environment, driving IaC maturity, and integrating AI-powered practices that improve reliability, reduce toil, and scale our operational capabilities. You will serve as a subject matter expert in infrastructure domains, own complex workstreams end-to-end, and partner strategically with peers, engineering teams, and guidance to deliver impactful outcomes across the organization. This is a fully remote position, and success in the role depends on clear, open, and proactive communication to keep distributed teammates informed, aligned, and unblocked.

Requirements

Bachelor’s degree in Computer Science, Information Technology, a related field, or equivalent practical experience.
5+ years of experience in systems operations, platform engineering, or DevOps with a focus on cloud infrastructure and containerized environments.
Proven production experience with AWS services (EC2, EKS, S3, RDS, IAM, VPC, API Gateway, Event Bridge, etc) and Kubernetes.
5+ years of hands-on experience with Infrastructure as Code tools, specifically Terraform and/or OpenTofu, including module design, state management, remote backends, and IaC testing.
Strong knowledge of Linux/Unix administration, systems, and shell scripting.
Proficiency in Python, Go, or similar programming languages.
Experience with CI/CD pipelines for infrastructure deployments (Bitbucket Pipelines, Jenkins, or similar).
Experience with monitoring and observability tools (Prometheus, Grafana, CloudWatch, or Datadog).
Demonstrated experience implementing or working with AIOps tools, practices, or AI-assisted operations in a professional context.
Experience using AI-assisted development tools (e.g., Cursor, Warp, Claude, or similar) to accelerate engineering work.
Strong communication skills with the ability to engage effectively across technical and non-technical audiences.
Practices open, transparent, and proactive communication in a fully remote environment; defaults to over-communication to keep distributed teammates informed and aligned across time zones and async workflows.
Demonstrated ability to guide and influence without formal authority.
Excellent problem-solving skills with the composure to guide through incidents under pressure.
Ability to work in a fast-paced, dynamic environment with shifting priorities while maintaining a high-quality bar.

Nice To Haves

AWS Certified Solutions Architect, DevOps Engineer, or equivalent certification.
Prior experience designing or implementing an Internal Developer Portal (IDP) using platforms such as Backstage, Port, Cortex, or equivalent.
Experience with policy-as-code tools such as OPA, Checkov, or Sentinel.
Experience with service mesh technologies (Istio, Linkerd, or similar).
Familiarity with Docker and container orchestration tools beyond Kubernetes.

Responsibilities

Cloud & Infrastructure Architecture
Design, deploy, and manage scalable and highly available cloud infrastructure on AWS, with deep expertise in core services (EC2, EKS, S3, RDS, IAM, VPC, and beyond).
Develop and maintain disaster recovery plans leveraging AWS capabilities for backup and replication to ensure business continuity.
Collaborate with engineering and security teams to improve infrastructure health, security, and long-term scalability.
Infrastructure as Code (IaC)
Design reusable Terraform/OpenTofu modules following DRY principles and organizational standards; implement module versioning and lifecycle strategies.
Direct the migration of manual infrastructure to code; establish patterns and best practices for IaC adoption across the team.
Implement IaC testing strategies, including validation, linting, and integration testing, using tools such as Terraform-Compliance or Checkov.
Architect and maintain complex Bitbucket pipeline configurations for multi-environment IaC deployments; implement pipeline security best practices.
AI-Enhanced Operations (AIOps)
Implement AIOps practices, leveraging AI tools to enhance monitoring, incident response, and predictive alerting.
Use AI-assisted development and operations tools (e.g., Cursor, Claude) to accelerate troubleshooting, code review, and documentation generation.
Evaluate and implement AI-powered automation to reduce operational toil, improve repeatability, and scale platform capabilities.
Reliability & Observability
Define and implement SLOs for services; guide and/or participate in incident response and conduct blameless postmortems.
Implement chaos engineering practices to proactively identify system weaknesses before they impact production.
Build and maintain comprehensive monitoring solutions using tools such as CloudWatch and Datadog to track performance and drive optimization.
Automation, Developer Experience & Internal Developer Portal
Develop automation scripts and tools in Python, Bash, or similar languages to streamline operations and eliminate manual toil.
Build self-service capabilities for development teams to reduce cognitive load and enable developer autonomy across the organization.
Guide the solution architecture and end-to-end implementation of DistroKid’s first Internal Developer Portal (IDP).
Define the IDP roadmap and success criteria in partnership with engineering leadership; establish golden paths, service catalogs, and self-service workflows that reduce deployment friction and accelerate developer productivity.
Drive adoption of the IDP across engineering teams; gather feedback, iterate on the platform, and measure impact through developer experience metrics and reduced time-to-deploy.
Cost Optimization
Guide cost optimization initiatives; implement rightsizing recommendations, reserved-capacity strategies, and tagging standards for cost allocation.
Monitor and optimize AWS resource usage; select appropriate services and configurations to meet performance requirements cost-effectively.
Technical Leadership & Collaboration
Direct planning, decision-making, and execution for infrastructure projects; own workstreams end-to-end.
Partner cross-functionally with engineering, security, and product teams; communicate impact in terms of company strategy and OKRs.
Provide technical mentorship to junior and mid-level engineers; invest in team growth and foster a culture of continuous learning.
Maintain and contribute to infrastructure documentation, runbooks, and architectural decision records to ensure knowledge sharing and operational consistency.