Principal Site Reliability Engineer

iSpotBellevue, WA
Hybrid

About The Position

iSpot.tv is changing how brands, agencies, and networks measure and assess the impact of TV advertising. We deal with BIG data, operating mainly in AWS with multiple Kubernetes clusters and thousands of servers. We are looking for an experienced SRE leader with the skills and passion to make a significant impact on our ecosystem. You will have a wide array of projects to tackle, with ample opportunities for growth. You will be a key member of our SRE leadership team, focused on empowering developers to build, test, and deploy applications faster and more efficiently. You will both lead the team and remain hands-on in designing, building, and maintaining the tools, platforms, and processes that improve our engineering teams' productivity and streamline the software development lifecycle. Your work will directly impact developer happiness and the speed at which we can deliver innovative features to our customers. We are seeking a seasoned and strategic Lead/Principal Site Reliability Engineer to drive the reliability, scalability, and performance of our core production systems while significantly enhancing the internal developer experience. This role sits at the intersection of operations and development, requiring deep technical expertise, strong leadership, and a passion for optimizing the entire software development lifecycle (SDLC). Our team consists of senior engineers who work together with minimal supervision to attain those goals. Candidates must possess deep operational experience with AWS and Kubernetes to support teams utilizing these systems. You will lead the technical direction of the team while remaining a key individual contributor. You will be responsible for creating a culture of engineering excellence, designing self-service platforms, and fostering alignment across all engineering teams to accelerate product delivery and maintain world-class service stability.

Requirements

  • Bachelor's degree in Computer Science, Engineering, or a related technical field, or equivalent practical experience.
  • 10+ years of relevant experience in software engineering, cloud architecture, and/or Site Reliability Engineering, with at least 3 years in a leadership or lead contributor role.
  • Deep expertise of AWS, including EKS, ECR, RDS, SQS/SNS, VPC, MWAA and S3.
  • Strong proficiency in Infrastructure as Code (IaC) tools (e.g., Terraform, CloudFormation).
  • Specialized experience in optimizing large-scale data platforms, specifically with Apache Spark.
  • Proven ability to profile, troubleshoot, and tune Spark jobs for performance, cost, and reliability.
  • 5+ years of experience with Kubernetes and containerization in general, including associated tools (kubectl, Helm, ArgoCD).
  • Strong knowledge of AWS cost optimization.
  • TCP/IP networking, including routing and AWS security groups.
  • Excellent knowledge of CI/CD concepts and experience developing associated pipelines in CircleCI.
  • Proficient in high-level scripting languages, including shell scripting, Python, and/or JavaScript.
  • Experience with OTel and monitoring tools such as Splunk or DataDog.
  • Excellent communication, collaboration, and stakeholder management skills, with proven experience driving technical initiatives across multiple teams.
  • Experience with researching and selecting new/modern developer toolsets and assisting teams in adopting them including vendor assessments, security assessments and procurement process.

Nice To Haves

  • Experience with native AI observability tools is a plus.
  • Experience with evaluating and rolling out GenAI tools for improving developer efficiency.
  • Experience in Ad-Tech or “BIG Data” processing organization is highly preferred.

Responsibilities

  • Architect, build, and maintain scalable, highly available, and reliable cloud infrastructure in AWS leveraging modern container orchestration technologies.
  • Serve as the reliability and cost optimization expert for high-volume, data-intensive workloads.
  • Focus on optimizing and ensuring the stability of distributed data processing engines, specifically Apache Spark and related ecosystems (e.g., EMR, Databricks, Glue).
  • Establish comprehensive observability practices by defining SLIs/SLOs, implementing advanced monitoring, alerting, and logging solutions to quickly identify and resolve system anomalies.
  • Drive automation across all operational aspects, including infrastructure provisioning (Terraform), scaling, deployment, and incident response, minimizing toil and manual effort.
  • Lead and participate in the incident response lifecycle, performing thorough post-mortems to derive actionable insights and implement preventative measures to improve system resilience.
  • Define and champion the strategic roadmap for AI/ML integration within SRE, establishing organizational best practices for AIOps, automated incident remediation, Toil Reduction via LLMs, and Automated Root Cause Analysis (RCA) and the governance of LLM-driven tooling to enhance system observability and resilience.
  • Design, implement, and champion self-service tools, internal developer portals, and services that empower engineering teams to manage their infrastructure and deployments independently and efficiently.
  • Lead the standardization of AI developer assistants by architecting and maintaining global 'steering files' and context-configuration standards, ensuring AI-generated code aligns with our specific patterns, security protocols, and architectural guardrails.
  • Own and continuously improve the CI/CD pipelines, reducing build times, streamlining deployment workflows, and integrating best practices for testing, security (Shift Left), and code quality.
  • Maintain and improve our container orchestration and deployment tools, leveraging Kubernetes, Helm, and ArgoCD to create seamless developer workflows.
  • Develop, implement, and maintain a set of key performance indicators (KPIs) to measure and improve the developer experience across all of Engineering.
  • Guide and mentor senior engineers, promoting SRE/DevEx principles.
  • Develop clear, comprehensive documentation and tutorials to ensure seamless adoption of new tools and platforms.
  • Strategically identify and implement opportunities for cloud cost optimization and resource efficiency without compromising reliability or performance.
  • Define, champion, and communicate the long-term technical roadmap for the SRE and DevEx platforms, balancing immediate operational needs with strategic, future-state goals.
  • Act as a critical liaison between infrastructure, security, and product development teams.
  • Proactively drive cross-team alignment on architectural standards, tooling choices, and development workflows to ensure consistency and shared accountability for system health.
  • Systematically identify engineering bottlenecks, friction points, and points of organizational toil within the SDLC.
  • Implement targeted solutions—whether technical, process-based, or organizational—to mitigate these constraints and enhance overall engineering velocity.
  • Collaborate with engineering leadership to transform the strategic roadmap into actionable, prioritized plans, securing cross-functional buy-in and resources for successful execution.

Benefits

  • Compensation packages consist of salary and equity in one of Seattle’s hottest start-ups.
  • Standard benefits.
  • Opportunity to participate in iSpot’s equity plan to receive stock options.
  • Non-exempt roles will also be eligible for (pre-approved) overtime pay.
© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service