Sr. Software Engineer

Blackhawk Network•Dallas, TX

48d•Hybrid

About The Position

We are looking for a Senior or Staff Software Engineer with deep Site Reliability Engineering expertise and strong Python development skills to join our SRE team. You will own the reliability, scalability, and observability of a critical Data pipeline platform. This is a builder and operator role. you will write production-grade Python services, and implement observability-as-code infrastructure, and drive reliability improvements across a complex AWS-native platform. You will work at the intersection of software engineering and infrastructure, eliminating toil through automation, building internal developer platforms, and ensuring our systems meet the reliability bar our partners and customers expect.

Requirements

Bachelor's degree in Information Technology, Computer Science, or related field; or equivalent experience.
7+ years of software engineering experience
4+ years of SRE or platform engineering experience in a production environment
Strong Python proficiency production services, automation, REST APIs
Experience with AWS EKS, EC2, RDS/Aurora, S3, IAM, VPC, CloudWatch, Security Groups
Hands-on Terraform experience writing modules, managing state, CI/CD integration
Experience with observability platforms Splunk New Relic (NRQL, NerdGraph)
Strong understanding of SRE principles SLOs/SLIs/error budgets, toil reduction, incident management, capacity planning
Experience with Kubernetes / EKS pod operations
Strong Linux/ Unix fundamentals log management, performance debugging

Nice To Haves

Experience with log pipeline tooling FluentD, Open Telemetry
Familiarity with Splunk, New Relic, OpenSearch, Elasticsearch for log storage and search
Experience building internal developer platforms or tooling CLI tools, internal APIs, automation frameworks
Familiarity with MCP (Model Context Protocol) server development or AI Gateway architecture

Responsibilities

Own and improve service reliability, availability, and performance across a distributed AWS platform (EKS, EC2, RDS, S3)
Define and track SLOs, SLIs, and error budgets for critical services and partner-facing integrations
Drive initiatives identify manual, repetitive operational work and eliminate it through automation
Partner with engineering teams to embed reliability practices into the SDLC
Build and maintain observability-as-code infrastructure using Terraform and New Relic NerdGraph
Design multi-signal observability pipelines metrics, logs, traces across Splunk, New Relic, and OpenSearch
Architect and maintain log routing and sawmills data pipelines using FluentD and Sawmills
Build New Relic NerdGraph integrations and custom NRQL dashboards for platform health visibility
Maintain and extend Terraform modules for AWS infrastructure provisioning Security Groups, EKS node groups, Aurora, IAM roles
Contribute to the MCP server architecture (Python-based internal developer tooling) integrating Jira, Confluence, Bitbucket, and observability platforms
Design, build, and maintain production-grade Python services and automation REST APIs, background workers, CLI tooling, and data pipeline adapters
Mentor junior and mid-level engineers code reviews, pair programming, architecture guidance
Work closely with program managers and product stakeholders