Site Reliability Engineer

Kochava•Sandpoint, ID

54d•Remote

About The Position

Kochava provides a unified platform with solutions for advertisers and publishers across the omni-channel advertising ecosystem to link media investments to outcomes. Kochava is an industry leader in the advertising ecosystem, providing tools and technologies for leading brands, agencies, and premium publishers for measurement and attribution, media mix modeling (MMM), and search ads optimization. We enable the visibility into and management of trillions of data points, hundreds of millions of users, and billions of dollars in lifetime value (LTV) and paid ad spend. Our suite of solutions are used as a growth stack for leading brands and publishers - empowering them to see and manage their data and unleash the power of their connected audiences. Kochava is looking for enthusiastic engineers to join our Site Reliability Engineering Team. As a member of this team you will focus on software development and infrastructure design building services to manage, scale and monitor our shared core infrastructure. The infrastructure and services that this team is responsible includes databases, message queues, monitoring solutions, security and networking in the cloud and physical data-centers. Engineers on this team will be challenged in a fast-paced environment and steer the advancement of efficient, resilient and scalable shared resources used by many of our production core services. Position with Kochava, Inc. based in Sandpoint, Idaho and can work remotely from a US approved state: California, Colorado, Georgia, Idaho, Illinois, Montana, North Carolina, New Jersey, New York, Oregon, Texas, or Washington.

Requirements

You have spend at least 5 years as an SRE, DevOps engineer or an equivalent role
You have an in-depth knowledge of containerization and managing complex workloads in Kubernetes
Proven track record of designing, building, optimizing, and maintaining infrastructure on a large scale.
You are expert with IaC tools such as Terraform
A deep understanding of the Linux operating system, from the console to the kernel.
Ability to work in as part of a distributed team.
Knowledge of CI/CD best practices.

Nice To Haves

Software development experience using Go, Python and Java.
Experience with on-prem Kubernetes cluster.
Experience with Kafka, MySQL, Influxdb, Elasticsearch, Redis, and/or Memcached.

Responsibilities

Streamline and enhance the day-to-day operational workflows of critical applications and services in a 24x7x365 environment located in Google Compute Platform, AWS, and physical data centers.
Build tools to enhance performance, scalability and observability of resources shared between multiple projects in production.
Interact with other teams across the organization to define SLOs,SLIs and SLAs.
Evangelize the adoption of best practices in relation to performance and reliability.
Continuously improve observability to ensure the uptime and reliability of our applications and infrastructure.
Troubleshoot issues across the entire stack; hardware, software, application and network within physical datacenter and cloud-based environments.
Provide on-call support for shared services and infrastructure.

Stand Out From the Crowd

Upload your resume and get instant feedback on how well it matches this job.

Upload and Match Resume