Software Engineer

Sift

46d

About The Position

The Core Platform team is responsible for maintaining and optimizing the data, infrastructure, messaging, and services platform that powers Sift’s online systems. We ensure these systems are always available, reliable, and performing at their best to meet customer needs. In the event of an outage or failure, we follow well-practiced recovery plans to restore services swiftly. Managing such complex, large-scale systems requires continuous monitoring and proactive maintenance to uphold these standards.

Requirements

2+ years of experience as a Software Engineer focused on infrastructure/platform services or in a Site Reliability Engineering (SRE) role.
Strong programming skills in languages such as Java, Scala, or Python.
Extensive experience building and managing cloud infrastructure on AWS or GCP.
Expertise in building infrastructure as code and automating provisioning processes using tools like CloudFormation or Terraform.
Proficiency in setting up and managing monitoring and alerting systems, both open-source and commercial.
Familiarity with Docker and container orchestration technologies like Kubernetes, GKE, or AWS ECS.
Experience troubleshooting and resolving production system issues, with a focus on building automated solutions to prevent future occurrences.
Proven expertise in automation and a solid understanding of configuration management tools.

Responsibilities

Design and build immutable infrastructure and fault-tolerant, multi-AZ/multi-region systems that are resilient and self-healing.
Implement multi-region deployments, such as BigTable clusters spanning multiple regions, with strategies to ensure specific customers are routed to designated regions (e.g., sticky sessions at the regional level).
Optimize local development and testing workflows to be fast, efficient, and seamless.
Create dynamic environments that enable specific services to interact with other environments in real time.
Develop automated bot solutions for deployment and monitoring, integrating with Slack for streamlined updates.
Participate in on-call support and incident response activities, providing 12/7 coverage for one calendar week approximately once every 3-4 weeks.