Software Engineer - Core Infrastructure Team

Eightfold•Santa Clara, CA

50d•$120,750 - $161,000•Hybrid

About The Position

Eightfold is a global leader in AI-native enterprise talent platform, trusted by the world’s largest & most respected fortune 500 organizations. Our platform is built from the ground up operating at scale across Azure and AWS, deployed in multiple regions globally, including IL4-compliant environments for US Government, supporting users in 100+ countries and 30+ languages. Today, Eightfold is at the forefront of agentic AI, delivering intelligent agents that actively drive outcomes across hiring and talent workflows, while much of the industry is still experimenting with prototypes. We are defining the next era of agentic talent systems. What sets Eightfold apart is not just the technology & our mission, but the team behind it. We are a deeply technical, execution-driven organization that values ownership, collaboration, and high standards. Our engineers, product leaders, and go-to-market teams work closely together — in person and across functions — to build systems that scale in the real world. If you’re excited to work on hard problems, move with urgency, raise the bar every day, and help build agentic systems that transform how the world works, Eightfold is the place to do it. About Eightfold's Core Infrastructure Team The Core Infrastructure Team is the backbone of Eightfold, responsible for the architecture, maintenance, and enhancement of critical elements of our technology stack. This encompasses Search, Databases, Machine Learning Infrastructure, Data Warehouse, Developer Platform, and Application Infrastructure. Our work is foundational to every product at Eightfold, underpinning the services our users and customers interact with daily. The infrastructure we build and maintain is pivotal to our mission, ensuring scalability, reliability, and security across all our offerings. This infrastructure is used by every team and powers every single product at Eightfold.

Requirements

Demonstrable experience in designing, developing, and delivering highly scalable systems,.
Strong foundation in cloud-scale distributed systems.
Deep understanding of cloud environments (AWS, GCP or Azure)
Strong coding, data structures, algorithms, and problem-solving skills.
1-3+ years of experience building high-quality software that is secure, scalable and highly available.
Experience in containerization like Docker and Kubernetes.
Familiarity with Cloud Operations principles. We are tools-agnostic and often build our own, or extend existing tools and services.
Experience with automation tools, including Terraform, CloudFormation, Python, Shell Scripting, and Ansible.
Demonstrable skills in python scripting and triaging production issues using CloudWatch and other debugging tools.
Proven experience providing 24/7 on-call support and incident management for critical production infrastructure.
Excellent problem-solving and troubleshooting skills.
Effective communication skills, both verbal and written.

Responsibilities

Design, build, and manage secure, scalable cloud environments on AWS and Azure, ensuring high availability, reliability, observability, security, and cost-efficiency.
Build out large-scale software platforms (that are used by millions of users and processes huge terabytes of data). This team supports a multitude of services that leverage this data.
Design and support highly scalable systems, ensuring they meet our rigorous standards for quality, security, scalability, and availability.
Create algorithms and data structures to improve overall system performance.
Build out microservices using frameworks such as Docker and Kubernetes to power all our products and to maximize system extensibility and performance.
Support the deployment and operations of our product across multiple environments.
Automate infrastructure deployment, configuration, monitoring, and disaster recovery using Shell scripting, Ansible, Terraform, and other automation tools to minimize manual intervention.
Implement and maintain containerized solutions using Docker, Kubernetes, and other cloud-native technologies to support large-scale, distributed systems serving millions of users and processing terabytes of data.
Monitor, maintain, and improve system performance, proactively addressing reliability, capacity, and cost issues.
Drive continuous improvement by contributing to post-incident reviews, root cause analysis, and implementing preventive solutions.
We support customers globally- all over the world, including governments and large enterprise.
Participate in 24/7 on-call rotations, triaging and resolving production incidents, following and improving runbooks, and ensuring smooth escalation processes.
Diagnose and solve problems that can arise in a complex distributed environment.