Staff Systems Engineer - Hyperscaler

ServiceNow•Kirkland, WA

39d

About The Position

The Cloud Advancement Team (CAT) is a highly dynamic, technically advanced group focused on enabling customers' transition to hyperscaler cloud environments such as AWS, Azure, and GCP. Whether supporting net-new deployments or migrating existing services, CAT ensures optimal performance across infrastructure layers. This team serves as the frontline for system performance and hardware validation within the hyperscalers, driving innovation in server SKUs (CPU, memory, storage), working with Azure on integration issues, and collaborating with internal teams to ensure the service is scalable and supportable. CAT is not a traditional operations or ticket-handling team; instead, it's a hybrid group with deep engineering focus. The team is responsible for designing and validating hardware configurations, developing automation for deployment, and debugging complex infrastructure issues across hyperscale environments. Every day brings a new challenge - from scaling deployments to investigating deep system-level bugs - making this team ideal for problem solvers and cloud-savvy engineers. What you get to do in this role: Collaborate with cloud providers (Azure, AWS, GCP) to support migrations and new deployments. Design, test, and validate new server SKUs in partnership with Technical Account Managers and engineering teams. Build tools and automation to streamline server configuration, validation, and deployment processes before hand-off to Quality Engineering (QE) testing. Design and develop software integrations with cloud services and internal APIs to support automation of infrastructure operations. Automate server deployment and scaling processes using tools such as Puppet, Ansible, and Git. Troubleshoot and resolve tier-3 customer performance issues with a strong emphasis on debugging and root cause analysis. Act as a technical point of contact for hyperscaler integration issues, particularly for database nodes and performance validation. Balance short-term customer-driven priorities with long-term capacity planning and automation improvements

Requirements

Experience in leveraging or critically thinking about how to integrate AI into work processes, decision-making, or problem-solving. This may include using AI-powered tools, automating workflows, analyzing AI-driven insights, or exploring AI's potential impact on the function or industry.
Typically requires a minimum of 5 years of related experience with a Bachelor's degree; or 3 years and a Master's degree; or a PhD without experience; or equivalent work experience.
Strong hands-on software engineering experience or infrastructure software development experience, with strong proficiency in Python, Go, or similar languages.
Working knowledge of hyperscaler (GCP, AWS, or Azure)
Proficient in Linux system administration and debugging.
Experience with infrastructure automation tools (Puppet, Ansible).
Solid scripting and automation skills with a developer mindset.
Familiarity with Git and code-based deployment workflows.
Ability to work in a fast-paced environment where priorities shift frequently.
Excellent problem-solving skills, curiosity, and ability to work without a predefined solution path.

Nice To Haves

Knowledge of ServiceNow platform.
Experience writing infrastructure tests or using test frameworks in a dev-oriented capacity.
Background working on developer platforms, DevOps tooling, or internal automation systems.
Exposure to infrastructure-as-code tools (Terraform, Ansible) from a software-centric perspect

Responsibilities

Collaborate with cloud providers (Azure, AWS, GCP) to support migrations and new deployments.
Design, test, and validate new server SKUs in partnership with Technical Account Managers and engineering teams.
Build tools and automation to streamline server configuration, validation, and deployment processes before hand-off to Quality Engineering (QE) testing.
Design and develop software integrations with cloud services and internal APIs to support automation of infrastructure operations.
Automate server deployment and scaling processes using tools such as Puppet, Ansible, and Git.
Troubleshoot and resolve tier-3 customer performance issues with a strong emphasis on debugging and root cause analysis.
Act as a technical point of contact for hyperscaler integration issues, particularly for database nodes and performance validation.
Balance short-term customer-driven priorities with long-term capacity planning and automation improvements