About The Position

Choice Hotels has an exciting new opportunity as our Staff Software Engineer, Resiliency & Platform Engineering in the SkyTouch Technology division. SkyTouch Technology is an independently operated division of Choice Hotels that provides the most widely used cloud-based (SaaS) hotel property management system. As a key member of our SkyTouch Technology division, you will help strengthen the resiliency, safety, and operability of a large-scale, multi-tenant SaaS platform by improving foundational platform capabilities, runtime behavior, and the developer experience used to build and operate our systems. This role sits at the intersection of software engineering, platform engineering, and resiliency. You will focus on building shared capabilities, libraries, frameworks, tooling, guardrails, and standards used by dozens of engineers across the organization. These capabilities make resilient behavior the default for application teams and reduce operational risk through better system design rather than reactive response. This is not a traditional Site Reliability Engineering (SRE) role. In our environment, resiliency and platform engineering are proactive, year-round engineering disciplines focused on preventing failures, improving system behavior under stress, and enabling teams to build and operate services safely at scale. The emphasis is on durable, systemic improvements and developer enablement rather than pager-driven operations or feature delivery. You will also be expected to apply AI-assisted tools and techniques pragmatically to reduce engineering toil, improve diagnostics, and accelerate resiliency and platform outcomes, prioritizing durability, correctness, and adoption over experimentation. Are you a senior engineer who thrives on improving how software is built and operated at scale. Someone who prefers fixing root causes, strengthening platforms, and improving developer experience as a reliability lever? The #SkysTheLimit when you #MakeItYourChoice! We encourage you to apply today!

Requirements

  • Bachelor’s degree in computer science, or a related technical field, or equivalent practical experience building and operating production systems.
  • Typically, 8–10+ years of hands-on experience designing, building, and supporting large-scale software systems in production environments.
  • Hands-on experience designing, building, and operating Java-based services, including Spring Boot applications running in virtualized and containerized environments.
  • Experience developing and supporting cloud-native and serverless workloads, including Python-based services and event-driven architectures.
  • Strong practical experience working in AWS public cloud environments, with an understanding of how cloud-managed services influence reliability, scalability, and operational behavior.
  • Working knowledge of relational and non-relational data stores, including how data persistence, availability, and failure characteristics impact system design and resiliency.
  • Experience using application monitoring and observability platforms to understand system behavior in production, such as application performance monitoring, centralized logging, and cloud-native telemetry tools (for example, AppDynamics, OpenSearch, Amazon CloudWatch, or similar).
  • Comfortable diagnosing complex production issues by interpreting metrics, logs, traces, and runtime signals rather than relying solely on reactive incident handling.
  • Solid understanding of Site Reliability Engineering (SRE) principles, with the judgment to apply them selectively to guide platform and resiliency improvements rather than adopting SRE practices as a one-size-fits-all operating model.
  • Demonstrated ability to choose between software design changes, platform capabilities, or developer enablement as the most effective way to improve reliability and operability.
  • Hands-on experience designing and delivering one or more platform-level capabilities such as shared libraries, frameworks, internal tooling, or enablement platforms used by multiple application teams.
  • Experience creating and rolling out paved roads, guardrails, or standardized patterns that balance safety, usability, and developer autonomy.
  • Experience using AI-assisted tools (such as code assistants, log/trace analysis, or incident analysis tools) to improve engineering effectiveness or system reliability.
  • Proven ability to influence technical direction and engineering practices across teams without direct ownership of delivery backlogs.
  • Successful candidates for this role consistently demonstrate strength in the following Korn Ferry competencies: Manages Complexity – Navigates complex technical environments, synthesizes information across systems, and identifies systemic root causes. Decision Quality – Makes sound technical decisions under constraints, balancing immediate needs with long-term platform health. Drives Results – Delivers durable improvements in platform resiliency, stability, and developer effectiveness.

Nice To Haves

  • Cloud or technology certifications (such as AWS certifications or equivalent) are a plus and demonstrate commitment to building and operating reliable systems at scale.

Responsibilities

  • Design and implement platform-level capabilities including shared libraries, frameworks, tooling, automation, and guardrails that improve application resiliency, runtime safety, and developer experience across the ecosystem, favoring leverage and durability over short-term delivery.
  • Strengthen foundational platform and runtime behavior by identifying and eliminating systemic failure modes such as JVM memory leaks, unsafe defaults, brittle error handling, poor failure propagation, and resource exhaustion.
  • Improve how software is built and operated at scale by defining and rolling out developer-facing standards and paved roads for resiliency, observability, error handling, and operational readiness.
  • Define, standardize, and evolve logging, monitoring, alerting, and observability practices that improve signal quality, reduce noise, and enable faster diagnosis and recovery.
  • Partner closely with Principal Software Engineers, Solution Architects, and Engineering Managers to identify systemic risks and translate them into well-scoped platform and resiliency initiatives and technical work.
  • Operate across software engineering resiliency, data engineering resiliency, and platform engineering teams to identify cross-cutting risks, design shared solutions, and raise the technical bar, rather than owning individual team backlogs.
  • Engage directly in application codebases, particularly during ramp-up, to understand real-world system behavior, identify failure patterns, and validate resiliency improvements. Exit application-level work once learning is complete and systemic improvements are identified.
  • Participate in incident postmortems and operational reviews to identify recurring patterns and ensure lessons learned are translated into durable platform or resiliency improvements, not one-off fixes.
  • Evaluate, prototype, and introduce tools and technologies that measurably improve developer productivity, platform safety, and application resiliency, prioritizing adoption, simplicity, and long-term impact.
  • Apply AI-assisted development, diagnostics, and operational tools where they demonstrably improve engineering productivity, root cause analysis, signal quality, or resiliency outcomes.
  • Influence engineering practices and technical direction through design reviews, reference implementations, mentorship, and technical leadership rather than formal authority or delivery ownership.

Benefits

  • Competitive compensation and benefits, including medical, dental, and vision coverage
  • Leave and paid time-off for holidays, vacation, personal, family, volunteer, sick, jury duty, bereavement, military, and religious observance
  • Financial benefits for retirement and health savings
  • Employee recognition programs
  • Discounts at Choice hotels worldwide
© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service