Principal Site Reliability Engineer

PTC•Boston, MA

21d•Hybrid

About The Position

We are looking for a Principal Site Reliability Engineer (SRE) to play a critical role in ensuring the long‑term reliability, scalability, and operational excellence of our platform. As a Principal SRE, you will operate with a high degree of autonomy and influence. You will lead complex, cross‑organization reliability initiatives, shape reliability strategy, and serve as a technical authority and trusted advisor across engineering. Your work will directly shape the experience of our customers by ensuring the platform is fast, resilient, and dependable. As a Principal SRE, you will help protect customer trust by driving reliability across the entire system lifecycle. This role is ideal for engineers who enjoy solving ambiguous, high‑impact problems at scale, influencing system design across teams, and raising the reliability bar for an entire organization.

Requirements

Ability to commute to the Seaport Boston office 2-3 days a week.
7+ years of experience in software engineering, site reliability engineering, or systems engineering roles
Extremely strong proficiency with the Java programming language and its ecosystem, including building, debugging, and operating production Java services
Deep experience operating complex, distributed systems in production environments
Strong software engineering background, with a track record of delivering high‑quality, maintainable code
Expert understanding of incident management, service reliability, and performance engineering
Strong hands‑on experience with observability (metrics, logs, traces), capacity planning, and SLO‑driven reliability
Deep familiarity with modern cloud‑based infrastructure, CI/CD pipelines, and infrastructure‑as‑code practices
Ability to reason about failure modes across application, data, and infrastructure layers
Demonstrated ability to lead complex initiatives that span teams and organizational boundaries
Comfortable making high‑impact technical decisions in ambiguous environments
Strong communicator who can influence design and operational decisions across a wide range of stakeholders
Systems thinker focused on root‑cause analysis and durable fixes
Calm and effective under pressure, especially during high‑severity incidents
Curious, data‑driven, and committed to continuous improvement

Nice To Haves

Experience operating or supporting systems using technologies such as MongoDB, ZooKeeper, and RabbitMQ
Background in performance tuning and scalability optimization of Java services
Experience setting or influencing engineering standards at the organization level
Prior involvement in evolving SRE or platform practices in a growing engineering organization
Experience designing, operating, or scaling systems in cloud environments such as AWS (preferred), including familiarity with core services, networking models, and reliability features

Responsibilities

Own Reliability at Scale
Lead design, implementation, and evolution of reliability, availability, and resiliency strategies for large‑scale distributed systems written primarily in Java
Apply deep experience operating complex, distributed systems to guide architectural decisions, reliability strategies, and long-term system evolution
Identify systemic risks in application architecture, data flows, and infrastructure, and drive architectural improvements that measurably improve availability, performance, and scalability
Set and evolve reliability standards, best practices, and operational principles across R&D
Drive Operational Excellence
Lead efforts to prevent, detect, and mitigate incidents through technical improvements and operational maturity
Serve as a senior coordination point during major incidents, helping manage response and guide long-term remediation
Champion blameless post-incident reviews and ensure learnings translate into durable system improvements
Reduce Toil Through Engineering
Apply advanced software engineering practices to eliminate manual work, reduce operational load, and improve system observability
Design and build internal platforms, automation, and tooling that support Java‑based services and their operational needs
Raise the bar on monitoring, alerting, and SLO/SLI adoption across systems
Lead Through Influence and Collaboration
Partner deeply with product engineers, architects, and engineering leadership to ensure reliability and operability are first‑class concerns in system design
Review and influence designs for complex systems involving technologies such as datastores, messaging systems, and coordination services
Serve as a technical mentor and coach for SREs and other engineers, raising overall engineering and operational maturity
Shape Strategy and Direction
Contribute to longer‑term reliability and infrastructure strategy aligned with business growth
Stay current with industry trends in SRE, distributed systems, and the Java ecosystem, turning insights into practical improvements
Help define what “great reliability” looks like for the organization and how we measure it

Benefits

Employees may be eligible for medical, dental and vision insurance, paid time off and sick leave, tuition reimbursement, 401(k) contributions and employer match, flexible spending accounts, life insurance, disability coverage and, if you are an office-assigned employee, a generous commuter subsidy.
Employees also have the opportunity to become a PTC shareholder through our employee share purchase program (ESPP), which allows for the purchase of discounted PTC stock.
Certain roles may also be eligible for participation in our equity programs.

Stand Out From the Crowd

Upload your resume and get instant feedback on how well it matches this job.

Upload and Match Resume