Lead, Service Reliability Engineer R&D

Johnson & Johnson Innovative Medicine•Raritan, NJ

10d•$94,000 - $151,800•Hybrid

About The Position

The Service Reliability Engineer (SRE) designs, builds, and operates reliability practices and technical capabilities that ensure critical engineering and enterprise services are available, performant, secure, and resilient. This is a hands-on, non-manager role focused on improving service reliability through observability, incident response, automation, and engineering excellence. This role partners closely with Product Owners, development teams, infrastructure/platform engineering, Quality/Validation, Security, and Enterprise Architecture to define reliability targets, implement operational controls, and maintain documentation appropriate for regulated environments. The SRE helps standardize operational patterns across environments (dev/test/prod), including monitoring baselines, access controls, runbooks, change management, and deployment readiness. Key outcomes include establishing and measuring Service Level Indicators/Objectives (SLIs/SLOs), improving alert quality and troubleshooting speed, reducing incident frequency and Mean Time to Recovery (MTTR), and enabling safe, repeatable releases through automation and operational readiness. The SRE identifies reliability risks and technical gaps, recommends scalable and resilient designs, implements reusable operational tooling, and participates in Agile ceremonies and on-call support aligned to the team’s ways of working.

Requirements

Bachelor’s degree in Computer Science, Engineering, or related discipline, or equivalent experience.
5+ years of experience in SRE, DevOps, platform engineering, or software engineering with substantial production operations responsibilities.
Hands-on experience with observability and incident management practices, including monitoring/alerting design, on-call operations, and root-cause analysis.
Experience with infrastructure-as-code and CI/CD (e.g., Terraform/CloudFormation, Git, Azure DevOps/Jenkins or similar) and automated testing/release practices.
Experience operating services in cloud-hosted or hybrid enterprise environments (AWS and/or on-prem), including networking fundamentals, secure configuration, and environment management.
Strong communication skills with the ability to explain technical issues, incident impact, reliability risks, and tradeoffs to both technical and non-technical stakeholders.
Working knowledge of Agile delivery practices and ability to collaborate across cross-functional teams (Product, Engineering, QA/Validation, Security, Infrastructure) to deliver reliable, well-managed releases.
Experience working in MedTech, Life Sciences, or other regulated environments, including familiarity with validated systems, documentation expectations, and controlled change processes.
Demonstrates AI Fluency—the ability to use and evaluate AI technologies responsibly (with a primary focus on generative AI in the workplace)—to improve productivity and decision quality while maintaining human accountability, managing risk, and complying with applicable governance, privacy, security, and policy requirements

Nice To Haves

Product Lifecycle Management (PLM)
Reliability Engineering
Agile Product Development
Analytical Reasoning
Coaching
Collaborating
Competitive Landscape Analysis
Critical Thinking
Customer Alignment
Demand Forecasting
Human-Computer Interaction (HCI)
Organizing
Product Development
Product Improvements
Product Strategies
Requirements Analysis
Research and Development
Software Development Life Cycle (SDLC)
Software Development Management
Stakeholder Management
Technical Credibility
Technical Writing
Technologically Savvy

Responsibilities

Define, implement, and continuously improve reliability standards for production services, including SLIs/SLOs, error budgets, and operational readiness criteria.
Build and maintain observability capabilities (metrics, logs, traces, dashboards) and establish actionable alerts that reflect customer impact.
Participate in on-call rotations, lead incident triage and restoration, and drive root-cause analysis with corrective and preventive actions.
Engineer reliability improvements through automation (self-healing, auto-remediation, runbook automation) and eliminate toil through scripting and tooling.
Partner with engineering teams to design and validate resilient architectures (timeouts/retries, circuit breaking, queuing, graceful degradation) and to improve deployment safety.
Perform capacity planning and performance analysis; proactively identify bottlenecks and reliability risks, and validate scaling strategies.
Establish and maintain operational runbooks, playbooks, and escalation paths; conduct game days and resilience testing (e.g., failover/chaos exercises) as appropriate.
Improve change management by defining deployment/rollback standards, validating monitoring coverage, and supporting release readiness reviews across dev/test/prod.
Create and maintain operational documentation (service catalogs, SLIs/SLOs, runbooks, monitoring standards) and ensure knowledge transfer across teams.
Support validation and audit readiness by following SDLC/IT controls, producing required evidence (e.g., monitoring/test results), and supporting controlled releases in regulated environments.
Develop reliability reporting (availability, latency, error rates, MTTR, incident trends) and present insights and recommendations to stakeholders.
Apply security-by-design principles (identity/access, secrets management, vulnerability management, data protection) and ensure operational practices meet company standards.
Collaborate with internal teams and vendors as needed to implement reliability improvements, manage platform upgrades, and continuously improve maintainability and supportability.

Benefits

medical
dental
vision
life insurance
short- and long-term disability
business accident insurance
group legal insurance
consolidated retirement plan (pension)
savings plan (401(k))
Vacation –120 hours per calendar year
Sick time - 40 hours per calendar year; for employees who reside in the State of Washington –56 hours per calendar year
Holiday pay, including Floating Holidays –13 days per calendar year
Work, Personal and Family Time - up to 40 hours per calendar year
Parental Leave – 480 hours within one year of the birth/adoption/foster care of a child
Condolence Leave – 30 days for an immediate family member: 5 days for an extended family member
Caregiver Leave – 10 days
Volunteer Leave – 4 days
Military Spouse Time-Off – 80 hours