Lead, Service Reliability Engineer R&D

Johnson & Johnson Innovative MedicineRaritan, NJ
$94,000 - $151,800Hybrid

About The Position

The Service Reliability Engineer (SRE) designs, builds, and operates reliability practices and technical capabilities that ensure critical engineering and enterprise services are available, performant, secure, and resilient. This is a hands-on, non-manager role focused on improving service reliability through observability, incident response, automation, and engineering excellence. This role partners closely with Product Owners, development teams, infrastructure/platform engineering, Quality/Validation, Security, and Enterprise Architecture to define reliability targets, implement operational controls, and maintain documentation appropriate for regulated environments. The SRE helps standardize operational patterns across environments (dev/test/prod), including monitoring baselines, access controls, runbooks, change management, and deployment readiness. Key outcomes include establishing and measuring Service Level Indicators/Objectives (SLIs/SLOs), improving alert quality and troubleshooting speed, reducing incident frequency and Mean Time to Recovery (MTTR), and enabling safe, repeatable releases through automation and operational readiness. The SRE identifies reliability risks and technical gaps, recommends scalable and resilient designs, implements reusable operational tooling, and participates in Agile ceremonies and on-call support aligned to the team’s ways of working.

Requirements

  • Bachelor’s degree in Computer Science, Engineering, or related discipline, or equivalent experience.
  • 5+ years of experience in SRE, DevOps, platform engineering, or software engineering with substantial production operations responsibilities.
  • Hands-on experience with observability and incident management practices, including monitoring/alerting design, on-call operations, and root-cause analysis.
  • Experience with infrastructure-as-code and CI/CD (e.g., Terraform/CloudFormation, Git, Azure DevOps/Jenkins or similar) and automated testing/release practices.
  • Experience operating services in cloud-hosted or hybrid enterprise environments (AWS and/or on-prem), including networking fundamentals, secure configuration, and environment management.
  • Strong communication skills with the ability to explain technical issues, incident impact, reliability risks, and tradeoffs to both technical and non-technical stakeholders.
  • Working knowledge of Agile delivery practices and ability to collaborate across cross-functional teams (Product, Engineering, QA/Validation, Security, Infrastructure) to deliver reliable, well-managed releases.
  • Experience working in MedTech, Life Sciences, or other regulated environments, including familiarity with validated systems, documentation expectations, and controlled change processes.
  • Demonstrates AI Fluency—the ability to use and evaluate AI technologies responsibly (with a primary focus on generative AI in the workplace)—to improve productivity and decision quality while maintaining human accountability, managing risk, and complying with applicable governance, privacy, security, and policy requirements

Nice To Haves

  • Product Lifecycle Management (PLM)
  • Reliability Engineering
  • Agile Product Development
  • Analytical Reasoning
  • Coaching
  • Collaborating
  • Competitive Landscape Analysis
  • Critical Thinking
  • Customer Alignment
  • Demand Forecasting
  • Human-Computer Interaction (HCI)
  • Organizing
  • Product Development
  • Product Improvements
  • Product Strategies
  • Requirements Analysis
  • Research and Development
  • Software Development Life Cycle (SDLC)
  • Software Development Management
  • Stakeholder Management
  • Technical Credibility
  • Technical Writing
  • Technologically Savvy

Responsibilities

  • Define, implement, and continuously improve reliability standards for production services, including SLIs/SLOs, error budgets, and operational readiness criteria.
  • Build and maintain observability capabilities (metrics, logs, traces, dashboards) and establish actionable alerts that reflect customer impact.
  • Participate in on-call rotations, lead incident triage and restoration, and drive root-cause analysis with corrective and preventive actions.
  • Engineer reliability improvements through automation (self-healing, auto-remediation, runbook automation) and eliminate toil through scripting and tooling.
  • Partner with engineering teams to design and validate resilient architectures (timeouts/retries, circuit breaking, queuing, graceful degradation) and to improve deployment safety.
  • Perform capacity planning and performance analysis; proactively identify bottlenecks and reliability risks, and validate scaling strategies.
  • Establish and maintain operational runbooks, playbooks, and escalation paths; conduct game days and resilience testing (e.g., failover/chaos exercises) as appropriate.
  • Improve change management by defining deployment/rollback standards, validating monitoring coverage, and supporting release readiness reviews across dev/test/prod.
  • Create and maintain operational documentation (service catalogs, SLIs/SLOs, runbooks, monitoring standards) and ensure knowledge transfer across teams.
  • Support validation and audit readiness by following SDLC/IT controls, producing required evidence (e.g., monitoring/test results), and supporting controlled releases in regulated environments.
  • Develop reliability reporting (availability, latency, error rates, MTTR, incident trends) and present insights and recommendations to stakeholders.
  • Apply security-by-design principles (identity/access, secrets management, vulnerability management, data protection) and ensure operational practices meet company standards.
  • Collaborate with internal teams and vendors as needed to implement reliability improvements, manage platform upgrades, and continuously improve maintainability and supportability.

Benefits

  • medical
  • dental
  • vision
  • life insurance
  • short- and long-term disability
  • business accident insurance
  • group legal insurance
  • consolidated retirement plan (pension)
  • savings plan (401(k))
  • Vacation –120 hours per calendar year
  • Sick time - 40 hours per calendar year; for employees who reside in the State of Washington –56 hours per calendar year
  • Holiday pay, including Floating Holidays –13 days per calendar year
  • Work, Personal and Family Time - up to 40 hours per calendar year
  • Parental Leave – 480 hours within one year of the birth/adoption/foster care of a child
  • Condolence Leave – 30 days for an immediate family member: 5 days for an extended family member
  • Caregiver Leave – 10 days
  • Volunteer Leave – 4 days
  • Military Spouse Time-Off – 80 hours
© 2026 Teal Labs, Inc
Privacy PolicyTerms of Service