Site Reliability Engineer

Specialized Dental PartnersFranklin, TN
9h$110,000 - $135,000

About The Position

Specialized Dental Partners is seeking highly skilled Site Reliability Engineer (SRE) to ensure the reliability, scalability, and resilience of our hybrid on-premises and Azure-based platforms. This focuses on designing systems and automation that keep critical clinical, infrastructure, and data platforms available, observable, and recoverable. The SRE partners closely with Infrastructure, Network, Server, Microsoft, Security, and Clinical Systems Engineering teams to reduce operational risk and eliminate classes of failure through engineering. This is a hands-on engineering role focused on reliability, automation, and systemic improvement - not reactive firefighting. About Specialized Dental Partners Specialized Dental Partners is one of the nation’s leading dental support organizations for Endodontic, Periodontic, and Oral Surgery practices. With more than 250 practice locations and over 430 doctors across the United States, Specialized Dental Partners empowers specialists to focus on exceptional patient care while the organization delivers tailored business, operational, and strategic support.

Requirements

  • 6+ years experience in a technology role
  • 2+ years of operational cloud infrastructure implementation
  • 2+ years working with Terraform/Ansible/”Other” Automated deployments
  • 2+ years experience in PowerShell, Scripting, Similar tooling
  • 2+ years providing operational IT with live environments
  • 3+ years supporting a SQL driven data infrastructure
  • 4+ years in communication-centric business role
  • 1+ year of Healthcare IT experience (HIPAA Aligned)

Nice To Haves

  • Bachelor’s degree in Computer Science, Information Technology, or related field
  • Strong direct ownership and responsibility for client experience
  • Azure Data Factory (ADF) or data pipeline reliability experience
  • Hybrid Cloud and On-Prem infrastructure support capabilities
  • Knowledgeable in performing Root Cause Analysis (RCA) processes
  • Documentation, standard setting, and lifecycle ownership
  • Proven ability to communicate technical decisions to technical and non-technical stakeholders
  • Experience with HIPAA security standards and aligning systems with least-priv methodology
  • Troubleshooting, diagnostics, and systemic resolution process experiences
  • Fabric-driven data environment infrastructure experience
  • Familiarity with SLIs, SLOs, and error budgeting
  • Hands-on experience with core infrastructure components on-prem
  • Automation-first solution development

Responsibilities

  • Reliability Engineering & Platform Health
  • Design and implement reliability patterns across hybrid platforms (on-prem + Azure)
  • Define and measure service reliability using SLIs, SLOs, and error budgets
  • Oversee configuration, upgrades, integrations, and vendor coordination
  • Improve availability, performance, and recoverability of clinical and enterprise systems
  • Identify and eliminate systemic reliability risks before they cause incidents
  • Automation & Toil Reduction:
  • Automate repetitive operational tasks using PowerShell, scripting, and IaC
  • Partner with engineering teams to build self-healing and self-service capabilities
  • Reduce manual intervention and operational toil through design and tooling
  • Own and drive initiatives to proactively implement automation methodology to group
  • Monitoring, Observability, and Alerting:
  • Design and maintain monitoring, logging, and alerting strategies
  • Ensure alerts are actionable, meaningful, and tied to business impact
  • Evolve alerting flows into full automations across standardized tooling
  • Improve visibility across infrastructure, applications, and data pipelines
  • Incident Response & Postmortems:
  • Participate in and lead incident response for high-severity outages
  • Drive blameless postmortems and root cause analysis
  • Ensure incidents result in durable engineering improvements
  • Partner with leadership on incident communications and lessons learned
  • Data & Integration Reliability:
  • Support reliability of data pipelines and integrations, including ADF workloads
  • Ensure data movement, transformations, and dependencies are observable and recoverable
  • Bring evolutionary techniques to pipelines and integrations as core member of ADF infrastructure
  • Partner with data and analytics teams to improve pipeline resilience
  • Security, Compliance, & Change Safety:
  • Ensure reliability practices align with security and compliance requirements (HIPAA-aligned)
  • Participate in change reviews and risk assessments
  • Design guardrails that allow teams to move fast without breaking systems
© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service