Site Reliability Engineer III - REMOTE

Net Health•Pittsburgh, PA

17d•$108,720 - $135,900•Remote

About The Position

About Net Healthâ¯â¯ Belong. Thrive. Make a Difference.â¯â¯ Are you looking for a meaningful and satisfying career where you have endless opportunities to grow and be financially rewarded? Net Health may be the perfect place for you.â¯â¯â¯ A high-growthâ¯and profitable company, we help caregivers harness data for human health.â¯Weâ¯alsoâ¯honor and respect the needs of our Net Health family and staff,â¯which is why weâ¯offer aâ¯work-from-anywhere environment and unlimited PTO.â¯Ourâ¯welcomingâ¯and collaborativeâ¯cultureâ¯paired with progressive benefits makesâ¯Net Healthâ¯the ultimate career home!â¯ As a leading-edgeâ¯SaaSâ¯company in healthcare, we deliver solutions that help patients get better, faster, and liveâ¯more fulfilling lives. Our software and predictive analytics coverâ¯theâ¯continuum of care, fromâ¯hospital-to-home, across various medical specialties. Come join us andâ¯start the next chapter of your exciting careerâ¯while helping othersâ¯toâ¯live better lives.â¯â¯ World-Class Benefits That Reflect Our World-Classâ¯Culture.â¯ Click Here to Learn More!:â¯ #WorkFromAnywhereâ¯#UnlimitedPTOâ¯#ComprehensiveBenefitsPackageâ¯#EmployeeResourceGroups #CasualDressCode #PrioritizedEmployeeWellnessâ¯#DiversityAndInclusionâ¯#AVoiceâ¯#NewHireSupportâ¯#CareerDevelopment #EducationalAssistance #EmployeeReferralBonusâ¯#ProgressiveParentalLeaveâ¯â¯â¯â¯ JOB OVERVIEW As a Site Reliability Engineer III, you will collaboratively manage the performance, stability, and redundancy of all Platform systems and infrastructure. You will be part of a team responsible for remediating system instability and slowness through monitoring, fault tolerance, tooling, capacity management, and automation. Proactive and relentless pursuit of the identification and implementation of infrastructure solutions to ensure high degrees of observability, availability, and reliability will be at the core of this role. Partnership with development teams in ensuring NH Platforms are performant, scalable, fault tolerant, and HIPAA compliant is critical.

Requirements

Bachelor’s degree in computer science OR equivalent
6+ years’ progressive experience in IT Operations and/or systems management
6+ years direct experience in a technical role dealing with complex enterprise software landscapes (DevOps focused development)
6+ years’ experience with scripting and automating technical activities
Experience with best-in-class application monitoring (APM) tooling (New Relic, Dynatrace, AppDynamics)
Direct, hands-on experience with automated software and system management.
Strong knowledge of change control best practices and methodologies
Experience with Ansible, Terraform, Python, or Docker (or similar) is a plus
Experience with Agile development methodology and/or ITIL ITSM is a plus
Servers, Workstations, Load Balancers, Switches, Routers, Firewalls, SAN, NAS and other storage hardware
PowerShell scripting, and coding standards
Best-in-class application monitoring (APM) tooling (New Relic, Dynatrace, AppDynamics)
Azure and/or AWS PaaS/IaaS
Linux OS and Apache (e.g. SALT, etc.)
Direct, hands-on experience with automated software delivery and system management.
Agile development methodology
Working understanding of Platform Engineering work model in a software development environment
Proven project management skills and/or substantial exposure to project-based work structures, project lifecycle models, etc
Proven experience in architecting and overseeing the direction, development, and implementation of technology solutions
O/S - Windows and Linux, VMWare, Powershell, Azure Administration, PRTG and other systems monitoring software, DNS Management, IIS, TomCat, Docker, APM Monitoring, ITSM tools, SSL/TLS certificates, JavaScript, Json, Python, Ansible, Terraform, Vsphere, Kubernetes, Service Fabric, Azure Management, Elastic, Citrix, JIRA, New Relic, Project Management Tools, ADO, DUO, Secret Server, Qualys, Pager Duty Application, Couchbase, Redis, API gateways, DNS, Security, IP Routing, SSH, FTP, LDAP, HTTP/HTTPS, Email Routing, Jenkins, GitHub, AWS , Cloud development pipelines using CI/CD tooling, Bash scripting

Responsibilities

Leading emergency response efforts in conjunction with Engineering, Infrastructure, and Database teams to establish root cause
Leading the efforts to build robust monitoring solutions while expanding our current monitoring and alerting footprint
Participate in the design of solutions increasing the holistic stability of NH Platforms and identifying potential risks
Conduct Blameless Postmortems and Anomaly Investigations after incidents to further analyze root cause and create permanent solutions to improve serviceability and prevent future outages
Establish a Don’t Repeat Incidents (DRI) culture by learning from past issues and always looking to improve monitoring and dashboarding capabilities
Ensuring applications are performing efficiently, collaborating with development teams and architecture to resolve application performance issues
Consults with management in the analysis of short- and long-range business requirements and recommends innovations
Championing automation efforts to reduce or eliminate repetitive, manual processes
Partner with project management to define Service Level Objectives (SLO) and identify and implement Service Level Indicators (SLI) to track compliance
Championing capacity management and disaster recovery testing efforts