Site Reliability Engineer (Azure)

Datamaxis•Northbrook, IL

About The Position

The ecommerce Platform Operations team is responsible for the stability, reliability, release and deployment of B2B & B2C ecommerce platforms. The team’s primary function is to increase the efficiency of the organization through well designed automation and infrastructure. As a Site Reliability Engineer, you will work closely with various infrastructure & application development teams to increase stability and reliability via the enablement of various Telemetry concepts. You will also be responsible for effective operations of the ecommerce platform via efficient automation & execution of operational processes. This position involves participating in on-call support, troubleshooting production issues, and implementing remediation.

Requirements

Expert level experience with operating ATG Commerce ecommerce platform (OR) building custom Java / Java EE customer-facing solutions on Azure Cloud environment (AKS)
3+ Years Azure Experience
Hands on experience with containerization, Kubernetes, and micro services
Experience with Cloud infrastructure and application monitoring following methodologies such as RED or USE
Familiarity with APM monitoring tools such as Splunk APM, AppDynamics and/or Azure AppInsights
Familiarity with Infrastructure monitoring tools such as Graphana, Prometheus, Azure monitor, Log Analytics (KQL queries)
Experience with log collection tools and analysis, as well as infrastructure performance and optimization practices
Experience with DevOps automation platforms such as Jenkins, Artifactory, ACR, and/or Azure DevOps
Experience with CI/CD provisioning and managing Azure Infrastructure
Experience performing Root Cause Analysis (RCA) for application and infrastructure related issues
Solid grasp of various performance monitoring methodologies, as well as 2+ years of hands-on experience configuring monitoring tools such as Azure Application Insights, New Relic, and Splunk is required
Must have experience enriching alerts for faster root-cause detection and incident resolution
Must be experience configuring monitors for business transactions, service end points, etc., as well as setup health rules for triggering alerts
Detailed knowledge of relational databases, Ex: MS SQL, MySQL (OR) NoSQL DB like Cosmos DB
Must be able to construct SQL queries and configure them with telemetry
Strong scripting (bash, python, shell) skills
Self-starter with the ability to quickly learn new tools and tool features
Must be able to handle multiple tasks and priorities within a fast-paced work environment
Must be highly motivated and dependable with excellent communication skills
Bachelors in Computer Science or other four-year degree in a relevant field is required

Nice To Haves

Strong experience with other telemetry tools, including AppDynamics, Extrahop, vSphere, Solarwinds Orion, SAM, etc. will be considered
Top candidate will have experience or thorough understanding of incident workflows (preferably using New Relic)
Experience using Terraform to perform infrastructure as code
Deep working knowledge with Azure networking, Application Gateway, APIM, IAM Policy and network security
Able to deploy and manage Azure storage
Experience with Azure Active Directory management and design experience a plus
Production support experience with E-commerce websites
Experience with tracking, measuring, and reporting KPIs like MTBI, MTRS, MTTD, etc.

Responsibilities

Monitoring and maintaining the Development, Testing/QA, Staging and Production environments
Mitigating production performance issues effectively by taking responsibility for seeing those performance issues through resolution with the goal of automating to prevent problem recurrence
Configure monitors, alerts, Service Level Indications using various Telemetry technologies
Create business friendly dashboards to monitor health of various production systems
Collaborate with teams within IT to implement cloud and/or hybrid systems that support the business goals
Monitor cloud-based systems and components for availability, performance, reliability, security, efficiency, and ability to meet non-functional requirements and service level agreements
Work with Infrastructure as Code pipelines to automate the deployment of Cloud resources
Serve as liaison between application and Cloud team to provide guidance to application teams on application container/pod deployments
Investigate, troubleshoot and resolve any issues that impact the customer
Work to improve performance and reliability as the platform scales, driving continuous improvement through operational metrics
Scale Cloud operations through best practices as applicable for configuration management, resource allocation, optimizing performance and capacity, compliance with security policies and requirements, and ensuring service-level agreements are met
Work with Azure cloud engineering team to operationalize Client’s cloud vision
Lead technical sessions making use of whiteboards or other resources to drive solution discussions leveraging published solution architectures for common infrastructure implementations
Enable proactive monitoring & alerting using Splunk log aggregation
Prepare applications to work on Kubernetes, Docker, and other hosted systems
Work on automation using scripting and be able to integrate different tools
Troubleshoot and help resolve telemetry system and software defects
Perform incident/disruption management and conduct root-cause analysis (RCA)
Work successfully within an Agile environment partnering with the Scrum Master
Document the work done, as well as mentor our FTE
Participate in after-hours on-call rotation and after-hours maintenance window activities as needed