Site Reliability Engineer (Azure)

DatamaxisNorthbrook, IL

About The Position

The ecommerce Platform Operations team is responsible for the stability, reliability, release and deployment of B2B & B2C ecommerce platforms. The team’s primary function is to increase the efficiency of the organization through well designed automation and infrastructure. As a Site Reliability Engineer, you will work closely with various infrastructure & application development teams to increase stability and reliability via the enablement of various Telemetry concepts. You will also be responsible for effective operations of the ecommerce platform via efficient automation & execution of operational processes. This position involves participating in on-call support, troubleshooting production issues, and implementing remediation.

Requirements

  • Expert level experience with operating ATG Commerce ecommerce platform (OR) building custom Java / Java EE customer-facing solutions on Azure Cloud environment (AKS)
  • 3+ Years Azure Experience
  • Hands on experience with containerization, Kubernetes, and micro services
  • Experience with Cloud infrastructure and application monitoring following methodologies such as RED or USE
  • Familiarity with APM monitoring tools such as Splunk APM, AppDynamics and/or Azure AppInsights
  • Familiarity with Infrastructure monitoring tools such as Graphana, Prometheus, Azure monitor, Log Analytics (KQL queries)
  • Experience with log collection tools and analysis, as well as infrastructure performance and optimization practices
  • Experience with DevOps automation platforms such as Jenkins, Artifactory, ACR, and/or Azure DevOps
  • Experience with CI/CD provisioning and managing Azure Infrastructure
  • Experience performing Root Cause Analysis (RCA) for application and infrastructure related issues
  • Solid grasp of various performance monitoring methodologies, as well as 2+ years of hands-on experience configuring monitoring tools such as Azure Application Insights, New Relic, and Splunk is required
  • Must have experience enriching alerts for faster root-cause detection and incident resolution
  • Must be experience configuring monitors for business transactions, service end points, etc., as well as setup health rules for triggering alerts
  • Detailed knowledge of relational databases, Ex: MS SQL, MySQL (OR) NoSQL DB like Cosmos DB
  • Must be able to construct SQL queries and configure them with telemetry
  • Strong scripting (bash, python, shell) skills
  • Self-starter with the ability to quickly learn new tools and tool features
  • Must be able to handle multiple tasks and priorities within a fast-paced work environment
  • Must be highly motivated and dependable with excellent communication skills
  • Bachelors in Computer Science or other four-year degree in a relevant field is required

Nice To Haves

  • Strong experience with other telemetry tools, including AppDynamics, Extrahop, vSphere, Solarwinds Orion, SAM, etc. will be considered
  • Top candidate will have experience or thorough understanding of incident workflows (preferably using New Relic)
  • Experience using Terraform to perform infrastructure as code
  • Deep working knowledge with Azure networking, Application Gateway, APIM, IAM Policy and network security
  • Able to deploy and manage Azure storage
  • Experience with Azure Active Directory management and design experience a plus
  • Production support experience with E-commerce websites
  • Experience with tracking, measuring, and reporting KPIs like MTBI, MTRS, MTTD, etc.

Responsibilities

  • Monitoring and maintaining the Development, Testing/QA, Staging and Production environments
  • Mitigating production performance issues effectively by taking responsibility for seeing those performance issues through resolution with the goal of automating to prevent problem recurrence
  • Configure monitors, alerts, Service Level Indications using various Telemetry technologies
  • Create business friendly dashboards to monitor health of various production systems
  • Collaborate with teams within IT to implement cloud and/or hybrid systems that support the business goals
  • Monitor cloud-based systems and components for availability, performance, reliability, security, efficiency, and ability to meet non-functional requirements and service level agreements
  • Work with Infrastructure as Code pipelines to automate the deployment of Cloud resources
  • Serve as liaison between application and Cloud team to provide guidance to application teams on application container/pod deployments
  • Investigate, troubleshoot and resolve any issues that impact the customer
  • Work to improve performance and reliability as the platform scales, driving continuous improvement through operational metrics
  • Scale Cloud operations through best practices as applicable for configuration management, resource allocation, optimizing performance and capacity, compliance with security policies and requirements, and ensuring service-level agreements are met
  • Work with Azure cloud engineering team to operationalize Client’s cloud vision
  • Lead technical sessions making use of whiteboards or other resources to drive solution discussions leveraging published solution architectures for common infrastructure implementations
  • Enable proactive monitoring & alerting using Splunk log aggregation
  • Prepare applications to work on Kubernetes, Docker, and other hosted systems
  • Work on automation using scripting and be able to integrate different tools
  • Troubleshoot and help resolve telemetry system and software defects
  • Perform incident/disruption management and conduct root-cause analysis (RCA)
  • Work successfully within an Agile environment partnering with the Scrum Master
  • Document the work done, as well as mentor our FTE
  • Participate in after-hours on-call rotation and after-hours maintenance window activities as needed
© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service