Nice Group Co., Ltd.-posted 3 months ago
Senior
Sandy, UT
Professional, Scientific, and Technical Services

The Senior Cloud SRE works to improve the reliability and availability of our solutions. This includes providing on-call support for Major Incidents and helping us reduce the duration and occurrence of outages.

  • Create a new dashboard to provide observability for a development team of the health of their application.
  • Consult with development workstreams on SRE services and how we can assist them improve their reliability.
  • Automate activities previously done manually to reduce toil.
  • Participate in design, definition and scoping of a new solution to meet our internal customer needs.
  • Thoroughly document this and ensure agreement by the participants.
  • Document findings and share with other SREs.
  • Work with teams to ensure proper monitoring is setup/enabled.
  • Identify evolutionary improvements.
  • Meet with Incident and Problem Management to discuss previous Major Incidents and help identify root cause and permanent fixes.
  • Assist other teams in doing data/performance analysis to identify why an issue is occurring.
  • Review work of other SREs and help train them.
  • Support services before they go live through activities such as system design consulting, developing software platforms and frameworks, capacity planning, and launch reviews.
  • Practice sustainable incident response and blameless post mortems.
  • Assist in creation of automated end-to-end diagnostics.
  • Communicate effectively to technical and non-technical peers and customers.
  • Coordinates and works on multiple cross-functional base work initiatives and projects.
  • Participates in planning long and short term project efforts.
  • Leads or provides technical direction for the planning, execution, and validation of testing work.
  • Provides technical guidance and coaching/mentoring to team members.
  • Follow established processes when performing work or help document and create processes as necessary.
  • Document troubleshooting steps and results in appropriate locations for historical access.
  • Ensures compliance with policies, procedures, and standards.
  • Implements or coordinates remediation required by audits/assessments, and documents as necessary.
  • Provide on call support for high priority incidents.
  • Estimate time to complete activities/projects.
  • Bachelor's degree in Computer Science, Business Information Systems, or related field (or equivalent work experience) is required.
  • 4+ years programming/scripting experience.
  • 4+ years of experience working within public or private cloud environments.
  • 4+ years of SRE or related experience.
  • Experience with Agile, Jira, GitHub, monitoring, automation, dashboarding.
  • 6+ years communicating in English in a technical field.
  • Can effectively troubleshoot supported applications effectively.
  • Can work on complex issues which may span multiple applications or environments.
  • Proactively engages with peers to discuss issues and keep stakeholders updated.
  • Mentors co-workers with expertise.
  • Coordinates work with peers.
  • Shares discoveries and best practices.
  • Learns from others within the team.
  • Self-Driven. Proactively looks for ways to improve.
  • Able to work with little supervision and complete tasks and projects as directed.
  • Experience working with Prometheus, Datadog, Grafana, Splunk, BMC.
  • Experience with Application Performance Monitoring solutions-Dynatrace, AppDynamics, New Relic.
  • Experience working with Kubernetes, Docker, microservices, serverless compute.
  • Experience working with Ansible, Terraform.
  • Experience with one or more of the following: C#, C++, Java, Python, Perl, or Ruby.
© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service