Hard Rock Digital-posted 4 months ago
FL
51-100 employees

We are looking for a skilled Sr. Site Reliability Engineer (SRE) to maintain and improve the reliability, scalability, and performance of our Java-based application. You will be responsible for managing and monitoring the application’s infrastructure, using the Grafana stack (Grafana, Loki, Prometheus) to ensure a high level of observability, and implementing robust monitoring, alerting, and logging solutions.

  • Ensure the availability, reliability, and performance of a high-traffic Java-based application in a distributed environment.
  • Troubleshoot and resolve complex issues in production and non-production environments.
  • Participate in both pre- and post-deployment performance testing and monitoring efforts to improve application performance.
  • Optimize Java application performance, ensuring efficient resource utilization and scaling.
  • Deploy and manage the Grafana stack (Grafana, Prometheus, Loki) to provide real-time monitoring, logging, and alerting.
  • Implement and refine observability strategies to enhance application and infrastructure visibility.
  • Create and maintain dashboards, alerts, and logs for comprehensive monitoring of system health and performance.
  • Support the operations team’s incident response efforts, conduct post-mortems, and identify root causes of issues to prevent recurrence.
  • Document and share lessons learned from incidents, contributing to a culture of continuous improvement.
  • Work closely with developers, architects, and other engineers to design and implement solutions that improve application reliability.
  • Collaborate closely with DevOps and NOC teams to support the application platform.
  • Communicate SRE practices and principles to technical and non-technical stakeholders.
  • Provide feedback and insights on application performance, potential improvements, and observability metrics.
  • Degree in computer science or a related field, or equivalent work experience.
  • 5+ years in SRE, DevOps, or similar Infrastructure roles.
  • Experience managing large-scale, high-availability production systems.
  • Track record of incident response and post-mortem processes.
  • Experience with capacity planning and performance optimization.
  • 3+ years hands-on experience managing production Kubernetes clusters.
  • Deep understanding of k8s architecture, networking, storage, and security.
  • Experience with cluster scaling (Karpenter), upgrades, and multi-cluster management.
  • Proficiency with kubectl, Helm, and Kubernetes operators.
  • Container orchestration and troubleshooting expertise.
  • Advanced expertise with the Grafana stack for dashboards, alerting, and visualization.
  • Hands-on experience with Grafana Alloy for telemetry data collection.
  • Proficiency in PromQL.
  • Experience with Loki for log aggregation and analysis.
  • Experience building comprehensive monitoring and alerting strategies.
  • Hands-on experience managing Java-based applications in large-scale, distributed environments, with a focus on JVM tuning and application optimization.
  • Cloud Platform expertise (AWS, GCP, or Azure).
  • Familiarity with infrastructure as code (IAC) tools like Terraform/Terragrunt or Ansible.
  • ArgoCD proficiency for GitOps workflows and continuous deployment.
  • Strong scripting abilities in Bash, Python, or Go.
  • Experience with CI/CD pipelines and automation tools.
  • Configuration Management and deployment automation.
  • Strong troubleshooting skills, with a proactive approach to diagnosing and resolving performance bottlenecks.
  • Proven experience managing on-call rotations, incident response, and root cause analysis.
  • Ability to mentor junior team members.
  • Strong communication skills (both written and verbal), positive attitude, and ability to receive constructive feedback.
  • Competitive compensation and comprehensive benefits.
  • Hybrid and Remote work.
  • Flexible vacation allowance.
  • Start up culture backed by a secure, global brand.
© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service