About The Position

We are seeking a detail-oriented and analytical SRE Metrics Analyst Intern to join our Site Reliability Engineering (SRE) team. In this role, you will be responsible for establishing and managing the collection of metrics related to system performance, reliability, and incidents. You will develop and maintain reporting frameworks to provide actionable insights to stakeholders, driving improvements in our systems and processes. Your work will support the organization’s commitment to delivering high-quality, reliable services. This role is 50% telework and candidates must be local to the following cities: Norfolk, VA Jacksonville, FL Bremerton, WA San Diego, CA

Requirements

  • Enrolled in a degree program in a related major - GPA 3.0 or better
  • US citizenship required
  • Ability to obtain and maintain a DoD security clearance
  • Experience in metrics collection, data analysis, or reporting, preferably in a Site Reliability Engineering or DevOps environment.
  • Proven experience in working with monitoring and observability tools (e.g., Prometheus, Datadog, New Relic).
  • Strong understanding of key metrics used in site reliability engineering, including SLIs, SLOs, and SLAs.
  • Proficiency in data analysis tools and languages (e.g., SQL, Python, R) for data manipulation and reporting.
  • Experience with data visualization tools (e.g., Grafana, Kibana, Tableau) to create dashboards and reports.
  • Strong analytical and problem-solving skills, with the ability to interpret complex data sets and provide actionable insights.
  • Ability to evaluate the relevance and accuracy of metrics and make recommendations for improvement.
  • Excellent communication skills, both written and verbal, with the ability to present data and findings to technical and non-technical audiences.
  • Proven ability to work collaboratively with cross-functional teams and build strong relationships with stakeholders.

Nice To Haves

  • Experience with cloud platforms (AWS, GCP, Azure) and their monitoring tools.
  • Familiarity with incident management processes and practices within an SRE context.
  • Knowledge of software development methodologies and best practices.

Responsibilities

  • Metrics Collection Framework: · Design and implement a comprehensive metrics collection framework that captures key performance indicators (KPIs) related to system reliability and operational efficiency. · Identify relevant metrics and establish methods for collecting, aggregating, and storing data from various sources, including monitoring tools, logs, and databases.
  • Data Analysis and Visualization: · Analyze collected metrics to identify trends, patterns, and anomalies that impact system reliability and performance. · Develop dashboards and visualizations to present data in a clear and actionable manner using tools such as Grafana, Kibana, or Tableau. · Ensure that stakeholders have access to real-time insights and reports that inform decision-making.
  • Reporting: · Create regular reports on system performance, reliability, incident response times, and other critical metrics for various stakeholders, including technical teams and management. · Provide insights and recommendations based on data analysis to drive continuous improvement initiatives. · Prepare and present findings to stakeholders, facilitating discussions on reliability goals and performance enhancements.
  • Collaboration with SRE Teams: · Work closely with SRE teams to identify their metric needs and ensure alignment with operational goals. · Collaborate with engineering and operations teams to ensure that metric collection is integrated into development and deployment processes. · Support incident response efforts by providing metrics that help identify root causes and areas for improvement.
  • Continuous Improvement: • Stay current with industry trends and best practices related to metrics collection, monitoring, and reporting within SRE and DevOps. • Continuously evaluate and enhance the metrics collection and reporting processes to improve data accuracy, relevance, and accessibility. • Foster a culture of data-driven decision-making within the SRE team and broader organization.

Benefits

  • Employment benefits include competitive compensation, Health and Wellness programs, Income Protection, Paid Leave and Retirement.
© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service