SRE Lead

HitachiDallas, TX
407d

About The Position

As an SRE Lead at Hitachi Digital Services, you will be responsible for ensuring the availability, reliability, and performance of both cloud-based and on-premises platforms. This role involves leading a team of engineers to troubleshoot and optimize systems while promoting automation and best practices in Site Reliability Engineering (SRE). You will also manage incident processes, drive initiatives in generative AI platforms, and mentor team members to uphold high operational standards.

Requirements

  • Proven experience with SRE principles and practices in managing on-premises and cloud applications.
  • Knowledge of generative AI applications and related technologies.
  • Strong leadership skills, with the ability to drive team performance and continuous improvement.
  • Analytical skills for resolving complex technical issues, ensuring system reliability, and minimizing downtime.
  • Excellent communication and collaboration skills to work effectively with cross-functional teams.
  • Expertise in SRE principles: anomaly detection, root cause analysis, and predictive maintenance.
  • Proficiency in defining SLIs, SLOs, and error budgets.
  • Experience leading an operations team in application production environments.
  • Knowledge of scripting languages (e.g., Java, Python, PowerShell).
  • Hands-on experience with Kubernetes and OpenTelemetry.
  • Understanding of generative AI, large language models (LLMs), and responsible AI.
  • Familiarity with DevOps methodologies, tools, and automation (e.g., CI/CD pipelines, Terraform, Helm).
  • Experience with public/private cloud platforms (e.g., AWS, Azure, GCP).

Nice To Haves

  • Knowledge of fine-tuning models, prompt engineering, retrieval-augmented generation (RAG), and cost optimization techniques.

Responsibilities

  • Leading a team of platform, application, and incident SREs to manage and resolve complex production issues.
  • Improving application performance, availability, and reliability.
  • Implementing observability solutions for proactive issue identification and optimization.
  • Managing processes for incidents, changes, releases, and deployments.
  • Developing automation tools (IaC, alert as code, dashboard as code) to enhance efficiency.
  • Conducting POCs to implement tools supporting generative AI platforms.
  • Analyzing trends in incidents, problems, and alerts to drive operational improvements.
  • Documenting SOPs, critical systems information, and best practices for current and future use.
  • Providing technical guidance and mentorship to junior SRE team members.
  • Staying updated on advancements in generative AI technologies and responsible AI practices.

Benefits

  • Industry-leading benefits and support for holistic health and wellbeing.
  • Flexible work arrangements to promote life balance.
  • Opportunities for continuous learning and professional development.
  • A diverse and inclusive work environment that values unique perspectives.

Stand Out From the Crowd

Upload your resume and get instant feedback on how well it matches this job.

Upload and Match Resume

What This Job Offers

Job Type

Full-time

Career Level

Mid Level

Industry

Electrical Equipment, Appliance, and Component Manufacturing

Education Level

No Education Listed

© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service