Lead DevOps Engineer

Thomson ReutersFrisco, TX
7d$125,000 - $175,000Hybrid

About The Position

As a Lead DevOps Engineer, you will: Lead the design, implementation, and optimization of our DevOps practices while enabling our teams to deliver reliable, scalable solutions. This role is crucial in building, automating, and maintaining our cloud infrastructure, CI/CD pipelines, and ensuring the reliability and scalability of our applications. You'll play a key part in fostering a culture of operational excellence, security, and continuous delivery, working closely with development and product teams. Infrastructure as Code (IaC): Design, implement, and manage scalable, secure, and highly available cloud infrastructure primarily on AWS, with an understanding of best practices for GCP environments. (e.g., using Terraform, CloudFormation). Automation & CI/CD: Develop and maintain robust CI/CD pipelines (e.g., GitLab CI/CD, GitHub Actions, Jenkins, AWS CodePipeline) to automate software delivery, testing, and deployment processes. Linux System Administration: Provide expert-level administration, troubleshooting, and optimization for Linux-based systems, ensuring stability, security, and performance. Monitoring & Observability: Implement comprehensive monitoring, logging, and alerting solutions (e.g., Prometheus, Grafana, ELK Stack, CloudWatch, DataDog) to ensure application health, performance, and proactive issue detection. Networking & Security: Configure and manage cloud networking components (VPCs, subnets, routing, security groups, firewalls) and implement security best practices (IAM, encryption, least privilege). Troubleshooting & Incident Response: Act as a subject matter expert for production issues, performing root cause analysis, implementing preventative measures, and participating in on-call rotations (as required). Collaboration: Work closely with software engineers, data scientists, and product managers to understand their needs and provide reliable, efficient, and secure infrastructure solutions. Continuous Improvement: Identify and implement improvements to existing systems, tools, and processes to enhance efficiency, reduce costs, and improve reliability. Documentation: Create and maintain clear, concise documentation for infrastructure, processes, and playbooks.

Requirements

  • 7+ years of experience in DevOps, Site Reliability Engineering, or Infrastructure Engineering roles with at least 2 years in a lead or senior capacity
  • Deep expertise in Linux System Administration: Command-line proficiency, shell scripting, process management, networking, file systems, user/group management, and security best practices.
  • Strong proficiency with AWS: Experience with core AWS services such as EC2, S3, RDS, VPC, IAM, Lambda, EKS/ECS, CloudWatch, and an understanding of well-architected principles.
  • Expertise in Infrastructure as Code (IaC) tools: Proven experience with Terraform (preferred), AWS CloudFormation, Pulumi or similar.
  • Solid experience with CI/CD tools and methodologies: e.g., GitLab CI/CD, GitHub Actions, Jenkins or similar.
  • Proficiency in at least one scripting language: Python (preferred), Bash, Go, or similar.
  • Experience with containerization and orchestration: Docker and Kubernetes (EKS, GKE, or self-managed).
  • Understanding of networking fundamentals: TCP/IP, DNS, Load Balancing, VPNs.
  • Experience with monitoring and logging tools: , Prometheus, Grafana, Datadog, Splunk, CloudWatch or similar.
  • Strong problem-solving skills: Ability to diagnose complex issues across various layers of the stack.
  • Excellent communication and collaboration skills.
  • A strong willingness to learn new technologies and adapt to evolving best practices.

Nice To Haves

  • Familiarity with GCP (Google Cloud Platform): Hands-on experience with at least a few core GCP services (e.g., GCE, GCS, GKE, Cloud Functions, IAM) is a significant advantage.
  • Experience leveraging AI/ML tooling in DevOps workflows: Develop and maintain MLOps pipelines for model training, deployment, and monitoring in production environments Implement infrastructure solutions for AI workloads including GPU clusters, model serving platforms, and data pipelines Collaborate with data science and engineering teams to operationalize machine learning models at scale Establish monitoring and observability practices for AI systems including model performance tracking and drift detection.
  • Experience with other cloud providers (Azure).
  • Knowledge of database administration (SQL/NoSQL).
  • Certifications in AWS, GCP, or Kubernetes.
  • Contributions to open-source projects.

Responsibilities

  • Lead the design, implementation, and optimization of our DevOps practices while enabling our teams to deliver reliable, scalable solutions.
  • Build, automate, and maintain our cloud infrastructure, CI/CD pipelines, and ensuring the reliability and scalability of our applications.
  • Foster a culture of operational excellence, security, and continuous delivery, working closely with development and product teams.
  • Design, implement, and manage scalable, secure, and highly available cloud infrastructure primarily on AWS, with an understanding of best practices for GCP environments.
  • Develop and maintain robust CI/CD pipelines to automate software delivery, testing, and deployment processes.
  • Provide expert-level administration, troubleshooting, and optimization for Linux-based systems, ensuring stability, security, and performance.
  • Implement comprehensive monitoring, logging, and alerting solutions to ensure application health, performance, and proactive issue detection.
  • Configure and manage cloud networking components and implement security best practices.
  • Act as a subject matter expert for production issues, performing root cause analysis, implementing preventative measures, and participating in on-call rotations (as required).
  • Work closely with software engineers, data scientists, and product managers to understand their needs and provide reliable, efficient, and secure infrastructure solutions.
  • Identify and implement improvements to existing systems, tools, and processes to enhance efficiency, reduce costs, and improve reliability.
  • Create and maintain clear, concise documentation for infrastructure, processes, and playbooks.

Benefits

  • Hybrid Work Model: We’ve adopted a flexible hybrid working environment (2-3 days a week in the office depending on the role) for our office-based roles while delivering a seamless experience that is digitally and physically connected.
  • Flexibility & Work-Life Balance: Flex My Way is a set of supportive workplace policies designed to help manage personal and professional responsibilities, whether caring for family, giving back to the community, or finding time to refresh and reset. This builds upon our flexible work arrangements, including work from anywhere for up to 8 weeks per year, empowering employees to achieve a better work-life balance.
  • Career Development and Growth: By fostering a culture of continuous learning and skill development, we prepare our talent to tackle tomorrow’s challenges and deliver real-world solutions. Our Grow My Way programming and skills-first approach ensures you have the tools and knowledge to grow, lead, and thrive in an AI-enabled future.
  • Industry Competitive Benefits: We offer comprehensive benefit plans to include flexible vacation, two company-wide Mental Health Days off, access to the Headspace app, retirement savings, tuition reimbursement, employee incentive programs, and resources for mental, physical, and financial wellbeing.
  • Globally recognized, award-winning reputation for inclusion and belonging, flexibility, work-life balance, and more.
  • Social Impact: Make an impact in your community with our Social Impact Institute. We offer employees two paid volunteer days off annually and opportunities to get involved with pro-bono consulting projects and Environmental, Social, and Governance (ESG) initiatives.
  • comprehensive benefits package to our employees. Our benefit package includes market competitive health, dental, vision, disability, and life insurance programs, as well as a competitive 401k plan with company match. In addition, Thomson Reuters offers market leading work life benefits with competitive vacation, sick and safe paid time off, paid holidays (including two company mental health days off), parental leave, sabbatical leave.
  • optional hospital, accident and sickness insurance paid 100% by the employee; optional life and AD&D insurance paid 100% by the employee; Flexible Spending and Health Savings Accounts; fitness reimbursement; access to Employee Assistance Program; Group Legal Identity Theft Protection benefit paid 100% by employee; access to 529 Plan; commuter benefits; Adoption & Surrogacy Assistance; Tuition Reimbursement; and access to Employee Stock Purchase Plan.
© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service