Ncontracts-posted about 2 months ago
$180,000 - $230,000/Yr
Full-time • Director
Remote • Brentwood, TN

Reporting to the CTO, we are looking for an experienced Director of Platform Engineering to lead our cloud infrastructure, site reliability engineering (SRE), and DevOps enablement initiatives across the organization. This role requires a forward-thinking leader who leverages AI to transform platform operations and infrastructure management The Director of Platform Engineering is responsible for overseeing the reliability, scalability, and security of our multi-cloud infrastructure while enabling development teams to deploy efficiently and safely. You will pioneer the integration of AI-driven automation, intelligent observability, and predictive infrastructure management to optimize our platform operations. This strategic position works directly within the Engineering organization and with Product, Security, and Development teams to ensure maximum system uptime, optimize cloud costs, and accelerate our software delivery capabilities through innovative use of AI and automation.

  • Lead and mentor a team of SRE engineers, DevOps engineers, and cloud infrastructure specialists while fostering a culture of AI-augmented operations
  • Define and execute the cloud strategy, including multi-cloud architecture, migration plans, and integrating AI-powered tools for infrastructure optimization, cost management, and capacity planning
  • Leverage AI and machine learning for predictive incident detection, automated remediation, and intelligent alerting to enhance system reliability
  • Establish and maintain SLOs, SLIs, and error budgets while using AI-driven analytics to identify patterns and prevent issues before they impact users, while driving a culture of reliability across engineering
  • Build and optimize CI/CD pipelines, infrastructure-as-code frameworks, and deployment automation
  • Manage cloud costs and implement FinOps practices to maximize ROI on cloud investments
  • Implement AI-powered infrastructure-as-code frameworks, automated deployment pipelines, and intelligent resource allocation to enable safe, rapid releases
  • Architect and maintain infrastructure supporting our AI-powered technology ecosystem, including LLM integrations, agentic workflows, containerized applications, message queuing, data pipelines, and storage systems
  • Partner with Security teams to ensure compliance, implement security best practices, and maintain SOC2/ISO certifications
  • Drive incident response processes, post-mortem culture, and continuous improvement in system reliability, DR/BCP.
  • Establish and maintain disaster recovery, business continuity, and backup strategies with documented runbooks and tested procedures
  • Stay at the forefront of AI innovations in platform engineering, evaluating and implementing emerging tools for AIOps, intelligent automation, and infrastructure optimization
  • Evangelize DevOps and SRE best practices across development teams to enable self-service capabilities
  • 10+ years of experience in cloud infrastructure, SRE, or DevOps roles, with 3+ years in leadership positions
  • Demonstrated experience leveraging AI/ML tools for infrastructure automation, observability, incident management, or platform optimization
  • Deep expertise with major cloud platforms (AWS, Azure, and/or GCP) and cloud-native architectures, inlcuding cloud storage solutions (Azure Blob Storage, S3) and data archtiectures
  • Proven track record of managing and scaling SRE/DevOps teams in high-growth technology environments
  • Hands-on expertise with containerization technologies, particularly Azure Kubernetes Apps, Docker, infrastructure-as-code (Terraform, CloudFormation), and observability tools (Dynatrace, Datadog, Prometheus, Grafana)
  • Experience implementing CI/CD pipelines, GitOps workflows, and automated deployment strategies (Azure DevOps, Github Actions)
  • Experience with message queuing systems (RabbitMQ, Kafka, Azure Service Bus, SQS)
  • Experience with data platforms and ETL tools, including Snowflake and Azure Data Factory
  • Strong knowledge of security best practices and compliance standards (OWASP, SOC 2, IAM, Secrets Management, Certificate Management), including AI security considerations
  • Demonstrated ability to balance system reliability with development velocity and business needs
  • Experience with performance tuning and scalability optimization for high-traffic applications
  • Excellent communication skills with the ability to influence technical and non-technical stakeholders
  • Bachelor's degree in Computer Science, Engineering, or equivalent practical experience
  • Experience integrating and managing AI services and APIs (OpenAI, Claude, or similar) within production infrastructure
  • Experience building or managing infrastructure for AI/ML model training, inference, and deployment at scale
  • Hands-on experience with Redis in high-throughput, horizontally scaled deployments
  • Hands-on experience with prompt engineering, RAG systems, or fine-tuning LLMs for operational use cases
  • Experience with in-memory analytics databases (DuckDB or similar)
  • Experience with distributed search and analytics platforms (Elastic or similar)
  • Experience with TeamCity and Octopus Deploy
  • Experience with microservices architecture and RESTful API design patterns
  • Experience with database optimization and data modeling for both SQL and NoSQL systems
  • Experience with version control workflows (Git flow, trunk-based development)
  • Experience with technical documentation and knowledge sharing within cross-functional teams
  • Experience with agile methodologies (Scrum, Kanban) and project management tools (Jira, Confluence)
  • A fun, fast-paced work environment
  • Responsible PTO Plan that meets or exceeds state and local medical and family leave laws
  • 11 paid holidays
  • Community and social events to keep you connected and engaged
  • Mental Health Benefits
  • Medical, Dental and Vision insurance
  • Company-paid Group Life Insurance, Short- and Long-Term Disability
  • Flexible Spending Account & Health Savings Account
  • Aflac Benefits – Critical Illness, Cancer Protection, & Hospital Choice
  • Pet Insurance
  • 401 (k) with company match with eligibility on Day 1 of employment
  • 2 Paid Volunteer Time Off Days
  • And much more!
© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service