Platform Site Reliability Engineer

NexthinkPhoenix, AZ
51dHybrid

About The Position

Nexthink is looking for a strong Platform Engineer with SRE operations experience to strengthen our infrastructure and accelerate our ability to deploy, monitor, and scale systems effectively. As a SaaS provider, our customers rely on us to deliver a seamless, reliable, and scalable experience 24/7. This role needs to be located in West or Mountain Time Zone. Join Nexthink's vibrant team where cutting-edge technology meets innovation. Be a part of Nexthink's Digital Employee Experience technological revolution, ensuring our global customers enjoy a seamless user experience. Embrace the future with Nexthink in US; apply now and become a key player in our dynamic Platform Engineering/SRE organization.

Requirements

  • Minimum BS in Computer Science/Engineering
  • 5+ years in an SRE/platform engineering role supporting SaaS platforms.
  • Strong hands-on experience with public cloud services (AWS, GCP, Azure).
  • Proficiency with Kubernetes, container-based deployment and related ecosystems (Helm...), and containerized microservices.
  • Strong programming or scripting skills (Python, Go, Bash...).
  • Experience with CI/CD pipelines (e.g., GitHub Actions, GitLab CI, ArgoCD).
  • Experience with observability stacks (Prometheus, ELK/EFK, Datadog, etc.).
  • Comfort with being part of a rotating on-call schedule, including handling critical incidents and conducting post-incident reviews.
  • Strong system-level troubleshooting skills and a proactive mindset toward incident prevention.
  • Deep understanding of Linux systems, networking, and common troubleshooting practices.
  • Experience supporting multi-tenant microservices architectures.

Nice To Haves

  • Familiarity with service mesh, e.g., Istio.
  • Knowledge of zero-downtime deployment strategies, blue/green and canary releases.
  • Exposure to compliance standards such as SOC 2, ISO 27001, or HIPAA. FedRAMP experience is a big plus.
  • Experience with chaos engineering or resilience testing practices.

Responsibilities

  • Design, build, and maintain the infrastructure powering our multi-tenant SaaS platform with reliability, security, and scalability in mind.
  • Implement and manage cloud-native systems (AWS) using best-in-class tools and automation.
  • Operate and enhance Kubernetes clusters, deployment pipelines, and service meshes to support continuous delivery.
  • Establish and enforce SLOs, SLAs, and error budgets, and proactively address availability and performance issues.
  • Develop infrastructure as code (Terraform or similar) for repeatable and auditable provisioning.
  • Experience in programming solutions for Platform Tools such as for automation, monitoring, provisioning, using programming technologies.
  • Solid understanding of the network stack (TCP/IP, VPN, HTTP, SSL, routing, etc.), cloud topologies (VPC, Virtual Subnets, NACLS, NSG, ILB, ELB, etc.) and storage (S3, EBS, Azure Files etc).
  • Monitor system health, application performance, and user-facing SLAs using tools like Datadog, Prometheus, Grafana...
  • Be a main actor and improve incident response practices and help reduce mean time to detect (MTTD) and recover (MTTR). Experience in coordinating teams and persons to maintain a SLA.
  • Ability to troubleshoot, narrow down and fix incidents with minimal intervention of other functions.
  • Participate in a shared on-call rotation, responding to incidents, troubleshooting outages, and driving timely resolution and communication.
  • Work closely with software engineers to embed reliability and observability into every service.
  • Develop automated runbooks, health checks, and alerting to support reliable operations with minimal manual intervention.
  • Support automated testing, canary deployments, and rollback strategies to ensure safe, fast, and reliable releases.
  • Contribute to security best practices, compliance automation, and cost optimization.

Benefits

  • Flexible Hours and unlimited vacation (employees have unlimited paid time off on top of the 15 days of holidays we offer), 11 company-paid holidays, and 3 extra days for volunteering.
  • Hybrid work model that balances office and remote work, with structured onboarding to foster connections and team integration.
  • Free access to professional training platforms to explore your interests and enhance your skills.
  • Up to 16 weeks of paid leave for birthing parents/primary caregivers, 6 weeks for secondary caregivers.
  • Plan for the future with a 401(k) plan featuring up to 4% company matching contributions, vesting immediately, to grow your retirement savings.
  • Bonuses for referring successful hires after three months of continuous employment.
© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service