Application SRE (DevOps)

ELLKAY, LLCElmwood Park, NJ
Hybrid

About The Position

ELLKAY is seeking an Application Site Reliability Engineer (SRE) with strong DevOps experience to enhance the reliability, scalability, and performance of their applications. This role acts as a technical point of contact, driving the operational maturity of the application ecosystem. The SRE will collaborate across teams to support scalable systems, establish reliability standards, improve observability, and implement automation to reduce operational workload. Key responsibilities include leading complex incident responses, guiding development teams on best practices, and influencing architectural decisions for resilient software delivery. The goal is to define reliability standards, minimize operational toil, and ensure smooth production operations while facilitating faster and safer releases.

Requirements

  • Strong experience as an SRE, DevOps Engineer, or Production Support Engineer
  • Solid understanding of Windows, Linux/Unix systems and networking fundamentals
  • 7 years of experience as an SRE
  • Hands-on experience with cloud platforms such as AWS, Azure, or GCP
  • Experience with containerization and orchestration tools like Docker and Kubernetes
  • Proficiency in CI/CD tools such as Jenkins, GitHub Actions, , or similar
  • Experience with Infrastructure as Code tools like Terraform, CloudFormation, or ARM
  • Strong scripting skills in Python, Bash, or similar languages
  • Experience with monitoring and observability tools (Prometheus, Grafana, ELK, Datadog, etc.)
  • Understanding of reliability concepts such as SLAs, SLOs, and incident management
  • Strong problem-solving and troubleshooting skills
  • Ability to work calmly during incidents and high-pressure situations
  • Clear communication and collaboration with cross-functional teams
  • Ownership mindset with a focus on continuous improvement

Nice To Haves

  • Experience supporting microservices-based architectures
  • Knowledge of security best practices in cloud and DevOps environments
  • Experience with configuration management tools (Ansible, Chef, or Puppet)
  • Exposure to chaos engineering or resilience testing practices

Responsibilities

  • Own application reliability, availability, performance, and scalability in production and non-production environments
  • Design, build, and maintain CI/CD pipelines for application deployments
  • Automate infrastructure provisioning and configuration using Infrastructure as Code
  • Monitor application health using metrics, logs, and traces; define SLIs, SLOs, and error budgets
  • Lead incident response, root-cause analysis (RCA), ensuring corrective and preventive actions are completed and communicated.
  • Improve system resilience through capacity planning, system tuning, and fault tolerance
  • Partner with development teams to ensure services meet reliability, performance, and scalability objectives.
  • Reduce manual operational effort through automation and self-healing solutions
  • Serve as a point of contact for critical Sev1/Sev2 incidents, leading incident command when required.

Benefits

  • Medical, Dental, and Vision benefits
  • Employer-paid Life and LTD
  • 401k w/ matching
  • Work/life balance
  • Paid Volunteer Program
  • Flexible working hours
  • Generous FTO
  • Remote work options
  • Employee Discounts
  • Parental Leave
  • Competitive compensation
  • Learning and growth opportunities in cloud, automation, and reliability engineering
  • On site in HQ Free daily lunches
© 2026 Teal Labs, Inc
Privacy PolicyTerms of Service