Application SRE (DevOps)

ELLKAY, LLC•Elmwood Park, NJ

62d•Hybrid

About The Position

ELLKAY is seeking an Application Site Reliability Engineer (SRE) with strong DevOps experience to enhance the reliability, scalability, and performance of their applications. This role acts as a technical point of contact, driving the operational maturity of the application ecosystem. The SRE will collaborate across teams to support scalable systems, establish reliability standards, improve observability, and implement automation to reduce operational workload. Key responsibilities include leading complex incident responses, guiding development teams on best practices, and influencing architectural decisions for resilient software delivery. The goal is to define reliability standards, minimize operational toil, and ensure smooth production operations while facilitating faster and safer releases.

Requirements

Strong experience as an SRE, DevOps Engineer, or Production Support Engineer
Solid understanding of Windows, Linux/Unix systems and networking fundamentals
7 years of experience as an SRE
Hands-on experience with cloud platforms such as AWS, Azure, or GCP
Experience with containerization and orchestration tools like Docker and Kubernetes
Proficiency in CI/CD tools such as Jenkins, GitHub Actions, , or similar
Experience with Infrastructure as Code tools like Terraform, CloudFormation, or ARM
Strong scripting skills in Python, Bash, or similar languages
Experience with monitoring and observability tools (Prometheus, Grafana, ELK, Datadog, etc.)
Understanding of reliability concepts such as SLAs, SLOs, and incident management
Strong problem-solving and troubleshooting skills
Ability to work calmly during incidents and high-pressure situations
Clear communication and collaboration with cross-functional teams
Ownership mindset with a focus on continuous improvement

Nice To Haves

Experience supporting microservices-based architectures
Knowledge of security best practices in cloud and DevOps environments
Experience with configuration management tools (Ansible, Chef, or Puppet)
Exposure to chaos engineering or resilience testing practices

Responsibilities

Own application reliability, availability, performance, and scalability in production and non-production environments
Design, build, and maintain CI/CD pipelines for application deployments
Automate infrastructure provisioning and configuration using Infrastructure as Code
Monitor application health using metrics, logs, and traces; define SLIs, SLOs, and error budgets
Lead incident response, root-cause analysis (RCA), ensuring corrective and preventive actions are completed and communicated.
Improve system resilience through capacity planning, system tuning, and fault tolerance
Partner with development teams to ensure services meet reliability, performance, and scalability objectives.
Reduce manual operational effort through automation and self-healing solutions
Serve as a point of contact for critical Sev1/Sev2 incidents, leading incident command when required.

Benefits

Medical, Dental, and Vision benefits
Employer-paid Life and LTD
401k w/ matching
Work/life balance
Paid Volunteer Program
Flexible working hours
Generous FTO
Remote work options
Employee Discounts
Parental Leave
Competitive compensation
Learning and growth opportunities in cloud, automation, and reliability engineering
On site in HQ Free daily lunches