Lead Application Reliability Engineer

CitiIrving, TX
69d$125,760 - $188,640

About The Position

The selected candidate will become the key engineer in supporting and advancing the platform used for threat-modeling process in Citi. The responsibilities will cover (among others) maintaining and supporting the threat-modeling application as well as developing relevant tools used throughout the threat-modeling process. The application is comprised of web servers and backend data storage databases and supporting it requires understanding of middleware, database, container, and AWS cloud environment as well as change-control and compliance processes. We are seeking a highly skilled and dedicated Lead Application Reliability Engineer to ensure the continuous availability, optimal performance, and security of a critical threat-modeling application. This role is central to our operational excellence, involving comprehensive support and maintenance of a robust technology stack including middleware, databases, Linux, and AWS EKS, all within a strictly regulated and change-controlled financial environment. The ideal candidate will leverage modern DevOps principles to drive stability and efficiency.

Requirements

  • 6+ years of relevant experience in an Engineering role, preferably in Financial Services or a large, complex, and/or global environment.
  • Experience managing and troubleshooting Linux Operating Systems (e.g., Red Hat Enterprise Linux (RHEL), CentOS, Ubuntu), including System Administration Tasks like User Management, Service Restarts, and File System Checks – Must Have.
  • Proficiency in Scripting for Automation (e.g., Bash, Python) and with Configuration Management Tools (e.g., Ansible, Puppet, Chef) for system administration and infrastructure automation – Must Have.
  • Experience with container orchestration using Helm and Kubernetes on platforms like AWS EKS, GCP GKE, or OpenShift – Must Have.
  • Working knowledge of Relational Databases (e.g., PostgreSQL), including basic querying – Must Have.
  • Proven track record of maintaining applications and their technology stacks compliant with security and configuration requirements, including successfully passing internal and external security audits by demonstrating secure configuration of applications and infrastructure (e.g., implementing least privilege access, hardening OS, managing firewall rules) and ensuring continuous compliance with regulatory standards (e.g., SOX, GDPR) through automated checks and reporting – Must Have.
  • Demonstrated adherence to strict change control procedures, executing all changes (e.g., code deployments, infrastructure updates) through a formalized change management process (e.g., ITSM, ServiceNow) with proper documentation and approvals – Must Have.
  • Experience with Ticketing Systems (e.g., Jira, ServiceNow) – Must Have.
  • Working understanding of Middleware Components (e.g., Nginx, Tomcat or equivalents).
  • Familiarity with Development Concepts (e.g., Git, CI/CD, Pipelines, SDLC).
  • Strong communication skills, both written and verbal, for technical and non-technical audiences.
  • Demonstrated analytical and diagnostic skills, with an ability to identify process improvements and best practices.
  • Ability to work independently, manage multiple tasks, take ownership of initiatives, and operate effectively in a matrixed environment under pressure and tight deadlines.

Nice To Haves

  • Associate Level Certification Required: (Require a Minimum of 1 or more of the following) Kubernetes and Cloud Native Associate (KCNA), Certified Kubernetes Application Developer (CKAD), Certified Kubernetes Administrator (CKA), Kubernetes and Cloud Native Security Associate (KCSA), Red Hat Certified System Administrator or like certification, AWS Certified Developer, AWS Certified SysOps Administrator, CompTIA Cloud+, Google Associate Cloud Engineer or other GCP certification, HashiCorp Certified: Terraform Associate.
  • Associate Cybersecurity Certification: (Not required but any of the following would be a plus) GIAC Security Essentials (GSEC), ISC2 Systems Security Certified Practitioner (SSCP), CompTIA CySA+, Microsoft Certified: Security Operations Analyst Associate; Information Protection Administrator Associate.

Responsibilities

  • Ensure high availability and optimal performance of the threat-modeling application through proactive monitoring, incident management, and efficient troubleshooting.
  • Perform routine and emergency application and infrastructure maintenance, including patching, upgrades, and configuration management, adhering strictly to change control procedures.
  • Conduct root cause analysis (RCA) for production incidents and implement preventative measures to minimize future occurrences.
  • Develop and maintain automation scripts and tools (e.g., using Python, Bash) to streamline operational tasks, improve monitoring, and facilitate efficient deployments.
  • Proactively identify, recommend, and implement enhancements to existing application maintenance practices, operational workflows, and system reliability.
  • Serve as a technology subject matter expert for internal and external stakeholders, contributing to technology domain roadmaps and firm-mandated controls and compliance initiatives.
  • Appropriately assess and mitigate risk in all technical decisions, ensuring compliance with applicable laws, rules, regulations, and internal policies, while escalating and reporting control issues with transparency.
  • Present technical work to senior stakeholders, the team, and other technical teams.
  • Mentor and train junior team members, fostering a culture of knowledge sharing and continuous improvement.

Benefits

  • Medical, dental & vision coverage
  • 401(k)
  • Life, accident, and disability insurance
  • Wellness programs
  • Paid time off packages, including planned time off (vacation), unplanned time off (sick leave), and paid holidays
© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service