Software Engineer (SRE)

FIS GlobalAtlanta, GA
7hHybrid

About The Position

We are seeking a highly skilled Site Reliability Engineer (SRE) with strong expertise in platform engineering, infrastructure reliability, cloud operations, and automation. The ideal candidate will play a key role in ensuring the performance, stability, scalability, and security of our production environments while partnering closely with development and operations teams to build resilient, self‑service platforms.

Requirements

  • 10+ years of experience in SRE, DevOps, or platform engineering roles.
  • Proven experience building resilient, scalable, and highly available systems.
  • Expertise with AIX, IBM mainframe, Linux, Windows, Oracle, IIS, F5 load balancers, Akamai.
  • Experience with Splunk, Dynatrace, BigPanda, Zabbix, SiteScope, Idera, and IBM TWS.
  • Database tuning, advanced load balancer configuration, Akamai WAF tuning.
  • Cloud and containerization experience.
  • Certifications (RHCE, Microsoft, Oracle, F5, Akamai).
  • Strong scripting and automation skills. (Python, Bash, PowerShell, Go).
  • Familiarity with common enterprise application architectures, including multi‑tier (web/app/db), service‑oriented architecture (SOA), microservices, message‑driven/event‑driven systems, API‑centric integration patterns, and distributed system design principles.
  • Translate complex technical concepts into clear, business‑friendly language and communicate expectations, risks, and solutions effectively with clients
  • Communicate effectively with clients at all levels—technical and non‑technical—building trust while understanding their goals, constraints, and success criteria, and proactively managing expectations through clear, timely, and transparent dialogue.
  • Commitment to continuous improvement.
  • Investigating issues across multi‑layered systems, identifying root causes and anticipating blockers before they occur.
  • Strong collaboration with cross functional teams (dev, ops, security, product).

Responsibilities

  • Infrastructure Reliability & Systems Engineering:
  • Manage, tune and support enterprise environments across:
  • AIX: LPARs, PowerVM/VIOS, NIM, storage, tuning
  • Linux: RHEL, SUSE, Ubuntu—performance, security hardening, system services
  • Windows Server: IIS administration, clustering, GPO, patching
  • Support and optimize:
  • Oracle databases (RAC, Data Guard, RMAN, SQLNet/capacity tuning)
  • IIS web infrastructure (app and thread pools, SSL/TLS, ARR, logs)
  • Load balancers (F5 BIG‑IP, HAProxy—monitors, iRules/policies)
  • Akamai CDN/WAF, caching, edge configuration
  • Ensure operational excellence in backup/restore, patching, DR, and capacity management
  • Provide enterprise application support (mission‑critical in‑house systems—performance, reliability, release operations)
  • Deployment & Release Engineering:
  • Own, manage, and execute deployments across UAT, Production and DR.
  • Maintain and optimize deployment runbooks, build artifacts, and environment promotion workflows.
  • Implement safe‑deployment strategies including blue/green, canary, rolling, and feature-flag-based releases.
  • Coordinate with development and DevOps teams to ensure deployment readiness, including configuration, dependencies, and release validation.
  • Troubleshoot deployment issues, manage rollbacks, and ensure post‑release stability.
  • Enhance and maintain CI/CD pipelines to improve deployment predictability, reliability, and auditability.
  • Integrate deployment telemetry into observability tools to detect release-related anomalies early.
  • Enforce deployment quality gates, configuration consistency, and compliance requirements.
  • Support continuous improvement of release processes, reducing manual steps and eliminating deployment toil.
  • Monitoring, Observability & Event Management:
  • Monitor system health and respond to incidents with a focus on rapid recovery and root‑cause analysis and long-term remediation
  • Administer and optimize monitoring and observability tools including Splunk, Dynatrace, BigPanda, Zabbix, SiteScope, Wireshark and Idera.
  • Build and maintain robust logging, metrics and tracing stacks.
  • Develop dashboards, alerts, and automated remediation workflows.
  • Drive post‑incident reviews and continuous improvement initiatives.
  • Performance & Scalability:
  • Conduct capacity planning and performance tuning across infrastructure and applications.
  • Identify systemic issues and architect resilient solutions.
  • Collaborate with engineering teams to optimize systems for reliability and performance.
  • Automation & Platform Engineering:
  • Automate provisioning and operations using Ansible, PowerShell, Bash, and Python.
  • Implement Infrastructure‑as‑Code using Terraform/Ansible
  • Build internal self‑service tools to reduce manual work.
  • Administer IBM Tivoli Workload Scheduler (TWS): workload automation, job streams, monitoring.
  • Security, Compliance & Governance:
  • Implement OS/middleware hardening, SSL/TLS certificate management, vulnerability remediation.
  • Experience in high compliance environments (SOC2, HIPAA, FedRAMP, ISO27001).
  • Partner with security teams to remediate vulnerabilities and ensure environment hardening.
  • Support compliance requirements through automation, logging, and operational controls
  • Follow ITIL processes for change, incident, and problem management.
  • Collaboration & Continuous Improvement:
  • Partner with DevOps, application, database, and network teams.
  • Maintain documentation, runbooks, diagrams, and standards.
  • Contribute to release planning, environment readiness, and cross‑team coordination

Benefits

  • A voice in the future of fintech
  • Always-on learning and development
  • Collaborative work environment
  • Opportunities to give back
  • Competitive salary and benefits
© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service