Software Development Engineer- Product Reliability Engineering

Visa•Highlands Ranch, CO

21h

About The Position

Every time someone taps, swipes, or clicks to pay- Visa infrastructure makes it happen in milliseconds, across 200+ countries. As a Software Development Engineer on the Product Reliability Engineering (PRE) team, you won’t just watch those systems run- you’ll be one of the engineers building, automating, and evolving them. PRE is not a traditional ops team. We are a software engineering organization that treats infrastructure as code, reliability as a product, and automation as a strategic advantage. You’ll write Python, build agentic AI tools, manage data platforms, and contribute to the distributed systems that process billions of real-time transactions. From day one, you are an engineer- and from day one, your work matters. If you are endlessly curious about how large-scale systems stay resilient, obsess over elegant automation, and want to launch your career at the intersection of AI, infrastructure, and global financial technology — this role was built for you.

Requirements

Bachelor's degree, OR 3+ years of relevant work experience
Solid foundations in data structures, algorithms, and systems design -you can reason about complexity, tradeoffs, and failure modes.
Proficiency in Python and comfort writing scripts or tools in at least one additional language (Go, Java, or Bash).
Foundational understanding of relational databases (RDBMS): SQL, data modeling, query optimization, and database connectivity troubleshooting.
Familiarity with Linux/Unix environments and meaningful command-line fluency.
Exposure to cloud platforms (AWS, GCP, or Azure) and a conceptual understanding of containerization (Docker, Kubernetes).
Understanding of CI/CD principles and how modern software delivery pipelines are structured and maintained.
Genuine curiosity about GenAI platforms and agentic systems (OpenAI, Anthropic Claude, LangChain, or similar)- hands-on exposure is a plus, intellectual interest is a must.

Nice To Haves

Bachelor’s degree in Computer Science, Software Engineering, or a related technical field (2023–2025 graduates preferred; December 2025 graduates welcome).
Hands-on experience with infrastructure-as-code tools: Terraform, Ansible, or Pulumi -even from coursework, a capstone, or an internship.
Experience with database CI/CD tooling, particularly Liquibase for schema change management across environments.
Experience with observability tooling: Prometheus, Grafana, Splunk, ELK, or Datadog.
Database administration exposure: backup/recovery procedures, performance tuning, index management, or replication monitoring.
Familiarity with Git workflows and modern DevOps toolchains (Jenkins, GitHub Actions, ArgoCD).
Academic or project experience with ML frameworks: scikit-learn, PyTorch, or LangChain / LangGraph.
Understanding of networking fundamentals: DNS, load balancing, service mesh, or TCP/IP.
A GitHub profile, personal project, hackathon entry, or open-source contribution that shows us how you think and build.

Responsibilities

Design and ship end-to-end automation for deployment pipelines, infrastructure provisioning, and release orchestration — code that runs millions of times so engineers never have to repeat themselves.
Write clean, production-grade Python (and Go or Bash where it counts) to eliminate toil, reduce manual intervention, and make systems self-managing.
Develop modular frameworks for release scheduling, validation, rollback, and reporting that integrate across the full software delivery lifecycle.
Support the build, deployment, and operations of relational database systems, contributing to schema design, architecture decisions, and solution engineering for critical payment data infrastructure.
Gain exposure to real-time event streaming architectures that support payment processing at scale
Perform database health operations including patching, upgrades, backups, and recovery to maintain the availability and integrity of tier-1 production databases.
Optimize query performance through index tuning, execution plan analysis, and replication monitoring — targeting metrics like query execution time, CPU usage, and replication latency.
Automate database tasks and configuration management using tools like Ansible and Liquibase, and contribute to CI/CD pipelines that govern schema changes through TEST and PROD environments safely.
Build predictive and reactive monitoring dashboards for database anomalies, surfacing health signals before they become incidents.
Build GenAI-powered engineering assistants that automate deployment orchestration, release governance, and environment lifecycle management.
Integrate LLMs into observability, incident response, and developer support workflows, transforming reactive operations into proactive, AI-driven intelligence.
Contribute to prompt engineering, model fine-tuning, and agentic automation initiatives that position PRE as one of the most AI-forward reliability organizations in financial technology.
Build dashboards, alerts, and metrics using Prometheus, Grafana, Splunk, or ELK that give engineers real-time clarity on complex, globally distributed systems.
Analyze system performance and availability data and turn insights into infrastructure improvements that prevent incidents before they occur.
Contribute to self-healing and auto-scaling capabilities that keep critical payment infrastructure resilient without human intervention.
Ensure infrastructure and data platforms meet security and compliance standards across cloud-native deployments supporting global financial services at scale.
Support zero-downtime deployment strategies and high-availability architectures that Visa’s partners and billions of cardholders depend on around the clock.
Participate in threat modeling, vulnerability remediation, and audit readiness activities as part of a team that treats security as a first-class engineering concern.
Embed within Agile squads, working alongside senior engineers, product managers, and global PRE peers across sprint planning, reviews, and release discussions.
Document runbooks, SOPs, and engineering guides that make the team smarter, faster, and more autonomous over time.
Participate in on-call rotations (with robust support structures and mentorship) to build the incident response instincts that distinguish great reliability engineers.