Site Reliability Engineer II

PROS•Houston, TX

About The Position

The Site Reliability Engineer II optimizes service performance, actively participates in reliability improvements, and conducts in-depth SLO and capacity analysis. This position exists to enhance system reliability and scalability while contributing to automation and self-service tool development.

Requirements

5+ years of experience in enterprise networking, including hands‑on work with routing, switching, firewalls, load balancers, and VPN technologies.
Strong understanding of cloud networking architectures across including VPC/VNet design, peering, private link, and hybrid connectivity models.
Experience with network security technologies, such as security groups, NACLs, firewall policies, WAF, IDS/IPS, and micro‑segmentation.
Proficiency in Layer 2 and Layer 3 network protocols, including BGP, OSPF, EIGRP, DNS, DHCP, NAT, and IP addressing/subnetting.
Hands‑on experience with load balancers and ingress technologies, including F5, NGINX, Azure Application Gateway, ALB/NLB, or equivalent.
Strong troubleshooting skills using packet analyzers tools, flow logs, and network monitoring platforms.
Skilled in analyzing performance trends and identifies optimization opportunities.
Skilled in analyzing trends to inform service improvements.
Collaborates with teams to align SLOs with user expectations.
Develops moderately complex automation tools.
Skilled in analyzing capacity data to inform scaling decisions.

Nice To Haves

Bachelor’s Degree in Computer Science, Information Technology, or a related field
Practical experience with Fortigate firewalls and F5 appliances is highly desirable
Understand core AI concepts and apply them ethically to enhance productivity, insights, and decision-making.
Craft effective prompts to optimize the quality and relevance of AI-generated outputs.
Explore and apply agentic AI systems, using or managing autonomous agents to streamline workflows and automate tasks.
Leverage AI tools to boost efficiency, creativity, and innovation in their daily work.
Stay curious and adaptable, continuously experimenting with AI-driven solutions to elevate team performance and customer impact.

Responsibilities

Monitor service performance, assist in troubleshooting production issues, and learn system architecture.
Monitor service reliability, participate in resolving basic issues, and learn disaster recovery testing procedures.
Understand SLO concepts, monitor and analyze SLO patterns, and assist in implementing SLO visualization and alerting.
Perform basic capacity analysis, identify trends in system capacity, and participate in capacity planning.
Deploy and maintain existing automation tools, create simple scripts, and troubleshoot automation scripts.
Collaborates with teams to improve monitoring coverage.
Ability to participate in structured reliability testing and analysis.
Able to evaluate system components for resilience.
Contributes to reliability-focused design discussions.
Skill in building internal self-service capabilities.
Evaluates automation opportunities for operational efficiency.
Able to recommend improvements for resource utilization.
Ensures scalability is considered in feature development.
Follow predefined procedures to deploy PROS products and third-party applications to the Cloud environments.
Contribute to the release management documentation.
Gain understanding of application architecture and interaction between system components.