INFINITE CHOICE LLC-posted 4 months ago

We're seeking an exceptional Site Reliability Engineer to architect, design, and build our SRE foundation from the ground up at InfiniteChoice. This is a rare greenfield opportunity to establish SRE practices, develop custom tooling, and create the reliability culture that will support our platform serving millions of users and billions in transaction volume. As our SRE, you'll combine deep technical expertise with strategic vision to build world-class monitoring, observability, and automation systems. You'll have the autonomy to define our SRE processes, select technologies, and create the framework that ensures our systems are reliable, scalable, and performant. Location: Remote - US based.

  • Build SRE practices from scratch - define SLIs, SLOs, error budgets, and reliability metrics
  • Establish incident response procedures, on-call rotations, and post-mortem processes
  • Create reliability engineering standards and best practices across all engineering teams
  • Develop disaster recovery and business continuity strategies
  • Design and implement capacity planning and performance optimization frameworks
  • Drive architecture decisions for comprehensive application and infrastructure monitoring solutions
  • Design and develop custom SRE tools for automated monitoring, alerting, and remediation
  • Build observability platforms that provide deep insights into system performance and user experience
  • Create automation frameworks for deployment, scaling, and incident response
  • Architect logging, metrics, and tracing systems for distributed microservices environments
  • Leverage Google Cloud Platform services to build resilient, scalable infrastructure
  • Implement cloud-native monitoring using Stackdriver, Cloud Monitoring, and Cloud Logging
  • Design auto-scaling and self-healing systems using GKE, Cloud Functions, and managed services
  • Optimize cloud costs while maintaining high availability and performance standards
  • Establish security and compliance frameworks within GCP environments
  • Research and implement cutting-edge SRE tools and methodologies
  • Leverage AI and machine learning for predictive analytics, anomaly detection, and automated remediation
  • Create dashboards and reporting systems that provide actionable insights to engineering and business teams
  • Establish feedback loops for continuous improvement of reliability and performance
  • Stay current with industry best practices and emerging technologies in the SRE space
  • 5+ years of experience in Site Reliability Engineering or Infrastructure Engineering
  • Proven track record designing and implementing monitoring and observability solutions at scale
  • Deep understanding of distributed systems, microservices architectures, and cloud-native patterns
  • Experience with infrastructure as code, configuration management, and deployment automation
  • Hands-on experience with Google Cloud Platform is required
  • Expertise with GCP monitoring and observability stack (Cloud Monitoring, Cloud Logging, Cloud Trace)
  • Experience with GKE, Compute Engine, Cloud Functions, and other core GCP services
  • Knowledge of GCP networking, security, and compliance capabilities
  • Understanding of GCP cost optimization and resource management
  • Strong programming skills in Python, Go, Java, or similar languages
  • Experience with monitoring tools (Prometheus, Grafana, Datadog, New Relic, or similar)
  • Proficiency with containerization (Docker, Kubernetes) and orchestration platforms
  • Knowledge of CI/CD pipelines, automated testing, and deployment strategies
  • Understanding of database performance tuning and optimization (SQL and NoSQL)
  • Familiarity with AI-driven development tools and methodologies is a huge plus
  • Experience with machine learning for operations (AIOps), anomaly detection, or predictive analytics
  • Knowledge of automated incident response and self-healing systems
  • Understanding of AI/ML tools for log analysis, pattern recognition, and intelligent alerting
  • Strong analytical and troubleshooting skills for complex distributed systems
  • Experience with high-pressure incident response and crisis management
  • Detail-oriented with commitment to operational excellence and continuous improvement
  • Comfortable with ambiguity and building processes in a fast-growing environment
  • Passion for reliability, automation, and engineering best practices
  • Demonstrated experience building SRE programs and processes from the ground up is a HUGE plus
  • Bachelor's degree in Computer Science, Engineering, or equivalent professional experience
  • Industry certifications (Google Cloud Professional, SRE or related certifications preferred)
  • Ground-floor opportunity to build SRE practices and culture from scratch
  • Full autonomy to define processes, select technologies, and establish best practices
  • Direct impact on platform reliability serving millions of users
  • Opportunity to create lasting engineering culture and operational excellence
  • Remote-first culture with in-person meeting in Dallas, TX on need basis
  • Collaborative environment with smart, passionate engineers and cross-functional teams
  • Access to cutting-edge technologies and AI-driven development tools
  • Competitive compensation, equity participation, and comprehensive benefits
© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service