Staff Software Engineer - Security and Reliability

Courtyard.io

116d•Remote

About The Position

We are actively recruiting a staff software engineer to own the security, reliability, and observability of the fastest growing e-commerce startup. You will be reporting directly to our Head of Engineering and work very closely with many members of our engineering team. Your mission will include establishing and maintaining world-class observability, monitoring and alerting systems, building systems that reduce operational toil for the entire engineering team, and conducting security audits, reviews and mitigations across our entire platform. We take reliability and security seriously. Doing so prepared us to scale to $500M in volume in under a year. You will help us scale the next 100x while keeping our systems secure and reliable. About You You have exceptional high agency and you don't let yourself be stuck on problems: you find creative solutions to complex reliability and security challenges so the business never stops running. When systems fail, you build the automation and tooling that helps the entire team respond effectively, not just heroically fix things yourself. You are a "professional hacker" in the best sense - someone who can operate without much guidance, exercise excellent judgment on when to build vs buy vs configure, and see security and reliability as fundamental enablers of business success rather than obstacles to overcome. 8+ years of experience building, securing, and operating complex distributed systems at scale. You've been on-call, you've debugged production incidents, and you've built the monitoring and automation systems that reduced toil for entire engineering organizations. You are passionate about making systems observable, reliable, and secure. You understand that the best reliability work multiplies the effectiveness of the entire team - better monitoring means faster debugging for everyone, better automation means less manual toil, and better incident response processes mean the whole team can handle issues confidently. We don't believe in heroes; we believe in systems that make heroics unnecessary. You understand our specific technology stack and can hit the ground running: Go microservices running on Google Cloud Run PostgreSQL Redis Google Cloud Platform infrastructure (Cloud Run, Cloud Build, Pub/Sub, Cloud Storage) Terraform for infrastructure as code Blockchain indexing and transaction submission External service integrations You have deep expertise in at least several of these areas: Building comprehensive observability platforms (metrics, logs, traces, dashboards) Designing and implementing effective alerting strategies that minimize noise while catching real issues Creating automation and tooling that reduces operational toil Establishing incident response processes, runbooks, and postmortem practices Conducting security audits and threat modeling for distributed systems Implementing security controls, authentication/authorization systems, and secrets management Performance optimization and capacity planning for high-throughput systems Database reliability, backup/recovery strategies, and data integrity API security, rate limiting, and DDoS mitigation Compliance and audit logging for financial systems You understand that sometimes the rocket must be launched and completed in flight. This means you're comfortable making pragmatic security and reliability tradeoffs when needed, while always having a plan to improve things incrementally. You know when "good enough for now with monitoring" is the right answer, and when "we need to fix this before we ship" is non-negotiable.

Requirements

8+ years of experience building, securing, and operating complex distributed systems at scale. You've been on-call, you've debugged production incidents, and you've built the monitoring and automation systems that reduced toil for entire engineering organizations.
You are passionate about making systems observable, reliable, and secure. You understand that the best reliability work multiplies the effectiveness of the entire team - better monitoring means faster debugging for everyone, better automation means less manual toil, and better incident response processes mean the whole team can handle issues confidently. We don't believe in heroes; we believe in systems that make heroics unnecessary.
You understand our specific technology stack and can hit the ground running: Go microservices running on Google Cloud Run PostgreSQL Redis Google Cloud Platform infrastructure (Cloud Run, Cloud Build, Pub/Sub, Cloud Storage) Terraform for infrastructure as code Blockchain indexing and transaction submission External service integrations
You have deep expertise in at least several of these areas: Building comprehensive observability platforms (metrics, logs, traces, dashboards) Designing and implementing effective alerting strategies that minimize noise while catching real issues Creating automation and tooling that reduces operational toil Establishing incident response processes, runbooks, and postmortem practices Conducting security audits and threat modeling for distributed systems Implementing security controls, authentication/authorization systems, and secrets management Performance optimization and capacity planning for high-throughput systems Database reliability, backup/recovery strategies, and data integrity API security, rate limiting, and DDoS mitigation Compliance and audit logging for financial systems

Responsibilities

Observability & Monitoring: Build and maintain comprehensive monitoring across our microservices architecture. Instrument our Go services with meaningful metrics. Create dashboards that tell the story of system health. Ensure every engineer can debug any issue in production with the data we collect.
Alerting & On-call Support: Design alerting strategies that wake people up for real problems, not noise. Every engineer is already in an oncall rotation - your job is to make their lives easier by building better alerts, better runbooks, and better automation. Reduce the toil so oncall is manageable and incidents are handled smoothly by whoever is on duty.
Security Audits & Reviews: Conduct regular security reviews of our codebase, infrastructure, and third-party integrations. Identify vulnerabilities before they become incidents. Work with the team to implement mitigations. Establish security best practices and ensure they're followed.
Incident Response Systems: Build the systems and processes that enable effective incident response across the team. Create runbooks, automate common remediation tasks, and establish postmortem practices that turn incidents into learning opportunities. Make it easy for any engineer to handle incidents confidently.
Reliability Engineering: Identify and eliminate single points of failure. Implement circuit breakers, retries, and graceful degradation. Build automation that reduces manual operational work. Ensure our systems can handle 100x growth without proportionally increasing operational burden.
Infrastructure Security: Secure our GCP infrastructure, manage secrets properly, implement least-privilege access controls, and ensure our Terraform configurations follow security best practices. Own the security of our CI/CD pipelines and deployment processes.

Benefits

A dynamic and engaging environment focused on fostering real growth and innovation
Opportunities to create amazing products that our customers truly love and value
Comprehensive health insurance packages with dependent coverage
Competitive salary with ample opportunities for career advancement and development
Enjoy the flexibility of a fully remote work environment
Access to employee wellness programs designed to support your overall well-being
401(k) plan with a 4% employer match to help you plan for the future

Stand Out From the Crowd

Upload your resume and get instant feedback on how well it matches this job.

Upload and Match Resume