Cloud SRE Engineer - Mandarin Bilingual

IntelliPro Group Inc.•Palo Alto, CA

5d•$70 - $100•Onsite

About The Position

North America cloud operations team is looking for a skilled Cloud SRE Engineer to own the reliability, stability, and continuous improvement of core cloud services — spanning compute infrastructure (CVM/VMs), networking, and cloud security products. You'll work in a production-critical environment where operational excellence, deep technical expertise, and a self-directed mindset are essential. Since the North America team operates independently from teams in China and Singapore with no overlapping hours, we're looking for someone who can hit the ground running with minimal ramp-up time.

Requirements

Some SRE, DevOps, or cloud operations experience — ability to maintain application stability independently is essential given timezone constraints
Mandarin/English bilingual preferred — ability to communicate with teams in China and Singapore is a plus
Strong networking fundamentals (TCP/IP, DNS, HTTP, ICMP, load balancing, firewalls, VPC) OR deep Linux/CVM knowledge — ability to own either the networking or compute side of operations
Hands-on experience with cloud platforms (AWS, GCP, Azure, or equivalent) — deployment, usage, and high availability
Familiarity with Kubernetes and container-based deployments
Proficiency in at least one scripting language (Python, Shell, or Go) with automation experience
Strong troubleshooting and debugging skills across infrastructure layers
Experience with monitoring and alerting tools (Grafana, Prometheus, CloudWatch, or equivalent)
Bachelor's degree or above in Computer Science or a related field
Strong self-directed work ethic — able to operate independently with minimal supervision across time zones

Responsibilities

Monitor and maintain cloud compute (CVM), networking, and security products in the North America region to ensure high availability and system stability
Respond to and resolve production incidents, customer-reported issues, and system-level outages with urgency and ownership
Perform deep troubleshooting across network, compute, security, and platform layers
Participate in on-call rotation and handle live production issues independently
Deploy new features, bug fixes, and enhancements into production environments using CI/CD pipelines and internal tooling
Develop scripts and automation tools to improve operational efficiency and reduce toil
Build and improve monitoring, alerting, and disaster recovery systems for 24/7 operations
Document operational workflows, runbooks, and best practices
Work closely with R&D, security, and platform teams across time zones to drive service reliability
Communicate technical issues clearly to internal teams and B2B customers