Summary
We are seeking a Senior SRE Engineer (Wallet Operations Focus) to ensure the stability, availability, and performance of our core business infrastructure on AWS. The role involves managing global production environments, building scalable and highly available systems, implementing automation and observability platforms, and maintaining security and compliance standards.
Job Purpose
- In charge of deployment
- Ensures systems run reliably, efficiently, and at scale.
- Builds tools to improve uptime, performance, and incident response.
Responsibilities
- Ensure global infrastructure stability, availability, and performance on AWS for core business operations, taking ownership of production SLAs.
- Design, operate, and troubleshoot cloud-native components such as Kubernetes, Envoy, Service Mesh (Istio/Linkerd), and Ingress controllers.
- Improve operational efficiency through automation and platform tools (IaC, CI/CD), achieving system observability, self-healing, and fast recovery from incidents.
- Implement and maintain operational security practices, including access control (AWS IAM/K8s RBAC), network security policies, vulnerability management, and incident response.
- Build and enhance a global operations system, including capacity planning, monitoring and alerting (Prometheus/ELK), CI/CD pipelines (GitLab/Jenkins), disaster recovery, and automated fault recovery.
- Understand business architecture deeply and participate in designing high-availability and disaster recovery solutions, with continuous cost optimization.
Qualifications
- 5+ years of Linux operations, SRE, or DevOps experience, with expertise in managing large-scale distributed systems.
- Proficient in AWS core services (EC2, S3, VPC, IAM, ELB, RDS, etc.) with architecture, operations, and cost optimization experience.
- In-depth knowledge of Kubernetes architecture, including managing, troubleshooting, and performance tuning large-scale production clusters.
- Familiarity with Envoy, Istio/Linkerd service mesh, or Nginx/Istio Ingress controllers for L7 traffic management.
- Strong operational security awareness and practices, including common OS, network, and application security vulnerabilities and mitigation measures.
- Proficient in at least one programming language (Go/Python/Shell) to implement automation solutions for operational challenges.
- Strong experience with observability stacks such as Prometheus and ELK, capable of building efficient monitoring platforms.
- Proven experience in capacity planning and performance testing, with the ability to quantify system bottlenecks and plan accordingly.
Preferred:
- Experience managing SRE/tooling/platform teams.
- Familiarity with observability stacks such as Prometheus, Grafana, and ELK.
- Professional certifications such as AWS (SAA/SAP), Kubernetes (CKA/CKE/CKS) are a plus