Job Category: IT
Job Type: Full Time
Salary: 12000-18000
Job Location: Kuala Lumpur
Full job description
Responsibilities:
- Lead and mentor a team of SREs, fostering a culture of ownership, collaboration, and continuous improvement.
- Define clear goals, performance metrics, and development plans for the team
- Design and implement strategies to improve system reliability, scalability, and performance.
- Conduct root cause analysis of production incidents and develop preventive solutions
- Oversee the deployment, monitoring, and management of production environments.
- Collaborate with development teams to design cloud-native infrastructure and architecture.
- Drive automation of operational processes, reducing manual intervention and response times.
- Optimize CI/CD pipelines to ensure smooth and rapid deployments.
- Establish incident response protocols and lead efforts during major incidents.
- Ensure robust monitoring and alerting systems are in place to proactively detect issues.
- Act as a liaison between engineering, operations, and other teams to align objectives, share insights and best practices with internal stakeholders to enhance overall system resilience
Requirements:
- Technical Expertise:
Strong experience with cloud platforms (AWS, Azure, Google Cloud) and infrastructure-as-code tools (Terraform, Ansible, etc.).
Proficiency in programming/scripting languages (Python, Go, Shell, etc.).
Deep knowledge of Kubernetes, containerization, and distributed systems. - Leadership Skills:
Proven track record of leading SRE or DevOps teams and managing large-scale production environments.
Strong decision-making, prioritization, and problem-solving capabilities. - Monitoring & Metrics:
Expertise in implementing and using monitoring tools (Prometheus, Grafana, Datadog, etc.) and logging systems.
Familiarity with service-level objectives (SLOs), service-level agreements (SLAs), and error budgets. - Soft Skills:
Excellent communication and collaboration skills to work across cross-functional teams.
Ability to mentor and upskill team members, fostering a learning-oriented culture. - Experience:
At least 8 years of experience in SRE, DevOps, or related roles with a focus on reliability engineering
Benefits:
- 13th month salary
- Annual Leave 18 days
- Laptop & Parking Provided
Job Type: Full-time
Pay: RM12000-RM18000 per month