Site Reliability Engineering Lead

Job Category: IT
Job Type: Full Time
Salary: 12000-18000
Job Location: Kuala Lumpur


Full job description

Responsibilities: 

  • Lead and mentor a team of SREs, fostering a culture of ownership, collaboration, and continuous improvement.
  • Define clear goals, performance metrics, and development plans for the team
  • Design and implement strategies to improve system reliability, scalability, and performance.
  • Conduct root cause analysis of production incidents and develop preventive solutions
  • Oversee the deployment, monitoring, and management of production environments.
  • Collaborate with development teams to design cloud-native infrastructure and architecture.
  • Drive automation of operational processes, reducing manual intervention and response times.
  • Optimize CI/CD pipelines to ensure smooth and rapid deployments.
  • Establish incident response protocols and lead efforts during major incidents.
  • Ensure robust monitoring and alerting systems are in place to proactively detect issues.
  • Act as a liaison between engineering, operations, and other teams to align objectives, share insights and best practices with internal stakeholders to enhance overall system resilience

Requirements:

  • Technical Expertise:
    Strong experience with cloud platforms (AWS, Azure, Google Cloud) and infrastructure-as-code tools (Terraform, Ansible, etc.).
    Proficiency in programming/scripting languages (Python, Go, Shell, etc.).
    Deep knowledge of Kubernetes, containerization, and distributed systems.
  • Leadership Skills:
    Proven track record of leading SRE or DevOps teams and managing large-scale production environments.
    Strong decision-making, prioritization, and problem-solving capabilities.
  • Monitoring & Metrics:
    Expertise in implementing and using monitoring tools (Prometheus, Grafana, Datadog, etc.) and logging systems.
    Familiarity with service-level objectives (SLOs), service-level agreements (SLAs), and error budgets.
  • Soft Skills:
    Excellent communication and collaboration skills to work across cross-functional teams.
    Ability to mentor and upskill team members, fostering a learning-oriented culture.
  • Experience:
    At least 8 years of experience in SRE, DevOps, or related roles with a focus on reliability engineering

Benefits:

  • 13th month salary
  • Annual Leave 18 days
  • Laptop & Parking Provided

Job Type: Full-time

Pay: RM12000-RM18000 per month

Apply for this position

Allowed Type(s): .pdf, .doc, .docx