Be responsible for ensuring the reliability of the business, including but not limited to monitoring and alert, incident management, business continuity management, resource and capacity management, campaign support, etc.
Take part in the planning and development of operational tools to automate processes, improve efficiency, and reduce costs.
Enhance the existing stability assurance system, drive the implementation of best practices and processes for SRE operations, ensuring scalability, reliability, and performance.
Collaborate with the Dev team, provide pertinent technical solutions based on their requirements.Proactively engage in effective communication to secure their support and ensure the successful delivery of relevant projects.
Responsible for 24/7 monitoring and response of Games business, response promptly to live incidents, quick location and recovery, to ensure business stability.
Requirements:
Bachelor's degree or above in Computer Science or related fields
Having 3+ years of experience as SRE/DevOps/System Engineer
Expert in Shell language, better familiar with Python or Go language, React, JavaScript also highly preferred
In-depth understanding of Network, Linux, Traffic Scheduling
Familiar with Jenkins, Gitlab, experienced in CI/CD process development and integration
Familiar with commonly used middleware and databases, such as Codis,Redis, MQ,MySQL
Familiarity in Docker/k8s including related underlying technology and principles is preferred
Able to respond promptly to handle all fault incidents
An effective team player with a customer service orientation
Meticulous and attentive to detail with strong critical thinking, data analytics and problem solving capabilities
Candidates with experience in independently leading technical projects will be given priority
Able to communicate effectively English to work with stakeholders in other regions
Have a passion for reliable and performance systems, and care deeply about the end-user experience