
GETTR
Job Description
Challenges you will solve:
- Participate in all stages of infrastructure provisioning, primarily providing the staging and production support.
- Assist in implementation of security best practices and initiatives at all levels of the systems infrastructure.
- Adhere with SRE (Site Reliability Engineering) principles/pillars on incident management and service level objectives.
- Work closely with DevOps engineers to apply/improve the automation scripts and system designs shared by DevOps to improve systems efficiency in production environment.
- Ensure maximum uptime and stability of cloud and on-premises environments, especially in staging and production environments.
- Apply the latest OS and security patches ensuring the compatibility of underlying running application.
- Lead on conducting in the disaster recovery/business continuity (DRBC) routine exercises.
- Handle help desk & JIRA tickets and mitigate any production issues.
- Ensure accurate knowledge base documentation in a timely manner.
Requirements:
- Strong knowledge of secure web app deployments in AWS (4+ years).
- Advanced experience as a Linux or Windows server administrator.
- The ability to work with little supervision; must be self-driven and motivated.
- Experience with continuous integration/continuous delivery (CI/CD) Jenkins and Git.
- Experience with containerized microservices delivered with Docker, Kubernetes (Kops, AWS EKS), or OpenShift 4.x.
- Manage & optimize unified logging system and APM (Application Performance Management) monitoring tools, constantly reduce the MTTR (Mean Time to Recovery).
- Strong experience with hybrid infrastructure systems monitoring and proactive incident management.
- Strong scripting skills using Shell and Python or Go (a plus).
- Some knowledge of web application programming languages (such as JavaScript, NodeJS, Java, etc.).
- Ability to proactively triage on troubleshooting urgent production issues under high time pressure with precision.
- Experience in working collaboratively with various applications development teams throughout the organization to resolve mission critical problems.
- Excellent written and oral communication skills necessary to produce and process technical documents.
- Excellent problem-solving and analytical skills and the ability to translate business requirements into information systems solutions.
- Experience with IT security.
- Someone who is a team player.
- Familiarity/experience with the DevOps process.
- Professional IT certifications, such as Red Hat Certified Engineer/Windows Server, and AWS certifications (a huge plus).
- Relevant work experience (8+ years), either in software development or IT infrastructure.
- Masters degree in technology related, engineering or computer science (a plus).
- Participate in a weekly on-call rotation (~every 3-4 weeks) as needed.
- Provide mission critical production support in case of an outage during off business hours if necessary.