10 Apr
Site Reliability Engineer (SRE) Level 3 - (Direct Hire with Client)
Vacancy expired!
- 7+ years of experience in infrastructure, system engineering, QA/testing automation.
- Demonstrable and subject matter expert with experience in testing methodology, testing automation framework.
- Advanced level of Linux/Unix experience.
- Full-stack software test engineer, having advanced experience in at least 2-3 in the following technology stack: React, Go, Git, Bitbucket, Python, SQL or No-SQL Databases, Docker, Kubernetes, and monitoring tools like New Relic and Stack Driver.
- Strong experience in at least 2 of the following sets of logging and monitoring tools: ELK stack, Prometheus, Stackdriver, New Relic, Datadog, Dynatrace.
- Intermediate level of knowledge for software release tooling to include but not limited to Bitbucket, Jenkins, Cloud Build, Spinnaker, Gitlab.
- Experience in designing, analyzing, scaling, and troubleshooting medium-scale distributed systems.
- Well-versed with SRE methodologies and passionate about solving operation problems through automation and software engineering.
- Ability to communicate effectively vertically and horizontally within the organization via demonstrated written and verbal communication skills.
- Intermediate level of knowledge of Docker technologies including experience in optimizing Docker image and managing Docker image lifecycle.
- Experience working with Google Cloud preferred but will consider any other public cloud provider experience.
- API and front-end testing automation.
- Microservices lifecycle management (integration, testing, deployment).
- A systematic problem-solving approach coupled with strong communications skills and a sense of ownership and drive.
- Identify opportunities to design, build and implement innovative solutions to solve unique platform and infrastructure problems to enhance developer workflow and production stability for the products.
- Collaborate with other senior team members to evangelize the SRE mindset and system design to implement technology solutions that will maximize the performance and availability of our environment.
- Design and implement orchestration, and tooling solutions to ensure that repetitive administration tasks are performed at a high level of efficiency and free of defect.
- Design and implement monitoring and recovery tools to provide for site high availability (HA) and disaster recovery (DR).
- Design and develop highly available infrastructure and platform components to meet the needs of our growing and evolving product lines.
- Design and implement security engineering best practices in all our deployed platforms and environments.
- Triage alerts & diagnose/resolve critical issues, manage the implementation of changes.
- Manage the coordination, documentation, and tracking of critical incidents ensuring rapid and complete issue resolution and appropriate closed-loop to customers and other key stakeholders.
- Develop continuous integration/continuous deployment orchestration system to reduce friction for software delivery to production.
- Evangelize SRE mindset and mentor others about reliability and best practices of SRE
- Identify and work with engineering to implement opportunities for automation, signal noise reduction, recurring issues, and other actions to reduce time to mitigate service impacting events and increase the productivity of cloud operations and development resources.
- Maintain a strong understanding of IaaS, Paas, and SaaS offerings with building and maintaining a state-of-the-art, cloud-based environment for massive-scale data processing.
- Ensure that implementation and solution are fully documented, and solution deployed with fully operationalized processes to support the solution lifecycle.
- Other tasks as assigned.
Vacancy expired!