22 May
Site Reliability Engineer- Onsite Phoenix, AZ
Vacancy expired!
- 3+ years of hands-on experience in a production/operational role
- Experience with the operation and management of cloud-based services, including operational processes, such as incident management life cycle and event management approaches
- Any technical background in digital, coding, networking, systems, and/or applications – you’ll be exposed to all of these, and your future career path could involve an extreme focus on any of these areas
- Familiarity with modern operations concepts such as Agile and DevOps
- Familiarity with security as it applies to operational platforms and processes
- Proven understanding of internet technologies, especially streaming and cloud technologies and services, basic network components and how they interoperate, as well as Internet protocols and configurations
- Understanding of orchestration and monitoring of cloud-native services
- Firm grasp of a scripting or high-level programming language, such as Python, Ruby, or Perl, with a demonstrated proficiency of experience in leveraging such language for the purpose of automation
- Ability to work under pressure in a fast-paced environment and make quick decisions and keep a cool head while under pressure
- Credibility with partner-stakeholders (business, engineering, product/program etc.) for a strong team and results-oriented delivery
- Excellent oral and written communication skills – demonstrated ability to influence technical and non-technical audiences including those at the senior leadership levels
- Schedule flexibility is key to support the operational needs of the business; this role may operate on an adjusted schedule including weekends, overnights, early mornings, etc
- Overall responsibility for Production Operations of content services, applications, and events, including event monitoring incident management, infrastructure, network, and cloud services management
- Create and maintain response plays across a variety of incident management and monitoring tools, such as Blameless, PagerDuty, Service Now, Dataminr, DataDog & New Relic
- Drive system enhancements and overall reliability through Dev, SRE, and other functions to prevent future incidents and improve system resiliency/quality; own the post-mortem/incident analysis process with involved teams to identify the root cause and remediation tasks, and identifying areas of concern and drive-thru resolution
- Handle the development and reporting of key operational metrics to drive improvements over time
- Lead creation and ongoing updates to documentation, including operational runbooks, support monitoring and remediation activities, providing guidance as support, and enabling function for Tier 1 Operations teams
- Support marquee events and day-to-day offerings by preparing documentation to guide event readiness and participate in the day of operational coverage
- Participate in the development and implementation of Digital Operations processes and SOPs and identify opportunities for automation of operational tasks, incident remediation, and scaling activities
- Execute production changes to Digital systems, infrastructure, products, and platforms in support of event and release activities
- Act as a point of contact for incident escalation for the Tier 1 Operations team and triage and mitigate as needed
- Work closely with the information security team to ensure security requirements are effectively met
Vacancy expired!