20 May
Site Reliability Engineer- Onsite Phoenix, AZ
Arizona, Phoenix , 85001 Phoenix USA

Vacancy expired!

Skills you have:
  • 3+ years of hands-on experience in a production/operational role
  • Experience with the operation and management of cloud-based services, including operational processes, such as incident management life cycle and event management approaches
  • Any technical background in digital, coding, networking, systems, and/or applications – you’ll be exposed to all of these, and your future career path could involve an extreme focus on any of these areas
  • Familiarity with modern operations concepts such as Agile and DevOps
  • Familiarity with security as it applies to operational platforms and processes
  • Proven understanding of internet technologies, especially streaming and cloud technologies and services, basic network components and how they interoperate, as well as Internet protocols and configurations
  • Understanding of orchestration and monitoring of cloud-native services
  • Firm grasp of a scripting or high-level programming language, such as Python, Ruby, or Perl, with a demonstrated proficiency of experience in leveraging such language for the purpose of automation
  • Ability to work under pressure in a fast-paced environment and make quick decisions and keep a cool head while under pressure
  • Credibility with partner-stakeholders (business, engineering, product/program etc.) for a strong team and results-oriented delivery
  • Excellent oral and written communication skills – demonstrated ability to influence technical and non-technical audiences including those at the senior leadership levels
  • Schedule flexibility is key to support the operational needs of the business; this role may operate on an adjusted schedule including weekends, overnights, early mornings, etc

Responsibilities:
  • Overall responsibility for Production Operations of content services, applications, and events, including event monitoring incident management, infrastructure, network, and cloud services management
  • Create and maintain response plays across a variety of incident management and monitoring tools, such as Blameless, PagerDuty, Service Now, Dataminr, DataDog & New Relic
  • Drive system enhancements and overall reliability through Dev, SRE, and other functions to prevent future incidents and improve system resiliency/quality; own the post-mortem/incident analysis process with involved teams to identify the root cause and remediation tasks, and identifying areas of concern and drive-thru resolution
  • Handle the development and reporting of key operational metrics to drive improvements over time
  • Lead creation and ongoing updates to documentation, including operational runbooks, support monitoring and remediation activities, providing guidance as support, and enabling function for Tier 1 Operations teams
  • Support marquee events and day-to-day offerings by preparing documentation to guide event readiness and participate in the day of operational coverage
  • Participate in the development and implementation of Digital Operations processes and SOPs and identify opportunities for automation of operational tasks, incident remediation, and scaling activities
  • Execute production changes to Digital systems, infrastructure, products, and platforms in support of event and release activities
  • Act as a point of contact for incident escalation for the Tier 1 Operations team and triage and mitigate as needed
  • Work closely with the information security team to ensure security requirements are effectively met

Vacancy expired!


Related jobs

Report job