03 Jul
Site Reliability Engineer -100% Remote
Texas, Dallas / fort worth , 75342 Dallas / fort worth USA

  • Performs application specific production support, incident management, problem management, RCAs, and service restoration as needed to quickly respond to and resolve production issues.
  • Free up the developer resources to focus on developing new features in the product by handling most of the relevant aspects of how to operate the products effectively and proactively manager customer experience.
  • Plan and achieve high availability, performance, and availability of the product service.
  • Ensure pro-active monitoring of all core services and processes to prevent un-planned service disruption.
  • Implement self-healing and scalability of technical services to avoid un-planned disruptions.
  • Establish observability of the business system health by integrating with the observability platform using automation
  • Maintains operations runbook for during business hour and off-hours system support.
  • Partners with the engineering to ensure successful change management from development to delivery.
  • Implements and trains team members on the tool consolidation strategy to optimize spend versus value for our end to end monitoring platform.
  • Contributes to definition of strategy, standardization of technologies, and establishment of patterns for rapid and continuous development and application of automated solutions to address reliability issues and automate manual tasks.
  • Leads, implements and trains team members on measurement capability of core product availability across Azure Cloud using HTTP endpoint testing and synthetic user testing.
  • Present usability, reliability, incident, and user experience of the core product services to senior and/or executive leadership on a weekly basis.
  • Define and report SLOs/SLAs for 99.9% availability to executive leadership and business partners.
  • Influences product delivery teams to implement usability and reliability enhancements leading to improved user experience index scores and improved availability
  • Provide detailed analysis and troubleshooting for systems outages providing feedback to product/software engineering


Related jobs

Report job