25 Jul
HPC Data Center Operator
Vacancy expired!
Company Description
Join us and make YOUR mark on the World!Are you interested in joining some of the brightest talent in the world to strengthen the United States' security? Come join Lawrence Livermore National Laboratory (LLNL) where our employees apply their expertise to create solutions for BIG ideas that make our world a better place.We are committed to a diverse and equitable workforce with an inclusive culture that values and celebrates the diversity of our people, talents, ideas, experiences, and perspectives. This is essential to innovation and creativity for continued success of the Laboratory's mission. Job Description Do you love High Performance Computing (HPC)? Would you like to work with four of the fastest HPC systems in the world? We have an opening for an HPC Data Center Operator to monitor, diagnose, and repair system faults on a large number of high-performance computer (HPC) systems, storage systems and networks, working under minimal supervision. You will interact with other Livermore Computing (LC) staff to remediate problems and provide advanced technical support in a complicated HPC computing and networking environment, working either Swing (4:00pm - 12:00am) or Owl Shift (12:00am - 8:00am). This position is in the Livermore Computing Operations and Networking Group in the LC Division within the Computing Department under the Computation Directorate.This position will be filled at either the 525.2 or 525.3 level depending on your qualifications. Additional job responsibilities (outlined below) will be assigned if you are selected at the higher level.In this roleyou will- Provide general technical support and monitoring capabilities as part of the operations team for the HPC systems including Sierra, large Linux clusters, file systems, and storage systems.
- Apply Unix system knowledge along with using a variety of in house and vendor supplied diagnostic tools to monitor and effect basic system repairs.
- Troubleshoot and document general software, hardware, and network issues, and then apply corrective action/repairs to the problem or escalate as per defined procedures.
- Utilize the Laboratory's trouble ticketing system, ServiceNow, for problem ticket tracking.
- Receive, document, and accommodate all customer calls, particularly during off-hours, and resolve customer issues if possible, or escalate to the appropriate level.
- Perform data center facilities monitoring, problem remediation, and emergency event response during normal daily operation and off-hours.
- Participate in the decommission process of older HPC systems & system relocation activities.
- Promote the use of inter-departmental resources for tools, metrics, and common solutions to team members via email and presentations.
- Perform other duties as assigned.
- Provide advanced technical support and monitoring capabilities for the HPC systems clusters, file systems, and storage systems.
- Troubleshoot moderately complex software, hardware, network and document issues, apply corrective action and repairs to the problem, or notify the appropriate on-call personnel.
- Perform a variety of high-level technical tasks in the installation, diagnosis, repair and maintenance of clustered computer systems and related file systems and networks.
- Ability to secure and maintain a U.S. DOE Q-level security clearance which requires U.S. citizenship
- Associate's degree in a computer-related field or equivalent combination of technical training and experience.
- Limited experience and knowledge of high-performance computer operating systems, networks (WAN/LAN), and/or networking protocols, such as Infiniband.
- Proficient verbal and written communication skills necessary to interact with customers and team members with the ability to work independently and as a member of a team.
- General working knowledge and experience working with Linux system administration, commands and utilities.
- General working knowledge of parallel systems, distributed systems and/or network protocols, along with raid arrays.
- Ability to understand and apply mechanical concepts and principles to solve problems with attention to detail.
- Experience and knowledge of the skills needed for a customer support role to include a focus on listening, rapport-building, friendly and approachable nature, and courtesy and patience.
- Ability to work all shifts, including weekends and holidays.
- Advanced knowledge of HPC operating systems, local area networks, and/or networking protocols such as Infiniband.
- Advanced knowledge of and experience working with Linux system administration, commands and utilities, including scripting skills (e.g. shells, Perl, Python).
- Significant experience with high performance systems, parallel systems, distributed systems and/or network protocols, along with raid arrays.
- Advanced training, certifications and experience in Linux system administration.
- Advanced knowledge and experience working with electrical systems.
- Computer Support background experience.
- Included in 2021Best Places to Work by Glassdoor!
- Work for a premier innovative national Laboratory
- Comprehensive Benefits Package
- Flexible schedules (depending on project needs)
- Collaborative, creative, inclusive, and fun team environment
Vacancy expired!