Observability Monitoring Engineer
(CANNOT WORK C2C) Must work W2Candidates must be willing to work onsite 3 days a week. No exceptions Job Description:We are seeking a highly skilled Observability Monitoring Engineer with expert knowledge in Prometheus, Grafana, or Git. This role involves developing and managing telemetry for large-scale datasets and implementing strategies to enhance AI system reliability and performance, as well as assisting in capacity management.Key Responsibilities:
Develop and manage telemetry systems for large-scale datasets.
Implement monitoring and alerting solutions to ensure system reliability.
Collect and analyze data to improve AI system performance.
Automate processes to enhance efficiency and reduce manual intervention.
Manage and maintain Kubernetes clusters and Docker containers.
Utilize Prometheus and Grafana for monitoring and visualization.
Work with DCGM/DCGM Exporter (Nvidia Stack) for telemetry.
Collaborate with data scientists to support AI/ML platforms.
Troubleshoot and resolve issues related to telemetry systems.
Primary Skills:
Telemetry/Observability, Monitoring and Alerting, Data Collection and Analysis, Automation
Prometheus and Grafana
JSON/YAML
Kubernetes and Docker/Container Technologies
DCGM/DCGM Exporter (Nvidia Stack)
Solid understanding of telemetry concepts, metrics, logs, and tracing
Benefits:
Eligibility requirements apply to some benefits and may depend on your job classification and length of employment. B
Benefits are subject to change and may be subject to specific elections, plan, or program terms.
If eligible, the benefits available for this temporary role may include the following:
Medical, dental & vision
Critical Illness, Accident, and Hospital
401(k) Retirement Plan – Pre-tax and Roth post-tax
contributions available
Life Insurance (Voluntary Life & AD&D for the
employee and dependents)
Short and long-term disability
Health Spending Account (HSA)
Transportation benefits
Employee Assistance Program
Time Off/Leave (PTO, Vacation or Sick Leave)
About TEKsystems:
We're partners in transformation. We help clients activate ideas and solutions to take advantage of a new world of opportunity. We are a team of 80,000 strong, working with over 6,000 clients, including 80% of the Fortune 500, across North America, Europe and Asia. As an industry leader in Full-Stack Technology Services, Talent Services, and real-world application, we work with progressive leaders to drive change. That's the power of true partnership. TEKsystems is an Allegis Group company. The company is an equal opportunity employer and will consider all applications without regards to race, sex, age, color, religion, national origin, veteran status, disability, sexual orientation, gender identity, genetic information or any characteristic protected by law.