Your browser cookies must be enabled in order to apply for this job. Please contact support@jobscore.com if you need further instruction on how to do that.

SRE Solutions Specialist-- Monitoring & Observability (M&O)

Information Technology | Remote in Knoxville, TN | Contract and Temp to Full Time

Job Description

Role : SRE Solutions Specialist-- Monitoring & Observability (M&O)

Duration : Long Term Contract

Location : Remote But Sometimes Travel required to Knoxville, TN


As a Senior Specialist in Monitoring & Observability, you will design, implement, and standardize enterprise-grade monitoring and alerting solutions across complex, cloud-based environments. This role sits at the intersection of Observability, SRE, and Incident Management, with a focus on ensuring systems are reliable, measurable, and proactively monitored. You’ll collaborate with Cloud Operations, Architecture, and Platform Engineering teams to define best practices and build resilient, insight-driven infrastructure that supports business-critical services.


What you would do :

Implement and standardize monitoring and alerting tools across multiple cloud platforms to ensure consistent observability practices.

Architect observability solutions with Splunk, OpenTelemetry, AWS CloudWatch, GuardDuty, Wiz, and other modern monitoring stacks.

Design and build incident response workflows, playbooks, and dashboards for actionable insights and faster recovery.

Define and operationalize SLOs, SLIs, and error budgets to align with reliability goals.

Integrate observability tools with ServiceNow ITOM and CMDB for automated incident management and asset tracking.

Collaborate with Cloud Operations and Architecture teams to ensure observability is embedded in design, build, and run phases.

Automate monitoring configurations and embed observability into CI/CD pipelines.

Optimize performance and reliability through log analysis, metrics correlation, and distributed tracing.

Drive initiatives to improve MTTR, incident detection, and proactive issue prevention.

Provide technical leadership and mentorship, sharing best practices across engineering and operations teams.

Skills & Experience Required

10+ years of experience in infrastructure engineering, with significant focus on monitoring and observability.

Proven expertise with observability platforms such as Splunk, OpenTelemetry, AWS CloudWatch, GuardDuty, Wiz.

Strong knowledge of logging, metrics, tracing, and open standards for observability.

Experience designing and managing incident response workflows and escalation processes.

Hands-on experience with ServiceNow ITOM and CMDB integrations.

Proficiency in cloud-native monitoring (AWS, Azure, GCP) and container observability (Docker, Kubernetes).

Familiarity with SRE principles: defining SLOs, SLIs, and error budgets.

Knowledge of automation practices and Infrastructure as Code (Terraform, CloudFormation, ARM templates).

Strong problem-solving skills with the ability to troubleshoot complex distributed systems.

Excellent communication, presentation, and leadership skills.

Preferred certifications

Cloud certifications such as AWS DevOps Engineer, Azure DevOps Engineer Expert, or Google Professional Cloud DevOps Engineer.

Experience in AIOps, predictive analytics, and security-driven observability.

Exposure to chaos engineering or performance engineering practices.

Experience in multi-cloud and hybrid environments with advanced observability patterns.