Your browser cookies must be enabled in order to apply for this job. Please contact support@jobscore.com if you need further instruction on how to do that.

SRE LEAD/Manager Monitoring & Observability (M&O): W2role

Information Technology | Remote in Knoxville, TN | Contract

Job Description

Senior Specialist – Monitoring & Observability (M&O)

As a Senior Specialist in Monitoring & Observability, you will design, implement, and standardize enterprise-grade monitoring and alerting solutions across complex, cloud-based environments. This role sits at the intersection of Observability, SRE, and Incident Management, with a focus on ensuring systems are reliable, measurable, and proactively monitored. You’ll collaborate with Cloud Operations, Architecture, and Platform Engineering teams to define best practices and build resilient, insight-driven infrastructure that supports business-critical services.

Your Impact

  • Implement and standardize monitoring and alerting tools across multiple cloud platforms to ensure consistent observability practices.
  • Architect observability solutions with Splunk, OpenTelemetry, AWS CloudWatch, GuardDuty, Wiz, and other modern monitoring stacks.
  • Design and build incident response workflows, playbooks, and dashboards for actionable insights and faster recovery.
  • Define and operationalize SLOs, SLIs, and error budgets to align with reliability goals.
  • Integrate observability tools with ServiceNow ITOM and CMDB for automated incident management and asset tracking.
  • Collaborate with Cloud Operations and Architecture teams to ensure observability is embedded in design, build, and run phases.
  • Automate monitoring configurations and embed observability into CI/CD pipelines.
  • Optimize performance and reliability through log analysis, metrics correlation, and distributed tracing.
  • Drive initiatives to improve MTTR, incident detection, and proactive issue prevention.
  • Provide technical leadership and mentorship, sharing best practices across engineering and operations teams.

Skills & Experience

  • 10+ years of experience in infrastructure engineering, with significant focus on monitoring and observability.
  • Proven expertise with observability platforms such as Splunk, OpenTelemetry, AWS CloudWatch, GuardDuty, Wiz.
  • Strong knowledge of logging, metrics, tracing, and open standards for observability.
  • Experience designing and managing incident response workflows and escalation processes.
  • Hands-on experience with ServiceNow ITOM and CMDB integrations.
  • Proficiency in cloud-native monitoring (AWS, Azure, GCP) and container observability (Docker, Kubernetes).
  • Familiarity with SRE principles: defining SLOs, SLIs, and error budgets.
  • Knowledge of automation practices and Infrastructure as Code (Terraform, CloudFormation, ARM templates).
  • Strong problem-solving skills with the ability to troubleshoot complex distributed systems.
  • Excellent communication, presentation, and leadership skills.

Set Yourself Apart With

  • Cloud certifications such as AWS DevOps Engineer, Azure DevOps Engineer Expert, or Google Professional Cloud DevOps Engineer.
  • Experience in AIOps, predictive analytics, and security-driven observability.
  • Exposure to chaos engineering or performance engineering practices.
  • Experience in multi-cloud and hybrid environments with advanced observability patterns.