Site Reliability Engineering (SRE) Architect
Information Technology | Dallas, TX | Contract
Position Title: Site Reliability Engineering (SRE) Architect (Telecom OSS/BSS & Mainframe)
Location: Dallas – TX, Basking Ridge - NJ, NC, and Tampa – FL.
Work Arrangement: Hybrid/Onsite
Interview Type: video
Must have:
15+ years of progressive experience in enterprise IT and telecommunications environments, with extensive expertise in designing, implementing, and supporting complex OSS/BSS ecosystems that enable large-scale business and network operations.
8+ years of hands-on architecture experience across IBM Mainframe z/OS and midrange platforms (Linux/Solaris), delivering scalable, secure, and highly available enterprise solutions.
Demonstrated expertise in Site Reliability Engineering (SRE) principles, including defining and managing Service Level Objectives (SLOs), Service Level Indicators (SLIs), Error Budgets, reliability governance, and continuous service improvement.
Deep functional and technical knowledge of Telcordia OSS applications, including SWITCH, TIRKS, FACS, WFA, and SOAC, with experience integrating and optimizing telecom operational support systems.
Proven ability to design and implement high-availability, fault-tolerant, resilient, and disaster recovery architectures, ensuring business continuity and mission-critical system reliability.
Strong hands-on expertise with IBM Mainframe technologies, including z/OS internals, JCL, IMS, VSAM, DB2, CICS, system utilities, workload management, performance tuning, and production diagnostics.
Extensive experience implementing observability and monitoring solutions using industry-leading tools such as Splunk, Dynatrace, Instana, IBM NetCool, Grafana, and AppDynamics to improve operational visibility and proactive incident detection.
Proven success in driving automation, self-healing capabilities, infrastructure as code, CI/CD reliability practices, and DevOps/SRE transformation across hybrid cloud and on-premises enterprise environments.
Strong understanding of end-to-end telecommunications business processes, including service provisioning, inventory management, order management, activation, network fulfillment, service assurance, and lifecycle management.
Extensive experience leading major incident management, conducting Root Cause Analysis (RCA), problem management, and implementing preventive measures to significantly improve MTTD (Mean Time to Detect), MTTR (Mean Time to Resolve), system stability, and operational excellence.
Proven ability to collaborate with cross-functional teams including Enterprise Architecture, Infrastructure, Development, Operations, Network Engineering, and business stakeholders to deliver highly reliable, business-critical technology solutions.
Excellent leadership, stakeholder management, and communication skills, with a strong track record of mentoring technical teams, driving reliability engineering best practices, and supporting large-scale enterprise transformation initiatives.
About Us
At Radiant Digital, we provide IT solutions and consulting services to help government agencies and businesses in the USA, Canada, the Middle East, and Southeast Asia. On the federal side, we support agencies like NASA, the Department of State (DOS), the IRS, ACL, ACF,USDA and many others, along with numerous state and local government agencies.
We work with industries like telecom, healthcare, entertainment, oil and gas offering solutions designed to meet their specific needs. We focus on improving systems, making better use of data, and updating applications to keep up with changing markets.
