Your browser cookies must be enabled in order to apply for this job. Please contact support@jobscore.com if you need further instruction on how to do that.

Senior DevOps & Infrastructure Lead

Software Engineering | Remote | Full Time | From $100,000 to $130,000 per year

Job Description

About the Role

RV LIFE is looking for a Senior DevOps & Infrastructure Lead to help us stabilize, document, and modernize the infrastructure behind our products.

This is a hands-on senior role for someone comfortable inheriting real production systems, reducing operational risk, improving reliability, and moving us toward a documented, secure, automated, infrastructure-as-code operating model.

We run production across DigitalOcean, AWS, Cloudflare, and other hosting providers, and are consolidating onto managed, infrastructure-as-code platforms. We need deep, hands-on expertise across these environments.

RV LIFE is an AI-first engineering organization. We expect this person to use AI to accelerate discovery, documentation, runbooks, log review, scripting, and infrastructure-as-code drafting, while applying strict human judgment around security, secrets, production access, destructive commands, rollback, and correctness.

This role focuses on the infrastructure path to reliability; application-level architecture changes are handled in partnership with our engineering team. It is not just about keeping servers alive. It is about building durable practices that reduce single-person dependency, improve visibility, and make our systems safer to operate.

This is not a standard 9-to-5 role. Production issues do not keep business hours, so it carries real on-call responsibility: you need to be reachable and able to respond when unforeseen incidents arise.

What You'll Do

  • Administer and improve existing DigitalOcean infrastructure.

  • Support and improve Linux-based production server environments.

  • Migrate self-managed databases onto managed database services, with validated failover, backups, and recovery.

  • Move applications onto managed runtimes (including Laravel Cloud where it fits), replacing manual deploy processes with automated, repeatable pipelines.

  • Expand and harden our use of Cloudflare for edge, static hosting, caching, and security.

  • Build a clear inventory of servers, services, databases, domains, access paths, backups, monitoring, and operational risks.

  • Create and maintain practical runbooks for common and emergency infrastructure workflows.

  • Improve incident response, escalation paths, monitoring, logging, and alerting.

  • Review and improve backup, restore, and disaster-recovery procedures.

  • Identify recurring manual work and convert it into safer procedures, scripts, automation, or infrastructure-as-code.

  • Help define infrastructure-as-code standards and move appropriate infrastructure into repeatable, version-controlled workflows.

  • Work with AWS services where needed (Lambda, VPC, IAM, CloudWatch, S3, SSM/Secrets Manager, queues).

  • Use AI tools to accelerate discovery, documentation, scripting, troubleshooting, and automation, with strong production-safety judgment.

  • Partner with engineering leadership to prioritize infrastructure risk and modernization; track work clearly in Jira/GitHub and communicate proactively about risks, tradeoffs, and blockers.

What Success Looks Like

In the first 30-60 days, you'll take ownership of how we see and operate our infrastructure, building on what we already track and closing the gaps.

You'll validate and take ownership of what already exists:

  • Our infrastructure inventory and server map

  • Our monitoring and alerting

  • Our DNS / Cloudflare configuration

  • Our prioritized infrastructure risk register

You'll create what we're missing:

  • An access and credential map

  • Verified backup and restore status for critical systems (tested, not assumed)

  • Runbooks for the highest-risk operational workflows

In the first 90 days, you'll move us toward a durable, consolidated model. Success means:

  • The first core database migrated to a managed service, with a tested restore, plus a clear, sequenced plan for the rest.

  • The first application running on a managed runtime (App Platform or Laravel Cloud).

  • The first static frontend served from Cloudflare Pages.

  • A measurably stronger edge security posture.

  • Critical systems no longer understood by only one person; common tasks have documented procedures; manual processes are being converted to automation; AI is used safely to reduce toil.

What We're Looking For

  • Senior-level experience operating production infrastructure.

  • Deep, hands-on Linux server administration (the traditional, "old-school" kind): operating, securing, and troubleshooting manually managed production servers (LAMP/LEMP, system services, cron, networking, SSH) directly at the command line, not only through a cloud console.

  • Experience with DigitalOcean, Linode, AWS EC2, bare VPS hosting, or comparable environments.

  • Senior database operations: migrating self-managed MySQL to a managed service, replication, backup validation, restore testing, and IO isolation.

  • Strong Cloudflare across DNS, WAF, CDN and caching behavior, page rules, Workers, Pages, and Zero Trust/Access, including traffic routing and origin protection.

  • PHP/Laravel application environments, and experience with a managed Laravel runtime (Laravel Cloud and/or DigitalOcean App Platform).

  • Datadog or a comparable observability platform for monitoring, alerting, dashboards, logs, and incident investigation.

  • Infrastructure-as-code such as Terraform, Pulumi, AWS CDK, Serverless Framework, or CloudFormation.

  • CI/CD pipelines and deployment automation.

  • Practical AWS experience (Lambda, IAM, VPC, CloudWatch, S3, SSM/Secrets Manager, queues).

  • Good judgment around production safety, access control, secrets, backups, and incident response.

  • Willingness to carry real on-call responsibility and respond to production incidents outside normal business hours; this is not a strict 9-to-5 role.

  • A habit of documenting what you learn and creating runbooks others can follow.

  • Practical experience using AI tools (ChatGPT, Claude, Cursor, GitHub Copilot, or similar), with strong judgment about where human verification is required.

  • Ability to work independently in a small, remote engineering organization where practical ownership matters more than bureaucracy.

Nice to Have

  • Experience migrating manually managed services onto managed platforms or IaC.

  • Experience moving static frontends onto Cloudflare Pages.

  • Managed migrations for MongoDB, OpenSearch, or Valkey/Redis.

  • Experience supporting Node.js, React, and React Native alongside PHP.

  • Experience helping organizations reduce infrastructure bus-factor risk.

  • Experience working with external DevOps/security partners or auditors.

Who You Are

You are someone who:

  • Takes ownership without waiting to be told every next step.

  • Is calm and practical during incidents.

  • Can inherit messy systems without being judgmental or reckless.

  • Prefers consolidating on platforms we already run over adding new vendors.

  • Documents as you go.

  • Uses AI as leverage, but does not blindly trust its output; you verify, test, and apply senior judgment before anything touches production.

  • Knows when to automate and when to stabilize first.

  • Communicates clearly with technical and non-technical stakeholders.

  • Understands that reliability is not just uptime: it is visibility, repeatability, recovery, and shared understanding.

  • Wants to leave infrastructure better than you found it.