Your browser cookies must be enabled in order to apply for this job. Please contact support@jobscore.com if you need further instruction on how to do that.

Senior DevOps & Infrastructure Lead

Software Engineering | Remote | Full Time | From $100,000 to $130,000 per year

‹ ›

Job Description

About the Role

RV LIFE is looking for a Senior DevOps & Infrastructure Lead to help us stabilize, document, and modernize the infrastructure behind our products.

This is a hands-on senior role for someone comfortable inheriting real production systems, reducing operational risk, improving reliability, and moving us toward a documented, secure, automated, infrastructure-as-code operating model.

We run production across DigitalOcean, AWS, Cloudflare, and other hosting providers, and are consolidating onto managed, infrastructure-as-code platforms. We need deep, hands-on expertise across these environments.

RV LIFE is an AI-first engineering organization. We expect this person to use AI to accelerate discovery, documentation, runbooks, log review, scripting, and infrastructure-as-code drafting, while applying strict human judgment around security, secrets, production access, destructive commands, rollback, and correctness.

This role focuses on the infrastructure path to reliability; application-level architecture changes are handled in partnership with our engineering team. It is not just about keeping servers alive. It is about building durable practices that reduce single-person dependency, improve visibility, and make our systems safer to operate.

This is not a standard 9-to-5 role. Production issues do not keep business hours, so it carries real on-call responsibility: you need to be reachable and able to respond when unforeseen incidents arise.

What You'll Do

Administer and improve existing DigitalOcean infrastructure.
Support and improve Linux-based production server environments.
Migrate self-managed databases onto managed database services, with validated failover, backups, and recovery.
Move applications onto managed runtimes (including Laravel Cloud where it fits), replacing manual deploy processes with automated, repeatable pipelines.
Expand and harden our use of Cloudflare for edge, static hosting, caching, and security.
Build a clear inventory of servers, services, databases, domains, access paths, backups, monitoring, and operational risks.
Create and maintain practical runbooks for common and emergency infrastructure workflows.
Improve incident response, escalation paths, monitoring, logging, and alerting.
Review and improve backup, restore, and disaster-recovery procedures.
Identify recurring manual work and convert it into safer procedures, scripts, automation, or infrastructure-as-code.
Help define infrastructure-as-code standards and move appropriate infrastructure into repeatable, version-controlled workflows.
Work with AWS services where needed (Lambda, VPC, IAM, CloudWatch, S3, SSM/Secrets Manager, queues).
Use AI tools to accelerate discovery, documentation, scripting, troubleshooting, and automation, with strong production-safety judgment.
Partner with engineering leadership to prioritize infrastructure risk and modernization; track work clearly in Jira/GitHub and communicate proactively about risks, tradeoffs, and blockers.

What Success Looks Like

In the first 30-60 days, you'll take ownership of how we see and operate our infrastructure, building on what we already track and closing the gaps.

You'll validate and take ownership of what already exists:

Our infrastructure inventory and server map
Our monitoring and alerting
Our DNS / Cloudflare configuration
Our prioritized infrastructure risk register

You'll create what we're missing:

An access and credential map
Verified backup and restore status for critical systems (tested, not assumed)
Runbooks for the highest-risk operational workflows

In the first 90 days, you'll move us toward a durable, consolidated model. Success means:

The first core database migrated to a managed service, with a tested restore, plus a clear, sequenced plan for the rest.
The first application running on a managed runtime (App Platform or Laravel Cloud).
The first static frontend served from Cloudflare Pages.
A measurably stronger edge security posture.
Critical systems no longer understood by only one person; common tasks have documented procedures; manual processes are being converted to automation; AI is used safely to reduce toil.

What We're Looking For

Senior-level experience operating production infrastructure.
Deep, hands-on Linux server administration (the traditional, "old-school" kind): operating, securing, and troubleshooting manually managed production servers (LAMP/LEMP, system services, cron, networking, SSH) directly at the command line, not only through a cloud console.
Experience with DigitalOcean, Linode, AWS EC2, bare VPS hosting, or comparable environments.
Senior database operations: migrating self-managed MySQL to a managed service, replication, backup validation, restore testing, and IO isolation.
Strong Cloudflare across DNS, WAF, CDN and caching behavior, page rules, Workers, Pages, and Zero Trust/Access, including traffic routing and origin protection.
PHP/Laravel application environments, and experience with a managed Laravel runtime (Laravel Cloud and/or DigitalOcean App Platform).
Datadog or a comparable observability platform for monitoring, alerting, dashboards, logs, and incident investigation.
Infrastructure-as-code such as Terraform, Pulumi, AWS CDK, Serverless Framework, or CloudFormation.
CI/CD pipelines and deployment automation.
Practical AWS experience (Lambda, IAM, VPC, CloudWatch, S3, SSM/Secrets Manager, queues).
Good judgment around production safety, access control, secrets, backups, and incident response.
Willingness to carry real on-call responsibility and respond to production incidents outside normal business hours; this is not a strict 9-to-5 role.
A habit of documenting what you learn and creating runbooks others can follow.
Practical experience using AI tools (ChatGPT, Claude, Cursor, GitHub Copilot, or similar), with strong judgment about where human verification is required.
Ability to work independently in a small, remote engineering organization where practical ownership matters more than bureaucracy.

Nice to Have

Experience migrating manually managed services onto managed platforms or IaC.
Experience moving static frontends onto Cloudflare Pages.
Managed migrations for MongoDB, OpenSearch, or Valkey/Redis.
Experience supporting Node.js, React, and React Native alongside PHP.
Experience helping organizations reduce infrastructure bus-factor risk.
Experience working with external DevOps/security partners or auditors.

Who You Are

You are someone who:

Takes ownership without waiting to be told every next step.
Is calm and practical during incidents.
Can inherit messy systems without being judgmental or reckless.
Prefers consolidating on platforms we already run over adding new vendors.
Documents as you go.
Uses AI as leverage, but does not blindly trust its output; you verify, test, and apply senior judgment before anything touches production.
Knows when to automate and when to stabilize first.
Communicates clearly with technical and non-technical stakeholders.
Understands that reliability is not just uptime: it is visibility, repeatability, recovery, and shared understanding.
Wants to leave infrastructure better than you found it.

Return to RV LIFE

Apply for this job

APPLICANT TRACKING