Your browser cookies must be enabled in order to apply for this job. Please contact support@jobscore.com if you need further instruction on how to do that.

Site Reliability Engineer

Engineering | Middlesex County, MA | Full Time

Job Description

About Kolide:

Kolide is an early stage startup focusing on infrastructure, instrumentation, analysis, detection and response. Based on open source software, founded by members of Facebook Security and FireEye, Kolide is uniquely positioned to take a progressive approach to a market ripe for disruption.

About the role

At Kolide, we’ve been using Kubernetes as our primary infrastructure platform since the inception of the company, almost two years ago. As we reach the stage of the company where we have to dramatically scale the delivery of our cloud product, we’re investing heavily in scaling our usage of Kubernetes.

Some aspects of our Kubernetes infrastructure that we’re focusing on right now are:

  • Completely isolated traffic from each customer within our production cluster

    • Establishing Network Security Policies to enforce this isolation

  • Limiting the scope of actions that each pod can take

    • Establish Pod Security Policies to enforce least privilege

  • Improving cluster observability and tracing using open-source tools and standards

  • Developing Custom Resource Definitions (CRDs) and using the Operator pattern to abstract aspects of each tenant’s deployment on the cluster

  • Developing custom schedulers and autoscalers to scale specific aspects of a customer instance based on observed utilization

  • Investing in more secure secret management solutions

  • Exploring mutual service authentication and authorization using something like SPIFFE or Istio

This is a remote role. open to candidates anywhere in US. 

Everyday you will:

  • Help design and implement a secure, multi-tenant Kubernetes deployment strategy

  • Improve observability in both production and development Kubernetes clusters

  • Study the failure modes of production infrastructure and participate in chaos testing to validate your assumptions

  • Support product developers as they participate in our continuous deployment system

  • Troubleshoot observed anomalies in conjunction with developers

  • Automate steps in CI / CD infrastructure

To succeed in this role, you will need to have experience with either Kubernetes or PostgreSQL, and be willing to learn the other.

Kubernetes

  • Experience running stateful and stateless applications on Kubernetes

  • Knowledge of auto-scaling concepts for complex Kubernetes applications such as Custom Resource Definitions (CRDs), custom schedulers, etc.

  • Prior experience creating observable infrastructure which facilitates automated response as well as developer-friendly debugging

  • Knowledge of how to use Kubernetes primitives and customer tooling to automate infrastructure toil

  • A desire to continue learning more about the Cloud Native ecosystem and apply that knowledge in a fast-paced environment

PostgreSQL

  • Extensive knowledge, interest, and experience with PostgreSQL deployment and administration

  • Experience with tools and techniques used for automated replication and failover

  • Experience tuning PostgreSQL configurations based on expected and observed workload

  • Ability to create tooling to perform automated backup creation and verification on an ongoing basis

Inherent motivation for absolute automation and a passion for troubleshooting issues will be extremely helpful in day to day work.

Your chances of success in this role dramatically improve with prior exposure to the following technologies:

  • Kubernetes and Docker

  • PostgreSQL

  • Go

  • Google Cloud Platform

  • Google Cloud Datastore

  • Automating builds