At Snowplow, we are on a mission to
empower people to use data to differentiate. We are able to provide
technology which enables customers to not only control their data, but
allows them to do amazing things with that control.
As part of that effort, we're
changing the way that people do digital analytics by moving companies
away from one-size-fits-all vendors, such as Google Analytics and Adobe,
to dictate what should be done with their data and enabling them to
collect and own their data themselves.
The opportunity
Our Managed Service
offering has grown significantly over the last year, and we now
orchestrate and monitor the Snowplow event pipeline across more than 100
customer-owned AWS accounts, with individual accounts processing many
billions of events per month.
We are looking for our second Site
Reliability Engineer to help us grow to managing 1,000 and then 10,000
AWS, GCP and Azure accounts. You’ll work closely with our Tech Ops Lead,
on all aspects of our proprietary deployment, orchestration and
monitoring stack.
The team and mission:
Technical Operations at Snowplow is responsible for two distinct domains:
- Snowplow’s
internal infrastructure, which powers Snowplow Insights, CI/CD, the
Snowplow website, and our support tooling, all running on AWS
- Our customers’ Snowplow-related infrastructure, running in their own AWS account
Within both domains, Tech Ops at
Snowplow is striving to increase service reliability, fulfil customer
requests in a timely fashion, and automate recurring tasks. Task
automation is essential as our customer base grows, because our
“infrastructure estate” scales linearly with our customer numbers,
unlike most software businesses.
Our roadmap includes:
- Deploying, orchestrating and monitoring Snowplow on GCP, Azure and on-premise, not just AWS
- “One click” infrastructure deployment and maintenance
- Building
self-healing and self-upgrading infrastructure, which learns how to
optimize itself for cost, performance and reliability
This is an enormously ambitious undertaking but also, we hope, a hugely exciting infrastructure automation challenge!
Technologies:
Today, our in-house stack uses
pragmatic technologies including Docker, Ansible, Consul,
CloudFormation, bash and Golang to manage our internal and customer
infrastructure.
For our next level of automation, we are now exploring tools such as Terraform, Kubernetes and Vault.
Responsibilities:
- The development
of software for the purposes of automating, monitoring and maintaining
client-deployed and Snowplow-internal infrastructure and services
- Providing deep technical support to internal and client teams
- Performing planned upgrades and modifications to customer infrastructure
- Handling high-severity internal or customer incidents, ensuring we meet all SLAs
Within the software engineering side
you will be responsible for the implementation, deployment and stability
of your systems and services. You will own software end to end with a
high expectation of ownership over anything that is deployed.
Within the operational side you will
join our on-call process for incident resolution, and be in the
assignment for the regular client infrastructure work, with a strong
mandate to continuing automation.
What we are looking for:
This role will be a great fit for somebody who:
- Has deep
knowledge of Linux, networking, containers and similar, able to
troubleshoot complex problems on individual servers and distributed
systems
- Has worked with at least one of: Amazon Web Services, GCP or Azure
- Has been part of an on-call rotation
- Has interacted directly with customers to solve their specific technical issues
- Is comfortable scripting in one or more of: Bash, Python, Ruby or Perl
- Is comfortable programming in one or more of: Java, Scala, Golang or Python
This role would be a great fit for a software engineer or systems administrator who wants to transition into a full SRE role.
Security:
The integrity of our customers'
systems and data underpin everything we do at Snowplow. As part of their
probation, candidates will be put through a full background security
check.
Out-of-hours work:
An important part of this role relates to out-of-hours work, particularly around:
- Performing planned upgrades and modifications to customer infrastructure outside of their working hours
- Being on-call to handle high-severity internal or customer incidents, ensuring we meet all SLAs
The on-call process for the Tech Ops team is still evolving; we will discuss these requirements with short-listed candidates.
What you’ll get in return:
- Competitive package based on experience
- 25 days holiday a year plus bank holidays
- The freedom to work wherever suits you best
- Two fantastic company away-weeks a year
- Working alongside a strong and talented team
Office-specific:
- Convenient central Shoreditch location
- Continuous supply of Pact coffee
- Regular mystery events
- MacBook