Site Relaiability Engineer in Berlin

Chattermill

Category

DevOps

Industry

AI Industry

Salary

€70,000 - €85,000

Workplace

Onsite

Hours

Full-Time

Internship

Skills

RabbitMQ PostgreSQL Linux ElasticSearch AWS

Share offer

Job Description

At Chattermill we use cutting-edge AI technology to give leading companies the key to improving their customer experience. We work with many of the most exciting companies in the world (Uber, HelloFresh, Transferwise, Skyscanner, and Deliveroo to name a few!) and are passionate about helping them put their customers’ at the heart of their decision making.

In four short years, we’ve grown from two co-founders to a team of 40 (and counting) bright and diverse individuals. We have big plans and are now looking for a Site Reliability Engineer to join our Engineering team in Berlin to ensure the stability and scalability of our platform.

As a Site Reliability Engineer, you will:

• Conduct an in-depth audit of our current infrastructure
• Help us to define and execute a plan for migration of our k8s cluster from bare metal to managed k8s in the cloud (our preference is GKE)
• Take active part at all stages of our engineering process from design and implementation to support and maintenance.
• Help colleagues from different teams (software engineers, data scientists) to setup the right infrastructure for their workloads
• Provide expertise and guidance to build self-healing system with high availability and horizontal scalability
• Ensure the health of all environments by monitoring technical and business metrics, setting up alerts for things going wrong, acting proactively to prevent disasters, acting fast and effectively when they happen
• Be the driver of our incident management process, apply root cause analysis in investigation of incidents with other engineers and help to define preventive measures to exclude the whole class of identified issues
• Improve our CI/CD pipelines based on CircleCI which involves but not limited to improving speed of our builds and tests, firing up prod-like test environments to run our e2e tests, canary releases with automatic rollback based on metrics
• Play a proactive role in identifying performance bottlenecks and other architectural issues and provide guidance on how to mitigate them in a planned and timely manner
• Take an active role in redesign of our data pipeline to build new version that facilitates agile experimentation for our Data Science team and enables even more complex data integration with our clients

What we’re looking for:

• Extensive experience managing production k8s clusters with data-intensive workloads in the cloud (GKE, AWS)
• Experience in complex infrastructure migrations of mission critical systems with zero downtime
• Operational experience with databases in our stack (Postgres, Redis, ElasticSearch).
• Proficiency in more than one programming language (including Go) and ability to identify and automate routine repetitive tasks
• A strong architectural background in distributed systems
• Experience in setting up central logging on ELK stack
• Experience in managing highly available Prometheus (with Thanos), setting up alerts with Prometheus’s Alertmanager, creating dashboards in Grafana
The ability to define SLA and provide a viable plan how to stay stick to it
• Experience setting up RabbitMQ and/or Kafka
• An understanding of infrastructure as code principle and experience in its successful application with tools like Terraform
• The ability to explain OSI model and to diagnose and debug network issues in a cloud environment
• Deep knowledge of Linux and the ability to explain how it works under the hood
• Good communication skills, interest in building effective relationships with colleagues
• An interest in providing a cutting-edge infrastructure, ability to assess new technologies, evaluate maintenance costs of different alternatives, prove their viability, willingness to facilitate adoption of new solutions within the team

Nice to have:

• Experience working as a Backend Engineer
• Experience in setting up data infrastructure for AI-companies

Why join us?

• A competitive salary and stock options
• Great progression opportunities - we want you to grow with us!
• 25 days holiday (in addition to public holidays)
• A big focus on personal development including a €550 development budget and biweekly Breakfast and Learns
• Flexible working conditions and the opportunity to work from home
• Central office location with free Barista coffee
• Quarterly Company Socials planned by our great colleagues!

Read full job description

About Chattermill

Industry
Ai

Chattermill company page is empty

Add a description and pictures to attract more candidates and boost your employer branding.

Site Relaiability Engineer in Berlin

Chattermill

Job Description

Read full job description

About Chattermill

Chattermill company page is empty

Other devops jobs that might interest you...

DevOps Cloud Architect at Baoss

SRE / DevOps Engineer at Fever

SysAdmin at Devoteam