Site Reliability Engineer in Madrid

Category

DevOps

Industry

Social Network Industry

Workplace

Onsite

Hours

Full-Time

Internship

Skills

Ansible Bash Graphite Linux Prometheus Python

Share offer

Job Description

Refine and influence system design and implementation

Enable and support the growth and scaling of products and services. Identifying inefficiencies in our current systems and planning for growth in those new and old.
Applying data-driven analysis to drive engineering decisions.
Minimize the level of manual tasks on our engineers by finding and automating inefficiencies to avoid extra work in the future.

Build and run tools to identify, predict and mitigate failures

Design, build and implement tools to aid the fault finding and debugging of incidents that occur in the deployment and running of applications and systems.
Introduce and maintain tools that help measure the resilience of our applications and infrastructure to help them better tolerate failures.
Monitor, analyse and predict service performance and capacity to proactively forecast problems. Apply engineering knowledge in developing or providing tools for anomaly detection and failure prediction.

Operational Support

Collaborate with our other engineering teams and lead the triage of high priority production incidents while bringing about changes to improve reliability.
Provide technical guidance for service upgrades, rollouts and enhancements.
Utilise tools and intuition to aid support teams in the identifying and mitigation of potential problems and vulnerabilities.
Develop engineering solutions to failures and all other problems that adversely affect site reliability and uptime. Including capacity, performance, stability and security issues.

Responsibilities

Understand and take responsibilities for all operational workflows and standard operating procedures, down to a granular, detailed level.
Help maintain an F5/HAProxy/Nginx/MySQL operations stack.
Understanding of networking (Ethernet, TCP/IP stack, static routing, etc...)
Dynamic routing protocols (RIP, OSPF, BGP), network high availability (CARP, VRRP, STP)
Measurement, optimization, and tuning of system performance and ensuring that systems will run reliably and are highly available in a 24/7 production environment.
Supporting deploys with necessary configuration work and monitoring.
Be mindful of security requirements when designing solutions.
Participate in 24/7 on-call rotation policy by responding to system and emergency problems.
Maintain high standards for consumer and customer service touch-points affected by operations.
Maintenance of technical documentation of services, processes and procedures used throughout normal operations.
Analyse, suggest, and implement release process optimizations
Design and development of release tools, automate installation and maintenance tasks

Must have

We make use of Prometheus, Graphite, Grafana for monitoring out services.
The ideal candidate has 3-5 years of experience with managing applications infrastructures.
Mastery of Linux including configuration, networking, hardening, shells, package management and scripting.
Experience with virtualization technologies like VMware, OpenStack
Strong experience with at least one scripting language (Python, Bash, Go).
Experience with configuration management tools (Puppet, Ansible).

Nice to have

We offer

Competitive, results-based compensation (30K - 45K + Bonus. Depending on candidate seniority)
Private Medical Insurance.
Attractive benefits such as English/Spanish lessons, free top-ups with tuenti, flexible spending account...
Flexible work hours and flexibility for remote working few days a month.
Great opportunity to learn: training budget for attending conferences, in-company learning communities, external events we host.
Amazing office located in the city center stocked with free snacks, video games, foosball.

Add a description and pictures to attract more candidates and boost your employer branding.