Refine and influence system design and implementation
-
Enable
and support the growth and scaling of products and services.
Identifying inefficiencies in our current systems and planning for
growth in those new and old.
-
Applying data-driven analysis to drive engineering decisions.
-
Minimize the level of manual tasks on our engineers by finding and automating inefficiencies to avoid extra work in the future.
Build and run tools to identify, predict and mitigate failures
-
Design,
build and implement tools to aid the fault finding and debugging of
incidents that occur in the deployment and running of applications and
systems.
-
Introduce
and maintain tools that help measure the resilience of our applications
and infrastructure to help them better tolerate failures.
-
Monitor,
analyse and predict service performance and capacity to proactively
forecast problems. Apply engineering knowledge in developing or
providing tools for anomaly detection and failure prediction.
Operational Support
-
Collaborate
with our other engineering teams and lead the triage of high priority
production incidents while bringing about changes to improve
reliability.
-
Provide technical guidance for service upgrades, rollouts and enhancements.
-
Utilise tools and intuition to aid support teams in the identifying and mitigation of potential problems and vulnerabilities.
-
Develop
engineering solutions to failures and all other problems that adversely
affect site reliability and uptime. Including capacity, performance,
stability and security issues.
Responsibilities
-
Understand
and take responsibilities for all operational workflows and standard
operating procedures, down to a granular, detailed level.
-
Help maintain an F5/HAProxy/Nginx/MySQL operations stack.
-
Understanding of networking (Ethernet, TCP/IP stack, static routing, etc...)
-
Dynamic routing protocols (RIP, OSPF, BGP), network high availability (CARP, VRRP, STP)
-
Measurement,
optimization, and tuning of system performance and ensuring that
systems will run reliably and are highly available in a 24/7 production
environment.
-
Supporting deploys with necessary configuration work and monitoring.
-
Be mindful of security requirements when designing solutions.
-
Participate in 24/7 on-call rotation policy by responding to system and emergency problems.
-
Maintain high standards for consumer and customer service touch-points affected by operations.
-
Maintenance of technical documentation of services, processes and procedures used throughout normal operations.
-
Analyse, suggest, and implement release process optimizations
-
Design and development of release tools, automate installation and maintenance tasks
Must have
-
We make use of Prometheus, Graphite, Grafana for monitoring out services.
-
The ideal candidate has 3-5 years of experience with managing applications infrastructures.
-
Mastery of Linux including configuration, networking, hardening, shells, package management and scripting.
-
Experience with virtualization technologies like VMware, OpenStack
-
Strong experience with at least one scripting language (Python, Bash, Go).
-
Experience with configuration management tools (Puppet, Ansible).
Nice to have
-
Experience with tools such as Terraform, Rundeck
-
Experience with CI / CD tools (Jenkins)
-
Experience with Docker containers and Kubernetes
We offer
-
Competitive, results-based compensation (30K - 45K + Bonus. Depending on candidate seniority)
-
Private Medical Insurance.
-
Attractive benefits such as English/Spanish lessons, free top-ups with tuenti, flexible spending account...
-
Flexible work hours and flexibility for remote working few days a month.
-
Great opportunity to learn: training budget for attending conferences, in-company learning communities, external events we host.
-
Amazing office located in the city center stocked with free snacks, video games, foosball.