Refine and influence system design and implementation
- Enable and support the growth and scaling of products and services. Identifying inefficiencies in our current systems and planning for growth in those new and old.
- Applying data-driven analysis to drive engineering decisions.
- Minimize the level of manual tasks on our engineers by finding and automating inefficiencies to avoid extra work in the future.
Build and run tools to identify, predict and mitigate failures
- Design, build and implement tools to aid the fault finding and debugging of incidents that occur in the deployment and running of applications and systems.
- Introduce and maintain tools that help measure the resilience of our applications and infrastructure to help them better tolerate failures.
- Monitor, analyse and predict service performance and capacity to proactively forecast problems. Apply engineering knowledge in developing or providing tools for anomaly detection and failure prediction.
- Collaborate with our other engineering teams and lead the triage of high priority production incidents while bringing about changes to improve reliability.
- Provide technical guidance for service upgrades, rollouts and enhancements.
- Utilise tools and intuition to aid support teams in the identifying and mitigation of potential problems and vulnerabilities.
- Develop engineering solutions to failures and all other problems that adversely affect site reliability and uptime. Including capacity, performance, stability and security issues.
- Understand and take responsibilities for all operational workflows and standard operating procedures, down to a granular, detailed level.
- Help maintain an F5/HAProxy/Nginx/MySQL operations stack.
- Understanding of networking (Ethernet, TCP/IP stack, static routing, etc...)
- Dynamic routing protocols (RIP, OSPF, BGP), network high availability (CARP, VRRP, STP)
- Measurement, optimization, and tuning of system performance and ensuring that systems will run reliably and are highly available in a 24/7 production environment.
- Supporting deploys with necessary configuration work and monitoring.
- Be mindful of security requirements when designing solutions.
- Participate in 24/7 on-call rotation policy by responding to system and emergency problems.
- Maintain high standards for consumer and customer service touch-points affected by operations.
- Maintenance of technical documentation of services, processes and procedures used throughout normal operations.
- Analyse, suggest, and implement release process optimizations
- Design and development of release tools, automate installation and maintenance tasks
- We make use of Prometheus, Graphite, Grafana for monitoring out services.
- The ideal candidate has 3-5 years of experience with managing applications infrastructures.
- Mastery of Linux including configuration, networking, hardening, shells, package management and scripting.
- Experience with virtualization technologies like VMware, OpenStack
- Strong experience with at least one scripting language (Python, Bash, Go).
- Experience with configuration management tools (Puppet, Ansible).
Nice to have
- Experience with tools such as Terraform, Rundeck
- Experience with CI / CD tools (Jenkins)
- Experience with Docker containers and Kubernetes
- Competitive, results-based compensation (30K - 45K + Bonus. Depending on candidate seniority)
- Private Medical Insurance.
- Attractive benefits such as English/Spanish lessons, free top-ups with tuenti, flexible spending account...
- Flexible work hours and flexibility for remote working few days a month.
- Great opportunity to learn: training budget for attending conferences, in-company learning communities, external events we host.
- Amazing office located in the city center stocked with free snacks, video games, foosball.
Official website, founding date, employees, how did it all begin... Do you know the whole story?Tell Us!