Hi, we’re Nexthink. We’re not just the leader in the digital employee experience category, we invented the category. Our solutions combine real-time analytics, automation and employee feedback across all endpoints to help IT teams delight people at work. Our cloud-native platform pinpoints issues and solutions, automates response, and helps companies continuously improve their employees’ experience, making them more productive, efficient, and happy at work. We have millions of endpoints deployed, we’ve surpassed $100M in ARR, and we’ve recently secured $180M in Series D financing for a company valuation of $1.1B, but we’re just getting started.
Nexthink is looking for passionate and innovative professionals that are keen to join a fast-growing Cloud Operations team. The team is being built to ensure our Cloud platform is operated using best-in-class methodologies and tools and allows us to delight our clients with the best cloud experience.
The team is responsible for maintaining our Cloud solutions with top performance, availability, and service level, but also ensuring that it runs in a cost-efficient way. The Site Reliability Engineer will also use her/his Software Engineering skills to prototype and deliver tools and products that will help reach those goals and will also participate in the operational requirements process.
Finally, you will be part of a fast-growing, international company with an opportunity to join the Cloud team, a strategic initiative that will help accelerate this growth.
- Monitoring: Use and own the specifications of our tooling set related to monitoring, telemetry, reliability, automation for End-to-End service.
- Incident management and response: Detect, diagnose, and fix incidents finding solutions to achieve required Service Levels (rollback, restore backups, etc.). Owner of the post-mortem process of such incidents by writing technical content both for customers and internal stakeholders.
- Operations: Define or build automation mechanisms for cloud operations: build, deploy, update, patch, backup, restore, scale, extend, protect, etc. Use past experience to solve the most relevant issues in a proactive fashion by either writing product or platform specifications or building the required automation to prevent the issues to surface again.
- Change Control: Owning the product update process for live client instances.
- Reliability: Manage the availability of the production instances of our cloud services. Understand and be able to communicate the scale, capacity, security, redundancy, and performance attributes and requirements of the cloud services.
- Subject matter expert: Be the ultimate escalation point for major platform-related incidents.
- Engage in and improve the whole lifecycle of services from inception and design, through deployment, operation, and refinement.
- Support services before they go live through activities such as system design consulting, developing software platforms and frameworks, capacity planning, and launch reviews.
- Maintain services once they are live by measuring and monitoring availability, latency, and overall system health.
- Scale systems sustainably through mechanisms like automation and evolve systems by pushing for changes that improve reliability and velocity.
- Practice sustainable incident response and blameless postmortems.
- Min 5 years of experience in Software Development with knowledge of best practice of professional software development, deploying, and in general lifecycle management.
- Experience with monitoring solutions, such as: Azure Analytics, Grafana, and others
- Experience administering and deploying on cloud-based platforms (Azure, AWS, Google and/or others), using infrastructure as code (Cloud Formation, Terraform, etc.), configuration management tools (Ansible, Puppet) and pipeline creation tools (like Jenkins).
- Experience in programming solutions for Platform Tools such as for automation, monitoring, provisioning, using programming technologies such as Java, Golang, Rust, C++, Python, Ruby or Scala.
- Solid understanding of the network stack (TCP/IP, VPN, HTTP, SSL, routing, etc.), cloud topologies (VPC, Virtual Subnets, NACLS, NSG, ILB, ELB, etc.) and storage (S3, EBS, Azure Files etc).
- At ease with operating and managing production systems, solving issues striking the right balance between urgency and methodology.
- Strong problem solving and analytical skills.
- Experience in coordinating teams and persons to maintain a SLA.
- Excellent written and verbal skills in English.
We are 800+ employees strong in 21 countries across 8 different time zones speaking 60+ languages. We are positive, we get things done, we keep growing, and we are one team, we are Nexthink. We believe actions are stronger than words when it comes to diversity, inclusion, and equity in the workplace. Nexthinkers are multinational and multilingual, and come from all walks of life. We are committed to hiring a genuinely representative workforce that can create solutions and foster innovation for the modern digital employee experience.