Site Reliability Engineering Manager
ONUM
Company
Onum is a data optimization and analytics company based in Madrid. We specialize in real-time data analysis to enable rapid decision-making regarding cybersecurity, network performance, and infrastructure management. Onum helps you optimize your data analytics costs by reducing data, avoiding vendor lock-in, and aligning the value of each dataset with actions taken.
About the Role
As an SRE Manager, you will be responsible for leading and managing a team of Site Reliability Engineers while staying actively involved in day-to-day technical operations. This is a hands-on leadership role where you will help your team solve complex problems, drive operational excellence, and ensure that our platform remains highly reliable, scalable, and efficient. You will work closely with software engineers and DevOps teams to identify opportunities to improve infrastructure reliability and automation.
Responsibilities
Team Leadership & Development:
- Manage and mentor a small team of SREs, helping them to grow their skills through coaching, feedback, and development plans.
- Foster a collaborative team environment where knowledge sharing, continuous learning, and innovation are encouraged.
- Assist in recruiting and onboarding new SRE team members, ensuring they are set up for success.
- Conduct regular one-on-ones with team members, set clear performance goals, and provide ongoing support.
Hands-on Technical Guidance:
- Lead by example by participating in technical discussions, incident resolution, and troubleshooting critical system issues.
- Provide guidance on best practices for system reliability, automation, and performance optimization.
- Support the team in designing and implementing reliable, scalable cloud infrastructure, ensuring smooth deployment pipelines and reducing manual toil.
Incident & Operations Management:
- Help the team manage the on-call rotation and be available to support incident response when necessary.
- Ensure timely resolution of incidents, participate in post-mortems, and track follow-up actions to prevent recurrence.
- Establish effective processes for monitoring, alerting, and improving system health, working with your team to ensure high availability.
Collaboration & Cross-functional Partnership:
- Collaborate closely with software engineering, DevOps, and product teams to define reliability standards and improve the overall stability of our platform.
- Communicate technical issues, resolutions, and improvements clearly to non-technical stakeholders.
- Work with teams to set Service Level Objectives (SLOs) and improve performance based on data-driven decisions.
Automation & Process Improvement:
- Identify opportunities for automation in daily operations, helping to improve deployment speed, incident response, and reliability of the platform.
- Ensure the team is leveraging infrastructure-as-code (e.g., Terraform) and other automation tools to reduce manual processes and increase scalability.
Operational Metrics & Monitoring:
- Work with your team to ensure systems are well-monitored and metrics are effectively captured using tools like Prometheus, Grafana, or Datadog.
- Track key performance indicators (KPIs) for system uptime, reliability, and team performance, identifying areas for continuous improvement.
Qualifications:
- 5+ years of experience in Site Reliability Engineering, DevOps, or a similar role, with at least 1+ years experience leading a small team or mentoring junior engineers.
- Strong understanding of cloud platforms (AWS, GCP, or Azure) and modern infrastructure practices (e.g., containerization with Docker/Kubernetes, CI/CD pipelines).
- Hands-on experience with infrastructure-as-code tools (Terraform, Ansible, etc.) and cloud automation.
- Proven ability to troubleshoot complex infrastructure issues, perform root cause analysis, and implement system improvements.
- Experience with monitoring and alerting systems like Prometheus, Grafana, Datadog, or equivalent.
- Excellent communication and collaboration skills, with the ability to work cross-functionally and explain technical concepts to non-technical stakeholders.