Site Reliability Engineering Manager
Hybrid/remote-first - Ontario, Canada
$170,000 - $250,000 + RRSP matching
Your opportunity
Our client builds a high throughput system powering hundreds of billions of daily transactions, each completed within milliseconds, across globally distributed infrastructure designed for reliability and efficiency. They have been quietly bootstrapping and growing in line with revenue for over 2 decades. They operate at a planetary scale with an exchange platform that handles nearly 500 billion daily auctions and is trending towards 1 trillion daily auctions in the next ~5 years. To the present day, the company has not raised money from venture capitalists. It maintains its independence from external stakeholders, enabling it to chart its course, maintain a long-term perspective and build an enduring, sustainable business that currently employs over 500 team members globally.
As an Engineering Manager of Site Reliability Engineer (SRE), you’ll be tasked with building and leading a team of SREs that hold their own with the best in the world. You will champion a culture of innovation and accountability with collaboration at its core. You will draw upon your depth of hands-on experiences provide mentorship and thoughtfully identify opportunities for your team members’ professional development. As a technical leader, you will help your team meet and exceed the service level objectives and indicators central to the company’s mission.
You will lead and support cross-functional, architectural collaborations that elevate operational excellence and reliability on a global scale. With a focus on low-latency systems, you’ll lead the development of strategies for proactive monitoring, automation, and incident management, helping drive initiatives that keep our client’s platform highly resilient and continuously available. You’ll work closely with globally distributed peer leaders and engineering teams to integrate best practices into the development lifecycle, ensuring each layer of their system is robust, scalable, and optimized for "real-time" performance.
Our client has been stubbornly racking and stacking infrastructure around the world for the duration of their existence, a habit that allows for them to price themselves at the cost of electricity while their competitors are mired in rising cloud infrastructure costs. While the major cloud infrastructure provider's offerings are considered in isolated, strategic instances, deep expertise with on-prem and hybrid infrastructure systems is the focus of this search. Our client’s infrastructure spans continents, supporting a growing business where every millisecond counts. As part of this team, you’ll guide projects that impact millions of users and directly shape the future of content creation and journalism across the open internet.
If you’re looking for a role that offers an opportunity to work with world-class peers, lead the innovation and optimization of systems at a massive scale, and create a lasting impact in an environment that values technical excellence and resilience, this could be for you.
Tech stack
Ansible, Terraform, Docker, Kafka, Nexus
Prometheus, ELK, Jaeger, Grafana, Nagios, Zabbix
Hadoop, HDFS, Spark, HBase
Go, Python, Bash, or Perl for automation
Bare-metal, vSphere, KVM, Kubernetes
Key responsibilities
Team leadership: Hiring, mentoring, and managing a team of SREs to a globally competitive standard and at a global scale
Vision and strategic direction: Deploying your passion for staying up to date on the latest technological and industry innovations, enabling you to identify and drive strategic initiatives that impact the entire business
Technical leadership: Leading significant architectural projects involving cross-functional teams that enhance system performance and reliability on a global scale
Operational excellence: Building and enhancing proactive automation, monitoring, and incident management technologies and processes
Collaboration with software engineering: Integrating SRE best practices, and system scalability and resiliency considerations into the SDLC and engineering culture
Incident management: Leading incident response efforts that drive rapid resolutions and thoughtful post-incident analysis
Providing insights: Designing and implementing reporting mechanisms that provide deep insight into system health and reliability
Your know-how
6+ years in Site Reliability Engineering (SRE) within low-latency, global-scale environments, ideally with upstream Kubernetes in on-prem or hybrid cloud contexts
3+ years of experience in technical leadership and team-building roles
Expertise in incident response and root cause analysis
Expertise with configuration management and associated tools (Ansible, Puppet, Salt, etc.)
Expertise with observability components (Prometheus, OpenTelemetry, ELK, Mimir)
Comfort with the Cloud Native Computing Foundation (CNCF) suite of SRE tools (Rook, Jaeger, Cilium, ArgoCD, OPA)
Software engineering skills, ideally (but not necessarily) with Go, Python and/or Perl
Excellent command of English and expertise in cross-functional communications
Interested in learning more?
Please upload your resume or .pdf export using the following link or send your resume or LinkedIn profile URL to talent@lutrapartners.com with “Engineering Manager, SRE” as the subject, and one of our partners will be in contact shortly!