SRE Leadership - Planetary scale, hybrid compute + bare metal, Kubernetes/K8s

Site Reliability Engineering Leadership

Hybrid/remote-first - Ontario + Montreal, Canada

$170,000 - $300,000 + RRSP matching (5%)

The search

Our client is seeking multiple technical leaders to contribute at both the Staff Engineer and Engineering Manager level. These are high-visibility, high-impact opportunities with a talent-dense, homegrown, bootstrapped, globally competitive Canadian tech success story entering a period of rapid, transformational growth.

The client

Our client builds a high throughput system powering hundreds of billions of daily transactions, each completed within milliseconds, across globally distributed infrastructure designed for reliability and efficiency. They have been quietly bootstrapping and growing in line with revenue for over 2 decades. They operate at a planetary scale with an exchange platform that handles nearly 500 billion daily auctions and is trending towards 1 trillion daily auctions in the next ~5 years. To the present day, the company has not raised money from venture capitalists. It maintains its independence from external stakeholders, enabling it to chart its course, maintain a long-term perspective and build an enduring, sustainable business that currently employs ~600 team members globally.

As a technical leader on the TechOps team, you will collaborate cross-functionally to architect solutions that elevate operational excellence and reliability on a global scale. With a focus on low-latency systems, you’ll develop strategies for proactive monitoring, automation, and incident management, helping drive initiatives that keep our client’s platform highly resilient and continuously available. You’ll work closely with globally distributed engineering teams to integrate best practices into the development lifecycle, ensuring each layer of their system is robust, scalable, and optimized for "real-time" performance.

Our client has been stubbornly racking and stacking infrastructure around the world for the duration of their existence, a habit that allows for them to price themselves at the cost of electricity while their competitors are mired in rising cloud infrastructure costs. While the major cloud infrastructure provider's offerings are considered in isolated, strategic instances, deep expertise with on-prem and hybrid infrastructure systems is the focus of these searches. Our client’s infrastructure spans continents, supporting a growing business where every millisecond counts. As part of this team, you’ll guide projects that impact millions of users and directly shape the future of content creation and journalism across the open internet.

If you’re looking for a role that offers an opportunity to innovate and optimize systems at a massive scale, creating a lasting impact in an environment that values technical excellence and resilience, this could be for you.

Role overviews

In our client’s context, both Staff Engineers and Engineering Managers remain hands-on. This is an intentional and informed cultural practice tailored to the challenge ahead of them.

As a Staff SRE, you will be a technical leader, driving strategic projects and cross-functional architectural decisions. You will:

Focus on building and refining proactive automation, monitoring, and incident management solutions that scale globally
Integrate SRE best practices into the software development lifecycle, championing reliability as a first-class concern
Develop strategies for real-time performance and resiliency, collaborating with distributed engineering teams to ensure each system layer is robust, scalable, and optimized for low-latency operation
Lead incident management efforts and conduct thorough post-incident reviews to continuously improve system reliability
Provide deep insights into system health, using a broad knowledge of observability tools and frameworks

As an Engineering Manager of SRE, you will build and lead a globally competitive SRE team. You will:

Hire, mentor, and manage a team of SREs, fostering a culture of innovation, accountability, and collaboration
Provide both technical and organizational leadership, setting vision and strategy for operational excellence, reliability, and performance at scale
Oversee cross-functional architectural initiatives, ensuring that SRE best practices are integrated across the broader engineering organization
Drive proactive monitoring, automation, and incident management strategies, enabling continuous availability of mission-critical systems
Champion professional development and continuous learning within your team, identifying growth opportunities and aligning individual career paths with organizational needs

Tech stack

Ansible, Terraform, Docker, Kafka, Nexus
Prometheus, ELK, Jaeger, Grafana, Nagios, Zabbix
Hadoop, HDFS, Spark, HBase
Bare-metal, vSphere, KVM, Kubernetes
Go, Python, Bash, or Perl for automation

Key responsibilities

Vision and strategic direction: Deploying your passion for staying up to date on the latest technological and industry innovations, enabling you to identify and drive strategic initiatives that impact the entire business
Technical leadership: Leading significant architectural projects involving cross-functional teams that enhance system performance and reliability on a global scale
Operational excellence: Building and enhancing proactive automation, monitoring, and incident management technologies and processes
Collaboration with software engineering: Integrating SRE best practices, and system scalability and resiliency considerations into the SDLC and engineering culture
Incident management: Leading incident response efforts that drive rapid resolutions and thoughtful post-incident analysis
Providing insights: Designing and implementing reporting mechanisms that provide deep insight into system health and reliability

Your know-how

6+ years in Site Reliability Engineering (SRE) within low-latency, global-scale environments, ideally with upstream Kubernetes in on-prem or hybrid compute contexts
3+ years of experience in technical leadership and team-building roles
Expertise in incident response and root cause analysis
Expertise with configuration management and associated tools (Ansible, Puppet, Salt, etc.)
Expertise with observability components (Prometheus, OpenTelemetry, ELK, Mimir)
Comfort with the Cloud Native Computing Foundation (CNCF) suite of SRE tools (Rook, Jaeger, Cilium, ArgoCD, OPA)
Software engineering skills, ideally (but not necessarily) with Go, Python and/or Perl
Excellent command of English and expertise in cross-functional communications

Additional know-how for the Engineering Manager roles

3+ years of experience in technical leadership and team-building, showing you can mentor engineers, guide career development, and create an environment that nurtures innovation and accountability

Interested in learning more?

If you’re excited about shaping and leading SRE at planetary scale—either as a technical force multiplier (Staff SRE) or as a team-building strategist (Engineering Manager of SRE)—we’d love to hear from you. Please upload your resume or a .pdf export of your LinkedIn profile using the following link or send your resume or LinkedIn profile URL to talent@lutrapartners.com with “Site Reliability” as the subject, and one of our partners will be in contact shortly!

Apply Now