Remote SRE Engineer Jobs

SRE is one of the cleanest remote roles in tech because the work itself is async-first and distributed by nature. Outages don't care about timezones. Observability is how you communicate across time. Most mature SRE teams have evolved their practices around distributed collaboration because they had to.

Three jobs are hiding in the same keyword

SRE roles cluster around what the team actually maintains and how much they code.

Platform SRE engineer. Building and maintaining the infrastructure that other engineers depend on—Kubernetes clusters, service meshes, observability stacks, CI/CD pipelines, internal tooling. Day to day: Infrastructure-as-code, automation, designing for resilience, working closely with other engineering teams. Deep systems thinking, high leverage, narrower focus on the platform itself.

Application SRE engineer. Owning the reliability of specific applications or services—performance optimization, scaling, incident response, working closely with the application teams that own the code. Day to day: Debugging production issues, tuning systems, working with application engineers to solve problems collaboratively. Broader surface area, more varied work, higher context-switching.

Incident response and on-call SRE. Specializing in incident management, post-mortems, and reliability culture—designing alerting and escalation, running war rooms, building blameless post-mortem practices, automation for detection and recovery. Day to day: On-call rotations, incident analysis, automation, documentation. Requires deep people skills alongside technical depth.

Four employer types cover most of the market

Large cloud platforms and infrastructure companies. Amazon, Google Cloud, Azure, Datadog, New Relic—companies whose product includes reliability and observability. Work is deep, technically rigorous, at scale. Pay is competitive, interviews are challenging, culture around reliability is mature.

Mid to late-stage SaaS with mature platforms. Slack, Stripe, Shopify, PagerDuty—companies that have built real platform reliability practices. Steady scope, high bar for reliability, good pay. Hiring is selective and usually involves thoughtful technical discussions.

Startups and fast-growing companies building their first SRE practice. Earlier-stage companies that are outgrowing ad-hoc DevOps and hiring SREs to build reliable systems. Work is pioneering and sometimes chaotic. Pay varies but can be high if the company is backed. Growth opportunity is real but risk is higher.

Managed services and hosting companies. Companies running applications for customers—Heroku, Netlify, cloud providers. Work is steady, reliability is non-negotiable, culture is mature. Less glamorous than Big Tech but often more stable.

What the stack actually looks like

Kubernetes is standard for modern shops—either managed (EKS, GKE, AKS) or self-hosted. Observability is table stakes: Prometheus for metrics, some flavor of logs (ELK, Datadog, Splunk), tracing (Jaeger, Lightstep). Infrastructure-as-code: Terraform is most common, Pulumi and CDK are growing. One of: Linux, some cloud platform (AWS, GCP, Azure), and the ability to debug systems at low levels. Python or Go for automation and tooling—both are standard. CI/CD: GitLab, GitHub Actions, or custom systems depending on company age.

The real requirement is comfort with Linux, networking, system design at scale, and reading other people's code when things break. Cloud certification doesn't move the needle much; shipped systems do.

Six things worth checking before you apply

Whether they have a real incident response process or it's chaotic. Look for mentions of on-call rotation, escalation procedures, blameless post-mortems, or runbooks. If the listing doesn't mention any of this, ask how incidents are actually handled. Chaos is a red flag.
How they think about toil versus high-leverage work. SREs often get buried in toil—repetitive manual work that should be automated. Good listings describe how they prioritize automation and push back on toil. Teams that don't think about this burn out SREs.
Whether their SRE team is separate from platform/infrastructure or embedded in application teams. Both models work. Embedded SREs are closer to the code; platform SREs have broader leverage. The listing should be clear about which model they use.
How they measure reliability and what the actual SLOs are. Vague language about "99.9% uptime" without detail suggests they haven't really defined it. Good teams have explicit SLOs, error budgets, and they use them to drive decisions.
Who the SRE tech lead or manager is and what their background is. SREs hired from pure infrastructure backgrounds differ from those who came up through application engineering. Neither is wrong, but the leadership style differs. Look for evidence of their thinking.
Whether they use on-call, and if so, what the compensation and load looks like. On-call can be fine or brutal depending on how it's structured. Good teams have explicit on-call compensation, clear escalation, and a culture that respects off-hours time. Ask directly about this.

The bottleneck is different at every level

Junior SRE roles are rare because the work requires both systems depth and maturity about how to communicate problems across teams. Junior candidates need evidence of infrastructure work—public Kubernetes projects, detailed case studies of incident management and resolution, or contributions to infrastructure-as-code projects. Generic DevOps experience doesn't move the needle; systems thinking does.

Mid-level is where SRE remote work opens up. You understand how systems break, you know how to read a dashboard and think about what it means, you can own a project from design to shipping. Remote hiring at mid-level is straightforward because SRE culture is already distributed—you're used to async communication about outages anyway.

Senior SRE roles often go to people who've led incident response at scale, who've designed systems from first principles, or who've built strong reliability cultures within teams. At this level, remote is an advantage because the thinking work is the hard part, and async communication is actually a strength.

What the hiring process usually looks like

SRE interviews vary by company stage: (1) application — resume with infrastructure experience, sometimes a portfolio or GitHub; (2) phone screen — 30 minutes, context on the platform and SRE philosophy; (3) technical — usually a take-home designing a system or an incident response simulation; (4) final round — deep systems design conversation, on-call expectations, team fit; (5) offer.

Some shops ask you to debug a production issue or read through an outage post-mortem. Others give architecture design tasks. The variation depends on what they actually need—pick a team whose interview questions feel relevant to the job.

Red flags and green flags

Red flags — step carefully or pass:

No mention of SLOs, observability, or how they measure reliability.
"We need you to do DevOps" with no clarity about infrastructure versus application focus.
On-call rotation mentioned with no compensation or no description of load.
Listings that look identical to traditional DevOps jobs from five years ago—suggests the team hasn't evolved.
No mention of how they handle incidents or who leads them.

Green flags — strong signal of a healthy team:

Clear description of SLOs and error budgets—"we target 99.95% uptime, we have an error budget."
Named SRE lead or tech lead with relevant experience, ideally with links to talks or writing.
Explicit mention of on-call practices, compensation, and escalation procedures.
Description of incident management philosophy—blameless post-mortems, learning culture, etc.
Transparent compensation that reflects the on-call burden and on-call compensation.

Gateway to current listings

RemNavi doesn't post jobs. We pull them in from public sources and link straight through to the employer's own listing, so you always apply at the source.

Frequently asked questions

What's the difference between SRE and DevOps? DevOps is a philosophy about breaking down barriers between development and operations. SRE is a specific discipline—Google's term for engineers who apply software engineering practices to operations. In practice, the terms blur. Modern SRE teams are usually what "DevOps" meant. Older organizations still use both terms separately.

Do I need to know Kubernetes to be an SRE? For most modern shops, yes. Kubernetes is the default container orchestration platform. That said, some companies still use other platforms (Nomad, ECS, custom). Ask what the primary platform is during the screen. If you know Linux and systems concepts, Kubernetes is learnable.

How much on-call work is typical? Depends on the company. Some have explicit rotations—one week per month, one week per quarter. Others are more ad-hoc. The key question is whether you're compensated for on-call time and whether the on-call load is reasonable. A bad on-call rotation will burn you out faster than almost anything else.

What certifications help with SRE hiring? Certifications barely move the needle. The Kubernetes certifications (CKA, CKAD) are respected but not required. CRE certifications are newer. What actually matters: shipped systems, incident response stories, infrastructure code you can show them, and thoughtful answers to architecture questions.

RemNavi pulls listings from company career pages and a handful of remote job boards, then sends you straight to the employer to apply. We don't host the listings ourselves, and we don't stand between you and the hiring team.

Related resources

Remote DevOps Engineer Jobs — Overlapping discipline with different focus
Remote Backend Developer Jobs — Close collaborators on reliability
Remote AWS Cloud Engineer Jobs — Infrastructure platform specialization
Remote TypeScript Developer Jobs — Infrastructure-as-code tooling on the node side
Remote Python Backend Developer Jobs — Common language for SRE automation