Site reliability engineer is the discipline that applies software engineering to infrastructure and operations — with the goal of building systems that are reliable, scalable, and observable enough to run with confidence at any scale. The SRE model, originated at Google, has become the dominant framework for how engineering-mature companies manage production.
What the work actually splits into
Production reliability and incident response. You own the availability and latency of production systems — defining SLOs, building alerting that fires on symptoms not causes, running postmortems that produce real improvements, and being the person who leads the response when things go wrong. This is the core of the SRE function at most companies.
Infrastructure automation and tooling. You eliminate toil — the manual, repetitive operational work that doesn't improve reliability. You build the internal tooling that makes deployments safer, rollbacks faster, and environment provisioning self-service. The output is engineering time returned to product teams.
Capacity planning and performance. You model traffic growth, right-size infrastructure, identify performance bottlenecks before they become incidents, and make the economic tradeoffs between cost and reliability explicit. Common at companies where infrastructure costs are material to unit economics.
Platform reliability. You ensure the internal developer platform — CI/CD, container orchestration, service mesh — is itself reliable and doesn't become the bottleneck for product teams. Close to platform engineering; the distinction is the reliability-first orientation.
Security and compliance reliability. You own the reliability and auditability of security controls — certificate rotation, secrets management, access control systems, and the infrastructure that compliance depends on. More common at regulated industries and security-focused companies.
The employer landscape
Large internet-scale companies (Google, Meta, Netflix, Uber, Airbnb, Stripe) originated and refined the SRE discipline. Their SRE teams are large, specialised, and the highest-paying in the market. These roles require deep systems knowledge, strong programming skills, and typically competitive hiring processes.
High-growth SaaS companies at Series C and beyond hire SREs to professionalise reliability as they scale past the point where heroics work. These roles often involve building the SRE function from scratch — defining SLOs for the first time, establishing on-call practices, and instrumenting systems that have never been properly observed.
Financial services and fintech companies have strict reliability and audit requirements. SREs at these companies deal with regulatory compliance alongside technical reliability — every incident has a paper trail, every change is reviewed, and availability targets are contractual.
Developer-tool and infrastructure companies hire SREs who eat their own cooking — the SRE team supports the product that engineers use, which creates an unusually direct feedback loop between reliability work and customer impact.
Consulting firms occasionally hire SREs for client-facing reliability transformation engagements. High variety, less depth per system.
What skills actually differentiate candidates
SLO-first thinking. Strong SREs define reliability in terms of user-facing objectives — latency at the 99th percentile, error rate, availability — not infrastructure metrics like CPU usage or uptime. The ability to translate a product experience into an SLO and use that SLO to make prioritisation decisions is what separates SRE from traditional ops.
Programming for production, not for demos. SREs write code that runs in production infrastructure — monitoring integrations, deployment tooling, chaos engineering scripts, auto-remediation. Code quality matters: tests, code review, on-call runbooks. This is engineering, not scripting.
Distributed systems depth. Incidents at scale involve subtle failure modes — network partitions, clock skew, thundering herd, cascading failures. Engineers who can reason about these failure modes during an incident diagnosis, not just look them up afterward, are significantly more valuable.
Blameless postmortem craft. Writing a postmortem that actually prevents recurrence — that identifies contributing factors rather than root causes, produces action items that get completed, and changes the system rather than blaming people — is a skill that compounds over time.
Five things worth checking before you apply
What is the on-call rotation? On-call is part of the SRE job, but the load varies dramatically. Ask how many alerts fire per shift, what the mean time to resolution looks like, and whether there is a follow-the-sun rotation that limits night pages. A heavy on-call load is a signal of unreliable systems or under-investment in alerting quality.
How are SLOs defined and enforced? If the company has no SLOs, the SRE role is probably ops with a better title. Ask who owns SLO definition and whether SLOs are used to make prioritisation decisions — error budget policy is the test.
What is the relationship between SRE and product engineering? In the Google model, SRE can hand services back to product engineering if reliability falls below the SLO. Understanding whether this dynamic exists tells you whether reliability has real organisational backing.
What is the toil budget? SRE teams should spend no more than 50% of time on toil (manual operational work). If the number is higher, the team is in a reliability debt spiral. Ask what percentage of last quarter was toil versus project work.
What observability stack is in use? Prometheus/Grafana, Datadog, New Relic, Honeycomb — these tell you both the maturity of the observability practice and what you'll be working with day-to-day.
The bottleneck at each level
Junior SREs are bottlenecked by systems breadth. They understand their stack well enough to handle known failure modes but struggle with novel incidents that require reasoning about systems they've never touched. Breadth comes from incident participation, postmortem reading, and deliberate study of the full production stack.
Mid-level SREs are bottlenecked by influence on product engineering practices. They can fix reliability problems but struggle to shift how product engineers write code — error handling, retry logic, graceful degradation — which is where reliability actually lives. The skill is embedding reliability thinking into the development process, not the operations process.
Senior SREs are bottlenecked by organisational prioritisation. Reliability work competes with feature work for engineering resources and almost always loses when deadlines arrive. Senior SREs who can translate reliability investment into business risk language — "without this, we have a 30% chance of a major incident in the next quarter that costs X in engineering time and Y in customer churn" — are far more effective at getting their roadmap funded.
Pay and level expectations
Remote SRE salaries in the US range from $140,000–$190,000 at mid-level to $190,000–$260,000 at senior level at most technology companies. Tier-one companies (Google, Meta, Stripe) pay significantly above this with large equity packages. On-call compensation structures vary — some companies pay additional cash for on-call shifts, others roll it into base.
European remote roles typically pay €80,000–€130,000 depending on seniority and country, with US-headquartered companies occasionally matching closer to US rates for strong candidates.
What the hiring process looks like
SRE interviews typically include systems design (design a distributed rate limiter; how would you improve the reliability of X system), coding rounds (usually in Python or Go), and an incident simulation or troubleshooting exercise. Some companies include a postmortem exercise — here's an incident timeline, write the postmortem.
The interview tests both engineering depth and operational judgment. Candidates need to demonstrate that they can write production-quality code AND reason about system failure modes AND communicate clearly under the pressure of a simulated incident.
Red flags and green flags
Red flags: No SLOs or no error budget policy. On-call is described as "we just handle things as they come up." SRE team is primarily responsible for provisioning and deployments rather than reliability. No postmortem culture — incidents are resolved and forgotten. High on-call alert volume with no stated plan to reduce it.
Green flags: SLOs are defined for all user-facing services and used to make prioritisation decisions. On-call alert quality is measured and actively improved. Postmortems are shared publicly within the company and produce completed action items. SRE has explicit capacity to work on project work, not just toil.
Gateway to current listings
Use the listings below to find current remote site reliability engineer openings. SRE roles vary more than most engineering titles — a company that says SRE but means DevOps is common, and vice versa. Ask specifically about SLOs, error budgets, and on-call load in the first conversation to calibrate quickly.
Frequently asked questions
What is the difference between SRE and DevOps? SRE is a specific implementation of DevOps principles with a software engineering orientation, defined error budgets, and a specific on-call model. DevOps is broader and more culturally defined. In practice, many companies use the titles interchangeably; the SLO/error budget framework is the reliable differentiator.
Do SREs need to know how to code? Yes, genuinely. SRE is a software engineering role applied to infrastructure. Python and Go are the most common languages. Engineers who can only configure tools (Terraform, Ansible) without writing automation logic are platform engineers or DevOps engineers — not SREs in the original sense.
Is SRE a good career path long-term? Yes — particularly as systems grow more complex and reliability becomes more business-critical. The SRE skill set (distributed systems, observability, software engineering) transfers well to platform engineering, infrastructure engineering, and technical leadership.
How important is cloud certification for SRE roles? Less important than practical experience. AWS/GCP/Azure certifications are a signal for roles that are heavy on managed services. For companies running significant Kubernetes infrastructure or building their own reliability tooling, demonstrated systems engineering depth matters more.
Related resources
- Remote SRE engineer jobs — same role, abbreviated title coverage
- Remote DevOps engineer jobs — adjacent discipline with cultural overlap
- Remote platform engineer jobs — internal developer platform counterpart
- Remote infrastructure engineer jobs — infrastructure depth without the SLO framework