LLM engineering sits in the narrow space between ML research and product engineering — close enough to the models to change how they behave, close enough to the product to own the end user experience. The role barely existed before 2023 and is now one of the most in-demand specialisations on the remote market.
Three jobs are hiding in the same keyword
"LLM Engineer" can mean very different things. The actual work depends on how deep into the model stack the team expects you to go, and you can usually tell from the first two paragraphs of a listing.
Applied LLM engineer. Builds prompt pipelines, eval harnesses, and guardrails inside a product. Day to day: designing prompts that actually work, building evaluation suites, writing the integration code that sits around a model API, and debugging failure modes no one told you about. Moderate systems depth, very high product focus. The most common LLM role on the remote market.
LLM infrastructure engineer. Owns the layer underneath the product — inference serving, fine-tuning pipelines, model routing, latency and cost optimisation. Day to day: GPU scheduling, batching, caching, quantisation, sometimes fine-tuning on custom data. Deep systems work, narrower product focus, paid well because few engineers have done it at scale.
Agent engineer. Builds agentic systems with tool use, memory, and orchestration across multiple model calls. Day to day: designing tool interfaces, building the state machines that coordinate multi-step reasoning, and debugging the long tail of edge cases that turn up when models make decisions. Newer specialty, growing quickly, still finding its shape.
Four employer types cover most of the market
LLM roles cluster by how the company uses models, not by how big it is.
AI-native product startups. Companies whose product is built around an LLM — copilots, writing tools, customer-facing chatbots, vertical copilots for legal, sales, or healthcare. Most applied LLM roles live here. The engineering culture varies enormously, so read carefully.
LLM infrastructure and serving companies. Companies whose product is the platform other AI teams run on — inference serving, fine-tuning tools, evaluation platforms, routing layers. These are the hardest roles and pay the most; the users are other engineers who will notice every rough edge.
Enterprise incumbents adding AI features. Larger product companies bolting LLM features onto existing software. The work is more applied than cutting-edge, the processes are slower, the pay is usually mid-range but the job security is real. Good entry point for engineers transitioning from product or backend roles.
Foundation model labs. A small market — Anthropic, OpenAI, Mistral, a handful of others. The work blurs the line between research and engineering, and the interviews are very competitive. Usually hire through their own networks, not through general job boards.
What the stack actually looks like
Very few listings spell out the full stack you'll need. What "LLM Engineer" usually implies in practice: Python at a comfortable working level; at least one model provider API (OpenAI, Anthropic, or open-source via vLLM); a prompt and eval tooling of some kind (home-grown or a framework); a vector database for retrieval, even if retrieval isn't the core of the role; observability for model calls; and — on infra roles — GPU scheduling, batching, and quantisation techniques.
Six things worth checking before you apply
These hold up better than any bullet list of tools, and they don't go stale when the model of the month changes.
- Which part of the LLM stack the role actually owns. Application layer, eval layer, inference layer, or fine-tuning layer. A good listing tells you. A weaker one just says "LLM Engineer wanted" and leaves you to guess.
- Whether the team has an evaluation story. "We test prompts" is not evaluation. Look for mentions of an eval suite, benchmark runs, automated regression checks, or offline metrics on real data. Teams without one are improvising.
- How the team handles model updates. Model providers push breaking behaviour changes on their own schedule. Teams that have thought about this will mention pinning, version tracking, or regression suites. Teams that haven't are about to find out the hard way.
- Remote-work maturity. Good remote teams put their async habits in writing: how decisions are documented, how review travels across timezones, how onboarding runs without a full-team call. AI teams are uneven on this — the good ones tend to stand out clearly.
- Product scope you can say out loud. If you can't describe in one sentence what you'd actually be building, the team probably hasn't agreed on it either. LLM roles with vague scope tend to turn into prompt-tuning marathons that nobody measures.
- How the hiring process itself reads. A take-home focused on judgement and eval design, a paid trial day, or structured pairing — these come from teams that value your time. Multi-stage live coding without context is a warning sign, especially in a field where most of the work is judgement rather than algorithmics.
The bottleneck is different at every level
Remote LLM hiring is crowded at the junior end and very competitive at senior.
Junior is crowded because the entry point looks welcoming — an API key, a Streamlit demo, a blog post. What thins the field is evidence you've taken an LLM feature from demo to something a real user depends on. A small public project with actual evals, measured failure modes, and a write-up of what you fixed is worth more than ten viral demos.
At mid and senior, the prompting bar barely moves. What changes is systems judgement: when a RAG loop is unnecessary, when an eval is lying to you, when to pin a model and when to accept drift, when a simple rule beats a model call. That kind of judgement rarely turns up on a CV. It shows up in how someone describes the last LLM feature they shipped and what they'd do differently now.
What the hiring process usually looks like
Length varies — from two weeks at a fast startup to two months at a foundation lab. The stages themselves don't move much: (1) application — tailored CV, short intro, links to real work; (2) screen — written intake or a 20–30 minute call; (3) technical — LLM-oriented take-home, paired eval design, or systems pairing; (4) final round — LLM systems design, team fit, written or verbal deep-dive; (5) offer — comp, references, start date.
Red flags and green flags
Red flags — step carefully or pass:
- A listing that treats "LLM Engineer" as "prompt engineer with a bigger title."
- Companies claiming to "do AI" with no public artifact — no demo, no write-up, no case study.
- Tech stack lists piling on every LLM framework in the same paragraph, which usually means the team hasn't chosen one.
- Unpaid take-homes longer than a few hours, particularly ones that would produce something shippable.
- Salary bands missing entirely, or a range so wide it carries no information.
Green flags — strong signal of a healthy team:
- A specific description of the model and the product, and what failure looks like in production.
- Public engineering writing about how the team evaluates or debugs model behaviour.
- A named tech lead or research lead with a link to their public work.
- A hiring process laid out step by step with time estimates at each stage.
- Transparent compensation and location policy, ideally linked from a public handbook.
Gateway to current listings
RemNavi doesn't post jobs. We pull them in from public sources and link straight through to the employer's own listing, so you always apply at the source.
Frequently asked questions
Do I need to have trained a model from scratch to be an LLM engineer? No. Most LLM engineering work happens at the application and infrastructure layers, not the training layer. A strong applied LLM engineer is someone who understands model behaviour well enough to build evals and guardrails around it — not someone who has trained a foundation model. Fine-tuning experience helps for some roles, but it's rarely the core of the job.
What's the difference between an LLM engineer and a prompt engineer? A prompt engineer writes prompts. An LLM engineer builds the system that the prompts live inside — evals, routing, fallbacks, guardrails, monitoring, the integration code, the serving layer. The prompt is one small piece. Most "prompt engineer" listings you see now are actually asking for LLM engineers with a narrower title.
How fast is the field actually moving? Fast enough that the tooling changes meaningfully every six months, but slow enough that the fundamentals — evals, observability, judgement about when a model is and isn't the right answer — don't. If you're learning the field, invest in the fundamentals first. The tool-chasing will take care of itself.
Why do LLM infrastructure roles pay so much more than applied LLM roles? Because production LLM infrastructure is still genuinely hard — GPU scheduling, batching, quantisation, latency budgets measured in hundreds of milliseconds — and the set of engineers who have done it at scale is small. The pay gap follows the scarcity of production infrastructure experience, not the glamour of the model side.
RemNavi pulls listings from company career pages and a handful of remote job boards, then sends you straight to the employer to apply. We don't host the listings ourselves, and we don't stand between you and the hiring team.
Related resources
- Remote RAG Engineer Jobs — The retrieval-heavy specialisation that pairs closely with LLM work
- Remote ML Engineer Jobs — The broader ML discipline that LLM engineering grew out of
- Remote Python Backend Developer Jobs — Most LLM systems live inside a Python backend
- Remote Data Engineer Jobs — Data infrastructure underneath most LLM systems
- Remote DevOps Engineer Jobs — Infrastructure and deployment for LLM systems