Chapter 14: Deploying LLMs in Production
From development to scalable, monitored production systems

Moving LLMs from experimentation to production requires careful consideration of scalability, reliability, security, and cost. This module covers practical strategies for deploying and maintaining LLM applications in real-world environments as of 2025.
LLM Deployment Overview
Deploying LLMs in production involves multiple components working together to deliver reliable, scalable, and secure model access. The architecture typically includes:
Deployment Architecture
Hover over components in the diagram to learn about their role in the deployment pipeline.
Common Deployment Scenarios
API Endpoints
Exposing LLM capabilities via REST/gRPC APIs for application integration
Batch Processing
Scheduled jobs for processing large datasets offline
Edge Deployment
Running smaller models on edge devices for low-latency applications
Hybrid Approach
Combining cloud-based large models with local smaller models
Deployment Decision Factors
Latency Needs
ms vs s
Throughput
RPS
Cost
$/1000 tokens
Data Sensitivity
PII Level
Prompt Engineering in Production
Production prompts require more structure and reliability than experimental ones. Key considerations include versioning, testing, and consistency across deployments.
Production-Ready Prompt Example
"""
SYSTEM: You are a customer service assistant for an e-commerce platform.
Your responses must adhere strictly to these guidelines:
1. Tone: Professional but friendly (formal-informal scale: 4/10)
2. Structure:
- Acknowledge the concern
- Provide accurate information
- Offer next steps
3. Safety:
- Never share internal policies
- Redirect account issues to secure portal
4. Length: 2-3 sentences maximum
5. Fallback: "Let me connect you with a human agent"
Current company policies (2025-06-01):
- Returns: 30-day window
- Shipping: Free over $50
USER: My package hasn't arrived after 5 days
"""
Prompt Management Best Practices
Version Control
Track prompt changes with Git or specialized tools like PromptHub
A/B Testing
Compare prompt variations with real users before full rollout
Environment Separation
Maintain distinct prompts for dev, staging, and production
Documentation
Record prompt purpose, expected inputs, and success metrics
Prompt Testing Framework
Test Type | Description | Frequency |
---|---|---|
Unit Tests | Verify prompt produces expected output for given inputs | Pre-deployment |
Edge Cases | Test with unusual or problematic inputs | Weekly |
Performance | Measure latency and token usage | Monthly |
Bias Checks | Evaluate outputs for fairness across demographics | Quarterly |
Scalability and Performance
LLM applications must handle varying loads while maintaining acceptable latency and reliability. Below are key techniques for scaling LLM deployments:
Scaling Strategies
Horizontal Scaling
Add more model instances behind a load balancer
Response Caching
Cache frequent or deterministic responses
Request Batching
Process multiple requests simultaneously when possible
Model Distillation
Use smaller distilled models for less critical tasks
Python: Load-Tested API Endpoint
from fastapi import FastAPI
from fastapi.middleware.cors import CORSMiddleware
from transformers import pipeline
import asyncio
from concurrent.futures import ThreadPoolExecutor
app = FastAPI()
model = pipeline('text-generation', model='meta-llama/Meta-Llama-3-8B-Instruct')
executor = ThreadPoolExecutor(max_workers=8) # Limit concurrent requests
app.add_middleware(
CORSMiddleware,
allow_origins=["*"],
allow_methods=["POST"]
)
@app.post("/generate")
async def generate_text(prompt: str):
loop = asyncio.get_event_loop()
try:
# Process in thread pool to avoid blocking
result = await loop.run_in_executor(
executor,
lambda: model(prompt, max_length=150, temperature=0.7)
)
return {"response": result[0]['generated_text']}
except Exception as e:
return {"error": str(e)}
# To run: uvicorn api:app --workers 4 --host 0.0.0.0 --port 8000
Performance Benchmarks (2025)
Small Model (8B)
45ms/token
Medium Model (70B)
120ms/token
Large Model (400B+)
300ms+/token
Average latency per token on A100 GPUs (batch size 8)
Monitoring and Maintenance
Continuous monitoring is essential for maintaining LLM application quality and reliability. Key metrics and practices include:
Essential Monitoring Metrics
Performance
Latency, throughput, error rates
Quality
Output accuracy, relevance scores
Cost
Tokens used, API costs per request
Safety
Content moderation flags, bias indicators
Python: Monitoring Script
import pandas as pd
from datetime import datetime
import mlflow
class LLMMonitor:
def __init__(self):
self.df = pd.DataFrame(columns=[
'timestamp', 'prompt', 'response', 'latency',
'token_count', 'error', 'feedback_score'
])
def log_request(self, prompt, response, latency, token_count, error=None):
new_row = {
'timestamp': datetime.now(),
'prompt': prompt[:200], # Truncate for storage
'response': response[:200] if response else None,
'latency': latency,
'token_count': token_count,
'error': error,
'feedback_score': None
}
self.df = pd.concat([self.df, pd.DataFrame([new_row])], ignore_index=True)
# Log to MLflow for analysis
with mlflow.start_run():
mlflow.log_metric("latency_ms", latency)
mlflow.log_metric("tokens", token_count)
if error:
mlflow.log_metric("error", 1)
def analyze_trends(self):
# Calculate hourly aggregates
return self.df.set_index('timestamp').resample('H').agg({
'latency': 'mean',
'token_count': 'sum',
'error': lambda x: x.notna().sum()
})
# Example usage
monitor = LLMMonitor()
monitor.log_request(
prompt="Explain quantum computing",
response="Quantum computing uses qubits...",
latency=450,
token_count=85
)
print(monitor.analyze_trends())
Alert Threshold Recommendations
Metric | Warning | Critical |
---|---|---|
Latency (p95) | 2x baseline | 5x baseline |
Error Rate | 5% | 10% |
Token Cost | 20% over expected | 50% over expected |
Feedback Score | Avg < 3/5 | Avg < 2/5 |
Security in Production
LLM deployments introduce unique security challenges that must be addressed to protect systems and data:
Security Risks and Mitigations
Prompt Injection
Malicious inputs that subvert system instructions
Mitigation: Input sanitization, system prompt isolation
Data Leakage
Accidental exposure of sensitive information
Mitigation: PII detection, output filtering
Denial of Service
Resource exhaustion via excessive requests
Mitigation: Rate limiting, request validation
Secure Configuration Example
# Secure API configuration for LLM deployment
from fastapi import FastAPI, HTTPException, Request
from fastapi.security import APIKeyHeader
from pydantic import BaseModel
import re
app = FastAPI()
api_key_header = APIKeyHeader(name="X-API-Key")
class GenerationRequest(BaseModel):
prompt: str
max_tokens: int = 100
@app.post("/v1/generate")
async def generate_text(
request: GenerationRequest,
api_key: str = Depends(api_key_header),
user_request: Request
):
# Validate API key
if not validate_api_key(api_key):
raise HTTPException(status_code=403, detail="Invalid API key")
# Rate limiting
if await check_rate_limit(user_request.client.host):
raise HTTPException(status_code=429, detail="Rate limit exceeded")
# Input validation
if len(request.prompt) > 1000:
raise HTTPException(status_code=400, detail="Prompt too long")
if request.max_tokens > 200:
raise HTTPException(status_code=400, detail="Max tokens too high")
# Sanitize prompt
sanitized_prompt = sanitize_input(request.prompt)
# Process with LLM (implementation omitted)
response = generate_with_llm(sanitized_prompt, request.max_tokens)
# Filter output for PII
clean_response = filter_pii(response)
return {"response": clean_response}
def sanitize_input(text: str) -> str:
"""Remove potentially dangerous patterns"""
# Remove attempts to override system prompts
text = re.sub(r'(?i)ignore previous instructions', '', text)
# Remove excessive whitespace
return ' '.join(text.split())
Security Audit Checklist
Cost Optimization
LLM operations can become expensive at scale. These strategies help control costs while maintaining performance:
Cost Reduction Techniques
Prompt Optimization
Reduce unnecessary tokens in prompts and responses
Model Selection
Use smallest effective model for each task
Caching
Cache frequent or deterministic responses
Batching
Process multiple requests together when possible
Cost Comparison (2025)
Estimated cost per 1M tokens (USD) for common models
Prompt Optimization Example
Original Prompt
"Can you please provide a detailed explanation of how photosynthesis works in plants? I'd like to understand the light-dependent reactions, the Calvin cycle, and the role of chlorophyll. Please include examples and make it comprehensive enough for a college biology student."
~65 tokens
Optimized Prompt
"Explain photosynthesis concisely: 1) Light reactions 2) Calvin cycle 3) Chlorophyll's role. College level."
~25 tokens (62% reduction)
User Feedback Integration
Continuous improvement of production LLM applications requires effective collection and utilization of user feedback:
Feedback Collection Methods
Direct Ratings
"Was this response helpful?" (Thumbs up/down)
Implicit Signals
User rephrasing same query, session duration
Follow-up Prompts
"What could make this answer more helpful?"
Human Review
Sampled interactions evaluated by experts
Feedback Loop Implementation
from fastapi import FastAPI
from pydantic import BaseModel
import pandas as pd
app = FastAPI()
feedback_db = pd.DataFrame(columns=['response_id', 'rating', 'comment', 'timestamp'])
class Feedback(BaseModel):
response_id: str
rating: int # 1-5 scale
comment: str = None
@app.post("/feedback")
async def submit_feedback(feedback: Feedback):
# Store feedback
new_entry = {
'response_id': feedback.response_id,
'rating': feedback.rating,
'comment': feedback.comment,
'timestamp': pd.Timestamp.now()
}
global feedback_db
feedback_db = pd.concat([feedback_db, pd.DataFrame([new_entry])], ignore_index=True)
# Trigger improvement processes if rating is low
if feedback.rating < 3:
await flag_for_review(feedback.response_id)
return {"status": "received"}
async def flag_for_review(response_id: str):
"""Trigger human review and prompt adjustment"""
# Implementation would connect to your review system
pass
def get_feedback_stats():
"""Calculate weekly feedback metrics"""
return {
'avg_rating': feedback_db['rating'].mean(),
'response_rate': len(feedback_db) / total_requests,
'common_complaints': feedback_db['comment'].value_counts().head(3)
}
Feedback Integration Workflow
Continuous feedback integration allows for iterative improvement of both prompts and model selection in production environments.