SLO Agent Guide¶

The SLO Agent helps you determine appropriate Service Level Objective (SLO) targets for your services by analyzing code patterns, repository structure, and deployment context. This guide covers everything from quick starts to advanced integration.

What is an SLO?¶

A Service Level Objective (SLO) is a target level of service performance. It defines what "good" looks like for your service:

Availability — "The service is up and responding" (e.g., 99.9%)
Latency — "Requests complete quickly" (e.g., p99 < 200ms)
Error Rate — "Requests succeed" (e.g., < 0.1% errors)

SLOs are not guesses—they should be based on business impact and service characteristics.

Overview¶

The SLO Agent analyzes:

Code patterns: Database calls, HTTP requests, caching, retry logic, error handling, monitoring
Repository structure: Tests, Docker, Kubernetes, monitoring setup
Service type: Explicitly provided or auto-detected from code
Context: User count, service criticality, business impact

It then recommends SLO targets with confidence scores and actionable recommendations.

Service Types¶

Understanding your service type is key to appropriate SLOs.

SaaS (User-Facing Services)¶

User-visible availability and latency matter directly. Examples: - REST APIs serving user requests - Web dashboards - Mobile backends - Real-time chat or notification systems

Typical targets: - Availability: 99.9% (~43 min/month downtime) - Latency (p99): < 200ms - Error rate: < 0.1%

Why these targets? Users notice delays over 100ms and errors immediately. Downtime frustrates users and damages trust.

Critical Infrastructure¶

Services that other services depend on. Examples: - Authentication/authorization services - API gateways - Database proxies - Service meshes - Load balancers

Typical targets: - Availability: 99.95% (~21 min/month downtime) - Latency (p99): < 50ms - Error rate: < 0.01%

Why these targets? Failures cascade across all dependent services. A 1-second latency adds to every downstream request.

Batch Jobs¶

Time-flexible, scheduled tasks. Examples: - Data pipelines and ETL - Report generation - Nightly backups - Async processing queues

Typical targets: - Availability: 95% (~1.5 days/month downtime) - Job duration (p95): < 1 hour - Error rate: < 5%

Why these targets? Batch jobs are not time-critical for individual users. Retries and delayed execution are acceptable.

Background Jobs¶

Async workers that process queue items. Examples: - Email delivery workers - Notification systems - Webhook handlers - Async task processors

Typical targets: - Availability: 98% (~7.2 hours/month downtime) - Task latency (p95): < 5 minutes - Error rate: < 1%

Why these targets? Users don't wait synchronously. Eventually-consistent delivery is acceptable. Retries and backoff are expected.

Quick Start¶

Basic SLO Check¶

Analyze code from stdin:

# Check a diff against main
git diff main | greybeard slo-check

# Check a file
greybeard slo-check --file service.py

# Check stdin directly
cat api.py | greybeard slo-check

Specify Service Type¶

# Explicit service type
greybeard slo-check --context "service-type:saas"

# With service name
greybeard slo-check --context "service-type:saas" --context "service-name:user-api"

# With user count context
greybeard slo-check \
  --context "service-type:saas" \
  --context "users:50000"

View Results¶

By default, results print as a table:

greybeard slo-check < api.py

Output:

Service Type: SaaS
Service Name: user-api
Confidence: 0.75

Targets:
┌──────────────────┬────────────┬─────────────────────┐
│ Metric           │ Target     │ Range               │
├──────────────────┼────────────┼─────────────────────┤
│ Availability     │ 99.9%      │ 99.5% - 99.95%      │
│ Latency (p99)    │ < 200ms    │ < 100ms - < 500ms   │
│ Error Rate       │ < 0.1%     │ < 0.05% - < 0.5%    │
└──────────────────┴────────────┴─────────────────────┘

Recommendations:
- Add timeouts to external HTTP calls
- Implement exponential backoff for retries

CLI Usage¶

Command Syntax¶

greybeard slo-check [OPTIONS] [FILE]

Options¶

Option	Type	Description
`--file`	PATH	File to analyze (instead of stdin)
`--context`	TEXT	Context flag (repeatable). Format: `key:value`
`--repo`	PATH	Repository root path for structure analysis
`--output`	TEXT	Output format: `table` (default), `json`, `markdown`
`--service-type`	TEXT	Override service type detection. One of: `saas`, `critical-infra`, `batch`, `background-jobs`
`--help`	-	Show help and exit

Context Flags¶

Context flags provide additional information for more accurate recommendations.

Common context flags:

# Service information
--context "service-type:saas"
--context "service-name:user-api"
--context "team:platform"

# Business context
--context "users:50000"
--context "critical:true"
--context "sla:99.99%"

# Technical context
--context "db:postgresql"
--context "cache:redis"
--context "queue:kafka"

Examples:

# SaaS with business context
greybeard slo-check \
  --context "service-type:saas" \
  --context "service-name:user-api" \
  --context "users:50000" \
  --context "critical:true"

# Batch job with timing context
greybeard slo-check \
  --context "service-type:batch" \
  --context "runs:nightly" \
  --context "max-duration:1h"

# Critical infrastructure
greybeard slo-check \
  --repo /path/to/auth-service \
  --context "service-type:critical-infra" \
  --context "downstream-services:12"

Output Formats¶

Table (Default)¶

Human-readable table output:

greybeard slo-check --output table < api.py

JSON¶

Structured JSON for parsing, integration, or automation:

greybeard slo-check --output json < api.py

Parse with jq:

greybeard slo-check --output json < api.py | jq '.targets[] | .metric'

Markdown¶

Documentation-ready markdown output:

greybeard slo-check --output markdown < api.py > docs/slo-targets.md

Great for adding to ADRs or runbooks.

Code Pattern Detection¶

The SLO Agent analyzes your code for patterns that impact reliability:

Pattern	Detected By	Impact	Signal
Database calls	SELECT, INSERT, query(), execute(), ORM methods	Latency, availability	Caching needed?
HTTP requests	requests, urllib, httpx, fetch, client calls	Timeouts, retries	Exponential backoff?
Caching	redis, memcached, @cache, lru_cache, cache decorators	Latency improvement	Cache hit rate?
Retry logic	retry, backoff, exponential, jitter	Failure recovery	Idempotent?
Error handling	try, except, error handlers, middleware	Reliability	Graceful degradation?
Async/await	async, await, asyncio, coroutines	Concurrency model	Resource pooling?
Monitoring	logging, prometheus, datadog, tracing, StatsD	Observability	Alert coverage?
Timeouts	timeout, deadline, ttl, max_time	Failure isolation	All calls covered?

The agent looks for these patterns and adjusts recommendations accordingly.

SLO Targets by Service Type¶

SaaS¶

Metric	Target	Range	Downtime/month
Availability	99.9%	99.5% - 99.95%	~43 minutes
Latency (p99)	< 200ms	< 100ms - < 500ms	—
Error Rate	< 0.1%	< 0.05% - < 0.5%	—

Rationale: - Users notice latency above 100ms and errors immediately - User-facing impact directly affects business - High availability expectations because users depend on service

Example targets for different user counts: - < 1,000 users: Can target 99.5% (more tolerance) - 1,000-50,000 users: Target 99.9% (typical SaaS) - > 50,000 users: Consider 99.95% or higher

Critical Infrastructure¶

Metric	Target	Range	Downtime/month
Availability	99.95%	99.9% - 99.99%	~21 minutes
Latency (p99)	< 50ms	< 10ms - < 100ms	—
Error Rate	< 0.01%	< 0.001% - < 0.1%	—

Rationale: - Failures cascade across all dependent services - Latency adds to every downstream request (multiplier effect) - Very low error tolerance to prevent cascading failures

Examples: - Auth service: 99.95% availability, < 30ms latency - API gateway: 99.99% availability, < 20ms latency - Service mesh: 99.95% availability, < 50ms latency

Batch Jobs¶

Metric	Target	Range	Downtime/month
Availability	95%	90% - 99%	~1.5 days
Duration (p95)	< 1 hour	< 30min - < 4 hours	—
Error Rate	< 5%	< 1% - < 10%	—

Rationale: - Time-flexible; users don't wait synchronously - Transient failures are acceptable and retryable - Batch windows provide flexibility for recovery

Examples: - Data pipeline: 95% success, < 2 hours duration - Report generation: 95% success, < 1 hour duration - Nightly backups: 99% success (more critical), < 4 hours

Background Jobs¶

Metric	Target	Range	Downtime/month
Availability	98%	95% - 99.5%	~7.2 hours
Latency (p95)	< 5 min	< 1min - < 30min	—
Error Rate	< 1%	< 0.1% - < 5%	—

Rationale: - Users don't wait synchronously for results - Eventually-consistent delivery is acceptable - Async can retry and apply backoff

Examples: - Email delivery: 98% success, < 5 min delivery - Notifications: 98% success, < 10 min delivery - Webhook handlers: 95% success (retryable), < 5 min processing

Analysis Output¶

JSON Structure¶

When using --output json, the output contains:

{
  "service_type": "saas",
  "service_name": "user-api",
  "targets": [
    {
      "metric": "availability",
      "target": "99.9%",
      "range": ["99.5%", "99.95%"],
      "rationale": "User-facing service..."
    }
  ],
  "context_signals": {
    "code_indicators": {
      "has_db_calls": true,
      "has_http_calls": true,
      "has_caching": true,
      "has_retry_logic": true,
      "has_error_handling": true,
      "has_async": false,
      "has_monitoring": true,
      "has_timeout": true
    },
    "repo_structure": {
      "has_tests": true,
      "test_count": 156,
      "has_docker": true,
      "has_k8s": true,
      "has_prometheus_metrics": true
    }
  },
  "confidence": 0.75,
  "notes": [
    "Database calls detected with caching — good pattern",
    "HTTP calls have retries and timeouts — solid",
    "Add more granular error handling"
  ],
  "recommendations": [
    {
      "category": "monitoring",
      "level": "warning",
      "message": "No explicit latency metrics detected"
    }
  ]
}

Confidence Scoring¶

Confidence reflects how certain the agent is in its recommendations (0.0 to 1.0):

0.9+ — Highly confident. Clear code patterns and repo structure align with service type.
0.75-0.9 — Confident. Most patterns detected, some ambiguity.
0.5-0.75 — Moderate. Limited patterns detected; context helps.
< 0.5 — Low confidence. Insufficient signals; user should validate.

Factors affecting confidence: - ✅ Code patterns match service type expectations - ✅ Repository structure is complete (tests, monitoring, deployment) - ✅ Explicit service type provided via context - ❌ Generic code with few patterns detected - ❌ Missing tests or monitoring setup - ❌ No explicit context provided

Integration with greybeard¶

The SLO Agent integrates with greybeard's content pack system. Load SLO-specific packs for domain-specific guidance:

Available SLO Content Packs¶

slo-saas — User-facing SaaS perspective and heuristics
slo-critical-infra — Platform and gateway SLO thinking
slo-batch — Batch job and scheduled task guidance
slo-background-jobs — Async worker and queue guidance

Using SLO Packs¶

Combine SLO check results with pack-based analysis:

# Run SLO agent
greybeard slo-check --output json < service.py > slo_targets.json

# Run analysis with SLO pack
greybeard analyze --pack slo-saas < service.py

# Combine both for full assessment
greybeard slo-check --context "service-type:saas" < service.py
greybeard analyze --pack slo-saas < service.py

Custom Packs¶

Create a custom pack for your team's SLO philosophy:

teams/slo-mycompany.yaml:

pack_name: slo-mycompany
description: Company SLO philosophy and targets

perspectives:
  - name: slo-assessment
    context: |
      Our SaaS targets are:
      - Availability: 99.9% (peak hours), 99.5% (off-peak)
      - Latency: p99 < 150ms for user-facing, < 50ms for critical infra
      - Error rate: < 0.05% user-facing, < 0.01% critical infra

      Consider these when setting SLOs:
      1. Is this service user-facing or internal?
      2. What are downstream dependencies?
      3. How would 1-hour downtime impact users?
      4. Can failures be retried idempotently?

Load with:

greybeard analyze --pack slo-mycompany < service.py

Python API¶

Use the SLO Agent programmatically:

from greybeard.agents import SLOAgent

# Create agent
agent = SLOAgent()

# Analyze code
recommendation = agent.analyze(
    code_snippet="""
    @app.get("/api/users")
    async def get_users():
        users = await db.query(User)
        return {"users": users}
    """,
    service_type="saas",
    context={
        "service-name": "user-api",
        "users": "50000",
    }
)

# Access results
print(f"Service type: {recommendation.service_type}")
print(f"Confidence: {recommendation.confidence}")

# Iterate targets
for target in recommendation.targets:
    print(f"\n{target.metric}:")
    print(f"  Target: {target.target}")
    print(f"  Range: {target.range}")
    print(f"  Rationale: {target.rationale}")

# Get JSON for integration
import json
data = recommendation.to_dict()
print(json.dumps(data, indent=2))

# Access code signals
signals = recommendation.context_signals
print(f"Has DB calls: {signals['code_indicators']['has_db_calls']}")
print(f"Has caching: {signals['code_indicators']['has_caching']}")

# Get recommendations
for rec in recommendation.recommendations:
    print(f"[{rec.level}] {rec.message}")

Common Workflows¶

Batch analysis of multiple files:

from pathlib import Path
from greybeard.agents import SLOAgent

agent = SLOAgent()

for py_file in Path("src").glob("**/*.py"):
    with open(py_file) as f:
        code = f.read()

    rec = agent.analyze(code)
    if rec.confidence < 0.5:
        print(f"⚠️  {py_file}: Low confidence (0.{int(rec.confidence*100)})")

Integration with CI/CD:

from greybeard.agents import SLOAgent
import sys
import json

agent = SLOAgent()

# Analyze proposed changes
recommendation = agent.analyze(
    code_snippet=sys.stdin.read(),
    service_type="saas"
)

# Fail if SLOs not met
if recommendation.confidence < 0.7:
    print(f"Confidence too low: {recommendation.confidence}")
    sys.exit(1)

# Output for workflow
print(json.dumps(recommendation.to_dict()))

Testing & Validation¶

Local Testing¶

Test against your own services:

# Test current service
git diff main | greybeard slo-check --context "service-type:saas"

# Test API code
find src -name "*.py" -exec sh -c '
  echo "=== $1 ===" && greybeard slo-check --file "$1"
' _ {} \;

# Compare outputs
greybeard slo-check --output json < api.py > before.json
# Make changes...
greybeard slo-check --output json < api.py > after.json
diff before.json after.json

Test Coverage¶

The SLO Agent has 93% test coverage:

# Run SLO Agent tests
pytest tests/test_slo_agent.py -v

# Run with coverage report
pytest tests/test_slo_agent.py --cov=greybeard.agents.slo_agent

# See detailed coverage
pytest tests/test_slo_agent.py --cov=greybeard.agents.slo_agent --cov-report=html
# Open htmlcov/index.html

Test categories (37 tests): - Basic functionality — Initialization, analysis, serialization - Service type detection — Auto-detect SaaS, batch, critical-infra, background - SLO target generation — Correct targets per service type - Code analysis — Database, HTTP, caching, retry, error handling patterns - Repository structure — Docker, Kubernetes, tests, metrics - Context integration — Service name, explicit type, multiple flags - Recommendations — Note generation for missing patterns - Confidence scoring — Appropriate confidence ranges - CLI integration — Command registration and invocation - Serialization — JSON round-trip

Examples¶

Example 1: User-Facing REST API¶

# Analyze user API
cat src/apis/user.py | greybeard slo-check \
  --context "service-type:saas" \
  --context "service-name:user-api" \
  --context "users:50000"

Expected output:

Service Type: SaaS
Confidence: 0.82

Targets:
┌──────────────────┬────────────┬─────────────────────┐
│ Metric           │ Target     │ Range               │
├──────────────────┼────────────┼─────────────────────┤
│ Availability     │ 99.9%      │ 99.5% - 99.95%      │
│ Latency (p99)    │ < 200ms    │ < 100ms - < 500ms   │
│ Error Rate       │ < 0.1%     │ < 0.05% - < 0.5%    │
└──────────────────┴────────────┴─────────────────────┘

Code Signals:
✅ Database calls with caching detected
✅ HTTP calls with retry logic
✅ Error handling in place
⚠️  No timeout on external calls — add!
⚠️  Limited latency instrumentation

Example 2: Batch Data Pipeline¶

# Analyze batch job
cat src/pipelines/nightly_etl.py | greybeard slo-check \
  --context "service-type:batch" \
  --context "runs:nightly" \
  --context "max-duration:1h"

Expected output:

Service Type: Batch
Confidence: 0.71

Targets:
┌──────────────────┬────────────┬─────────────────────┐
│ Metric           │ Target     │ Range               │
├──────────────────┼────────────┼─────────────────────┤
│ Availability     │ 95%        │ 90% - 99%           │
│ Duration (p95)   │ < 1 hour   │ < 30min - < 4 hours │
│ Error Rate       │ < 5%       │ < 1% - < 10%        │
└──────────────────┴────────────┴─────────────────────┘

Recommendations:
- Add idempotent handling for retries
- Implement dead-letter queue for failed items
- Add progress checkpointing for long runs

Example 3: Authentication Service¶

# Analyze auth service
greybeard slo-check \
  --repo /path/to/auth-service \
  --context "service-type:critical-infra" \
  --context "service-name:auth" \
  --context "downstream-services:12"

Expected output:

Service Type: Critical Infrastructure
Confidence: 0.88

Targets:
┌──────────────────┬────────────┬─────────────────────┐
│ Metric           │ Target     │ Range               │
├──────────────────┼────────────┼─────────────────────┤
│ Availability     │ 99.95%     │ 99.9% - 99.99%      │
│ Latency (p99)    │ < 50ms     │ < 10ms - < 100ms    │
│ Error Rate       │ < 0.01%    │ < 0.001% - < 0.1%   │
└──────────────────┴────────────┴─────────────────────┘

Critical Signals:
✅ 12 downstream services depend on this
✅ Comprehensive monitoring and tracing
⚠️  Latency budget tight — optimize cache
🔴 No circuit breaker on external auth provider

Troubleshooting¶

Low Confidence Score¶

If you're getting confidence < 0.5:

Check service type: Ensure --context "service-type:..." is correct
Add context: More context flags improve confidence
Verify code patterns: Add monitoring, error handling, timeouts
Run from repo root: --repo . helps detect structure

# Debug: see what was detected
greybeard slo-check --output json < api.py | jq '.context_signals'

Unexpected Targets¶

If targets don't match your expectations:

Verify service type: Is it correctly auto-detected or explicitly set?
Check code patterns: Run with --output json to see detected signals
Add context: Business context (user count, criticality) matters
Review rationale: Each target has a rationale explaining the choice

Missing Recommendations¶

If no recommendations appear:

Your code already has good patterns!
Or detection might be missing some patterns
Review the code signals in JSON output to see what was detected

Best Practices¶

1. Set SLOs Early¶

Don't wait until production to think about SLOs. Use the SLO Agent early:

# During design/prototyping
git diff main | greybeard slo-check --context "service-type:saas"

# Before first deployment
greybeard slo-check \
  --repo . \
  --context "service-type:saas" \
  --context "launch:critical"

2. Document SLO Decisions¶

Save SLO recommendations to your ADR:

greybeard slo-check --output markdown > docs/adr/slo-targets.md

Then add to your ADR template:

## SLO Targets

**Generated by SLO Agent on 2024-03-21**

[paste markdown output here]

**Team decision**: [Accept/Modify/Override with justification]

3. Validate Code Patterns¶

Before deploying, ensure code has the patterns your SLOs require:

SaaS 99.9%: Needs retries, timeouts, error handling ✅
Critical infra: Needs minimal latency, comprehensive monitoring ✅
Batch: Needs idempotent retries and DLQ ✅

Use the context_signals output to validate:

greybeard slo-check --output json < api.py | jq '.context_signals.code_indicators'

4. Iterate with Your Team¶

Use the --mentor or --coach modes with greybeard for deeper discussion:

# Get mentoring on SLO philosophy
echo "We're building a user-facing API" | greybeard analyze --mode mentor

# Get help explaining to stakeholders
greybeard analyze --mode coach --context "audience:executives"

5. Monitor Against SLOs¶

Once targets are set, monitor them:

# prometheus/recording_rules.yaml
groups:
  - name: slo.saas
    interval: 30s
    rules:
      - record: slo:availability:4w
        expr: (1 - rate(errors_total[4w]) / rate(requests_total[4w])) * 100

      - record: slo:latency:p99:4w
        expr: histogram_quantile(0.99, rate(request_duration_seconds_bucket[4w]))

      - record: slo:error_rate:4w
        expr: rate(errors_total[4w]) / rate(requests_total[4w]) * 100

FAQ¶

Q: Should all services have 99.9% availability?

No. SLOs should match business needs: - Low-traffic internal tools: 95% is fine - User-facing: 99.9% is typical - Critical infrastructure: 99.95%+

Q: What if my code doesn't match the detected service type?

Explicitly specify with --context "service-type:...". The agent auto-detects but can be overridden.

Q: How do I set SLOs for a microservice mesh?

Set SLOs per service: 1. SaaS frontend: 99.9% 2. Critical infra services (auth, gateway): 99.95% 3. Backend workers: 98%

The compound availability might be lower, so design accordingly.

Q: Can I use different SLOs for different time periods?

Yes! Use time-based SLOs: - Peak hours (9am-5pm): 99.95% (stricter) - Off-peak: 99.5% (relaxed) - Maintenance windows: excluded from SLO calculations

Q: How do I track error budgets?

Use monitoring:

# If target is 99.9%, error budget is 0.1%
error_budget_threshold = (1 - 0.999) * 100  # 0.1%
remaining_budget = error_budget_threshold - actual_error_rate

When approaching zero budget, freeze deployments and focus on reliability.

Next Steps¶

Read Creating Agents for custom analysis agents
Review Content Packs to build custom SLO packs
Check CLI Reference for all commands
See Interactive Mode for deeper exploration