- Go 80.5%
- Python 13.8%
- Shell 3.9%
- Dockerfile 1.8%
|
All checks were successful
Build and Push Bifrost with Llama Guardrails / build-and-push (push) Successful in 23m13s
processor.tokenizer() defaults to add_special_tokens=True, which prepends a BOS token. The client-rendered prompt path already includes the literal '<|begin_of_text|>' marker at the start of the string, so the tokenizer was emitting two BOS tokens in a row. Confirmed via DEBUG_FULL_PROMPT decoded_with_special log: before: '<|begin_of_text|><|begin_of_text|><|header_start|>user<|header_end|>...' after: '<|begin_of_text|><|header_start|>user<|header_end|>...' Token count drops by 1 and the prompt now matches what apply_chat_template emits exactly. Note: the chat_template path is unaffected because apply_chat_template handles BOS itself; only the bypass path needed this guard. |
||
|---|---|---|
| .claude/skills/gitnexus | ||
| .forgejo/workflows | ||
| llama-guard | ||
| llama-guardrails | ||
| prompt-guard | ||
| .gitignore | ||
| AGENTS.md | ||
| CLAUDE.md | ||
| docker-compose.yml | ||
| LICENSE | ||
| README.md | ||
| test_header_handling.sh | ||
Bifrost Guardrails Plugin
Content safety plugin for bifrost-http v1.4.23, implementing two-stage guardrails: Prompt Guard (input) and Llama Guard (input + output).
Architecture
┌─────────────────────────────────────────────┐
│ bifrost-http (v1.4.23) │
│ │
HTTP Request ───────────►│ PreLLMHook: checks request before LLM call │
│ PostLLMHook: checks response after LLM call │
└──────────┬──────────────────────┬───────────┘
│ │
┌─────────────▼───────────┐ ┌──────▼──────────┐
│ llama-guardrails.so │ │ │
│ (Go plugin, CGO build) │ │ │
│ │ │ │
│ ┌──────────────────┐ │ │ │
│ │ PromptGuardPool │ │ │ │
│ │ (round-robin) │ │ │ │
│ └────────┬─────────┘ │ │ │
│ │ │ │ │
│ ┌────────▼─────────┐ │ │ │
│ │ LlamaGuardPool │ │ │ │
│ │ (round-robin) │ │ │ │
│ └──────────────────┘ │ │ │
└──────────────────────────┘ │ │
│ │
┌──────────────────────────────▼──┐ ┌──────────▼────────┐
│ prompt-guard service │ │ llama-guard │
│ (CPU, ~2 GB RAM) │ │ service │
│ port 8010 │ │ (GPU, ~6-8 GB) │
│ │ │ port 8011 │
│ Llama-Prompt-Guard-2-86M │ │ │
│ AutoModelForSeqClassifier │ │ Llama-Guard-4-12B │
│ │ │ 4-bit NF4 quantized│
└──────────────────────────────────┘ └───────────────────┘
Components
llama-guardrails.so (Go plugin)
Built as a Go plugin (.so) with CGO_ENABLED=1 and DYNAMIC=1, loaded by bifrost-http at runtime as a volume-mounted ELF shared object. Exposes four hook functions:
| Hook | Trigger | Guard checked | Blocks? |
|---|---|---|---|
PreLLMHook |
Before LLM call | Prompt Guard → Llama Guard | Yes — short-circuit with safety message |
PostLLMHook |
After LLM call | Llama Guard | Yes — replaces response with safety message |
HTTPTransportPreHook |
On HTTP transport layer | Prompt Guard | Yes |
HTTPTransportStreamChunkHook |
On each streaming chunk | Llama Guard | Yes |
The Go plugin maintains two round-robin connection pools (PromptGuardPool, LlamaGuardPool) to distribute load across multiple replica URLs.
prompt-guard (Python, CPU)
FastAPI service running Llama-Prompt-Guard-2-86M. A small 86M-parameter classifier with ~3 labels (BENIGN, INJECTION, JAILBREAK). No GPU required.
Request: POST /scan → {"text": "..."}
Response: {"label": "INJECTION", "scores": {"BENIGN": 0.01, "INJECTION": 0.98, "JAILBREAK": 0.01}}
If scores[label] > threshold (default 0.9), the request is blocked with: "Request blocked: {label} detected."
llama-guard (Python, GPU)
FastAPI service running Llama-Guard-4-12B with 4-bit NF4 quantization (~6–8 GB VRAM). Performs content safety classification on both user prompts (pre-LLM) and LLM responses (post-LLM).
Request: POST /classify → {"text": "..."}
Response: {"label": "unsafe", "category": "S9", "score": 0.95}
Categories follow the Llama Guard taxonomy (S1–S12). Input is truncated to 8192 tokens server-side to bound KV cache memory.
Build & Run
1. Build the plugin (produces ./llama-guardrails/llama_guardrails.so)
DOCKER_BUILDKIT=1 docker build \
--target export \
--output type=local,dest=./llama-guardrails \
-f llama-guardrails/Dockerfile.plugin \
./llama-guardrails
2. Start the stack (uses pre-built bifrost image, mounts the plugin as volume)
docker compose up -d --wait
3. (Optional) Run Go unit tests
cd llama-guardrails && GOWORK=off go test ./... -count=1
Architecture Note: Plugin Build
The Go plugin (llama_guardrails.so) and the Bifrost runtime are built separately:
- Bifrost is the pre-built image
forge.engelmann.me/engel75/bifrost:v1.4.23-ew-4 - The plugin is compiled as a standalone ELF
.soviaDockerfile.pluginand mounted into the Bifrost container at runtime
ABI alignment between plugin and runtime is guaranteed by compiling the plugin against the exact same Bifrost source tree (go mod edit -replace github.com/maximhq/bifrost/core=/bifrost/core). See llama-guardrails/Dockerfile.plugin for details.
Configuration
Plugin config in llama-guardrails/config-data.json:
{
"prompt_guard_urls": ["http://prompt-guard:8010"],
"llama_guard_urls": ["http://llama-guard:8011"],
"llama_guard_model": "meta-llama/Llama-Guard-4-12B",
"timeout_ms": 10000,
"prompt_guard_threshold": 0.9,
"log_blocked_requests": true,
"debug": true
}
Data Flow
Request path (PreLLMHook)
- Go plugin extracts last user message from
schemas.BifrostRequest - Sends to prompt-guard →
POST /scan- If score > threshold → short-circuit with
"Request blocked: {label} detected."
- If score > threshold → short-circuit with
- Sends user message to llama-guard →
POST /classify- If
label=unsafe→ short-circuit with"Request blocked: content violates safety policy (categories: S9)."
- If
- If both pass, request proceeds to LLM
Response path (PostLLMHook)
- Go plugin extracts assistant message from
schemas.BifrostResponse - Sends to llama-guard →
POST /classify- If
label=unsafe→ replaces response with"I'm unable to provide this response as it violates the platform safety policy."
- If
Memory & VRAM
| Service | Model | Memory | Notes |
|---|---|---|---|
| prompt-guard | Prompt-Guard-2-86M | ~2 GB RAM | CPU, no GPU |
| llama-guard | Llama-Guard-4-12B | ~6–8 GB VRAM | GPU required, 4-bit NF4 quantized |
Llama Guard input is truncated at 8192 tokens server-side to prevent KV cache from blowing up on long conversations. Only the last user message is sent for classification (not the full conversation history).