Architecture
use-local-llm has a layered architecture with a clear separation between streaming I/O, response parsing, and React state management.
Layer Diagram
┌─────────────────────────────────────────────┐
│ React Hooks Layer │
│ │
│ useOllama ──► useLocalLLM useModelList │
│ useStreamCompletion │
├─────────────────────────────────────────────┤
│ Stream Utilities Layer │
│ │
│ streamChat() streamGenerate() │
│ parseStreamChunk() readStream() │
├─────────────────────────────────────────────┤
│ Endpoint Configuration │
│ │
│ ENDPOINTS CHAT_PATHS GENERATE_PATHS │
│ MODEL_LIST_PATHS detectBackend() │
├─────────────────────────────────────────────┤
│ Browser Fetch API │
│ │
│ fetch() + ReadableStream + TextDecoder │
└─────────────────────────────────────────────┘
Data Flow
Chat Flow (useLocalLLM / useOllama)
User calls send("Hello")
│
├── 1. Append user message to state
├── 2. Append empty assistant message
├── 3. Build API messages (with system prompt)
├── 4. Auto-detect backend from URL port
├── 5. Call streamChat() with AbortController
│ │
│ ├── POST to endpoint (e.g. /api/chat)
│ ├── Read response.body as stream
│ ├── Decode chunks with TextDecoder
│ ├── Split by newlines, parse each line
│ └── yield StreamChunk { content, done, model }
│
├── 6. For each chunk:
│ ├── Accumulate content
│ ├── Call onToken callback
│ └── Update assistant message in state
│
└── 7. On completion: call onResponse callback
Completion Flow (useStreamCompletion)
User calls start()
│
├── 1. Reset text and tokens
├── 2. Auto-detect backend
├── 3. Call streamGenerate() with AbortController
│ │
│ ├── POST to endpoint (e.g. /api/generate)
│ └── yield StreamChunks...
│
├── 4. For each chunk:
│ ├── Accumulate text
│ ├── Push token to tokens array
│ └── Call onToken callback
│
└── 5. On completion: call onComplete callback
Streaming Protocols
NDJSON (Ollama)
Ollama uses Newline-Delimited JSON. Each line is a complete JSON object:
{"model":"gemma3:1b","message":{"content":"Hi"},"done":false}
{"model":"gemma3:1b","message":{"content":" there"},"done":false}
{"model":"gemma3:1b","message":{"content":"!"},"done":true}
SSE (OpenAI-compatible)
LM Studio and llama.cpp use Server-Sent Events:
data: {"choices":[{"delta":{"content":"Hi"},"finish_reason":null}]}
data: {"choices":[{"delta":{"content":" there"},"finish_reason":null}]}
data: {"choices":[{"delta":{"content":"!"},"finish_reason":"stop"}]}
data: [DONE]
The stream parser handles both formats transparently.
Hook Hierarchy
useOllama(model, options)
└── useLocalLLM({ endpoint, model, backend: "ollama", ...options })
└── streamChat({ endpoint, backend, model, messages, signal })
└── fetch() → readStream() → parseStreamChunk()
useStreamCompletion(options)
└── streamGenerate({ endpoint, backend, model, prompt, signal })
└── fetch() → readStream() → parseStreamChunk()
useModelList(options)
└── fetch(endpoint + MODEL_LIST_PATHS[backend])
File Structure
src/
├── hooks/
│ ├── useLocalLLM.ts # Full chat hook with history
│ ├── useOllama.ts # Zero-config Ollama wrapper
│ ├── useStreamCompletion.ts # Low-level text completion
│ └── useModelList.ts # Model discovery
├── utils/
│ ├── streamParser.ts # NDJSON + SSE parsing, async generators
│ └── endpoints.ts # Backend configs + auto-detection
├── types/
│ └── index.ts # All TypeScript interfaces
└── index.ts # Barrel exports
Key Design Decisions
- No runtime dependencies — Uses only
fetch,ReadableStream, andTextDecoderwhich are available in all modern browsers - AsyncGenerator pattern — Stream utilities use
async function*for composable, cancellable streaming - Ref-based options — Hooks use
useRefto access latest options without re-creating callbacks - Auto-abort on re-send — Calling
send()while streaming automatically aborts the previous stream - AbortController integration — Every stream accepts a
signalfor cancellation