Architecture

use-local-llm has a layered architecture with a clear separation between streaming I/O, response parsing, and React state management.

Layer Diagram

┌─────────────────────────────────────────────┐
│              React Hooks Layer               │
│                                             │
│  useOllama ──► useLocalLLM    useModelList  │
│                useStreamCompletion          │
├─────────────────────────────────────────────┤
│            Stream Utilities Layer            │
│                                             │
│  streamChat()    streamGenerate()           │
│  parseStreamChunk()    readStream()         │
├─────────────────────────────────────────────┤
│            Endpoint Configuration            │
│                                             │
│  ENDPOINTS    CHAT_PATHS    GENERATE_PATHS  │
│  MODEL_LIST_PATHS    detectBackend()        │
├─────────────────────────────────────────────┤
│            Browser Fetch API                │
│                                             │
│  fetch() + ReadableStream + TextDecoder     │
└─────────────────────────────────────────────┘

Data Flow

Chat Flow (`useLocalLLM` / `useOllama`)

User calls send("Hello")
  │
  ├── 1. Append user message to state
  ├── 2. Append empty assistant message
  ├── 3. Build API messages (with system prompt)
  ├── 4. Auto-detect backend from URL port
  ├── 5. Call streamChat() with AbortController
  │     │
  │     ├── POST to endpoint (e.g. /api/chat)
  │     ├── Read response.body as stream
  │     ├── Decode chunks with TextDecoder
  │     ├── Split by newlines, parse each line
  │     └── yield StreamChunk { content, done, model }
  │
  ├── 6. For each chunk:
  │     ├── Accumulate content
  │     ├── Call onToken callback
  │     └── Update assistant message in state
  │
  └── 7. On completion: call onResponse callback

Completion Flow (`useStreamCompletion`)

User calls start()
  │
  ├── 1. Reset text and tokens
  ├── 2. Auto-detect backend
  ├── 3. Call streamGenerate() with AbortController
  │     │
  │     ├── POST to endpoint (e.g. /api/generate)
  │     └── yield StreamChunks...
  │
  ├── 4. For each chunk:
  │     ├── Accumulate text
  │     ├── Push token to tokens array
  │     └── Call onToken callback
  │
  └── 5. On completion: call onComplete callback

Streaming Protocols

NDJSON (Ollama)

Ollama uses Newline-Delimited JSON. Each line is a complete JSON object:

{"model":"gemma3:1b","message":{"content":"Hi"},"done":false}
{"model":"gemma3:1b","message":{"content":" there"},"done":false}
{"model":"gemma3:1b","message":{"content":"!"},"done":true}

SSE (OpenAI-compatible)

LM Studio and llama.cpp use Server-Sent Events:

data: {"choices":[{"delta":{"content":"Hi"},"finish_reason":null}]}

data: {"choices":[{"delta":{"content":" there"},"finish_reason":null}]}

data: {"choices":[{"delta":{"content":"!"},"finish_reason":"stop"}]}

data: [DONE]

The stream parser handles both formats transparently.

Hook Hierarchy

useOllama(model, options)
  └── useLocalLLM({ endpoint, model, backend: "ollama", ...options })
        └── streamChat({ endpoint, backend, model, messages, signal })
              └── fetch() → readStream() → parseStreamChunk()

useStreamCompletion(options)
  └── streamGenerate({ endpoint, backend, model, prompt, signal })
        └── fetch() → readStream() → parseStreamChunk()

useModelList(options)
  └── fetch(endpoint + MODEL_LIST_PATHS[backend])

File Structure

src/
├── hooks/
│   ├── useLocalLLM.ts       # Full chat hook with history
│   ├── useOllama.ts          # Zero-config Ollama wrapper
│   ├── useStreamCompletion.ts # Low-level text completion
│   └── useModelList.ts       # Model discovery
├── utils/
│   ├── streamParser.ts       # NDJSON + SSE parsing, async generators
│   └── endpoints.ts          # Backend configs + auto-detection
├── types/
│   └── index.ts              # All TypeScript interfaces
└── index.ts                  # Barrel exports

Key Design Decisions

No runtime dependencies — Uses only fetch, ReadableStream, and TextDecoder which are available in all modern browsers
AsyncGenerator pattern — Stream utilities use async function* for composable, cancellable streaming
Ref-based options — Hooks use useRef to access latest options without re-creating callbacks
Auto-abort on re-send — Calling send() while streaming automatically aborts the previous stream
AbortController integration — Every stream accepts a signal for cancellation

Layer Diagram​

Data Flow​

Chat Flow (useLocalLLM / useOllama)​

Completion Flow (useStreamCompletion)​

Streaming Protocols​

NDJSON (Ollama)​

SSE (OpenAI-compatible)​

Hook Hierarchy​

File Structure​

Key Design Decisions​