Skip to content

PIKA Overview

PIKA (Private Intelligent Knowledge Assistant) is a self-hosted document Q&A system powered by Retrieval-Augmented Generation (RAG). Upload PDFs, DOCX files, or plain text, then ask questions and get answers grounded in your documents — all running locally via Ollama.

Key Features

Feature Description
RAG pipeline Documents are chunked, embedded with Sentence Transformers, stored in ChromaDB, and retrieved at query time to ground LLM responses
Multi-user auth Role-based access (admin/user) with session management, CSRF protection, and rate limiting. Auth delegated to Hub for centralised identity
Circuit breaker Graceful degradation when Ollama is unavailable — queries fail fast instead of hanging
Streaming responses Answers stream token-by-token via Server-Sent Events for a responsive UI
Query queue FIFO queue with per-user fairness and configurable concurrency limits
Query cache Repeated questions are served from cache (configurable TTL)
Backup / restore Full-system ZIP export of documents, vector store, and configuration with configurable retention
Prometheus metrics Built-in /metrics endpoint for monitoring query latency, queue depth, circuit breaker state, and more

Architecture

┌─────────────────────────────────────────────────────┐
│                     PIKA :8000                      │
│                                                     │
│  ┌──────────┐   ┌───────────┐   ┌───────────────┐  │
│  │ Jinja2   │   │ FastAPI   │   │ Prometheus    │  │
│  │ Web UI   │──▶│ API       │   │ /metrics      │  │
│  └──────────┘   └─────┬─────┘   └───────────────┘  │
│                       │                             │
│            ┌──────────▼──────────┐                  │
│            │    RAG Engine       │                  │
│            │  chunk → embed →    │                  │
│            │  retrieve → prompt  │                  │
│            └───┬────────────┬───┘                  │
│                │            │                       │
│  ┌─────────────▼──┐  ┌─────▼──────────┐            │
│  │ ChromaDB       │  │ Sentence       │            │
│  │ (vector store) │  │ Transformers   │            │
│  │ SQLite-backed  │  │ all-MiniLM-L6  │            │
│  └────────────────┘  └────────────────┘            │
│                                                     │
└──────────────────────┬──────────────────────────────┘
                       │ http://ollama:11434
              ┌────────▼────────┐
              │  Ollama         │
              │  (shared)       │
              └────────┬────────┘
              ┌────────▼────────┐
              │  Hub            │
              │  (auth, license)│
              └─────────────────┘

Document Lifecycle

Upload (.pdf / .docx / .txt / .md)
Validate (extension, size ≤ 50 MB)
Store in documents/ directory
Index  ─── Extract text ──▶ Chunk (500 tokens, 50 overlap)
  │                              │
  │                              ▼
  │                        Embed (all-MiniLM-L6-v2)
  │                              │
  │                              ▼
  │                        Store vectors in ChromaDB
Ready for queries
Query ── Embed question ──▶ Retrieve top-K chunks
  │                              │
  │                              ▼
  │                        Build prompt with context
  │                              │
  │                              ▼
  │                        Send to Ollama (streamed)
  │                              │
  │                              ▼
  │                        Return answer + sources + confidence
Feedback (thumbs up / down) stored for quality tracking

Confidence Scoring

Each answer includes a confidence level based on the similarity of retrieved chunks:

Level Threshold Meaning
high >= 0.7 Strong match — answer is well-supported by documents
medium >= 0.5 Moderate match — answer may be partially supported
low >= 0.3 Weak match — answer has limited document support
none < 0.3 No relevant documents found

Improving confidence

Upload more relevant documents and experiment with CHUNK_SIZE and TOP_K settings to improve retrieval quality for your use case.

Tech Stack

Component Technology
API framework FastAPI (Python 3.11+)
Vector store ChromaDB (SQLite-backed, persistent)
Embeddings Sentence Transformers (all-MiniLM-L6-v2)
LLM inference Ollama (local, shared service)
Web UI Jinja2 templates + vanilla JS
Auth Hub-delegated (centralised identity)
Metrics Prometheus client
Rate limiting SlowAPI