PIKA Overview¶

PIKA (Private Intelligent Knowledge Assistant) is a self-hosted document Q&A system powered by Retrieval-Augmented Generation (RAG). Upload PDFs, DOCX files, or plain text, then ask questions and get answers grounded in your documents — all running locally via Ollama.

Key Features¶

Feature	Description
RAG pipeline	Documents are chunked, embedded with Sentence Transformers, stored in ChromaDB, and retrieved at query time to ground LLM responses
Multi-user auth	Role-based access (admin/user) with session management, CSRF protection, and rate limiting. Auth delegated to Hub for centralised identity
Circuit breaker	Graceful degradation when Ollama is unavailable — queries fail fast instead of hanging
Streaming responses	Answers stream token-by-token via Server-Sent Events for a responsive UI
Query queue	FIFO queue with per-user fairness and configurable concurrency limits
Query cache	Repeated questions are served from cache (configurable TTL)
Backup / restore	Full-system ZIP export of documents, vector store, and configuration with configurable retention
Prometheus metrics	Built-in `/metrics` endpoint for monitoring query latency, queue depth, circuit breaker state, and more

Architecture¶

┌─────────────────────────────────────────────────────┐
│                     PIKA :8000                      │
│                                                     │
│  ┌──────────┐   ┌───────────┐   ┌───────────────┐  │
│  │ Jinja2   │   │ FastAPI   │   │ Prometheus    │  │
│  │ Web UI   │──▶│ API       │   │ /metrics      │  │
│  └──────────┘   └─────┬─────┘   └───────────────┘  │
│                       │                             │
│            ┌──────────▼──────────┐                  │
│            │    RAG Engine       │                  │
│            │  chunk → embed →    │                  │
│            │  retrieve → prompt  │                  │
│            └───┬────────────┬───┘                  │
│                │            │                       │
│  ┌─────────────▼──┐  ┌─────▼──────────┐            │
│  │ ChromaDB       │  │ Sentence       │            │
│  │ (vector store) │  │ Transformers   │            │
│  │ SQLite-backed  │  │ all-MiniLM-L6  │            │
│  └────────────────┘  └────────────────┘            │
│                                                     │
└──────────────────────┬──────────────────────────────┘
                       │ http://ollama:11434
              ┌────────▼────────┐
              │  Ollama         │
              │  (shared)       │
              └────────┬────────┘
                       │
              ┌────────▼────────┐
              │  Hub            │
              │  (auth, license)│
              └─────────────────┘

Document Lifecycle¶

Upload (.pdf / .docx / .txt / .md)
  │
  ▼
Validate (extension, size ≤ 50 MB)
  │
  ▼
Store in documents/ directory
  │
  ▼
Index  ─── Extract text ──▶ Chunk (500 tokens, 50 overlap)
  │                              │
  │                              ▼
  │                        Embed (all-MiniLM-L6-v2)
  │                              │
  │                              ▼
  │                        Store vectors in ChromaDB
  │
  ▼
Ready for queries
  │
  ▼
Query ── Embed question ──▶ Retrieve top-K chunks
  │                              │
  │                              ▼
  │                        Build prompt with context
  │                              │
  │                              ▼
  │                        Send to Ollama (streamed)
  │                              │
  │                              ▼
  │                        Return answer + sources + confidence
  ▼
Feedback (thumbs up / down) stored for quality tracking

Confidence Scoring¶

Each answer includes a confidence level based on the similarity of retrieved chunks:

Level	Threshold	Meaning
high	>= 0.7	Strong match — answer is well-supported by documents
medium	>= 0.5	Moderate match — answer may be partially supported
low	>= 0.3	Weak match — answer has limited document support
none	< 0.3	No relevant documents found

Improving confidence

Upload more relevant documents and experiment with CHUNK_SIZE and TOP_K settings to improve retrieval quality for your use case.

Tech Stack¶

Component	Technology
API framework	FastAPI (Python 3.11+)
Vector store	ChromaDB (SQLite-backed, persistent)
Embeddings	Sentence Transformers (`all-MiniLM-L6-v2`)
LLM inference	Ollama (local, shared service)
Web UI	Jinja2 templates + vanilla JS
Auth	Hub-delegated (centralised identity)
Metrics	Prometheus client
Rate limiting	SlowAPI