← Projects

llm-in-a-box

Local LLM deployment exposed as a simple API using FastAPI, focused on understanding model serving and AI systems.

LLM · FastAPI · Docker · AI · MLOps


Problem

Most LLM usage today happens through APIs, which abstracts away how models are actually run and served.

I wanted to understand how LLMs work at a system level — how they are deployed, exposed, and integrated into applications.

The goal was:

Run an LLM locally and expose it as a simple, reusable API, similar to any backend service.


Architecture

User / Client

FastAPI Server

LLM Runtime (Ollama / Hugging Face)

Response

Optional (future):

User → API → Embeddings → FAISS → Context → LLM → Response

This setup focuses on understanding how models are served and how an API layer interacts with them.


Tech Stack

  • FastAPI
  • Ollama / Hugging Face Transformers
  • Docker
  • Python
  • FAISS (planned)

Key Decisions

Local model over API-based model To gain full control, avoid cost, and understand how models are actually served.

FastAPI for serving Lightweight and simple for exposing model inference as an API.

Docker for portability Ensures the setup is reproducible across environments.

Start simple Focus on core prompt-response flow before adding embeddings or retrieval systems.


Challenges

  • Running LLMs efficiently on local hardware
  • Managing model size and performance trade-offs
  • Understanding model loading vs inference lifecycle
  • Handling latency in responses
  • Designing a clean and usable API interface

Result

A working foundation for running and exposing LLMs locally through a backend service.

The project is evolving into a system for experimenting with model serving, API design, and AI infrastructure patterns.


Future Work

  • Add embeddings and vector search (FAISS)
  • Implement context-aware responses (RAG)
  • Optimize performance and latency
  • Extend API capabilities

  • GitHub: (to be added)