Skip to content

How Trove Works

Trove is a Cloudflare Worker that ingests content from data sources, stores it across three storage layers, and makes it searchable via MCP tools and a GraphQL API.

Data Sources --> Connectors --> Ingest Pipeline --> Storage --> Search/MCP/API --> AI Assistants & Apps

Trove splits storage across three services, each optimized for a different access pattern.

Each user gets their own SQLite database (Cloudflare D1) for document metadata: titles, authors, dates, tags, and preview text. There are no user_id columns. The database itself is the user’s data.

Search results, document listings, and connector stats all read from D1 without touching object storage.

Complete document text stored as plain text files in Cloudflare R2. Keys are prefixed with the user ID ({user_id}/{connector_id}/{document_id}.txt). Full text is loaded only when explicitly requested. List views and search results use preview text from D1.

Semantic embeddings stored in Cloudflare Vectorize. A shared index across all users with a mandatory user_id metadata filter on every query. Documents are represented as 1024-dimensional vectors. Search queries are compared against them using cosine similarity.

When content arrives from a scheduled sync, a manual save, or the Sync API, it passes through a deterministic pipeline.

  1. Dedup. Check if (connector_id, external_id) already exists. Skip duplicates.
  2. Preview. Extract the first ~300 words for fast display in search results and list views.
  3. Store. Write metadata to D1, full text to R2.
  4. Chunk. Split text into overlapping chunks using a content-type-aware splitter.
  5. Embed. Generate vector embeddings via Workers AI using the bge-m3 model (1024 dimensions), batched up to 100 chunks per call.
  6. Index. Upsert vectors to Vectorize with user_id and other metadata for filtering.

The pipeline processes batches of up to 50 documents and typically completes in 5-10 seconds.

When you search:

  1. Your query is embedded using the same bge-m3 model.
  2. Vectorize performs a cosine similarity search with a mandatory user_id filter plus any additional filters (connector, author, date range, content type, tags).
  3. Matching document IDs fetch metadata from D1.
  4. Results are assembled with snippets (the most relevant text chunk), document metadata, and relevance scores.
  5. Optionally, an AI reranker refines the result ordering.

Total search latency is typically 35-55ms without reranking, 55-75ms with reranking.

The MCP server supports two transports.

  • Streamable HTTP (POST /mcp). Stateless Workers with sessions backed by KV (1-hour TTL). Used by most MCP clients including Claude Desktop and Cursor.
  • WebSocket. Durable Objects per user with hibernation (zero cost when idle).

Both transports run the same JSON-RPC 2.0 protocol and expose the same 7 tools.

Cloud connectors run on a schedule without user intervention.

  1. A cron trigger fires every 5 minutes and checks a global schedule index for connectors due for sync.
  2. Due connectors are enqueued as jobs via Cloudflare Queues.
  3. A queue consumer picks up each job, executes the connector (fetches new content from the data source), and runs the ingest pipeline on the results.
  4. The connector’s cursor and sync timestamp are updated for the next run.

Schedule options: "every 30 minutes", "every hour", "every 6 hours", "every 12 hours", "daily".