Skip to main content

Uploading Documents

Upload documents to your RAG collections. Documents are processed, chunked, and stored as vectors.

Upload a Document

Using cURL

curl -X POST http://localhost/api/v1/rag/upload \
-H "Authorization: Bearer YOUR_API_KEY" \
-F "file=@user_guide.pdf" \
-F "collection_name=documentation" \
-F "description=User guide for our product"

Using Python

import requests

with open("user_guide.pdf", "rb") as f:
response = requests.post(
"http://localhost/api/v1/rag/upload",
headers={"Authorization": "Bearer YOUR_API_KEY"},
files={"file": f},
data={
"collection_name": "documentation",
"description": "User guide for our product"
}
)

result = response.json()
print(f"Document ID: {result['document_id']}")
print(f"Status: {result['status']}")

Using JavaScript

const formData = new FormData();
formData.append('file', fileInput.files[0]);
formData.append('collection_name', 'documentation');
formData.append('description', 'User guide for our product');

const response = await fetch('http://localhost/api/v1/rag/upload', {
method: 'POST',
headers: {
'Authorization': 'Bearer YOUR_API_KEY'
},
body: formData
});

const result = await response.json();
console.log(`Document ID: ${result.document_id}`);

Upload Parameters

ParameterTypeRequiredDescription
filefileYesDocument file to upload
collection_namestringYesTarget collection name
descriptionstringNoDocument description
metadataobjectNoCustom metadata key-value pairs

Upload with Metadata

import requests

with open("policy_doc.pdf", "rb") as f:
response = requests.post(
"http://localhost/api/v1/rag/upload",
headers={"Authorization": "Bearer YOUR_API_KEY"},
files={"file": f},
data={
"collection_name": "policies",
"description": "Privacy policy document",
"metadata": json.dumps({
"category": "legal",
"version": "2.0",
"effective_date": "2024-01-01",
"department": "compliance"
})
}
)

print(f"Uploaded: {response.json()['document_id']}")

Supported File Formats

FormatExtensionsMax Size
Text.txt10 MB
Markdown.md10 MB
PDF.pdf50 MB
Word.docx25 MB
JSON.json10 MB
HTML.html10 MB

Document Processing

When you upload a document:

  1. Parse - Extract text from file
  2. Chunk - Split into 500-1000 token pieces
  3. Embed - Convert chunks to vectors
  4. Index - Store vectors for fast search

Processing time depends on file size:

  • Small files (< 1 MB): 5-10 seconds
  • Medium files (1-10 MB): 10-30 seconds
  • Large files (10-50 MB): 30-120 seconds

Upload Multiple Documents

import os
import requests

documents = ["doc1.pdf", "doc2.txt", "doc3.md"]

for doc in documents:
with open(doc, "rb") as f:
response = requests.post(
"http://localhost/api/v1/rag/upload",
headers={"Authorization": "Bearer YOUR_API_KEY"},
files={"file": f},
data={
"collection_name": "knowledge_base",
"description": f"Document: {doc}"
}
)
print(f"{doc}: {response.json()['status']}")

Upload from URL

import requests

# Download and upload from URL
url = "https://example.com/document.pdf"
response = requests.get(url)

# Upload downloaded content
files = {"file": ("document.pdf", response.content)}
upload_response = requests.post(
"http://localhost/api/v1/rag/upload",
headers={"Authorization": "Bearer YOUR_API_KEY"},
files=files,
data={
"collection_name": "documentation",
"description": "Downloaded from external URL"
}
)

print(f"Uploaded: {upload_response.json()['document_id']}")

Check Upload Status

import time

def upload_with_status(file_path, collection_name):
with open(file_path, "rb") as f:
response = requests.post(
"http://localhost/api/v1/rag/upload",
headers={"Authorization": "Bearer YOUR_API_KEY"},
files={"file": f},
data={"collection_name": collection_name}
)
return response.json()

result = upload_with_status("large_doc.pdf", "docs")

if result["status"] == "processing":
print(f"Document {result['document_id']} is being processed...")
print("You can search once processing completes.")
elif result["status"] == "completed":
print(f"Document {result['document_id']} is ready for search.")
else:
print(f"Error: {result.get('error', 'Unknown error')}")

Upload Response

Successful upload returns:

{
"document_id": "doc_abc123",
"collection_name": "documentation",
"filename": "user_guide.pdf",
"chunk_count": 45,
"status": "processing",
"created_at": "2024-01-15T10:30:00Z"
}

Best Practices

File Preparation

  • Use clean, well-formatted documents
  • Remove unnecessary images and formatting
  • Ensure text is readable and accessible
  • Check for sensitive data before uploading

Metadata Usage

Add relevant metadata for better filtering:

metadata = {
"category": "product",
"version": "1.2",
"language": "en",
"audience": "developers",
"last_updated": "2024-01-15"
}

Batch Uploads

For large document sets:

  1. Upload in small batches (5-10 documents)
  2. Monitor server performance
  3. Check processing status between batches
  4. Handle errors and retry failed uploads

Document Size

  • Keep documents under 10 MB when possible
  • Split large documents into smaller files
  • Use PDFs for preserving formatting
  • Use plain text for fastest processing

Troubleshooting

File Too Large

Problem: {"error": "File size exceeds limit"}

Solution:

  • Split document into smaller files
  • Compress or optimize the file
  • Use a format with smaller file size

Unsupported Format

Problem: {"error": "Unsupported file format"}

Solution:

  • Convert to supported format (TXT, PDF, MD, DOCX, JSON, HTML)
  • Use PDF for complex documents
  • Use TXT for simple text documents

Processing Timeout

Problem: Upload succeeds but search returns no results

Solution:

  • Wait for processing to complete
  • Check document status endpoint
  • Verify collection name is correct
  • Review server logs for errors

Duplicate Document

Problem: Same document uploaded multiple times

Solution:

  • Check existing documents before upload
  • Use unique filenames
  • Delete duplicates from collection
  • Use metadata to identify original

Next Steps