Uploading Documents
Upload documents to your RAG collections. Documents are processed, chunked, and stored as vectors.
Upload a Document
Using cURL
curl -X POST http://localhost/api/v1/rag/upload \
-H "Authorization: Bearer YOUR_API_KEY" \
-F "file=@user_guide.pdf" \
-F "collection_name=documentation" \
-F "description=User guide for our product"
Using Python
import requests
with open("user_guide.pdf", "rb") as f:
response = requests.post(
"http://localhost/api/v1/rag/upload",
headers={"Authorization": "Bearer YOUR_API_KEY"},
files={"file": f},
data={
"collection_name": "documentation",
"description": "User guide for our product"
}
)
result = response.json()
print(f"Document ID: {result['document_id']}")
print(f"Status: {result['status']}")
Using JavaScript
const formData = new FormData();
formData.append('file', fileInput.files[0]);
formData.append('collection_name', 'documentation');
formData.append('description', 'User guide for our product');
const response = await fetch('http://localhost/api/v1/rag/upload', {
method: 'POST',
headers: {
'Authorization': 'Bearer YOUR_API_KEY'
},
body: formData
});
const result = await response.json();
console.log(`Document ID: ${result.document_id}`);
Upload Parameters
| Parameter | Type | Required | Description |
|---|---|---|---|
file | file | Yes | Document file to upload |
collection_name | string | Yes | Target collection name |
description | string | No | Document description |
metadata | object | No | Custom metadata key-value pairs |
Upload with Metadata
import requests
with open("policy_doc.pdf", "rb") as f:
response = requests.post(
"http://localhost/api/v1/rag/upload",
headers={"Authorization": "Bearer YOUR_API_KEY"},
files={"file": f},
data={
"collection_name": "policies",
"description": "Privacy policy document",
"metadata": json.dumps({
"category": "legal",
"version": "2.0",
"effective_date": "2024-01-01",
"department": "compliance"
})
}
)
print(f"Uploaded: {response.json()['document_id']}")
Supported File Formats
| Format | Extensions | Max Size |
|---|---|---|
| Text | .txt | 10 MB |
| Markdown | .md | 10 MB |
.pdf | 50 MB | |
| Word | .docx | 25 MB |
| JSON | .json | 10 MB |
| HTML | .html | 10 MB |
Document Processing
When you upload a document:
- Parse - Extract text from file
- Chunk - Split into 500-1000 token pieces
- Embed - Convert chunks to vectors
- Index - Store vectors for fast search
Processing time depends on file size:
- Small files (< 1 MB): 5-10 seconds
- Medium files (1-10 MB): 10-30 seconds
- Large files (10-50 MB): 30-120 seconds
Upload Multiple Documents
import os
import requests
documents = ["doc1.pdf", "doc2.txt", "doc3.md"]
for doc in documents:
with open(doc, "rb") as f:
response = requests.post(
"http://localhost/api/v1/rag/upload",
headers={"Authorization": "Bearer YOUR_API_KEY"},
files={"file": f},
data={
"collection_name": "knowledge_base",
"description": f"Document: {doc}"
}
)
print(f"{doc}: {response.json()['status']}")
Upload from URL
import requests
# Download and upload from URL
url = "https://example.com/document.pdf"
response = requests.get(url)
# Upload downloaded content
files = {"file": ("document.pdf", response.content)}
upload_response = requests.post(
"http://localhost/api/v1/rag/upload",
headers={"Authorization": "Bearer YOUR_API_KEY"},
files=files,
data={
"collection_name": "documentation",
"description": "Downloaded from external URL"
}
)
print(f"Uploaded: {upload_response.json()['document_id']}")
Check Upload Status
import time
def upload_with_status(file_path, collection_name):
with open(file_path, "rb") as f:
response = requests.post(
"http://localhost/api/v1/rag/upload",
headers={"Authorization": "Bearer YOUR_API_KEY"},
files={"file": f},
data={"collection_name": collection_name}
)
return response.json()
result = upload_with_status("large_doc.pdf", "docs")
if result["status"] == "processing":
print(f"Document {result['document_id']} is being processed...")
print("You can search once processing completes.")
elif result["status"] == "completed":
print(f"Document {result['document_id']} is ready for search.")
else:
print(f"Error: {result.get('error', 'Unknown error')}")
Upload Response
Successful upload returns:
{
"document_id": "doc_abc123",
"collection_name": "documentation",
"filename": "user_guide.pdf",
"chunk_count": 45,
"status": "processing",
"created_at": "2024-01-15T10:30:00Z"
}
Best Practices
File Preparation
- Use clean, well-formatted documents
- Remove unnecessary images and formatting
- Ensure text is readable and accessible
- Check for sensitive data before uploading
Metadata Usage
Add relevant metadata for better filtering:
metadata = {
"category": "product",
"version": "1.2",
"language": "en",
"audience": "developers",
"last_updated": "2024-01-15"
}
Batch Uploads
For large document sets:
- Upload in small batches (5-10 documents)
- Monitor server performance
- Check processing status between batches
- Handle errors and retry failed uploads
Document Size
- Keep documents under 10 MB when possible
- Split large documents into smaller files
- Use PDFs for preserving formatting
- Use plain text for fastest processing
Troubleshooting
File Too Large
Problem: {"error": "File size exceeds limit"}
Solution:
- Split document into smaller files
- Compress or optimize the file
- Use a format with smaller file size
Unsupported Format
Problem: {"error": "Unsupported file format"}
Solution:
- Convert to supported format (TXT, PDF, MD, DOCX, JSON, HTML)
- Use PDF for complex documents
- Use TXT for simple text documents
Processing Timeout
Problem: Upload succeeds but search returns no results
Solution:
- Wait for processing to complete
- Check document status endpoint
- Verify collection name is correct
- Review server logs for errors
Duplicate Document
Problem: Same document uploaded multiple times
Solution:
- Check existing documents before upload
- Use unique filenames
- Delete duplicates from collection
- Use metadata to identify original
Next Steps
- Search Documents - Search uploaded documents
- Manage Documents - List and delete documents
- Bulk Upload Example - Upload multiple files