Local LLM Setup
Local LLM Setup
Section titled “Local LLM Setup”Run translations completely locally using LM Studio or Ollama.
Overview
Section titled “Overview”Subtide supports local LLM inference, allowing you to:
- Run translations without cloud API costs
- Keep all data on your machine
- Use any compatible model
- Avoid rate limits
LM Studio
Section titled “LM Studio”LM Studio provides a user-friendly interface for running local models.
Installation
Section titled “Installation”- Download LM Studio from lmstudio.ai
- Install and launch
- Download a model from the built-in browser
Recommended Models
Section titled “Recommended Models”| Model | Size | Quality | Speed |
|---|---|---|---|
| Llama 3.1 8B | 4.7 GB | Good | Fast |
| Mistral 7B | 4.1 GB | Good | Fast |
| Qwen 2.5 7B | 4.4 GB | Excellent | Fast |
| Llama 3.1 70B | 40 GB | Excellent | Slow |
Start Local Server
Section titled “Start Local Server”- In LM Studio, click Local Server tab
- Select your downloaded model
- Click Start Server
- Server runs at
http://localhost:1234/v1
Configure Extension
Section titled “Configure Extension”Operation Mode: Tier 1 or Tier 2API Provider: Custom EndpointAPI URL: http://localhost:1234/v1API Key: lm-studioModel: (leave empty or enter model name)Ollama
Section titled “Ollama”Ollama is a lightweight tool for running LLMs locally.
Installation
Section titled “Installation”=== “macOS”
brew install ollama=== “Linux”
curl -fsSL https://ollama.com/install.sh | sh=== “Windows”
Download from ollama.ai
Download Models
Section titled “Download Models”# Recommended modelsollama pull llama3.1:8bollama pull mistral:7bollama pull qwen2.5:7b
# For Asian languagesollama pull qwen2.5:14bStart Server
Section titled “Start Server”Ollama runs automatically after installation. Verify:
ollama serve # If not runningServer runs at http://localhost:11434/v1.
Configure Extension
Section titled “Configure Extension”Operation Mode: Tier 1 or Tier 2API Provider: Custom EndpointAPI URL: http://localhost:11434/v1API Key: ollamaModel: llama3.1:8bModel Recommendations
Section titled “Model Recommendations”By Language
Section titled “By Language”| Language | Recommended Model | Notes |
|---|---|---|
| English | Any model | All work well |
| European | Mistral, Llama | Good coverage |
| Chinese | Qwen 2.5 | Specifically trained |
| Japanese | Qwen 2.5 | Good Asian language support |
| Korean | Qwen 2.5 | Good Asian language support |
By Hardware
Section titled “By Hardware”| RAM | Recommended Model |
|---|---|
| 8 GB | Not recommended |
| 16 GB | 7B models (quantized) |
| 32 GB | 7B-8B models |
| 64 GB | 13B-14B models |
| 128 GB+ | 70B models |
Warning: Memory Requirements Models require significant RAM. The values above are rough estimates. Quantized models (Q4, Q5) use less memory.
Performance Tuning
Section titled “Performance Tuning”LM Studio
Section titled “LM Studio”- GPU Layers: Increase for faster inference
- Context Length: Reduce for less memory
- Batch Size: Adjust based on hardware
Ollama
Section titled “Ollama”# Set number of GPU layersOLLAMA_NUM_GPU=35 ollama run llama3.1:8b
# Set context sizeollama run llama3.1:8b --num-ctx 4096Combining with Local Whisper
Section titled “Combining with Local Whisper”For fully local operation:
# Backend with local WhisperWHISPER_MODEL=base WHISPER_BACKEND=mlx ./subtide-backendExtension:Operation Mode: Tier 2 (Enhanced)API Provider: Custom EndpointAPI URL: http://localhost:1234/v1API Key: lm-studioThis setup:
- Transcribes audio locally with Whisper
- Translates locally with LM Studio/Ollama
- No data leaves your machine
Troubleshooting
Section titled “Troubleshooting”Connection Refused
Section titled “Connection Refused”- Verify the local server is running
- Check the port (1234 for LM Studio, 11434 for Ollama)
- Ensure no firewall blocking
Slow Responses
Section titled “Slow Responses”- Use a smaller model
- Increase GPU layers
- Use quantized versions
Out of Memory
Section titled “Out of Memory”- Use a smaller model
- Use more aggressive quantization
- Reduce context length
Poor Translation Quality
Section titled “Poor Translation Quality”- Use a larger model
- Try a different model architecture
- Consider using cloud APIs for critical work
Model Comparison
Section titled “Model Comparison”Translation Quality Ranking
Section titled “Translation Quality Ranking”- GPT-4o (cloud) - Best quality
- Llama 3.1 70B (local) - Excellent
- Qwen 2.5 14B (local) - Very good for Asian languages
- Llama 3.1 8B (local) - Good
- Mistral 7B (local) - Good
Speed Ranking (local)
Section titled “Speed Ranking (local)”- Mistral 7B - Fastest
- Llama 3.1 8B - Fast
- Qwen 2.5 7B - Fast
- Qwen 2.5 14B - Medium
- Llama 3.1 70B - Slow
Cost Comparison
Section titled “Cost Comparison”| Setup | Cost |
|---|---|
| OpenAI GPT-4o | ~$0.01/video |
| OpenAI GPT-4o-mini | ~$0.001/video |
| Local LLM | Electricity only |
Tip: Hybrid Approach Use local models for most translations, cloud APIs for high-quality needs.
Next Steps
Section titled “Next Steps”- Backend Overview - All backend options
- Docker Deployment - Container deployment
- Configuration - All settings