Skip to content

Local LLM Setup

Run translations completely locally using LM Studio or Ollama.


Subtide supports local LLM inference, allowing you to:

  • Run translations without cloud API costs
  • Keep all data on your machine
  • Use any compatible model
  • Avoid rate limits

LM Studio provides a user-friendly interface for running local models.

  1. Download LM Studio from lmstudio.ai
  2. Install and launch
  3. Download a model from the built-in browser
ModelSizeQualitySpeed
Llama 3.1 8B4.7 GBGoodFast
Mistral 7B4.1 GBGoodFast
Qwen 2.5 7B4.4 GBExcellentFast
Llama 3.1 70B40 GBExcellentSlow
  1. In LM Studio, click Local Server tab
  2. Select your downloaded model
  3. Click Start Server
  4. Server runs at http://localhost:1234/v1
Operation Mode: Tier 1 or Tier 2
API Provider: Custom Endpoint
API URL: http://localhost:1234/v1
API Key: lm-studio
Model: (leave empty or enter model name)

Ollama is a lightweight tool for running LLMs locally.

=== “macOS”

Terminal window
brew install ollama

=== “Linux”

Terminal window
curl -fsSL https://ollama.com/install.sh | sh

=== “Windows”

Download from ollama.ai

Terminal window
# Recommended models
ollama pull llama3.1:8b
ollama pull mistral:7b
ollama pull qwen2.5:7b
# For Asian languages
ollama pull qwen2.5:14b

Ollama runs automatically after installation. Verify:

Terminal window
ollama serve # If not running

Server runs at http://localhost:11434/v1.

Operation Mode: Tier 1 or Tier 2
API Provider: Custom Endpoint
API URL: http://localhost:11434/v1
API Key: ollama
Model: llama3.1:8b

LanguageRecommended ModelNotes
EnglishAny modelAll work well
EuropeanMistral, LlamaGood coverage
ChineseQwen 2.5Specifically trained
JapaneseQwen 2.5Good Asian language support
KoreanQwen 2.5Good Asian language support
RAMRecommended Model
8 GBNot recommended
16 GB7B models (quantized)
32 GB7B-8B models
64 GB13B-14B models
128 GB+70B models

Warning: Memory Requirements Models require significant RAM. The values above are rough estimates. Quantized models (Q4, Q5) use less memory.


  1. GPU Layers: Increase for faster inference
  2. Context Length: Reduce for less memory
  3. Batch Size: Adjust based on hardware
Terminal window
# Set number of GPU layers
OLLAMA_NUM_GPU=35 ollama run llama3.1:8b
# Set context size
ollama run llama3.1:8b --num-ctx 4096

For fully local operation:

Terminal window
# Backend with local Whisper
WHISPER_MODEL=base WHISPER_BACKEND=mlx ./subtide-backend
Extension:
Operation Mode: Tier 2 (Enhanced)
API Provider: Custom Endpoint
API URL: http://localhost:1234/v1
API Key: lm-studio

This setup:

  • Transcribes audio locally with Whisper
  • Translates locally with LM Studio/Ollama
  • No data leaves your machine

  1. Verify the local server is running
  2. Check the port (1234 for LM Studio, 11434 for Ollama)
  3. Ensure no firewall blocking
  1. Use a smaller model
  2. Increase GPU layers
  3. Use quantized versions
  1. Use a smaller model
  2. Use more aggressive quantization
  3. Reduce context length
  1. Use a larger model
  2. Try a different model architecture
  3. Consider using cloud APIs for critical work

  1. GPT-4o (cloud) - Best quality
  2. Llama 3.1 70B (local) - Excellent
  3. Qwen 2.5 14B (local) - Very good for Asian languages
  4. Llama 3.1 8B (local) - Good
  5. Mistral 7B (local) - Good
  1. Mistral 7B - Fastest
  2. Llama 3.1 8B - Fast
  3. Qwen 2.5 7B - Fast
  4. Qwen 2.5 14B - Medium
  5. Llama 3.1 70B - Slow

SetupCost
OpenAI GPT-4o~$0.01/video
OpenAI GPT-4o-mini~$0.001/video
Local LLMElectricity only

Tip: Hybrid Approach Use local models for most translations, cloud APIs for high-quality needs.