Commercially available large language models such as OpenAI’s ChatGPT4, Google’s Gemini and Anthropic’s Claude continue to dominate the headlines and have been heavily discussed within the test community, however there has been a great deal of research and development in openly available models that can be run on locally on consumer grade hardware.
There are a number of exceptional language models now available to download, run and enhance published by a range of organisations including:
Meta‘s Llama3.1 – https://llama.meta.com/
Google‘s Gemma2 – https://ai.google.dev/gemma
Alibaba‘s Qwen2 – https://github.com/QwenLM/Qwen2
Microsoft‘s Phi-3 – https://azure.microsoft.com/en-gb/products/phi-3/
A vibrant community has emerged around the openly available models to enhance, modify, finetune and benchmark the models to suit a myriad of use cases and Hugging Face has become the leader in the open models ecosystem – https://huggingface.co/
and excitingly, these are now as capable as the commercially available models in some use cases:
(https://www.theverge.com/2024/4/18/24134103/llama-3-benchmark-testing-ai-gemma-gemini-mistral)
(https://aimlapi.com/comparisons/qwen-2-vs-chatgpt-4o-comparison)
Getting started – LM Studio
Offering a familiar ‘chat’ interface and a ‘discover’ page, a good starting point is LM Studio from https://lmstudio.ai/
Once installed, you can download a model:
Meta Llama 3.1 8B is a good starting point. Once downloaded, you can than start to send prompts and chat to the model – all running locally on your computer without sending any data to 3rd party services online.
Getting started – Ollama
If you prefer a more minimalist command line experience and a curated choice of models, the more development focused Ollama platform can be downloaded from https://ollama.com/download
Once installed, everything is done from the command line, for example to download and begin prompting with llama3.1 on Windows, use a terminal to get started:
Choosing models – parameters vs quantization
The key limiting factor in running language models locally is available memory and for reasonable performance there must be enough GPU memory available to load the majority or entirety of the model.
When models are trained, the size of the dataset governs the resultant size – the Llama 3.1 8B (8 billion parameters) model in full precision (16bit) needs 17GB while the 70B model needs over 148GB!
Quantization is a technique to reduce the model size at the expense of accuracy, but reducing the Llama 3.1 8B model to 4bit (the default download for both LM Studio and Ollama) gives excellent performance and only requires 6.7GB of GPU VRAM whereas the full 16bit model would need 17GB – if that were not available to the GPU, it would have to then use the system based memory for the rest of the model (resulting in a lower response speed).
Obviously, the 70B model has been trained on a richer sat of data than the 8B model, but running that model with very high quantization would have an adverse effect on accuracy of responses : https://www.theregister.com/2024/07/14/quantization_llm_feature/
Choosing models – task specific
While Llama 3.1 is a good all-round performing LLM, there are a range of models of varying sizes some more suited to specific tasks, including:
InternLM 2.5 – Strong reasoning across the board
Codestral – Mistral AI’s model for code generation tasks in over 80 languages
Deepseek-coder-v2 – an open model that is comparable to GPT4-turbo in code specific tasks
Gemma 2 – Excellent performance for its size, benefiting from Google’s experiences with Gemini
Phi3 – Microsoft’s small sized model with a large context window
Qwen2 – A new series of models in a range of sizes
Summary
If you have a reasonably powerful desktop or laptop you can easily run a LLM locally and expect to get results that are comparable with commercial offerings, and locally running LLM’s are inherently secure so if data privacy concerns are preventing you from using LLM’s this is a viable and cost free solution.