Enter your email address below and subscribe to our newsletter

10 Best LLMs You Can Run on Your Computer

Share your love

Large Language Models (LLMs) have revolutionised the world of artificial intelligence, but many believe these powerful tools are out of reach for the average computer user. So in this guide, I will walk you through the 10 best LLMs you can run on your personal computer.

Whether you’re a developer, researcher, or AI enthusiast, you’ll discover options that fit your computing resources and requirements.

But before that, let’s take a look at a few reasons why one would want to run LLMs locally on their computers.

Good reasons to run LLMs Locally

I do it for privacy as I don’t want to hand over my data to train AI models further. Here are some reasons why one should run LLMs locally on their computer:

  • Running LLMs locally gives you more control over the model, data, and infrastructure. You can optimise and fine-tune the model for your specific use case.
  • Keeping your data and models on-premises reduces the risk of exposing sensitive information to third parties.
  • You maintain full control over access and security measures for your data and intellectual property when not relying on cloud providers.
  • If you already have the necessary computing infrastructure, running LLMs locally can be more cost-effective than paying for cloud GPU resources and API calls.

What are the best LLMs to run on a computer?

Here’s a a list of the best 10 LLMs that you can run on your computer:

  • Llama 3.1 8b: Exceptional multilingual coding assistant with a massive 128K token context length.
  • StarCoder: Specialized coding powerhouse trained in 80+ programming languages with state-of-the-art performance.
  • BLOOM-7.1B: Multilingual marvel supporting 46 natural languages and 13 programming languages with strong zero-shot cross-lingual transfer.
  • Gemma 7B: Compact yet powerful multimodal model excelling in math, science, and coding tasks.
  • Mistral-7B-Instruct-v0.2: Instruction-tuned model with advanced attention mechanisms for superior task understanding and execution.
  • GPT-J-6B: Versatile autoregressive model trained on a diverse 800GB dataset for robust natural language processing.
  • FLAN-T5-XXL: Instruction-tuned powerhouse fine-tuned on 1,800+ tasks for exceptional versatility and performance.
  • GPT-NeoX-20B: Massive 20B parameter model with impressive zero-shot capabilities across a wide range of tasks.
  • T0pp: Zero-shot generalization champion that outperforms much larger models on unseen tasks.
  • Dolly-v2-7b: Efficient, instruction-following model fine-tuned on high-quality human-generated data for practical applications.

Now, let’s address each one in detail.

1. Llama 3.1 8b

Llama 3.1 8B is part of Meta’s latest series of open-source language models, designed to provide advanced AI capabilities while being accessible for local deployment. This model is particularly suitable for developers and researchers who need a customizable and efficient AI tool that can run on consumer-grade hardware.

Key features:

  • Multilingual Capabilities: Llama 3.1 8B supports 8 languages, including English, German, French, Italian, Portuguese, Hindi, Spanish, and Thai. This makes it versatile for a wide range of multilingual applications.
  • Large Context Length: The model has an impressive context length of 128K tokens, a significant increase from previous versions. This allows for much longer interactions and enables the processing of longer documents.
  • Enhanced Coding Skills: Llama 3.1 8B demonstrates strong performance on coding benchmarks like HumanEval, MBPP, and HumanEval+. It achieves scores of 72.6, 60.8, and 67.1 respectively on these benchmarks, showcasing its coding capabilities.
  • Improved Tool Usage: The model has been fine-tuned to effectively use various tools, enabling it to perform a wide range of tasks in a chat setup, including multi-turn dialogues and complex problem-solving scenarios.
  • Open-Source Availability: As an open-source model, Llama 3.1 8B’s weights are available for download. Developers can fully customize the model for their needs, train on new datasets, and conduct additional fine-tuning. This enables the broader developer community to realize the power of generative AI.

2. StarCoder

With an impressive 15.5 billion parameters, StarCoder is designed to assist developers with a wide range of coding tasks across more than 80 programming languages. At its core, StarCoder is based on the GPT-2 architecture but incorporates several key modifications to optimize its performance on code.

StarCoder was trained in two stages. First, the base model (StarCoderBase) was trained on an extensive dataset called The Stack, which contains 1 trillion tokens of code from a diverse set of programming languages.

To further refine StarCoder’s capabilities, the BigCode team performed an additional fine-tuning step using 35 billion tokens of Python code. This targeted fine-tuning allows StarCoder to excel at Python-specific coding tasks and achieve state-of-the-art performance on various benchmarks.

Key features:

  • Code-Specific Training: StarCoder is a large language model specifically trained on code from 80+ programming languages, Git commits, GitHub issues, and Jupyter notebooks. This code-centric training allows StarCoder to excel at a wide range of coding tasks compared to general-purpose language models.
  • Exceptional Performance on Coding Benchmarks: StarCoder outperforms existing open-source code language models and matches or surpasses closed models like OpenAI’s code-cushman-001 on popular programming benchmarks. It demonstrates strong performance on tasks like code completion, generation, translation, and summarization.
  • Extended Context Length: With a context length of over 8,000 tokens, StarCoder can process more input than any other open-source language model. This enables it to handle longer code snippets, understand more context, and perform complex coding tasks.
  • Versatile Coding Assistant: StarCoder can be used as a versatile coding assistant, helping with tasks like generating code snippets, completing partially written code, fixing bugs, formatting code, translating between programming languages, and providing explanations in natural language.

3. Bloom-7.1B

bloom 7b

BLOOM-7.1B is a powerful open-source multilingual language model developed by BigScience, a collaborative research workshop involving over 1,000 researchers from around the world. With 7.1 billion parameters, it uses a decoder-only transformer architecture consisting of 70 layers and 112 attention heads.

Key Features:

  • Multilingual Capabilities: BLOOM-7.1B supports 46 natural languages and 13 programming languages out of the box. This makes it highly versatile for multilingual applications across many languages.
  • Strong Zero-Shot Cross-Lingual Transfer: BLOOM-7.1B demonstrates impressive zero-shot performance on cross-lingual benchmarks like XNLI without requiring language-specific fine-tuning. This highlights its ability to effectively transfer knowledge across languages.
  • Decoder-Only Architecture: Unlike Llama 3.1 8B, BLOOM-7.1B uses a decoder-only transformer architecture. This architecture choice influences the types of tasks the model excels at, such as language generation and completion.
  • Efficient Inference: With 7.1 billion parameters, BLOOM-7.1B provides a good balance between model capability and computational efficiency. It requires around 14 GB of VRAM, making it feasible to run inference on a single high-end GPU.

4. Gemma 7B

Gemma 7B is a state-of-the-art multimodal language model developed with a focus on open-source accessibility and responsible AI practices.

Under the hood, Gemma 7B utilizes a decoder-only transformer architecture with 32 layers and 32 attention heads. It was trained on a vast corpus of high-quality web-scraped data that was carefully filtered for safety and relevance.

Key features:

  • Multimodal Capabilities: Gemma 7B can process and generate information across text, voice, and vision domains. This makes the model highly versatile for applications that involve multiple modalities, such as voice assistants, image captioning, and multimodal search.
  • Compact Model Size: Despite its powerful capabilities, Gemma 7B has a relatively compact model size of just 7 billion parameters. This lightweight architecture enables more efficient deployment and inference on a wider range of hardware compared to larger models.
  • Strong Math, Science & Coding Performance: Gemma 7B demonstrates impressive capabilities on technical tasks related to math, science, and programming. It achieves high scores on benchmarks in these domains, showcasing its analytical reasoning and code-generation abilities.
  • Permissive Open-Source License: Gemma 7B is open-sourced under the Apache 2.0 license, allowing for broad usage, modification, and distribution. This permissive licensing encourages wide adoption and enables the community to adapt and build upon the model.
  • Large Vocabulary Size: The model boasts an extensive vocabulary of 256,000 words. This is substantially larger than competitors like Llama 2 which has a 32,000 word vocabulary. The broad vocabulary helps Gemma 7B understand and generate more diverse and nuanced language.

5. Mistral-7B-Instruct-v0.2

Mistral-7B-Instruct-v0.2 is an instruction-tuned version of the base Mistral-7B model, fine-tuned to better understand and follow natural language instructions.

At its core, Mistral-7B-Instruct-v0.2 employs a decoder-only transformer architecture with 32 layers and 32 attention heads. The model was trained on a large corpus of high-quality web-scraped data, carefully filtered for safety and relevance. It uses a SentencePiece tokenizer with a vocabulary size of 64,000 tokens.

Key features:

  • Instruction Fine-Tuning: Mistral-7B-Instruct-v0.2 is an instruction fine-tuned version of the base Mistral-7B model. This fine-tuning allows the model to better understand and follow instructions, making it more suitable for task-oriented applications.
  • Grouped-Query Attention: The model incorporates Grouped-Query Attention, an architectural enhancement that improves the model’s ability to handle complex, multi-part queries. This feature enables more efficient processing of instructions and queries.
  • Sliding-Window Attention: Mistral-7B-Instruct-v0.2 utilizes Sliding-Window Attention, which allows the model to focus on relevant parts of the input sequence. This attention mechanism enhances the model’s ability to capture long-range dependencies and handle longer input sequences effectively.
  • Strong Zero-Shot Performance: The model demonstrates impressive zero-shot performance on various benchmarks and tasks. It can generate high-quality responses without the need for task-specific fine-tuning, showcasing its versatility and adaptability.

6. GPT-J-6B

GPT-J-6B is a powerful autoregressive language model developed by EleutherAI, a research group dedicated to open-source AI development. GPT-J-6B was trained on a massive dataset consisting of 800GB of high-quality text data.

This diverse training corpus, known as The Pile, was curated by EleutherAI and encompasses a wide range of domains, from academic papers to web content.

  • Autoregressive Language Modeling: GPT-J-6B is an autoregressive, decoder-only transformer model. It is designed to predict the next token in a sequence, making it highly capable of language generation tasks.
  • Versatile Natural Language Processing: GPT-J-6B can be applied to a wide range of natural language processing tasks beyond just text generation, such as text classification, question answering, summarization, and more. Its strong language understanding capabilities make it a versatile tool.
  • Deep Architecture: The model consists of 28 transformer layers, each with a hidden dimension of 4096 and 16 attention heads. The feedforward dimension is 16384. This deep architecture allows GPT-J-6B to capture intricate patterns and long-range dependencies in text.

7. FLAN-T5-XXL

FLAN-T5

FLAN-T5-XXL is a state-of-the-art language model developed by Google. The model builds upon the T5 (Text-to-Text Transfer Transformer) architecture, which has proven to be highly effective for a wide range of natural language tasks.

FLAN-T5-XXL employs a 24-layer encoder-decoder structure, with a hidden size of 1024 and a feed-forward hidden size of 4096. One of the key features of FLAN-T5-XXL is its instruction fine-tuning. The model has been fine-tuned on a diverse set of more than 1,800 tasks, each expressed through natural language instructions

  • Instruction Fine-Tuning: FLAN-T5-XXL has been fine-tuned on a mixture of more than 1,800 diverse tasks that were expressed via natural language instructions. This instruction fine-tuning allows the model to better understand and follow a wide variety of instructions, making it highly versatile.
  • State-of-the-Art Performance: FLAN-T5-XXL achieves state-of-the-art performance on several benchmarks. For example, the larger variant Flan-PaLM 540B achieves 75.2% on five-shot MMLU. FLAN-T5-XXL itself outperforms the much larger PaLM 62B model on some challenging BIG-Bench tasks.
  • Efficient Serving with Quantization: FLAN-T5-XXL can be served efficiently by quantizing the model weights to int8 precision. This allows fitting the 11B parameter model on a single GPU with around 45 GB VRAM when using quantization, enabling more cost-effective deployment.
  • Broad Language Coverage: The model has been trained on a broad set of tasks covering multiple languages including English, German, and French. This allows FLAN-T5-XXL to perform well on a variety of language tasks across several languages.

8. GPT-NeoX-20B

GPT-NeoX-20B’s architecture is based on the GPT-3 design but with several notable enhancements. GPT-NeoX-20B features 44 transformer layers, each with a hidden size of 6144 and 64 attention heads.

One of the key strengths of GPT-NeoX-20B is its training data. The model was trained on The Pile, a massive and diverse dataset curated by EleutherAI, containing over 800GB of high-quality text from a wide range of domains, including academic papers, books, websites, and more.

Key features:

  • Massive Scale: GPT-NeoX-20B is a truly massive language model with 20 billion parameters. This enormous parameter count allows the model to capture and generate complex linguistic patterns, resulting in strong performance across a wide range of natural language tasks.
  • Diverse Training Data: GPT-NeoX-20B was trained on the Pile, a large-scale, curated dataset created by EleutherAI. The Pile spans a wide range of domains, from academic writing to web content. This diverse training data allows the model to perform well on a broad spectrum of language tasks and adapt to different contexts.
  • Strong Zero-Shot Performance: The model demonstrates impressive zero-shot capabilities, meaning it can perform well on tasks it was not explicitly trained on. For example, GPT-NeoX-20B exceeded GPT-3’s one-shot performance on the challenging MATH dataset without task-specific fine-tuning. This adaptability makes it a powerful general-purpose language model.
  • Improved Architecture: While largely following the GPT-3 architecture, GPT-NeoX-20B incorporates several notable improvements. These include using rotary positional embeddings, computing attention and feedforward layers in parallel, and employing a different initialization scheme. These optimizations contribute to the model’s strong performance.

9. T0pp

T0pp

T0pp is a powerful language model developed by BigScience, a collaborative research workshop involving AI researchers from around the world. T0pp, pronounced “T Zero Plus Plus,” is part of the T0 model family that showcases impressive zero-shot task generalization capabilities.

T0pp is built upon the T5 (Text-to-Text Transfer Transformer) architecture, utilizing an encoder-decoder structure. With 11 billion parameters, it is one of the larger models in the T0 series, enabling it to capture and understand complex language patterns and nuances.

Key features:

  • Instruction Fine-Tuning: T0pp has been fine-tuned on a large set of diverse tasks specified via natural language instructions. This allows the model to understand and follow instructions for a wide variety of NLP tasks, making it highly versatile.
  • Zero-Shot Task Generalization: T0pp demonstrates impressive zero-shot performance, outperforming GPT-3 on many tasks while being 16x smaller. It can perform well on completely unseen tasks specified through natural language prompts without requiring task-specific fine-tuning.
  • Prompt Engineering: The training data for T0pp was created by converting supervised datasets into prompts using multiple templates and varying formulations. This showcases the importance of prompt engineering in eliciting strong zero-shot performance from language models.
  • Broad Task Coverage: T0pp was fine-tuned on a multitask mixture covering a wide range of NLP tasks including question answering, natural language inference, coreference resolution, word sense disambiguation, and sentence completion. This broad coverage contributes to its strong zero-shot generalization.
  • Efficient Serving: Despite its large size (11B parameters), T0pp can be served efficiently by using techniques like parallelization across multiple GPUs. The model was trained with bf16 activations, so using fp32 or bf16 precision is recommended over fp16 for inference.

10. Dolly-v2-7b

Dolly-v2-7b

Dolly-v2-7b is an open-source, instruction-following language model developed by Databricks. With 6.9 billion parameters, it is based on EleutherAI’s Pythia-6.9b architecture and has been fine-tuned on a curated dataset of approximately 15,000 human-generated instruction/response pairs called databricks-dolly-15k.

  • Instruction Fine-Tuning: Dolly-v2-7b has been fine-tuned on a ~15K record instruction dataset called databricks-dolly-15k, generated by Databricks employees. This instruction fine-tuning allows the model to follow instructions and exhibit behaviour beyond what is typical of the base model it was trained on.
  • Efficient Serving with Quantization: Despite its large size of 6.9B parameters, Dolly-v2-7b can be served efficiently by quantizing the model weights to 8-bit precision. This allows running the model on a single GPU with less than 45 GB VRAM.
  • Strong Zero-Shot Performance: While not state-of-the-art, Dolly-v2-7b demonstrates strong zero-shot performance on various benchmarks and tasks. It can generate coherent responses to instructions without task-specific fine-tuning.

Meanwhile, Dolly-v2-7b does have some limitations. The model may struggle with syntactically complex prompts, programming problems, and math operations. It can occasionally make factual errors or hallucinate information that is not grounded in reality.

Additionally, Dolly-v2-7b is not considered state-of-the-art in overall performance when compared to larger language models.

Wrapping Up…

This was my take on the 10 best LLMs that you can run on your computer locally. I hope you will find this guide helpful. If you have any queries, feel free to reach out to us and we will get back to you ASAP šŸ™‚

Share your love
Kabir
Kabir

A tech journalist whose life revolves around networks.

Newsletter Updates

Enter your email address below and subscribe to our newsletter

Stay informed and not overwhelmed, subscribe now!