AI Reference Guide

As organisations explore how to boost productivity using Artificial Intelligence, understanding the basics of how Large Language Models (LLMs) work is a crucial first step. This AI Reference Guide introduces key terms, concepts, and technical variables to help you make informed decisions when considering the adoption of AI in your workplace.

Whether you're evaluating local deployment options, planning integrations with internal systems, or simply curious about how models like ChatGPT function under the hood, this guide will give you a clear, jargon-free foundation.

You can read below or download a PDF copy for printing and sharing with your team.

Common Terms & Concepts

Model: A mathematical system trained on data to perform tasks like generating text, translating languages, or answering questions. In AI, a language model learns patterns in language so it can predict what comes next in a sentence.
Training / Fine‑Tuning:
- Training is the initial process of teaching an AI model by feeding it massive amounts of data so it can learn patterns.
- Fine-tuning happens after initial training, it's when a model is further adjusted using specialised or smaller datasets to improve performance on specific tasks (e.g., medical chatbots, legal assistants).
Inference: The process of using a trained model to generate predictions or responses. For example, when you ask ChatGPT a question, it performs inference to produce the answer.
Parameter Count: Number of trainable weights in a model (e.g., 7 B = 7 billion). These parameters determine how well the model can learn and generalise from data.
Token: A text piece (word or subword) the model processes sequentially. For example, “cat” might be one token, while “predictable” could be broken into several subword tokens.
Context Window (Token Limit): Max length of input + output the model can handle at once (e.g., 128 k tokens). Longer windows allow models to handle bigger documents or longer conversations.
Accuracy / Benchmarks: Performance on standardised tasks like MMLU (Massive Multitask Language Understanding), HellaSwag (commonsense reasoning), HumanEval (code generation), etc. Higher scores indicate stronger real-world understanding.
Quantization: A technique to reduce model size and speed up performance by using lower precision numbers (e.g., INT8 instead of FP16), making it easier to run large models on smaller hardware.
RAG: RAG (Retrieval-Augmented Generation) enhances an LLM by retrieving relevant documents or chunks from your knowledge base and injecting them into the prompt before generating a response. RAG automates the retrieval of relevant context.
MCP: MCP (Model Context Protocol) is an open protocol that standardises how applications provide context to LLMs. MCP gives you a repeatable and structured format to inject context into the model.

Open‑Source LLM Comparison

Model	Params	Benchmarks	Possible Hardware
LLaMA 3.1 (Meta)	8B / 70B / 405B	MMLU: 87.3% (405B), 82.0% (70B), 69.4% (8B); HumanEval: 89.0% (405B)	8B: Mac Mini M4 (24GB), RTX 4070 70B: Mac Studio M3 Ultra (128GB), 2x RTX 4090 405B: Enterprise GPU clusters
Qwen 3 (Alibaba)	1B – 235B (22B active MoE)	MMLU: 64.3% (8B), strong multilingual, coding excellence	8B: Mac Mini M4 (24GB), RTX 4060 Ti 22B: Mac Studio M3 Max (64GB), RTX 4080 235B: Mac Studio M3 Ultra (512GB), Multi-GPU workstation
DeepSeek-V3 (DeepSeek)	671B total / 37B active	MMLU: 88.5%, MATH: 90.2%, HumanEval: 82.6%, state-of-the-art performance	37B active: Mac Studio M3 Ultra (128GB), 2x RTX 4090Full model: Enterprise clusters
Mistral Large 2	123B (dense)	MMLU: ~84.0%, strong instruction following, 128K context	Mac Studio M3 Ultra (256GB), 4x RTX 4090, H100
DeepSeek-R1 (DeepSeek)	671B total / 37B active	MMLU: 90.8%, MATH: 97.3%, reasoning specialist competitive with OpenAI o1	37B active: Mac Studio M3 Ultra (128GB), 2x RTX 4090Full model: Enterprise clusters
Kimi K2 (Moonshot AI)	1T total / 32B active (MoE)	LiveCodeBench: 53.7%, SWE-bench: 65.8%, GPT-4-class performance	32B active: Mac Studio M3 Ultra (128GB), RTX 4090<br>Full model: Multi-GPU clusters

GPU Size & Hardware Requirements

Model Size	Precision	Memory Needed	Mac Options	PC/GPU Options	Performance Notes
3–7B	INT4	~3.5–4 GB	Mac Mini M4 (16GB) MacBook Pro M4 (16GB)	RTX 3060, RTX 4060	Budget-friendly, excellent Mac performance
3–7B	FP16	~14–16 GB	Mac Mini M4 (24GB) Mac Studio M3 Max (32GB)	RTX 4090, RTX 5090	High-end consumer setup
8–13B	INT4	~6.5–7 GB	Mac Mini M4 (32GB) Mac Studio M3 Max (64GB)	RTX 4070, RTX 5070	Good balance of cost/performance
8–13B	FP16	~26–28 GB	Mac Studio M3 Max (64GB) Mac Studio M3 Ultra (128GB)	2x RTX 4090, RTX 5090	Professional workstation level
20–30B	INT4	~15–20 GB	Mac Studio M3 Max (64GB)<br>Mac Studio M3 Ultra (128GB)	RTX 4090, A6000	High-end workstation
20–30B	FP16	~60–65 GB	Mac Studio M3 Ultra (128GB) Mac Studio M3 Ultra (256GB)	4x RTX 4090, A6000, H100	High-memory workstation/server
65–70B	INT4	~35–42 GB	Mac Studio M3 Ultra (128GB) Mac Studio M3 Ultra (256GB)	A6000 (48GB), H100	Great Mac performance at this size
65–70B	FP16	~140–150 GB	Mac Studio M3 Ultra (256GB) Mac Studio M3 Ultra (512GB)	4x A100, 2x H100	Mac now viable for 70B FP16!
120–200B	INT4	~60–100 GB	Mac Studio M3 Ultra (256GB) Mac Studio M3 Ultra (512GB)	2x H100, 4x A6000	Mac competitive for large models
405B+	FP16	~200+ GB	Mac Studio M3 Ultra (512GB) for smaller 405B variants	4x H100 (80GB)	Enterprise clusters preferred
405B+	FP16	~800+ GB	N/A	8x H100 (80GB)	Enterprise clusters / Cloud GPUs

Note: Quantization (e.g., INT4) can reduce memory needs dramatically (e.g., 70B INT4 can fit on a 24 GB GPU).

FAQs

1. What’s the difference between an LLM and general AI?

A Large Language Model (LLM) is a type of AI trained to understand and generate human language. It excels at tasks like writing, summarising, and answering questions.
Artificial Intelligence (AI) is a broader field that includes LLMs but also covers vision, robotics, decision-making systems, etc.

2. Do LLMs think or understand like humans?

No. LLMs generate text based on statistical patterns learned from massive datasets. They don’t have intentions, self-awareness, or true understanding—but they often appear intelligent due to the quality of their training data.

3. How do I connect to my organisations documents and knowledge?

You can connect to your organisation’s documents by combining Retrieval-Augmented Generation (RAG)—which retrieves relevant internal content at runtime—with the Model–Context–Prompt (MCP) protocol, which cleanly defines the model used, the context retrieved, and the prompt given. This setup enables grounded, auditable answers from local AI systems without sending data to the cloud.

4. Where do you find datasets for training?

Training datasets are often sourced from public internet data such as webpages, books, scientific articles, GitHub code, and forums. Common repositories include Hugging Face, The Pile, Common Crawl, and OpenWeb UI. For fine-tuning, organisations may use curated internal data or domain-specific corpora.

5. Why are some LLMs good at some tasks and not others?

Performance varies based on a model’s training data, architecture, and number of parameters. LLMs trained on diverse, high-quality datasets tend to generalise well. Others may specialise—for example, coding models are often fine-tuned on code. Larger models typically perform better but can be less efficient or harder to deploy.

Keep Your Data Local. Unlock the Power of AI.

Local AI gives your organisation full control over how you deploy powerful language models — privately, securely, and without sending data to the cloud.

If you're ready to explore a custom Local AI setup that meets your security, performance, and compliance needs, we'd love to hear from you.