Large Language Models
Large Language Models (LLMs) are everywhere these days, from chatbots and writing assistants to coding helpers and search engines. However, not all LLMs are built the same. As these models become increasingly embedded in our daily tools and decision-making processes, understanding how they work is not just for techies; it is essential for anyone who relies on them. In this post, let’s break down the main types of LLMs, how they work, and what they’re good (or not so good) at.
What are LLMs?
First, what is an LLM? As earlier discussed in Aura’s blog, an LLM is an AI model trained to understand and generate human language. These models use a type of neural network called a transformer, which allows them to recognise and reproduce patterns in massive amounts of text. And when we say massive, we mean truly massive: LLMs typically require huge datasets, ranging from web pages and books to forums and codebases, far more than smaller models (often called SLMs or Small Language Models). Where an SLM might be trained on domain-specific datasets, LLMs rely on large-scale, diverse sources to capture the breadth of human language. But an important caveat is that LLMs don’t actually ‘think’ like humans do. They predict what word (or piece of text) should come next based on what they have already seen.
Autoregressive Models
The most well-known models are the writers (autoregressive models). These are the models that generate text, word by word, based on the input. Models like GPT (OpenAI), Claude (Anthropic), and LLaMA (Meta) fall into this category. They can be seen as the creative writers of AI; give them a prompt and they will keep going. These LLMs are great at writing things that sound natural, creating blog posts, conversations, stories, code, even poems. However, since they do not fact-check what they say, but just predict likely continuations, they can sometimes make stuff up. This is called a ‘hallucination’. It sounds convincing… Unless you have more knowledge on the topic or until you do your own research.
Masked Language Models
But a different type of models exists as well, called a Masked Language Model, like BERT and RoBERTa. Instead of generating text, they are trained to fill in missing words in a sentence, kind of like solving word puzzles. This training setup allows them to develop a strong understanding of meaning and context, making them ideal for tasks like classifications, spam filtering, and Q&A systems.
Unlike GPT-style models, these models are not meant for generating full articles or holding conversations. However, when it comes to interpreting and analysing text, they are incredibly efficient and often more accurate. So, are these models ‘better’ than something like GPT? In their niche, yes, they are typically more compact, faster, and less prone to hallucinations. But they are not generalists; you would not ask BERT to write you a newsletter.
General Encoder-Decoder Models
Some transformer models are built to take one kind of text and turn it into another, such as T5 or MariamMT. These are the general encoder-decoder models. They work in two parts; the encoder reads and understands the input, and the decoder generates the output. This makes them ideal for structured language tasks, like translation, summarisation, or rewriting. Compared to other types of LLMs, these are often heavier and slower, but more controlled and task specific. So, if you want a precise translation or summary, a model like T5 might outperform something like GPT, which is more general-purpose.
Retrieval-Augmented Generation Models
Some of the newer models do not only rely on memory, but they also look things up. These hybrid models use external tools, like search engines or databases, to retrieve facts while they generate answers. This is called retrieval-augmented generation (RAG). Models like Cohere’s Command R use this approach. However, these models are more complex to build and usually slower, and thus more costly to use.
Summary
Large Language Models are powerful tools, but picking the right one (or designing the right system) depends on what you want to do. Some LLMs are better at creative writing, some excel at deep reasoning, some are smaller and faster, and others are specialised for safety or retrieval. As LLM technology continues to evolve, we will likely see even smarter, faster, and more specialised models that fit right into our lives, whether we are chatting with an AI, searching for information, or getting real-time translations on our phones. So, for you as a decision-maker, it is important to understand the different models to better understand what is required and what needs to be created when starting your project. The table beneath summarises the advantages and disadvantages of all models.
Model | Advantages | Disadvantages | Cost | Comments |
Autoregressive Models (GPT, Claude, LLaMA) | – Creative and fluent text generation – Versatile: writing, coding, conversations | – Can hallucinate (make up facts) – Limited factual accuracy | Medium – high | Great for creative use cases but less reliable for factual info |
Masked Language Models (BERT, RoBERTa) | – Strong contextual understanding – Good for classification, Q&A, analysis – Less prone to hallucinations | – Not designed for full text generation – Less creative | Low – medium | Ideal for text interpretation and analysis, not generative tasks |
Encoder-Decoder Models (T5, MarianMT) | – Precise for translation, summarisation, rewriting – More controlled output | – Slower and heavier – Less flexible outside specific tasks | Medium – high | Preferred for accurate, structured language tasks like translation or summarisation |
Retrieval-Augmented Models (Cohere Command R) | – Reduces hallucinations by retrieving external facts – Up-to-date and factual | – Complex architecture – Slower and more expensive to run | High | Excellent for factual and real-time applications but higher development and usage costs |