AI Terminology Glossary

50+ Terms Every Leader Must Know

v1
April 17, 2026
๐Ÿ”„ Auto-updated weekly

AI Terminology Glossary

50+ Terms Every Leader Must Know

Transformer
The neural network architecture that powers all modern LLMs. Introduced in the 2017 paper "Attention Is All You Need" by Google Brain researchers. Processes words in parallel rather than sequentially, enabling massive scale. Every major model โ€” GPT, Claude, Gemini, LLaMA, Qwen โ€” is transformer-based. It is the single most important architectural innovation in modern AI.
Token
The basic unit of text that an LLM processes. A token is roughly three-quarters of an English word โ€” "hamburger" might be one token, while "unbelievable" might split into two. Models are priced and limited by token counts. A typical English word is 1.3 tokens. Understanding tokens is essential for managing API costs and context window limits.
Context Window
The maximum amount of text (measured in tokens) that a model can process in a single interaction. Early models had 2K-token windows; current frontier models support 128K to 1M+ tokens. A larger context window allows processing entire documents, codebases, or conversation histories โ€” but increases computational cost quadratically in standard transformer architectures.
Parameters
The numerical values (weights) inside a neural network that are adjusted during training. Model size is measured by parameter count โ€” 7B (7 billion) is considered small, 70B mid-range, and 400B+ frontier. More parameters generally mean greater capability but higher computational cost for both training and inference. GPT-4 class models are estimated at over 1 trillion parameters.
RLHF (Reinforcement Learning from Human Feedback)
A training technique where human raters evaluate model outputs, creating a reward signal that teaches the model to produce more helpful, harmless responses. Pioneered by OpenAI for ChatGPT. Now used by Anthropic (constitutional AI variant), Google, Meta, and most major AI companies. RLHF is what transforms a raw language model into a useful assistant.
Fine-tuning
The process of taking a pre-trained model and training it further on a specific dataset to adapt it for particular tasks or domains. A medical company might fine-tune a general LLM on clinical literature to create a specialized medical assistant. Fine-tuning requires far less data and compute than training from scratch โ€” often achievable in hours on a single GPU.
RAG (Retrieval-Augmented Generation)
A technique that enhances LLM responses by first retrieving relevant documents from an external knowledge base, then feeding those documents to the model as context. RAG addresses hallucination and staleness problems without retraining. Most enterprise AI deployments use RAG to ground model outputs in authoritative, up-to-date company data.
Prompt Engineering
The practice of crafting input text (prompts) to elicit desired outputs from an LLM. Includes techniques like system prompts, few-shot examples, chain-of-thought instructions, and role assignment. Prompt engineering is the primary interface between humans and foundation models โ€” a new form of programming that uses natural language rather than code.
Hallucination
When an LLM generates plausible-sounding but factually incorrect information with apparent confidence. Hallucination is the most significant reliability problem in production AI systems. It occurs because models generate statistically likely text rather than retrieving verified facts. Mitigation strategies include RAG, grounding, and output verification systems.
Multimodal
A model that can process and/or generate multiple types of data โ€” text, images, audio, video, or code โ€” within a single architecture. GPT-4o, Gemini, and Claude 3.5 are multimodal. This capability enables applications like analyzing images, transcribing audio, generating visual content, and understanding documents with charts and diagrams.
Foundation Model
A large AI model trained on broad data that can be adapted to many downstream tasks. The term was coined by Stanford's HAI Institute in 2021. Foundation models (GPT-4, LLaMA, Gemini) serve as the base layer upon which applications are built. The foundation model paradigm has largely replaced the older practice of training specialized models for each task from scratch.
LLM (Large Language Model)
A neural network trained on massive text corpora to understand and generate human language. LLMs predict the next token in a sequence, which emerges as seemingly intelligent behavior at scale. The "large" refers to both parameter count (billions to trillions) and training data (trillions of tokens). ChatGPT, Claude, and Gemini are all LLMs.
Mixture of Experts (MoE)
An architecture that activates only a subset of a model's parameters for each input, reducing computational cost while maintaining large total parameter counts. DeepSeek-V3 (671B total, 37B active) and Qwen3-235B (22B active) use MoE. This approach allows models to be large and knowledgeable while remaining efficient at inference time.
Quantization
Reducing the precision of model weights (e.g., from 16-bit to 4-bit numbers) to decrease memory usage and inference cost with minimal quality loss. A quantized 70B model might run on consumer hardware that the full-precision version could not. Common formats include INT8, INT4, and GGUF. Quantization is essential for deploying AI on edge devices and controlling cloud costs.
Agentic AI
AI systems that can autonomously plan, execute multi-step tasks, use tools, and make decisions with minimal human oversight. Unlike chatbots that respond to single prompts, agents can break complex goals into subtasks, call external APIs, browse the web, write and execute code, and iterate on their own output. Agentic AI is widely seen as the next major paradigm in AI applications.
MCP (Model Context Protocol)
An open protocol introduced by Anthropic in late 2024 that standardizes how AI models connect to external data sources and tools. MCP provides a universal interface for models to access databases, file systems, APIs, and other services. Similar to how USB standardized hardware connections, MCP aims to standardize AI-to-tool integration, reducing custom development work.
Embedding
A numerical representation (vector) of text, images, or other data in a high-dimensional space where similar items are close together. Embeddings enable semantic search, clustering, and similarity comparison. When you search for "car" and get results about "automobile," embeddings are why. They are the foundation of modern search and retrieval systems.
Vector Database
A specialized database optimized for storing and querying embedding vectors. Examples include Pinecone, Weaviate, Milvus, and Chroma. Vector databases power RAG systems by finding the most semantically relevant documents for a given query. They are the infrastructure layer that makes AI-grounded retrieval possible at scale.
Semantic Search
Search that understands the meaning of queries rather than just matching keywords. Powered by embeddings and vector databases, semantic search returns results based on conceptual relevance. Asking "How do I reduce expenses?" can find documents about "cost optimization" even without keyword overlap. It represents a fundamental upgrade over traditional keyword-based search.
LoRA (Low-Rank Adaptation)
A parameter-efficient fine-tuning technique that trains small adapter modules alongside a frozen base model rather than modifying all weights. LoRA reduces fine-tuning costs by 90%+ while achieving comparable results to full fine-tuning. It has become the standard approach for customizing foundation models for specific domains, tasks, or organizational needs.
Inference
The process of using a trained model to generate outputs โ€” the production phase as opposed to training. Inference is what happens every time you send a prompt to ChatGPT. It is the primary ongoing cost of deployed AI systems. Inference optimization (quantization, batching, caching, specialized hardware) is a major focus for reducing AI operational costs.
Training
The process of teaching a model by adjusting its parameters on large datasets. Pre-training learns general knowledge from internet-scale data. Fine-tuning adapts the model for specific tasks. Training a frontier model costs millions to hundreds of millions of dollars and requires thousands of GPUs running for months. It is the capital-intensive foundation of all AI capability.
Neural Network
A computational architecture loosely inspired by biological brains, consisting of layers of interconnected nodes (neurons) that process information. Neural networks are the backbone of modern AI. They learn patterns from data rather than being explicitly programmed. Deep neural networks with many layers power everything from image recognition to language generation.
Deep Learning
A subset of machine learning using neural networks with many layers (hence "deep"). Deep learning drove the AI revolution starting around 2012, achieving breakthroughs in image recognition, speech processing, and natural language understanding. All modern LLMs, image generators, and speech systems use deep learning. The term distinguishes modern multi-layer architectures from earlier shallow machine learning approaches.
Backpropagation
The algorithm used to train neural networks by calculating how much each parameter contributed to errors in the output, then adjusting parameters to reduce those errors. Proposed in the 1980s and enabled by GPU computing in the 2010s, backpropagation is the mathematical engine that makes deep learning work. Every trained neural network you use was optimized through backpropagation.
Gradient Descent
The optimization algorithm at the heart of neural network training. It iteratively adjusts model parameters in the direction that reduces errors, like rolling a ball downhill to find the lowest point in a landscape. Variants like SGD (Stochastic Gradient Descent), Adam, and AdamW are standard in modern training. Understanding gradient descent helps explain why training requires massive compute โ€” each step processes billions of parameters.
Attention Mechanism
The core innovation behind transformers that allows a model to focus on relevant parts of the input when generating each part of the output. Self-attention lets every token attend to every other token, capturing relationships across the entire context. Multi-head attention enables the model to attend to different types of relationships simultaneously. Attention is what makes transformers so powerful โ€” and so computationally expensive.
Encoder-Decoder
A neural network architecture with two components: an encoder that processes input into a compressed representation, and a decoder that generates output from that representation. The original Transformer paper used this design for translation. Modern LLMs typically use decoder-only architectures, but encoder-decoder structures persist in specialized models for translation, summarization, and multimodal tasks.
Autoregressive
A generation method where the model produces output one token at a time, with each new token conditioned on all previous tokens. GPT, Claude, and virtually all modern LLMs are autoregressive โ€” they literally predict the next word, then the next, building coherent text token by token. This sequential generation is why inference takes time proportional to output length.
Temperature
A parameter (typically 0 to 1) that controls the randomness of model outputs. Temperature 0 always selects the most likely next token (deterministic, focused). Higher temperature increases randomness, producing more creative but less predictable outputs. Temperature is the primary dial for adjusting the tradeoff between consistency and creativity in AI-generated content.
Top-p (Nucleus Sampling)
A sampling technique that restricts token selection to the smallest set of tokens whose cumulative probability exceeds threshold p (e.g., 0.9). Unlike temperature, which scales all probabilities, top-p dynamically adjusts the candidate pool based on the distribution shape. Top-p = 0.9 means the model chooses from tokens covering 90% of the probability mass. Combined with temperature for fine-grained output control.
Chain-of-Thought (CoT)
A prompting technique that instructs the model to show its reasoning step-by-step before giving a final answer. CoT dramatically improves performance on mathematical, logical, and complex reasoning tasks. Research showed that even simply adding "think step by step" to a prompt can improve results. Reasoning models like OpenAI's o1 and DeepSeek's R1 automate this process internally.
Zero-shot
Asking a model to perform a task without providing any examples โ€” relying entirely on its pre-trained knowledge and the task description. Zero-shot capability is a key measure of a foundation model's generality. Strong zero-shot performance means less need for task-specific training data or examples, reducing deployment friction for new use cases.
Few-shot
Providing a small number of examples (typically 2-5) in the prompt to demonstrate the desired task format and style before asking the model to perform it. Few-shot prompting often dramatically improves output quality compared to zero-shot. It is the simplest and most commonly used technique for steering model behavior without fine-tuning.
Benchmark
A standardized test used to evaluate and compare AI model capabilities. Benchmarks assess areas like general knowledge (MMLU), coding (HumanEval), mathematical reasoning (GSM8K), and instruction following. Rankings on benchmarks drive competitive dynamics and purchasing decisions. Critics note that benchmark performance may not reflect real-world utility and that models can be over-optimized for benchmark scores.
MMLU (Massive Multitask Language Understanding)
The most widely cited benchmark for LLM evaluation, testing knowledge across 57 subjects including STEM, humanities, law, and medicine at varying difficulty levels. Scores range from 0-100%; frontier models now exceed 90%. MMLU has become the de facto standard for comparing model intelligence, though its limitations (static questions, potential data contamination) are well-documented.
HumanEval
A benchmark that tests AI coding ability by presenting 164 Python programming problems of varying difficulty. Models must generate function implementations that pass a suite of unit tests. HumanEval is the standard measure of code generation capability. Top models now pass over 90% of problems, reflecting AI's growing competence as a programming assistant.
FLOPS (Floating Point Operations Per Second)
A measure of computational performance โ€” how many mathematical operations a processor can perform per second. AI training and inference are measured in FLOPS (or petaFLOPS, exaFLOPS). An H100 GPU delivers approximately 2,000 teraFLOPS for FP8 operations. Understanding FLOPS helps executives evaluate hardware requirements and compare the computational economics of different AI approaches.
GPU (Graphics Processing Unit)
The hardware workhorse of AI. Originally designed for rendering video games, GPUs excel at the parallel mathematical operations required for neural network training and inference. Nvidia dominates the market with its H100, H200, and Blackwell series. GPU scarcity and cost are the primary constraints on AI development capacity worldwide. A single H100 costs $25,000-40,000.
TPU (Tensor Processing Unit)
Google's custom-designed AI accelerator chip, purpose-built for neural network workloads. TPUs power Google's internal AI services (Search, Gemini, YouTube) and are available to cloud customers via Google Cloud. While less versatile than GPUs, TPUs can offer superior price-performance for specific training and inference workloads. Google's TPU v5p is the latest generation.
CUDA
Nvidia's parallel computing platform and programming model that enables developers to use GPUs for general-purpose computing, including AI. CUDA is the dominant software ecosystem for AI development โ€” virtually all major AI frameworks (PyTorch, TensorFlow, JAX) are CUDA-optimized. CUDA's ecosystem lock-in is one of Nvidia's most powerful competitive moats.
Model Weights
The numerical parameters (often billions of them) that encode a model's learned knowledge. Weights are what you download when you obtain an open-source model. They are the product of training โ€” the compressed representation of all patterns learned from training data. Protecting proprietary weights is a major security concern for closed-source AI companies.
Open-weight
Models whose trained parameters are publicly available for download, inspection, and modification. Examples include Meta's LLaMA, Alibaba's Qwen, and DeepSeek's models. Open-weight does not necessarily mean open-source in the traditional software sense โ€” the training code and data may remain proprietary. The open-weight vs. closed-source debate is the defining ideological divide in the AI industry.
Closed-source
AI models whose architecture, training data, and weights are kept proprietary by their creators. OpenAI's GPT-4, Google's Gemini, and Anthropic's Claude are closed-source, accessible only through paid APIs. Proponents argue this enables better safety oversight and commercial protection. Critics contend it creates dangerous concentration of power and prevents independent safety auditing.
Constitutional AI
A training methodology developed by Anthropic where an AI model is guided by a set of principles (a "constitution") that govern its behavior. The model critiques and revises its own outputs against these principles, reducing the need for human feedback. It is Anthropic's alternative to RLHF, designed to create models that are helpful, harmless, and honest by design rather than by human-imposed constraints.
Red Teaming
The practice of systematically attacking or probing AI systems to discover vulnerabilities, biases, dangerous capabilities, and failure modes before deployment. Named after military adversarial testing, AI red teaming involves trying to make the model produce harmful outputs, reveal training data, or behave in unintended ways. Major AI companies now conduct red teaming as a standard pre-release safety measure.
Alignment
The challenge of ensuring that AI systems pursue goals that match human values and intentions. An aligned model does what humans actually want, not what they literally say, and avoids harmful behaviors even when prompted. Alignment research encompasses RLHF, constitutional AI, interpretability, and safety constraints. It is considered one of the most important unsolved problems in AI as systems grow more capable.
AGI (Artificial General Intelligence)
AI that can perform any intellectual task a human can, with the flexibility to learn and adapt across domains. AGI does not yet exist. Major AI companies (OpenAI, Google DeepMind, Anthropic) explicitly state AGI as their goal. There is no consensus definition, which allows companies to claim progress toward a moving target. When achieved, AGI would represent the most significant technological milestone in human history.
Superintelligence
AI that significantly exceeds human cognitive performance across all domains โ€” not just matching but vastly surpassing human intelligence. Superintelligence remains theoretical but is the subject of intense research and policy debate. Concerns center on whether humans could control or even comprehend a system dramatically smarter than any person. Leading AI scientists have called superintelligence risk an existential concern requiring proactive governance.
Diffusion Model
The architecture behind modern AI image and video generation (Midjourney, DALL-E, Stable Diffusion). Diffusion models learn to create data by reversing a noise-adding process โ€” starting from random static and gradually refining it into coherent images. The approach displaced GANs for visual generation due to superior quality and stability. Diffusion models also power emerging applications in protein design and molecular simulation.
GAN (Generative Adversarial Network)
An architecture where two neural networks โ€” a generator and a discriminator โ€” compete against each other. The generator creates synthetic data, the discriminator tries to distinguish it from real data. GANs powered the first wave of AI-generated images (deepfakes) but have been largely superseded by diffusion models for visual content. They remain important in data augmentation, scientific simulation, and specialized generation tasks.
Tokenization
The process of converting raw text into the numerical tokens that a model processes. Tokenizers break words into subword units using algorithms like BPE (Byte Pair Encoding) or SentencePiece. The tokenizer determines how efficiently a model represents different languages โ€” English is typically efficient, while many other languages require more tokens per word, increasing costs and reducing effective context length for non-English users.
KV Cache
A memory optimization technique used during autoregressive inference that stores previously computed key-value pairs from the attention mechanism. Without KV caching, generating each new token would require recomputing attention over all previous tokens, making inference quadratically more expensive. KV caching reduces this to linear cost, dramatically speeding up text generation. Managing KV cache memory is a key engineering challenge in serving LLMs at scale.
Reasoning Models
AI models specifically designed to perform extended internal thinking before producing an answer, mimicking human deliberation. OpenAI's o1/o3, DeepSeek's R1, and similar models generate hidden "chain-of-thought" reasoning traces that improve performance on math, coding, and logic problems. Reasoning models represent a paradigm shift from fast pattern-matching to slower, more deliberate problem-solving โ€” trading inference cost for accuracy.
Tool Use
The ability of an AI model to invoke external tools, APIs, or services to accomplish tasks beyond its native capabilities. A model might use a calculator for arithmetic, a search engine for current information, or a code interpreter for data analysis. Tool use transforms LLMs from isolated text generators into components of larger automated workflows, dramatically expanding their practical utility.
Function Calling
A structured interface that allows LLMs to generate formatted requests to call specific external functions or APIs. Rather than outputting free-form text, the model produces a structured JSON object specifying which function to call and with what parameters. Function calling is the technical backbone of tool use, enabling reliable integration between AI models and enterprise systems, databases, and third-party services.

This entry is part of the CXO Academy AI Encyclopedia โ€” updated weekly.