OpenAI released GPT-3 in 2020, and in the data and AI space, it was a significant moment. A language model that could generate coherent, surprisingly relevant text at a scale nobody had seen before.
But GPT-3 was largely inaccessible to most people; it was available through an API, so if you were accessing it, you were already quite technical.
ChatGPT changed that. When it launched in November 2022, it put a conversational interface on top of GPT-3.5 and made it available to anyone with a browser.
No API, no code, just a chat window. It was no longer just a significant moment for the data and AI space; AI was now becoming a household topic.
The underlying model was still stateless; there was no memory, no tools, nothing carried over between sessions. While it has always had its strengths and weaknesses, it has changed and improved drastically since 2022.
Here, we’ll go through how ChatGPT works and the journey that got it to where it is today.
What happens inside ChatGPT when you send a message?
When you hit send, a lot happens before you see a single word back.
First, your message goes through a moderation check. There have always been moderation checks, but the moderation checks of today have been iteratively updated significantly based on the behavior of users and discovering guardrails that should be added.
OpenAI runs it against a classifier to catch anything that violates their usage policies before the model ever sees it. If it clears, your message gets assembled into a prompt along with the entire conversation history up to that point.
The base model has no persistent awareness between turns. Every time you send a message,the full transcript gets resent to the model from scratch. What feels like a flowing conversation is really just the model being handed the whole thread and asked to continue it.
Separately, ChatGPT's memory feature (added in 2024) is a layer on top that stores facts about you across conversations, but the base model itself remains stateless.
From there, the model generates a response token by token, which is why you see the answer appear word by word rather than all at once.
It's not thinking and then typing. It's generating each token based on everything that came before it, one piece at a time. Before that response reaches you, it goes through another moderation pass on the output side. A token is roughly 3-4 characters of text.
The thing that's easy to miss in all of this is that the model isn't retrieving answers or reasoning through a problem the way a human would. It's predicting the most probable next token given the context it's been handed.
This is why you’ve seen these models produce a confident, well-formatted answer that's completely wrong, because confidence and accuracy are not the same thing in a system built on prediction.
The most modern systems add layers on top (like search, code execution, retrieval) specifically to address the “confidently wrong” limitation by grounding responses in additional information rather than purely the information the model was trained on.
ChatGPT of today has added a lot of features:
The context window has grown dramatically
The underlying models have been replaced multiple times
Tool use and agentic behavior didn't exist at launch
Memory wasn't a feature until 2024
Multimodality, including voice, image input, and image generation came later
The moderation and safety layer has been continuously updated
The model's instruction-following, reasoning quality, and refusal behavior have all shifted
How is the model trained?
Understanding how the GPT models are trained helps explain both the capabilities and their limitations, and training happens in three stages:
First, the model goes through pre-training.
Here, it reads web pages, books, and articles. This includes a large mixture of licensed, human-created, and publicly available text, with a single task: predict the next word.
Given a sentence like "Instead of turning left, she turned _", the model learns that "right" is far more probable than "purple." The model isn't being taught facts or logic directly; it's learning patterns by exposure to massive amounts of text.
This is where it picks up grammar, general knowledge, and reasoning patterns. But at this stage, it's still just a text completion engine.
The second stage is supervised fine-tuning.
Human trainers create example conversations alongside prompts drawn from early API users. They provide ideal responses to real questions, and the model is trained to match them. This teaches it to follow instructions and behave conversationally rather than just predict probable text continuations.
The final stage is RLHF (reinforcement learning from human feedback).
Human raters rank multiple responses to the same prompt, and the model learns to prefer the highly-ranked answers (you’ve probably seen the thumbs up/thumbs down feedback prompts).
This is the stage that pushed ChatGPT toward refusing harmful requests and sounding helpful rather than just plausible. It also nudged the model toward expressing uncertainty in some cases, though this is inconsistent in practice.
ChatGPT still regularly sounds confident when it shouldn't. In practice, modern models go through additional post-training beyond RLHF, including safety fine-tuning, evaluation cycles, and system-level controls that guide behavior at runtime.
For analysts, this training process explains both why ChatGPT is so good at understanding context and following complex instructions and why it still hallucinates facts. It learned patterns from text, not how to verify whether something is true.
How does ChatGPT handle memory?
Memory is one of the areas where ChatGPT has changed most visibly since launch, and it's worth being precise about what "memory" actually means here because the word gets used loosely.
At the base level, ChatGPT has never had persistent memory. Every conversation starts fresh — the model has no idea who you are, what you talked about last week, or what you told it yesterday.
What feels like continuity within a single conversation is just the full transcript being resent with every message, as we covered above. Close the window and that context is gone.
The memory feature that OpenAI added in 2024 is a separate layer built on top of that stateless base.
It works by storing facts about you across conversations. Things like your job, your preferences, context you've shared, and retrieving the relevant ones to slot into your prompt at the start of a new session.
If that sounds familiar, it should: it's essentially RAG applied to personal context. The model itself hasn't changed; it's still stateless underneath. What's changed is that the prompt it receives now includes a summary of what it "knows" about you before you've typed a word.
To sum it up: conversation memory is the transcript that exists within a single session and disappears when it ends. Persistent memory is the feature that carries facts forward across sessions, and it's opt-in, viewable, editable, and deletable from your settings.
How ChatGPT uses retrieval to ground responses (RAG)
One of the biggest limitations of a standalone language model is that it relies only on what it learned during training. To address this, modern ChatGPT often uses retrieval-augmented generation (RAG).
In a RAG setup, relevant information is fetched from external sources at the time the question is asked and is inserted into the prompt before the model generates a response. This allows the model to ground its answers in real, up-to-date, or domain-specific data rather than relying purely on learned patterns.
You can see this in features like web search, document uploads, or enterprise knowledge integrations. The model still generates the response, but it does so using retrieved context rather than memory alone. This is how ChatGPT can cite sources from web search or answer questions about documents you just uploaded.
The key distinction is that RAG retrieves context first, then generates a response based on that retrieved information, rather than relying purely on what the model learned during training.
Why ChatGPT responses are probabilistic, not deterministic
One of the most important things to understand about GPT models is that their outputs are probabilistic.
When the model generates a response, it isn’t selecting a single “correct” answer from a database. Instead, at each step, it assigns probabilities to all possible next tokens and samples from that distribution.
This means there are often multiple valid continuations, and small changes in phrasing, context, or system settings can lead to different outputs. This is why you can ask the same question twice and get slightly different answers. The model is not retrieving a fixed response, it is generating one in real time based on likelihood.
How is model quality measured? (evaluations and feedback loops)
Training doesn’t end when the model is deployed.
ChatGPT relies heavily on evaluation frameworks to measure performance and guide improvements. These evaluations test how well the model behaves across a wide range of scenarios, including correctness, reasoning quality, safety, and instruction-following.
Some evals are automated, while others involve human reviewers assessing outputs. The results are used to fine-tune the model, adjust system behavior, and catch regressions over time.
For analysts, this is an important shift: model performance isn’t static, it’s continuously measured and iterated on like any other production system.
What does the context window limit (and why this matters)
Although ChatGPT can feel like it has memory, it can only “see” a limited amount of text at once, known as the context window. This includes your current message, prior conversation history, and any additional information inserted into the prompt.
If the conversation gets too long or too much data is included, older or less relevant information may be truncated. This limitation affects how much context the model can use when generating a response and can impact accuracy in longer interactions. It’s one of the reasons why summarization, retrieval, and careful prompt design become important in real-world applications.
How ChatGPT uses tools (and what the model is actually doing)
When ChatGPT uses tools like code execution, web search, or file analysis, the model itself isn't running Python or browsing the web. Instead, it's acting as the decision-maker and instruction-writer.
Here's how it works: the model decides a tool is needed, generates structured instructions specifying what to do (like "search for 'inflation data 2026'" or "run this Python code on the uploaded CSV"), and those instructions get sent to external systems that actually execute them.
The results, whether search snippets, code output, or file contents, are then appended to the conversation as additional context, feeding back into the same context window mechanism described earlier. The model then generates its final response based on that expanded context.
This separation is crucial. The model is responsible for deciding when to use a tool and how to use it, but the actual execution happens outside the model in specialized systems.
A language model can't actually execute code or fetch web pages. It can only predict text. What makes it agentic is that it's learned to predict instructions that trigger real actions, then incorporate the results into its reasoning.
The decision of when to invoke a tool isn't purely the model's independent judgment. The available tools are presented to the model in the system prompt, so it knows what's accessible. It's been trained to recognize when tool use is appropriate and to invoke the right one, but that training is what enables the behavior, not some innate awareness of its own capabilities.
This is why new tools can be added without retraining the base model from scratch. Once a model has learned tool-use behavior in general, individual tools are largely plug-and-play. The model learns to invoke tools; the tools themselves are external systems that can be swapped or extended independently.
When did ChatGPT become an agent?
The shift from chatbot to agent happened gradually, and each addition changed what ChatGPT was actually capable of in a meaningful way.
The code interpreter (now called Advanced Data Analysis), introduced in July 2023 for Plus subscribers, was the first real signal that something different was happening. Rather than generating code and handing it back to you to run, the code interpreter runs Python in a secure, isolated sandbox environment with no internet access.
When you upload a CSV or other data file, ChatGPT writes Python code to analyze it and executes that code server-side. The environment comes pre-loaded with common data libraries like pandas, numpy, matplotlib, and seaborn, so you can get statistical analysis, visualizations, and data transformations without installing anything locally.
The resulting charts, cleaned datasets, or summary tables are generated and available for download directly from the chat. This means you can go from raw data to insights without ever leaving the conversation or touching a code editor. For analysts, this was the most immediately practical addition; you could drop in a CSV and get real analysis back.
Web search followed, giving the model access to current information at query time rather than relying purely on what it learned during training. Image generation, file reading, voice input, and image understanding followed in stages after that, each one expanding the range of inputs the model could work with and the types of tasks it could complete.
What makes all of this "agentic" is less about any individual tool and more about the decision layer that sits underneath them.
When you send a message today, the model doesn't just generate a response; it decides whether it needs to do something before it can answer well. And increasingly, it's chaining these actions together, searching for information, writing code based on what it finds, running that code, and adjusting if something doesn't work. These decision-making steps are what separate an agent from a chatbot.
Why hallucination is a structural risk, not a bug
Hallucinations aren't a fully fixable flaw in ChatGPT, they're a consequence of how generative models work.
A language model is trained to predict plausible continuations. It has no internal mechanism to verify whether a statement is true; it only knows whether it sounds true based on patterns in its training data.
When the model encounters a gap in knowledge or an ambiguous question, modern ChatGPT has been trained to refuse to answer or say "I don't know" — and it does this far more often now than in earlier versions. But it can still fill gaps with the most statistically probable response, which can be confidently wrong.
Final Thoughts
ChatGPT has changed a lot since 2022, and understanding what's happening under the hood makes it easier to use and to know when to trust it vs double-check the answer.
And if you're thinking about building your own RAG app or agentic system, a lot of what we covered here—context windows, tool use, retrieval layers—will help you with your own projects.
If you want to learn about the actual GPT model architecture, I suggest reading this article.

Up to 50% Off Maven Pro Plans
Spring Savings Sale
Take advantage of this limited-time offer and save up to 50% off unlimited Maven access!

Kristen Kehrer
Data Science & AI Expert
I love building coding demos and educating others around topics in AI and machine learning. This past year I've leveraged computer vision to build things like a school bus detector that I use during the school year to get my kids on the bus. I've most recently been playing with semantic video search, vector databases, and building simple chatbots using OpenAI and LangChain.
Frequently Asked Questions
How does ChatGPT differ from a search engine?
Search engines retrieve documents. ChatGPT generates responses. When you query Google, it crawls indexed web pages and returns a ranked list of links to sources that match your keywords. The search engine doesn't create new text, it surfaces existing documents and lets you decide which to trust based on source credibility, recency, and your own evaluation. ChatGPT, on the other hand, predicts the most probable next token given your prompt and generates a brand-new response word by word. Even when ChatGPT Search is enabled and the model retrieves web results, the final answer is still generated, it's a synthesized summary based on what the model predicts is most relevant.
Can ChatGPT analyze my data without uploading it to OpenAI's servers?
No. When you use ChatGPT's Advanced Data Analysis feature, your files are uploaded to OpenAI's servers and processed there. The Python code runs in OpenAI's sandboxed environment, not on your local machine, which means every row of your data is visible to their infrastructure during the session. For most users on the free or Plus tiers, this data can be used to train future models unless you've opted out in your privacy settings. Enterprise and Business accounts have training disabled by default, but the data still passes through OpenAI's servers. If you're working with sensitive data (customer PII, financial records, proprietary business information), you should not upload it to ChatGPT. The risk isn't that OpenAI is malicious, it's that your data governance policies likely prohibit sending regulated or confidential data to third-party servers. For truly sensitive analysis, use local tools like Python on your own machine or SQL databases that never transmit data externally.
Why does ChatGPT sometimes give different answers to the same question?
Because the model's outputs are probabilistic, not deterministic. When ChatGPT generates a response, it doesn't retrieve a fixed answer from a database. Instead, at each step in the generation process, it assigns probabilities to thousands of possible next tokens and samples from that distribution. This means there are often multiple valid continuations, and small variations in context, phrasing, or even randomness can lead to different outputs. The model uses a parameter called "temperature" that controls how much randomness is introduced during sampling. Higher temperature means more creative, varied responses. Lower temperature means more consistent, focused outputs. ChatGPT's default settings use moderate temperature to balance helpfulness with variety. This is also why you'll sometimes see ChatGPT confidently give two contradictory answers to the same question asked minutes apart. It's not changing its mind; it's sampling from a probability distribution where both answers had reasonable likelihood. The model has no memory of what it said last time unless that prior response is still in the conversation history.
How does ChatGPT decide whether to search the web or run code?
The model itself makes this decision based on your prompt. When you send a message, the model first evaluates what's needed to answer well. If it determines that current information is required, it generates instructions to trigger a web search. If it sees that you've uploaded a CSV and asked for analysis, it decides to write and execute Python code. This decision-making happens through the model predicting tool invocation instructions rather than directly generating an answer. It's been trained (through fine-tuning and RLHF) to recognize patterns like "analyze this file" or "what's happening in the news today" and respond by calling the appropriate tool first. The model doesn't have explicit if-then rules; it's learned through examples that certain types of questions benefit from external information or computation before generating the final response. This is why ChatGPT will sometimes search when you didn't expect it to, or why it might write code when a direct answer would have sufficed. The model is predicting what a helpful assistant would do, and sometimes it gets that prediction wrong.




































