xavier collantes

AI Chatbots: Giving relationship advice since 1966

By Xavier Collantes

Created: 8/1/2025; Updated: 8/1/2025

AI generated movie poster for when this blog gets a movie deal (including both CEOs of Alphabet and Google, respectively).

This 60-year AI evolution fundamentally changed how humans interact with machines and as you read you will notice how the milestones become closer to each other, highlighting the near-exponential pace of AI advancement.

1966: ELIZA

Developed in 1966 at MIT by Joseph Weizenbaum, ELIZA was the first chatbot written for the IBM 7091. It was a simple program that simulated a psychotherapist.

In Weizenbaum's paper, he describes how the program could be improved as "language abilities may be continually improved by a 'teacher'". He was truly prophetic about how chatbots would be used today. The ELIZA program even back then:

"user types in some statement... in natural language using normal punctuation and sentence structures"
Took transformation rules to filter and edit the return statements
Input prompt was encoded and intent was formed by matching against a set of keywords

Example of a woman using ELIZA seeking advice about her boyfriend (some things never change):

ELIZA used a process of pattern matching and transformation rules on the "prompt" to generate outputs. As opposed to an alternative method such as Context-Free Grammars, which I learned about at Google.

Notice how the computer reuses words from the input prompt such as 'BOYFRIEND', 'DEPRESSED'.

txt

1Human: Men are all alike.
2
3Computer: IN WHAT WAY?
4
5Human: They’re always bugging us about something or other.
6
7Computer: CAN YOU THINK OF A SPECIFIC EXAMPLE?
8
9Human: Well, my boyfriend made me come here.
10
11Computer: YOUR BOYFRIEND MADE YOU COME HERE
12
13Human: He says I’m depressed much of the time.
14
15Computer: I AM SORRY TO HEAR YOU ARE DEPRESSED
16
17Human: It’s true. I am unhappy.
18

snippet hosted withby Xavier

You can read Weizenbaum's paper here: ELIZA: A Computer Program For the Study of Natural Language Communication Between Man And Machine.

1972: PARRY, Psychological Simulation

Kenneth Mark Colby at Stanford University develops PARRY, described as "ELIZA with attitude", the first chatbot designed to model specific mental illness. Written in LISP with approximately 500 heuristics, PARRY implements a crude model of paranoid schizophrenic behavior using concepts, conceptualizations, and beliefs.

In early testing, psychiatrists could only correctly identify PARRY as a computer program 48% of the time, demonstrating successful simulation of human-like responses.

1990s: Shift from Hand Written Rules to Statistical Models

1960s-1980s: The Old Way: Hand-Crafted Linguistic Rules

Early NLP systems like ELIZA and PARRY were built using explicitly programmed rules written by linguists and programmers. These systems worked like elaborate "if-then" statements:

"If the user says 'I am sad,' then respond with 'Why do you think you are sad?'"
"If the sentence contains 'because,' then extract the reason that follows"
"If the input has the pattern 'I _ you,' then respond with 'Why do you _ me?'"

The Problems:

Brittleness: Rules only worked for exact situations anticipated by programmers
Scalability: Adding new capabilities required manually writing thousands more rules
Maintenance nightmare: Rules often conflicted with each other
Limited coverage: No system could handle the full complexity of natural language

Example: ELIZA's DOCTOR script had about 200 rules. To handle even basic conversation, you'd need millions of rules covering every possible linguistic pattern.

1990s: The New Way: Statistical Approaches

Instead of trying to manually encode language rules, why not let computers learn patterns from actual language data? This was revolutionary, moving from human intuition about language to mathematical analysis of how language actually works.

Feed the computer massive amounts of text, and let it discover statistical patterns:

Which words commonly appear together?
What's the probability that "the" is followed by a noun?
How often does "New" precede "York"?

The Three Enabling Factors

Steady Increases in Computational Power: Moore's Law in Action: Computer processing power doubled approximately every two years throughout the 1980s and 1990s.

1980: Typical computers had ~1 MHz processors
1990: 25-33 MHz processors became common
2000: 1+ GHz processors were standard

1993: NVIDIA Founded With Gaming Focus

Jensen Huang, Chris Malachowsky, and Curtis Priem found NVIDIA at a Denny's restaurant in East San Jose with $40,000 in capital. After near-bankruptcy in 1996, the 1997 RIVA 128 became their breakthrough product, followed by the 1999 invention of the GPU (Graphics Processing Unit) and successful IPO.

2006: NVIDIA Releases CUDA

NVIDIA releases CUDA (Compute Unified Device Architecture), a parallel computing platform that fundamentally opened GPUs to general-purpose scientific computing. This $1 billion investment enabled developers to use familiar programming languages (C, C++, Fortran, Python) to harness GPU parallel processing power.

This starts the opportunity for non-video game developers to use GPUs for other means.

2017: "Attention Is All You Need"

Arguably the most important paper triggering the current AI revolution.

Eight researchers at Google Brain publish the groundbreaking paper, Attention Is All You Need, introducing the transformer architecture, fundamentally replacing Recurrent Neural Networks with Attention-Based Processing.

Authors Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Łukasz Kaiser, and Illia Polosukhin demonstrate that attention mechanisms alone are sufficient for sequence modeling, eliminating the need for recurrence and enabling parallel processing. Thus creating the "transformer" architecture.

The innovation was in the method of processing text at the time. Before transformers, the dominant approach for processing texts or sequences was Recurrent Neural Networks (RNNs):

Each word was processed one at a time in order
Each word depends on the previous word's processing
Information flows like a chain: Word 1 → Word 2 → Word 3 → Word 4

Critical Problems

Sequential bottleneck: You could not process Word 5 until you would finish processing Word 4, making training extremely slow.
Vanishing gradients: Information from early words got "forgotten" by the time you reached later words, for example: "The cat that lived in the house on the hill was hungry," an RNN might "forget" about "cat" by the time it reaches "was hungry" because there are so many words in between.
Limited parallelization: Since each step depended on the previous step, you could not use multiple GPUs efficiently since GPUs were great at parallel processing.

The Google Brain team made a radical proposal: What if we could let every word in a sentence directly "look at" and "talk to" every other word simultaneously?

Instead of processing sequentially (word by word), the transformer processes all words at once using something called attention mechanisms.

The transformer architecture solved these problems by:

Allowing each word to attend to every other word directly, capturing complex relationships
Using multi-head attention to capture different aspects of the input
Using positional encoding to maintain sequence order without recurrence
Using self-attention to capture the context of the entire sequence at once, allowing for parallel processing

Revolutionizing AI

Massive Parallelization: Before (RNNs): Processing a 100-word sentence required 100 sequential steps. After (Transformers): Processing a 100-word sentence happens in parallel across all 100 words.
Training time: Training time went from weeks to days, enabling much larger models.
Long-range dependencies: Every word has direct access to every other word's information where models could handle much longer sequences and maintain coherence across massive data sets.
You could actually visualize what the model was paying attention to. In translating "The animal didn't cross the street because it was too tired," you could see the model correctly attending from "it" back to "animal" rather than "street."

Attention Is All You Need did not just introduce a new technique, it established the architectural foundation that powers all modern AI systems. When you interact with ChatGPT, Claude, or any modern AI assistant, you are interacting with a direct descendant of the transformer architecture introduced in this paper.

Me, really happy for my fellow Googlers who wrote a cool paper.

Transformer Architecture In Detail

You can skip if you are not interested in the math.

Every word can directly attend to every other word in the sentence.

Every word asks every other word: "How relevant are you to me?"
Words respond with relevance scores
Each word gets updated information based on all the relevant words

How it works mathematically:

Each word gets converted into three vectors: Query (Q), Key (K), and Value (V)

Think of Q as "what am I looking for?"
Think of K as "what do I offer?"
Think of V as "what information do I contain?"

For the math geeks, here is the formula:

txt

1Attention(Q,K,V) = softmax(QK^T/√d_k)V
2

snippet hosted withby Xavier

The Multi-Head Attention is a key component of the transformer architecture.

One attention mechanism might focus on grammar, another on meaning, another on relationships.

How it works:

Run 8 different attention mechanisms in parallel
Each "head" learns to focus on different types of relationships
Example heads might specialize in:

Subject-verb relationships
Adjective-noun relationships
Long-distance dependencies
Syntactic patterns

Since all words are processed simultaneously, how does the model know word order?

Through Positional Encoding, the transformer architecture needs to know the order of the words in the sentence.

The solution is to add mathematical "position stamps" to each word that encode its location in the sequence.

Instead of processing sequentially to maintain order, they encoded position information directly into the word representations.

2018: OpenAI GPT-1

OpenAI releases GPT-1, the first decoder-only transformer with 117 million parameters trained on BookCorpus.

This established the GPT paradigm of unsupervised pre-training followed by supervised fine-tuning, achieving state-of-the-art performance. Most LLMs today follow this pattern of pre-training and fine-tuning.

2019: HuggingFace Transforms Model Accessibility

HuggingFace launches the Transformers library, democratizing access to pre-trained models with simple APIs. The platform becomes the "GitHub of machine learning," hosting over 500,000 models by 2024.

2020: n8n Emerges as a Crucial Workflow Automation Platform

n8n emerges as a crucial workflow automation platform that enables non-technical users to integrate AI capabilities into business processes. Created by Jan Oberhauser, n8n's visual workflow builder allows users to connect AI models to databases, APIs, and business applications without coding, opening up the capabilities to a new demographic of users.

2022: ChatGPT Launches the Modern AI Era

OpenAI releases ChatGPT, built on GPT-3.5 and optimized using Reinforcement Learning from Human Feedback (RLHF), a three-step process including supervised fine-tuning, reward model training, and proximal policy optimization. In other words, OpenAI was able to tune parameters of the model to make it more "human-like".

ChatGPT reached 1 million users in 5 days and 100 million users in 2 months, becoming the fastest-growing consumer application in history and bringing AI capabilities to mainstream awareness.

2023: LangChain Launches Application Development

Harrison Chase creates LangChain, the first comprehensive framework for building applications with language models. LangChain addresses critical gaps in LLM development: memory management, chain-of-thought reasoning, document processing, and agent behaviors. The framework introduces concepts like:

Chains: Combining LLM calls with other tools
Agents: LLMs that can use tools and make decisions
Memory: Persistent conversation and context management
Retrievers: Integration with vector databases for knowledge retrieval

LangChain was to AI development as ReactJS was to web development, introducing reusable components and established patterns that significantly reduced development complexity. The framework enabled developers to focus on application logic rather than infrastructure concerns.

2023: LlamaIndex Enables Document Intelligence

Jerry Liu creates LlamaIndex (formerly GPT Index) to solve the "LLM + your data" problem. The framework specializes in ingesting, structuring, and querying large document collections, enabling applications like querying on your own data with RAG.

2023: Meta's Open-Source Strategy

Meta released LLaMA 1 to researchers, which subsequently leaked and enabled widespread open-source development. The July 2023 Llama 2 release with commercial licensing and the April 2024 Llama 3 series democratized access to high-quality language models, with the 405B parameter Llama 3.1 matching closed-source model performance.

2023: GPT-4 Introduces Multimodal Capabilities

OpenAI releases GPT-4 with text and image inputs. The model underwent six months of iterative alignment using adversarial testing, demonstrating improved factuality, steerability, and safety.

2023: Google's Bard Responds to ChatGPT

Just 47 days after ChatGPT's launch, Google announced Bard in a "code red" response. A factual error in the demonstration cost Alphabet $100 billion in market value, highlighting the competitive pressure and stakes involved in the AI race.

2024: Model Context Protocol, The Universal Standard

Anthropic introduced the Model Context Protocol (MCP) in November 2024, representing the next evolutionary step. MCP provides:

A standardized protocol for AI-tool integration
Universal connectivity between AI models and external systems
Dynamic tool discovery and capability expansion
Secure, scalable integration patterns

MCP has been described as the "USB-C for AI" because it standardizes how AI systems connect with external tools and data sources, eliminating the fragmentation that previously required custom integrations.

The Transformation: From Technical to Accessible

This evolution mirrors the broader democratization of technology:

Manual Era (2019): Specialists built connections to AI models and external tools and data sources
Framework Era (2023): Software Engineers used reusable components and established patterns to build applications
No-Code Era (2024): Non-technical users could build business applications and LLMs are used for coding entire applications
Agentic Era (2024-2025): LLMs are enabled to use tools and data sources to perform actions and solve problems
Standardization Era (2025+): Universal protocols enable seamless interoperability between AI models and external tools and data sources

As we move along progression, the same thing is happening as any arena such as web development, databases, or any other field:

Gap between the technical and non-technical users is closing
Growing interconnectivity between AI models and external tools.

For example, in the early days of web development:

You had to know how to code HTML, CSS, and JavaScript
Frameworks like ReactJS and AngularJS emerged
Wix and other no-code platforms emerged

The same will happen with AI development.

Conclusion

The rapid evolution from ChatGPT's launch in 2022 to the establishment of universal AI integration protocols in 2024 demonstrates the extraordinary pace of AI development.

The future promises even greater accessibility and capability as these standards mature and new abstractions emerge.

At my core, I call myself a Software Engineer, but specifically like 99% of today's Software Engineers, I am a Web Developer. All the technology and productions I make are related to the World Wide Web, even the Backend Engineering work. For the future, I need to be a Web Developer but for AI. The parallels are clear, and the natural evolution of the field is in AI.