Intelligence In AI - Define It, Measure It, Build It (#ChatGTP5)

Beyond the Hype: Why Current AI Falls Short of AGI and What Comes Next

Based on a presentation about the limitations of current AI systems and the path toward true artificial general intelligence recorded in 2024.

The Peak AGI Hype of 2023

Do you remember what February 2023 felt like? ChatGPT had been released just a couple of months prior, GPT-4 had just launched, and Bing Chat was being hailed as the "Google killer." We were told that ChatGPT would make us 100x more productive, that it would replace most jobs, and that AGI was no longer a decade away—it was just around the corner.

The reasoning seemed sound: if AI could pass the bar exam, it could be a lawyer. If it could solve programming puzzles, it could be a software engineer. Many predicted that lawyers, software engineers, doctors, and most desk jobs would disappear within the year.

That was a year and a half ago. Today, US employment rates are actually higher than they were then.

The Reality Check: LLMs Have Fundamental Limitations

Away from the headlines and hype, it becomes clear that large language models (LLMs) might be a bit short of "general." They suffer from several inherent problems that stem from the paradigm we use to build these models—problems that haven't been solved in over five years of development.

Pattern Matching Without Understanding

LLMs are autoregressive models that always respond with what seems likely to follow your question, without necessarily analyzing the content of your question. For months after ChatGPT's release, if you asked "What's heavier, 10 kilos of steel or one kilo of feathers?" it would answer "They weigh the same"—because it had memorized the classic trick question about one kilo of each, without actually parsing the numbers in your specific question.

Extreme Sensitivity to Phrasing

LLMs show remarkable brittleness when you change names, places, or variable names in text. This sensitivity suggests superficial pattern matching rather than robust understanding. For any LLM query that seems to work, there's usually an equivalent rephrasing that a human would readily understand but that will break the model's performance.

The Memorization Problem

LLMs appear capable of in-context learning and adapting to new problems, but what actually happens is that they fetch memorized problem-solving templates and map them to current tasks. When faced with something slightly unfamiliar—even if it's very simple—they cannot analyze it from first principles.

Take Caesar ciphers as an example: state-of-the-art LLMs can solve Caesar ciphers, but only for specific key sizes (like 3 and 5) that appear commonly in online examples. Show them a cipher with a key size of 13, and they fail completely. They have no actual understanding of the algorithm—only memorized solutions for specific cases.

Limited Generalization

Even for problems LLMs have seen millions of times, like number multiplication or list sorting, they struggle with generalization and typically need external symbolic systems for assistance. Research shows that LLMs don't actually handle composition—instead, they perform "linearized subgraph matching."

Perhaps most surprisingly, there's the "reversal curse": if you train an LLM that "A is B," it cannot infer the reverse relationship that "B is A."

The Paradox: Great Benchmarks, Limited Understanding

This creates a puzzling situation: LLMs are beating every human benchmark we throw at them, yet they're not demonstrating robust understanding. The resolution lies in understanding that skill and benchmarks aren't the primary lens through which we should evaluate these systems.

Two Views of Intelligence

There have been two main approaches to defining AI's goals:

1. **The Minsky/Big Tech View**: AGI as a system that can perform most economically valuable tasks—focused on task-specific performance scaled to many tasks.

2. **The McCarthy/Locke View**: Intelligence as the ability to handle problems you haven't been prepared for—a general-purpose learning mechanism rather than a collection of specialized skills.

The key insight is this: **skill is not intelligence**. Displaying skill at any number of tasks doesn't demonstrate intelligence. It's always possible to be skillful at any given task without requiring intelligence.

Think of it as the difference between having a road network versus having a road-building company. A road network lets you travel between specific predetermined points, but a road-building company lets you connect arbitrary points as your needs evolve.

Intelligence is the ability to deal with new situations and blaze fresh trails—not just travel existing roads.

The Core Problem: How We Measure Intelligence

The way we define and measure intelligence reflects our understanding of cognition and limits the answers we can get. If we have a bad feedback signal, we won't make progress toward actual generality.

Three key concepts are essential for defining and measuring intelligence:

1. Static Skill vs. Fluid Intelligence

There's a crucial distinction between having access to a large collection of static programs (like LLMs) versus being able to synthesize brand new programs on the fly for problems you've never seen before.

2. Operational Area

There's a big difference between being skilled only in situations very close to what you're familiar with versus being skilled in any situation within a broad scope. True intelligence should generalize broadly.

3. Information Efficiency

How much data was required for your system to acquire a new skill? More information efficiency indicates higher intelligence.

All three concepts are linked by **generalization**—the central question in AI, not skill.

The ARC Challenge: A Better Way to Measure Intelligence

To address these measurement problems, I developed the Abstraction and Reasoning Corpus (ARC)—an intelligence test that can be taken by humans or AI agents. ARC is designed around several key principles:

- **Every task is novel**: You cannot prepare for ARC by memorizing solutions

- **Few-shot learning**: You see 2-3 examples and must infer the underlying program

- **Core knowledge grounding**: Based only on fundamental cognitive systems like objectness, geometry, numbers, and basic physics—no acquired knowledge like language required

Current state-of-the-art LLMs perform poorly on ARC:

- Most LLMs: 5-9% accuracy

- Claude 3.5: 21% accuracy

- Basic program search: ~50% accuracy

- Humans: 97-98% accuracy

Understanding Abstraction: The Engine of Generalization

The universe is full of analogies—everything is similar to everything else in some way. Intelligence is the ability to mine experience, identify reusable bits (abstractions), and recombine them to make sense of novel situations.

There are two key types of abstraction:

Value-Centric Abstraction (System 1/Type 1 Thinking)

- Operates in continuous domains

- Compares things via distance functions

- Powers perception, intuition, and pattern recognition

- **LLMs excel at this**

Program-Centric Abstraction (System 2/Type 2 Thinking)

- Operates in discrete domains

- Compares discrete programs through exact subgraph matching

- Powers explicit reasoning and planning

- **LLMs struggle with this**

The Path Forward: Merging System 1 and System 2

Human intelligence combines both types of thinking. When playing chess, you use System 2 for step-by-step calculation, but System 1 intuition to narrow down which moves are worth calculating. This combination allows humans to play chess with minimal cognitive resources compared to computers.

The next breakthrough in AI will likely come from merging machine learning (System 1) with program synthesis (System 2). The key idea is to use fast but approximate neural network judgments to fight the combinatorial explosion that plagues discrete program search.

Think of it as drawing a map: you embed discrete objects and relationships into a geometric manifold where you can make fast inferences about relationships, then use this to guide more precise discrete search.

Two Promising Research Directions

1. Deep Learning as Components in Discrete Programs

- Use deep learning as a perception layer to parse the real world into discrete building blocks

- Add symbolic components to deep learning systems (like external verifiers and tool use for LLMs)

2. Deep Learning Models to Guide Discrete Search

- Use neural networks to provide intuitive program sketches

- Reduce the space of possible branching decisions at each search node

Current Progress and Future Outlook

Recent work combining LLMs with discrete program search shows promise. For example, using LLMs to generate hypotheses about ARC tasks in natural language, then implementing candidate programs in Python, achieves 2x improvement on ARC performance.

The current state-of-the-art approach uses sophisticated prompting with GPT-4 to generate thousands of candidate programs per task, achieving 42% accuracy on ARC's public leaderboard.

The Road Ahead

We know that LLMs fall short of AGI—they excel at System 1 thinking but lack System 2 capabilities. Progress toward AGI has stalled because the fundamental limitations we see today are the same ones we've been dealing with for five years.

We need new ideas and breakthroughs. My prediction is that the next major advance will likely come from an outsider, while big labs remain focused on scaling current approaches.

The ARC challenge isn't just about creating better AI—it's about solving the fundamental question of how to make machines that can approach problems they've never seen before and figure them out. That's the key missing piece for achieving artificial general intelligence.

Perhaps the breakthrough will come from someone reading this post. The future of AI depends not just on bigger models, but on fundamentally new approaches that combine the best of both worlds: the pattern recognition power of deep learning with the flexible reasoning capabilities of program synthesis.

---

*This blog post is based on a presentation about the current limitations of AI systems and pathways toward AGI. The ARC Challenge offers over $1 million in prizes for researchers who can solve this fundamental problem in artificial intelligence.*

Search This Blog

Surf Find Post

Intelligence In AI - Define It, Measure It, Build It (#ChatGTP5)

Comments

Post a Comment

Popular posts from this blog

Video From YouTube

GPT Researcher: Deploy POWERFUL Autonomous AI Agents

Building AI Ready Codebase Indexing With CocoIndex