The recent developments in AI, particularly around a new algorithm called "Q*," have sparked considerable intrigue in the machine learning community. This interest arose amid significant changes at OpenAI, marked by the controversial exit of CEO Sam Altman and rumors hinting at a critical AI breakthrough, potentially inching closer to artificial general intelligence (AGI). In this blog post, we will explain what Q-star is, why it's such a big deal, and how it may change the future of AI. As a matter of fact, the information about Q* is not based on any paper or product by OpenAI but is instead a result of AI community research.
What is Q-star (Q*)?
Some AI researchers believe that Q* is a synthesis of A* (a navigation/search algorithm) and Q-learning (a reinforcement learning schema) that can achieve flawless accuracy on math tests that weren't part of its training data without relying on external aids. This may not sound that impressive since computers are designed to be good at math, but there's a reason why OpenAI scientists are probably concerned about Q*. The algorithm achieves 100% accuracy on math problems, surpassing the performance benchmarks of models like GPTs.
The current large language models are great at language-related tasks like translations or summaries but aren’t good at math logic and strategy. They heavily rely on training data and can be considered 'information repeaters.' On the other hand, Q-star is said to showcase impressive logic and long-term strategizing. This could be the next big mathematical step toward revolutionizing scientific research. The discussion around Q* extends beyond machine learning, touching on aspects of neuroscience and cognitive architecture, suggesting it could be more than just a technical achievement but a significant breakthrough in AI research and a possible concern for humanity.
While this sounds like a cool scientific advancement, it might also be the reason behind the troublesome events at OpenAI that made the board – Adam D'Angelo, Tosha McCauley, Ilya Sutskever, and Helen Toner – fire Sam Altman and hire him again in just a few days.
Why is Q-star so "scary"?
It's no secret that rapid advancements in artificial intelligence may raise significant ethical concerns. The letter from OpenAI researchers is said to showcase worries about the quick progress of the system, potentially seeing it as a "threat to humanity." To understand this better, let's talk about artificial general intelligence.
Artificial general intelligence (AGI)
Artificial general intelligence (AGI) is a highly advanced form of AI that's trying to replicate the way humans think and learn. Imagine a computer program that not only does specific tasks, like translating languages or playing games, but also figures out entirely new tasks on its own, just like a person would. AGI would be smart enough to know when it doesn't know something and then go out and learn it by itself. It could even change its own programming to better match what happens in the real world. Basically, AGI is about creating a machine that can do any intellectual job a human can and adapt and learn as flexibly as we do.
AGI is about the future of AI, where the models are good at complex reasoning, making decisions under uncertainty, and possessing emotional and social intelligence. AGI could potentially innovate, create original content, and understand context and nuances in ways that current AI systems cannot. This level of intelligence would enable AGI systems to perform tasks ranging from composing music to conducting scientific research, essentially embodying the versatility and depth of human intelligence in a machine. Many researchers believe that Q* is a big step towards AGI, and serious AI regulations must be conducted before it’s too late.
But before seeing Q* as a significant threat to humanity, give a quick listen to Shane Legg, a chief scientist from Google DeepMind, who shares his doubts about models going beyond their training data.
The idea of AGI taking over sparks controversial opinions in the AI community. Here’s a tweet from Geoffrey Hinton who shared his thoughts on this and received interesting responses from Andrew Ng and Yann LeCun.
A* and Q-learning
To understand the concepts of A* and Q-learning, let’s imagine a problem of navigating from the current state to the goal state – not in a physical space, but rather in an AI agent environment. This process involves planning and decision-making, where the agent needs cognitive functions like brainstorming steps or evaluation functions. Given the current state and the problem we want to solve, brainstorming the steps involves using prompting strategies like the tree of thought (ToT) and chain of thought (CoT).
Understanding these concepts will also help grasp the ideas of A* and Q-learning – both fundamental in AI goal-directed and decision-making behaviors.
What is A*?
The A* search algorithm is a powerful tool used in computer science to find the most efficient path between two points. It's especially useful in situations with many possible routes, like in a road network or a game map. A* works by exploring various paths, calculating the cost of each path based on factors like distance and any obstacles, and then using this information to predict the most efficient route to the goal. This prediction is based on a heuristic, which is a way of estimating the distance from any point on the map to the destination. As A* progresses, it refines its path choices until it finds the most efficient route, balancing exploring new paths and extending known ones. This makes A* highly efficient for tasks like GPS navigation, game AI for character movement, and solving complex puzzles.
The logic of A* in language models is rather complex. Although generative models don't navigate physical spaces, they still traverse through complex information pieces to find the most relevant responses for the given prompt. Here's where Q-learning comes in.
What is Q-learning?
Q-learning is a method in machine learning where an 'agent' learns to make decisions or take actions that lead to the best possible outcome in a given situation. This technique is part of reinforcement learning, which is about learning through interactions with an environment.
In Q-learning, the 'Q' stands for 'quality,' which refers to the value or benefit of taking a certain action in a specific state. The agent is rewarded for good actions and penalized for bad ones. Through repeated trials and learning from these rewards and penalties, the agent gradually understands the best series of actions to achieve its goal.
For example, if you were teaching a robot to navigate a maze, Q-learning would involve trying different paths and learning from each attempt. It keeps track of which actions (like turning left, right, or moving forward) in various parts of the maze led to success. Over time, the robot learns the most efficient path to the exit. This process is similar to how humans learn from their experiences, gradually improving their decision-making over time.
Think of Q-learning as giving the AI system a cheat sheet of its success and failure actions. In complex situations, however, this sheet may get too long and complicated, and here comes deep Q-learning to help. Deep Q-learning uses neural networks to approximate the Q-value function instead of just storing it.
Tree-of-thoughts (ToT) reasoning: Linking back to AlphaGo
The research about A* and Q-learning is raising interest around the search mechanism that's being used in the context of LLMs. Nathan Lambert speculates that Q* works by searching over language/reasoning steps via tree-of-thoughts (ToT) reasoning. The goal is to link large language model training and usage to the core components of deep reinforcement learning that enable success like AlphaGo: self-play and look-ahead planning.
Self-play is about the agent playing against different versions of itself and encountering challenging cases, thus improving its play. In the context of LLMs, you can think of AI feedback (RLAIF) as the “competing” element that improves the model’s performance.
Look-ahead planning is the idea of using the model of the world to plan for better future actions. There are two variants of such planning – Model Predictive Control (MPC), which is more used on continuous states, and Monte-Carlo Tree Search (MCTS), which works with discrete actions and states.
There’s still a lot of research to be done to thoroughly understand how these concepts link together in the realm of large language models.
New leaks on Q*
Last week, Matthew Berman dropped a video on his YouTube channel discussing new leaks about Q*. In the video, he discusses a tweet by an X user named Jimmy Apples, who's been recently hyped for being accurate about different leaks and updates, primarily from OpenAI. In the tweet, the user says that the lawsuit thing between Elon Musk and OpenAI is putting a delay on the already planned release of Q* and model update before releasing GPT 5.
Moving on to the actual leak, another X user posted a very intriguing tweet about Q*, claiming that he couldn't confirm the authenticity and that the source of the information is unknown. The tweet says:
"Q* is a dialog system conceptualized by OpenAI, designed to enhance the traditional dialog generation approach through the implementation of an energy-based model (EBM)."
Traditional token prediction methods predict one word at a time, but Q* takes a different approach. It tries to mimic how humans think things through when faced with tricky situations, like deciding on the best move in a chess game. It's about taking a moment to consider all the options deeply, often leading to smarter decisions than just going with your gut reaction. By digging up hidden factors—like the behind-the-scenes variables in probabilistic and graphical models—Q* is changing the game for how chat systems work, pushing towards conversations that feel more thought-out and insightful.
Energy-based model for dialog generation
Q*'s secret sauce is something called the EBM, or energy-based model. Think of it as the model's way of scoring how well an answer fits a question. It gives a score that represents the 'energy' of a response. Here's the principle: the lower the score, the better the fit between the question and the answer. So, a low score means a top-notch answer, while a high score points to a poorer answer. This feature lets Q* take a big-picture approach to picking responses. Instead of just piecing together words one by one, it gets a sense of the overall relevance and fittingness of an answer to the question at hand.
Optimization in abstract representation space
The cool twist with Q* is that it gets better answers, but not just within the space of possible strings. Instead, it dives into an abstract representation space. Q* uses gradient descent, a method for finding the minimum of a function, applied to iteratively refine these abstract representations towards those that yield the lowest energy in relation to the prompt.
Once an optimal abstract representation—one that minimizes the EBM's output—is identified, Q* employs an autoregressive decoder to transform this abstract thought into a coherent text output. This step bridges the gap between the dialog system's non-linguistic, conceptual understanding and the linguistic output required for human interaction.
Q* training process
Within Q*, the EBM gets better through training with pairs of prompts and responses. The goal is to tweak the system's settings to lower the energy for pairs that go well together while ensuring that mismatched pairs end up with higher energy. This training can use both contrastive methods, helping the system tell apart good from bad matches, and non-contrastive methods, which use certain techniques to make sure the low-energy responses are well-distributed among all possible answers.
Implications for dialog systems
The way Q* uses EBMs to create conversations is a big leap from how things were done before. By focusing on optimizing over an abstract representation space using gradient descent for inference, Q* offers a smarter, potentially more effective way to come up with dialog responses. This approach doesn't just mean better-quality text; it could also pave the way for future AI systems that can think and chat more like humans.
What makes Q* work so well is the detail in its EBM, the complex landscape it has to navigate, and how accurately it can represent ideas. Its ability to mimic deep, thoughtful reasoning, like what humans do when we deliberate, sets a new standard for chat systems. However, training Q* to hit the right balance—making sure it can give specific, correct answers without losing variety—brings its own set of challenges and exciting possibilities for AI development.
Final thoughts
General language models are good at language-related tasks but bad at math reasoning. Math requires formal logic and planning. Math is also the fundamental component in physics, chemistry, cryptography, and, finally, artificial intelligence itself. If Q* is truly "talented" to solve math problems, we'll open a new era of generative models that solve an entirely new set of problems.
And it not only has to do with math. With its new assumed way of energy-based model reasoning, Q* may change the whole game of teaching AIs to act like humans.
Q* may be the indicator of another round of possible AI breakthroughs. While these are exciting times for AI enthusiasts and researchers, it's essential to highlight the need for AI regulations and ethical norms, which have become increasingly critical.
Disclaimer: The readers are advised to take the presented information with a grain of salt. While these are the thoughts of many AI researchers and the results of their analyses, no official letter/announcement has been made by the board of OpenAI about Q*.