The frontier in AI models is shifting from general purpose LLMs (like GPT-4o and Llama 3.3) to advanced reasoning models. Models like o1 and QwQ-32B-Preview can now “think” through the question; and they excel in math and coding benchmarks.
At Ubicloud, we recently launched our inference APIs that serve open models. One of those models is QwQ-32B. We serve QwQ at $0.60 per M tokens versus o1’s $60 per M tokens, a 99% cost reduction. Since we already served QwQ, we figured it was a good opportunity to study reasoning models.
In quantitative math and coding benchmarks, QwQ-32B ties with o1 and outperforms Claude 3.5 Sonnet. In our qualitative tests, we found o1 to perform better.
Nonetheless, we’re super excited about both o1 and QwQ-32B. For verifiable tasks, these models notably outperform general purpose LLMs. They do this by using dynamic inference strategies and “thinking longer” on harder problems at inference time. This blog post provides some background on these models, presents our empirical comparison of o1 and QwQ; and how we see things playing out in the future.
Google’s paper on “Chain-of-Thought Prompting” initiated new research on improving model performance. This paper showed that you could improve PaLM 540B’s solve rate from 18% to 57% in GSM8K, just by adding eight chain-of-thought examples in the prompt.
That is, before asking the model a math word problem, you prepend to the prompt a triplet <different problem, reasoning steps, different answer>. When you do this, it leads to an emergent property on large AI models (over 10B parameters). The model starts using reasoning steps when answering the question. Through these reasoning steps, the model can then invoke the relevant knowledge that it already contains.
Other types of prompt engineering also lead to accuracy improvements. For example, if you keep asking a model to refine their thoughts or correct their mistakes, their performance tends to improve on simpler tasks. Academia calls this “iterative self refinement”, Hacker News calls this “make it better.”
Now the $T question is, given that AI models have this emergent reasoning property, how do you generalize it?
The current answer is “teaching” the model how to search over a solution space in steps. You then teach the model how to reward / critique each step and verify its results. To do this, you create datasets that follow your desired reasoning strategy and fine-tune an existing model with that dataset.
QwQ-32B-Preview and o1 are two models that do this today. To infer how they reason, we first asked QwQ-32B to “add a pair of parentheses to the incorrect equation: 1 + 2 * 3 + 4 * 5 + 6 * 7 + 8 * 9 = 479, to make the equation true.”
QwQ’s answer shows iterative reasoning steps, where the model uses trial and error as a heuristic. o1 follows a similar strategy and enumerates over candidate solutions using trial and error.
We then gave a more complex math question to both models, one that involved polynomial equations. QwQ continued to use iterative reasoning with light heuristics. o1 answered the same question using a deepen-and-test (though not purely depth-first) search. On questions from different domains, we felt that o1 could employ a variety of search strategies. QwQ at times got stuck in recursive reasoning loops.
If you’re interested in this field, “Scaling Test Time Compute” talks about the latest research in open models. For o1, a recent paper on “Meta Chain-of-Thought” credibly argues that the model uses a higher level reasoning process in an auto-regressive fashion.
On verifying results, we also saw that both models checked their work. Human annotated feedback, process reward models (PRM), and critic models are three ways to teach a model how to do this today. However, neither models’ authors disclose their training method. The holy grail in advanced reasoning is self-verification; and there is an impression in the research community that o1 found a way to do this for verifiable domains.
Finally, we found that o1 had other advantages over QwQ. For example, we asked both models to write example Python programs. Looking at the answers, it became clear that o1 was trained on a larger data set and that it was aware of Python libraries that QwQ-32B didn’t know about.
We think OpenAI has two unfair advantages over QwQ-32B.
If we think that OpenAI has these unfair advantages, why are we serving QwQ-32B (and other open weight models)? Two reasons.
First, QwQ is still comparable to o1 and Ubicloud offers it for 100x less. You can employ a dozen QwQ-32Bs, prompt them with different search strategies, use VMs to verify their results, and still come in under what o1 costs. In the short term, combining these classic AI search strategies with AI models feels much more efficient than trying to “teach” an uber AI model.
Second, we think open source fosters collaboration and trust -- and that is its unfair advantage that compounds over time. We foresee a future where open source AI not only delivers top-quality results, but also surpasses proprietary models in some areas. If you believe in that future and are looking for someone to partner with on the infrastructure side, please hit us up at [email protected]!