Revolutionizing AI: OpenAI’s Breakthrough with the O3 Reasoning Model

Sachin

7 months ago

Sam Altman presenting the new o3 model's capabilities — OpenAI

During the concluding segment of its 12 Days of OpenAI livestream event, CEO Sam Altman introduced the new o3 foundation model, a successor to the recently unveiled o1 reasoning AIs, referred to as o3 and o3-mini.

Surprisingly, OpenAI bypassed the o2 naming, seemingly to avoid any conflict with the branding of the British telecommunications company, O2.

While the o3 models aren’t available to the general public yet, there’s no official release date for their inclusion in ChatGPT. However, they are currently accessible for testing by safety and security professionals.

The o3 models, like the o1 series, function differently from typical generative models by performing internal fact-checking on their responses before delivering them to users. While this process may delay responses by several seconds to minutes, it ensures that answers to complex topics like science, math, and programming are more precise and reliable than those provided by GPT-4. Moreover, the model can clearly articulate its reasoning for how it arrived at a given answer.

Users have the option to adjust how long the model considers a problem by selecting from low, medium, or high compute settings, with the highest setting yielding the most thorough answers. However, this increased computational cost comes at a significant price, with high compute tasks potentially costing thousands of dollars per job, as noted by ARC-AGI co-founder Francois Chollet in an X post.

Today OpenAI announced o3, its next-gen reasoning model. We've worked with OpenAI to test it on ARC-AGI, and we believe it represents a significant breakthrough in getting AI to adapt to novel tasks.
It scores 75.7% on the semi-private eval in low-compute mode (for $20 per task… pic.twitter.com/ESQ9CNVCEA

The o3 family of models reportedly shows considerable improvement over o1, which launched in September, in terms of performance on the toughest benchmark tests. OpenAI claims that o3 surpasses its predecessor by almost 23 percentage points on the SWE-Bench Verified coding test and exceeds o1’s score on Codeforce’s benchmark by over 60 points. Additionally, the o3 model achieved an impressive 96.7% on the AIME 2024 math test, missing just one question, and outperformed human experts on the GPQA Diamond with a score of 87.7%. Perhaps most notably, 03 was able to solve more than a quarter of the problems in the EpochAI Frontier Math benchmark, an area where other models struggled to solve more than 2% of the problems correctly.

While the o3 models shown on Friday are still in the early stages of development, OpenAI cautions that “final results may evolve with further post-training.” The company has also incorporated new “deliberative alignment” safety measures into o3’s training process to minimize the potential for undesirable behaviors. The previous o1 model was noted for attempting to deceive human evaluators at a higher rate than traditional AIs such as GPT-4, Gemini, or Claude; OpenAI hopes that the new safety protocols will help reduce these tendencies in o3.

For researchers interested in testing o3-mini themselves, they can sign up for access via OpenAI’s waitlist.

source

Please follow and like us: