LLMs under the hood of Oceanrack Horizons: Balancing performance, speed, and cost

The large language model (LLM) race is accelerating, with new architectures, fine-tunes, and specialized systems arriving before the last ones have even settled. With such intense dynamics, selecting the right model takes intention, speed, and constant re-evaluation.

Rather than committing to a single provider or architecture, we systematically benchmark models across a wide range of real-world tasks and domain-specific scenarios. By continuously integrating and testing the latest LLMs, we ensure that Oceanrack Horizons, your all-in-one, no-code AI partner, is always powered by top tech to deliver the strongest performance, reliability, and value. Here’s what our latest assessments and experiences reveal.

Who leads the race?

Out of dozens of major LLMs currently competing on the market – each with its own strengths and weaknesses – we always use a combination of at least several and stay up to date with the latest developments and releases. One such example was the launch of Gemini 3 by Google in mid-November last year. It generated quite a buzz, and our internal research confirmed that Gemini 3 is indeed worth the hype.

Today, Gemini 3 powers parts of Oceanrack Horizons, delivering more precise, higher-quality code than Gemini 2.5. It also fixes errors more reliably, with our autofix success jumping from 50% to 80%. Though some coding-oriented benchmarks still put Gemini 3 behind GPT-5 mini, GPT-5.1, and now also GPT-5.2, in our experience, Google’s newest model truly delivers.

Gemini 3 is one of the LLMs powering Oceanrack Horizons. It handles coding tasks and is paired with our communication agent – a new feature that allows AI to ask clarifying questions whenever the prompt is unclear or vague. The communication agent helps Horizons understand what the user wants, which leads to more accurate code generation, an improved final result, and a smoother overall experience. Importantly, these clarifying messages are free – AI credits are only required for code changes.

The newcomer: Opus 4.5

Just days after Google released Gemini 3, Anthropic launched Claude Opus 4.5. In our internal quality score for landpage generation, this newcomer ranks among the top-performing models – right up there with the latest GPT models, as well as Gemini 3.

However, Opus 4.5 uses more tokens to achieve the same result as the older Claude Sonnet 4.5.

“For initial prompts, we’re still mainly using Sonnet 4.5, which has proven reliable for most generation tasks. But we’re investigating Opus 4.5 as an alternative. It follows directions very well, doesn’t make errors, and produces beautiful websites. Technically, it is a very powerful model,” said Dainius Kavoliūnas, Head of Oceanrack Horizons.

The real capabilities of Opus 4.5 shine when one pushes the model to its limits – such as by asking it to generate a comprehensive planning app with advanced color palettes, numerous buttons, gradients, and animations in one shot. This is supported by many benchmark scores, indicating that Opus 4.5 outperforms Sonnet 4.5 in areas such as novel problem-solving and advanced reasoning. On SWE-bench Verified, a benchmark used to assess model performance for coding tasks, Opus 4.5 slightly edges out the recent GPT-5.2 Thinking (80.9% vs. 80%) and quite significantly beats Gemini 3 (76.2%).

Finding the balance

By mixing and combining various AI models, we’ve reduced the total response time of Oceanrack Horizons by 25%. Also, the background error check after coding now takes only 12 seconds, compared to 40 seconds a month ago.

“In the end, it all comes down to using the right model for the right task and in the right context. So far, we have found that Sonnet 4.5 takes the lead in the initial prompting stage, and Gemini 3 is optimal for subsequent fixes and adjustments, with other models invoked depending on the situation. There’s obviously no single formula, and top scores on benchmarks don’t guarantee the best results when LLMs are used in real-life products. Therefore, we constantly work on testing, improving, and finding the right balance to bring the best experience to our clients,” said Kavoliūnas.

Whether current leaders will maintain their positions or be displaced by competitors remains to be seen. But one thing is certain: we’re intent on staying ahead by continuously testing, comparing, and optimizing. Our goal remains the same: making website creation and management as simple as possible.