Taalas is replacing programmable GPUs with hardwired AI chips to achieve 17,000 tokens per second for ubiquitous inference
In the high-stakes world of AI infrastructure, the industry has operated under a singular assumption: flexibility is king. We build general-purpose GPUs because AI models change every week, and we need programmable silicon that can adapt to the next research breakthrough. But Taalas , the Toronto-based startup thinks that flexibility is exactly what’s holding AI back. According to Taalas team, if we want AI to be as common and cheap as plastic, we have to stop ‘simulating’ intelligence on general-purpose computers and start ‘casting’ it directly into silicon. The Problem: The ‘Memory Wall’ and the GPU Tax The current cost of running a Large Language Model (LLM) is driven by a physical bottleneck: the Memory Wall . Traditional processors (GPUs) are ‘Instruction Set Architecture’ (ISA) based. They separate compute and memory. When you run an inference pass on a model like Llama-3, the chip spends the vast majority of its time and energy shuttling weights from High Bandwidth Memory (H...
