YouZum

Uncategorized

AI, Committee, Nachrichten, Uncategorized

Google’s ‘world-model’ bet: building the AI operating layer before Microsoft captures the UI

Google doubles down on its ‘world-model’ vision, racing to build an AI operating layer to drive a universal personal assistant with Gemini. Even as Microsoft moves to capture the enterprise UI. Here’s what’s at stake.Read More

Google’s ‘world-model’ bet: building the AI operating layer before Microsoft captures the UI Beitrag lesen »

AI, Committee, Nachrichten, Uncategorized

NVIDIA Releases Llama Nemotron Nano 4B: An Efficient Open Reasoning Model Optimized for Edge AI and Scientific Tasks

NVIDIA has released Llama Nemotron Nano 4B, an open-source reasoning model designed to deliver strong performance and efficiency across scientific tasks, programming, symbolic math, function calling, and instruction following—while being compact enough for edge deployment. With just 4 billion parameters, it achieves higher accuracy and up to 50% greater throughput than comparable open models with up to 8 billion parameters, according to internal benchmarks. The model is positioned as a practical foundation for deploying language-based AI agents in resource-constrained environments. By focusing on inference efficiency, Llama Nemotron Nano 4B addresses a growing demand for compact models capable of supporting hybrid reasoning and instruction-following tasks outside traditional cloud settings. Model Architecture and Training Stack Nemotron Nano 4B builds upon the Llama 3.1 architecture and shares lineage with NVIDIA’s earlier “Minitron” family. The architecture follows a dense, decoder-only transformer design. The model has been optimized for performance in reasoning-intensive workloads while maintaining a lightweight parameter count. The post-training stack for the model includes multi-stage supervised fine-tuning on curated datasets for mathematics, coding, reasoning tasks, and function calling. In addition to traditional supervised learning, Nemotron Nano 4B has undergone reinforcement learning optimization using Reward-aware Preference Optimization (RPO), a method intended to enhance the model’s utility in chat-based and instruction-following environments. This combination of instruction tuning and reward modeling helps align the model’s outputs more closely with user intent, particularly in multi-turn reasoning scenarios. The training approach reflects NVIDIA’s emphasis on aligning smaller models to practical usage tasks that traditionally require significantly larger parameter sizes. Performance Benchmarks Despite its compact footprint, Nemotron Nano 4B exhibits robust performance in both single-turn and multi-turn reasoning tasks. According to NVIDIA, it provides 50% higher inference throughput compared to similar open-weight models within the 8B parameter range. The model supports a context window of up to 128,000 tokens, which is particularly useful for tasks involving long documents, nested function calls, or multi-hop reasoning chains. While NVIDIA has not disclosed full benchmark tables in the Hugging Face documentation, the model reportedly outperforms other open alternatives in benchmarks across math, code generation, and function calling precision. Its throughput advantage suggests it can serve as a viable default for developers targeting efficient inference pipelines with moderately complex workloads. Edge-Ready Deployment One of the core differentiators of Nemotron Nano 4B is its focus on edge deployment. The model has been explicitly tested and optimized to run efficiently on NVIDIA Jetson platforms and NVIDIA RTX GPUs. This enables real-time reasoning capabilities on low-power embedded devices, including robotics systems, autonomous edge agents, or local developer workstations. For enterprises and research teams concerned with privacy and deployment control, the ability to run advanced reasoning models locally—without relying on cloud inference APIs—can provide both cost savings and greater flexibility. Licensing and Access The model is released under the NVIDIA Open Model License, which permits commercial usage. It is available through Hugging Face at huggingface.co/nvidia/Llama-3.1-Nemotron-Nano-4B-v1.1, with all relevant model weights, configuration files, and tokenizer artifacts openly accessible. The license structure aligns with NVIDIA’s broader strategy of supporting developer ecosystems around its open models. Conclusion Nemotron Nano 4B represents NVIDIA’s continued investment in bringing scalable, practical AI models to a broader development audience—especially those targeting edge or cost-sensitive deployment scenarios. While the field continues to see rapid progress in ultra-large models, compact and efficient models like Nemotron Nano 4B provide a counterbalance, enabling deployment flexibility without compromising too heavily on performance. Check out the Model on Hugging Face. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 95k+ ML SubReddit and Subscribe to our Newsletter. The post NVIDIA Releases Llama Nemotron Nano 4B: An Efficient Open Reasoning Model Optimized for Edge AI and Scientific Tasks appeared first on MarkTechPost.

NVIDIA Releases Llama Nemotron Nano 4B: An Efficient Open Reasoning Model Optimized for Edge AI and Scientific Tasks Beitrag lesen »

AI, Committee, Nachrichten, Uncategorized

Optimizing Assembly Code with LLMs: Reinforcement Learning Outperforms Traditional Compilers

LLMs have shown impressive capabilities across various programming tasks, yet their potential for program optimization has not been fully explored. While some recent efforts have used LLMs to enhance performance in languages like C++ and Python, the broader application of LLMs to optimize code, especially in low-level programming contexts, remains limited. Existing LLM benchmarks largely focus on code generation from natural language or solving GitHub issues, as seen in HumanEval, MBPP, APPS, SWE-bench, and SWE-agent. Moreover, models such as Codex, AlphaCode, and Code Llama primarily aim to improve code generation quality rather than performance. However, select research has begun addressing optimization, including parallelization and code efficiency improvements, though many of these approaches are constrained by the need for formal verification, limiting scalability. In contrast, some newer methods embrace test-based validation, allowing optimization of more complex programs with loops. Learning-based strategies in compiler optimization—like AutoPhase, which uses reinforcement learning for pass sequencing, and Coreset, which applies graph neural networks—have shown promise in improving performance. Superoptimization techniques aim to find the most efficient version of a program but are typically restricted to small-scale problems. Additionally, frameworks like AutoTVM and Ansor have focused on optimizing GPU kernel code through statistical modeling and search. Recently, LLM-driven optimization has gained attention, with reinforcement learning approaches guiding LLMs using feedback from test cases. Techniques like CodeRL and PPOCoder leverage policy optimization methods to fine-tune models for better performance, even across resource-constrained programming languages like Verilog.  Stanford, UIUC, CMU, and Visa Research researchers explore using LLMs to optimize assembly code performance—an area traditionally handled by compilers like GCC. They introduce a reinforcement learning framework using Proximal Policy Optimization (PPO), guided by a reward balancing correctness and speedup over the gcc -O3 baseline. Using a dataset of 8,072 real-world programs, their model, Qwen2.5-Coder-7B-PPO, achieves a 96.0% test pass rate and a 1.47× average speedup, outperforming 20 other models, including Claude-3.7-sonnet. Their results show that with RL training, LLMs can effectively outperform conventional compiler optimizations.  The methodology involves optimizing compiled C programs for performance using an RL approach. Given a C program C, it is compiled to assembly P using gcc -O3. The goal is to generate a new assembly program P’ that is functionally equivalent but faster. Correctness is verified using a test set, and speedup is measured by execution time improvement. Using CodeNet as the dataset, the authors apply PPO to train a language model that generates improved code. Two reward functions—Correctness-Guided Speedup and Speedup-Only—are used to guide training based on program validity, correctness, and performance gains.  The study evaluates various language models on optimizing assembly code, revealing that most models struggle with low test pass rates and minimal speedups. However, Qwen2.5-Coder-7B-PPO, trained with reinforcement learning, significantly outperforms others, achieving 96% accuracy and a 1.47× average speedup. Ablation studies show that using gcc -O3 as a reference aids performance, while removing it leads to sharp declines. Notably, models like Claude-3.7-sonnet can surpass compilers by identifying hardware-specific optimizations, such as replacing loops with a single popcnt instruction, demonstrating their ability to perform semantic-level code transformations beyond traditional compiler capabilities.  In conclusion, the study explores using LLMs to optimize assembly code, a domain where traditional compilers struggle due to the complexity of low-level performance tuning. The authors fine-tune Qwen2.5-Coder-7B using PPO, rewarding both correctness (via test cases) and speedup over gcc -O3. They introduce a benchmark of 8,072 real-world C programs to evaluate performance. The model achieves a 96.0% test pass rate and a 1.47× average speedup, outperforming 20 other models, including Claude-3.7-sonnet. While effective, limitations include a lack of formal correctness guarantees and variability in hardware performance across systems.  Check out the Paper. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 95k+ ML SubReddit and Subscribe to our Newsletter. The post Optimizing Assembly Code with LLMs: Reinforcement Learning Outperforms Traditional Compilers appeared first on MarkTechPost.

Optimizing Assembly Code with LLMs: Reinforcement Learning Outperforms Traditional Compilers Beitrag lesen »

AI, Committee, Nachrichten, Uncategorized

Encoders and Decoders in Transformer Models

This article is divided into three parts; they are: • Full Transformer Models: Encoder-Decoder Architecture • Encoder-Only Models • Decoder-Only Models The original transformer architecture, introduced in “Attention is All You Need,” combines an encoder and decoder specifically designed for sequence-to-sequence (seq2seq) tasks like machine translation.

Encoders and Decoders in Transformer Models Beitrag lesen »

AI, Committee, Nachrichten, Uncategorized

This AI Paper Introduces Group Think: A Token-Level Multi-Agent Reasoning Paradigm for Faster and Collaborative LLM Inference

A prominent area of exploration involves enabling large language models (LLMs) to function collaboratively. Multi-agent systems powered by LLMs are now being examined for their potential to coordinate challenging problems by splitting tasks and working simultaneously. This direction has gained attention due to its potential to increase efficiency and reduce latency in real-time applications. A common issue in collaborative LLM systems is agents’ sequential, turn-based communication. In such systems, each agent must wait for others to complete their reasoning steps before proceeding. This slows down processing, especially in situations demanding rapid responses. Moreover, agents often duplicate efforts or generate inconsistent outputs, as they cannot see the evolving thoughts of their peers during generation. This latency and redundancy reduce the practicality of deploying multi-agent LLMs, particularly when time and computation are constrained, such as edge devices. Most current solutions have relied on sequential or independently parallel sampling techniques to improve reasoning. Methods like Chain-of-Thought prompting help models to solve problems in a structured way but often come with increased inference time. Approaches such as Tree-of-Thoughts and Graph-of-Thoughts expand on this by branching reasoning paths. However, these approaches still do not allow for real-time mutual adaptation among agents. Multi-agent setups have explored collaborative methods, but mostly through alternating message exchanges, which again introduces delays. Some advanced systems propose complex dynamic scheduling or role-based configurations, which are not optimized for efficient inference. Research from MediaTek Research introduced a new method called Group Think. This approach enables multiple reasoning agents within a single LLM to operate concurrently, observing each other’s partial outputs at the token level. Each reasoning thread adapts to the evolving thoughts of the others mid-generation. This mechanism reduces duplication and enables agents to shift direction if another thread is better positioned to continue a specific line of reasoning. Group Think is implemented through a token-level attention mechanism that lets each agent attend to previously generated tokens from all agents, supporting real-time collaboration. The method works by assigning each agent its own sequence of token indices, allowing their outputs to be interleaved in memory. These interleaved tokens are stored in a shared cache accessible to all agents during generation. This design allows efficient attention across reasoning threads without architectural changes to the transformer model. The implementation works both on personal devices and in data centers. On local devices, it effectively uses idle compute by batching multiple agent outputs, even with a batch size of one. In data centers, Group Think allows multiple requests to be processed together, interleaving tokens across agents while maintaining correct attention dynamics. Performance tests demonstrate that Group Think significantly improves latency and output quality. In enumeration tasks, such as listing 100 distinct names, it achieved near-complete results more rapidly than conventional Chain-of-Thought approaches. The acceleration was proportional to the number of thinkers; for example, four thinkers reduced latency by a factor of about four. In divide-and-conquer problems, using the Floyd–Warshall algorithm on a graph of five nodes, four thinkers reduced the completion time to half that of a single agent. Group Think solved code generation challenges in programming tasks more effectively than baseline models. With four or more thinkers, the model produced correct code segments much faster than traditional reasoning models. This research shows that existing LLMs, though not explicitly trained for collaboration, can already demonstrate emergent group reasoning behaviors under the Group Think setup. In experiments, agents naturally diversified their work to avoid redundancy, often dividing tasks by topic or focus area. These findings suggest that Group Think’s efficiency and sophistication could be enhanced further with dedicated training on collaborative data. Check out the Paper. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 95k+ ML SubReddit and Subscribe to our Newsletter. The post This AI Paper Introduces Group Think: A Token-Level Multi-Agent Reasoning Paradigm for Faster and Collaborative LLM Inference appeared first on MarkTechPost.

This AI Paper Introduces Group Think: A Token-Level Multi-Agent Reasoning Paradigm for Faster and Collaborative LLM Inference Beitrag lesen »

We use cookies to improve your experience and performance on our website. You can learn more at Datenschutzrichtlinie and manage your privacy settings by clicking Settings.

Privacy Preferences

You can choose your cookie settings by turning on/off each type of cookie as you wish, except for essential cookies.

Allow All
Manage Consent Preferences
  • Always Active

Save
de_DE