Committee Archives - 20ページ目 (99ページ中)

A Rigorous Evaluation of LLM Data Generation Strategies for Low-Resource Languages

admin NU / 9月 22, 2025

arXiv:2506.12158v3 Announce Type: replace Abstract: Large Language Models (LLMs) are increasingly used to generate synthetic textual data for training smaller specialized models. However, a comparison of various generation strategies for low-resource language settings is lacking. While various prompting strategies have been proposed, such as demonstrations, label-based summaries, and self-revision, their comparative effectiveness remains unclear, especially for low-resource languages. In this paper, we systematically evaluate the performance of these generation strategies and their combinations across 11 typologically diverse languages, including several extremely low-resource ones. Using three NLP tasks and four open-source LLMs, we assess downstream model performance on generated versus gold-standard data. Our results show that strategic combinations of generation methods, particularly target-language demonstrations with LLM-based revisions, yield strong performance, narrowing the gap with real data to as little as 5% in some settings. We also find that smart prompting techniques can reduce the advantage of larger LLMs, highlighting efficient generation strategies for synthetic data generation in low-resource scenarios with smaller models.

A Rigorous Evaluation of LLM Data Generation Strategies for Low-Resource Languages 投稿を読む »

AI, Committee, ニュース, Uncategorized

MIT Researchers Enhanced Artificial Intelligence (AI) 64x Better at Planning, Achieving 94% Accuracy

admin NU / 9月 22, 2025

Can a 8B-parameter language model produce provably valid multi-step plans instead of plausible guesses? MIT CSAIL researchers introduce PDDL-INSTRUCT, an instruction-tuning framework that couples logical chain-of-thought with external plan validation (VAL) to lift symbolic planning performance of LLMs. On PlanBench, a tuned Llama-3-8B reaches 94% valid plans on Blocksworld, with large jumps on Mystery Blocksworld and Logistics; across domains they report up to a 66% absolute improvement over baselines. https://arxiv.org/pdf/2509.13351 But What’s new? The research team tackles a well-known failure mode: LLMs often generate “plausible-sounding” but logically invalid multi-step plans. PDDL-INSTRUCT couples explicit state/action semantics with ground-truth checking: Error education: Models are trained to explain why candidate plans fail (unsatisfied preconditions, wrong effects, frame violations, or goal not reached). Logical chain-of-thought (CoT): Prompts require step-by-step inference over preconditions and add/del effects, yielding state→action→state traces ⟨sᵢ, aᵢ₊₁, sᵢ₊₁⟩. External verification (VAL): Every step is validated with the classic VAL plan validator; feedback can be binary (valid/invalid) or detailed (which precondition/effect failed). Detailed feedback yielded the strongest gains. Two-stage optimization: Stage-1 optimizes the reasoning chains (penalizing state-transition errors); Stage-2 optimizes end-task planning accuracy. How Good is it? Benchmarks Evaluation follows PlanBench—Blocksworld, Mystery Blocksworld (predicate names obfuscated to break pattern-matching), and Logistics—established stress tests where generic LLMs historically underperform on plan generation. The authors highlight that Mystery Blocksworld is particularly challenging; prior studies often report <5% validity without tool support. Blocksworld: up to 94% valid plans with Llama-3-8B under PDDL-INSTRUCT. Mystery Blocksworld: large relative gains; the paper reports dramatic improvement versus a near-zero baseline (reported as orders-of-magnitude, e.g., 64× in their summary figures/tables). Logistics: substantial increases in valid plans. Across domains, the research team showcase up to 66% absolute improvement over untuned baselines. Detailed validator feedback outperforms binary signals, and longer feedback budgets further help. https://arxiv.org/pdf/2509.13351 Summary PDDL-INSTRUCT shows that coupling logical chain-of-thought with external plan validation can materially improve LLM planning, but its current scope is classical PDDL domains (Blocksworld, Mystery Blocksworld, Logistics) and relies on VAL as an external oracle; the reported gains—e.g., 94% valid plans on Blocksworld and large relative improvements on Mystery Blocksworld with Llama-3-8B—demonstrate a viable path for neuro-symbolic training where reasoning steps are grounded in formal semantics and checked automatically, suggesting immediate utility for agent pipelines that can tolerate a verifier in the loop while longer-horizon, temporal/numeric, and cost-sensitive planning remain open extensions. Check out the PAPER. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter. The post MIT Researchers Enhanced Artificial Intelligence (AI) 64x Better at Planning, Achieving 94% Accuracy appeared first on MarkTechPost.

MIT Researchers Enhanced Artificial Intelligence (AI) 64x Better at Planning, Achieving 94% Accuracy 投稿を読む »

AI, Committee, ニュース, Uncategorized

Emergent Abilities of Large Language Models under Continued Pretraining for Language Adaptation

admin NU / 9月 22, 2025

arXiv:2506.00288v3 Announce Type: replace Abstract: Continued pretraining (CPT) is a popular approach to adapt existing large language models (LLMs) to new languages. When doing so, it is common practice to include a portion of English data in the mixture, but its role has not been carefully studied to date. In this work, we show that including English does not impact validation perplexity, yet it is critical for the emergence of downstream capabilities in the target language. We introduce a language-agnostic benchmark for in-context learning (ICL), which reveals catastrophic forgetting early on CPT when English is not included. This in turn damages the ability of the model to generalize to downstream prompts in the target language as measured by perplexity, even if it does not manifest in terms of accuracy until later in training, and can be tied to a big shift in the model parameters. Based on these insights, we introduce curriculum learning and exponential moving average (EMA) of weights as effective alternatives to mitigate the need for English. All in all, our work sheds light into the dynamics by which emergent abilities arise when doing CPT for language adaptation, and can serve as a foundation to design more effective methods in the future.

Emergent Abilities of Large Language Models under Continued Pretraining for Language Adaptation 投稿を読む »

AI, Committee, ニュース, Uncategorized

MANZANO: A Simple and Scalable Unified Multimodal Model with a Hybrid Vision Tokenizer

admin NU / 9月 22, 2025

arXiv:2509.16197v1 Announce Type: cross Abstract: Unified multimodal Large Language Models (LLMs) that can both understand and generate visual content hold immense potential. However, existing open-source models often suffer from a performance trade-off between these capabilities. We present Manzano, a simple and scalable unified framework that substantially reduces this tension by coupling a hybrid image tokenizer with a well-curated training recipe. A single shared vision encoder feeds two lightweight adapters that produce continuous embeddings for image-to-text understanding and discrete tokens for text-to-image generation within a common semantic space. A unified autoregressive LLM predicts high-level semantics in the form of text and image tokens, with an auxiliary diffusion decoder subsequently translating the image tokens into pixels. The architecture, together with a unified training recipe over understanding and generation data, enables scalable joint learning of both capabilities. Manzano achieves state-of-the-art results among unified models, and is competitive with specialist models, particularly on text-rich evaluation. Our studies show minimal task conflicts and consistent gains from scaling model size, validating our design choice of a hybrid tokenizer.

MANZANO: A Simple and Scalable Unified Multimodal Model with a Hybrid Vision Tokenizer 投稿を読む »

AI, Committee, ニュース, Uncategorized

Layer-wise Minimal Pair Probing Reveals Contextual Grammatical-Conceptual Hierarchy in Speech Representations

admin NU / 9月 22, 2025

arXiv:2509.15655v1 Announce Type: new Abstract: Transformer-based speech language models (SLMs) have significantly improved neural speech recognition and understanding. While existing research has examined how well SLMs encode shallow acoustic and phonetic features, the extent to which SLMs encode nuanced syntactic and conceptual features remains unclear. By drawing parallels with linguistic competence assessments for large language models, this study is the first to systematically evaluate the presence of contextual syntactic and semantic features across SLMs for self-supervised learning (S3M), automatic speech recognition (ASR), speech compression (codec), and as the encoder for auditory large language models (AudioLLMs). Through minimal pair designs and diagnostic feature analysis across 71 tasks spanning diverse linguistic levels, our layer-wise and time-resolved analysis uncovers that 1) all speech encode grammatical features more robustly than conceptual ones.

Layer-wise Minimal Pair Probing Reveals Contextual Grammatical-Conceptual Hierarchy in Speech Representations 投稿を読む »

AI, Committee, ニュース, Uncategorized

Localmax dynamics for attention in transformers and its asymptotic behavior

admin NU / 9月 22, 2025

arXiv:2509.15958v1 Announce Type: new Abstract: We introduce a new discrete-time attention model, termed the localmax dynamics, which interpolates between the classic softmax dynamics and the hardmax dynamics, where only the tokens that maximize the influence toward a given token have a positive weight. As in hardmax, uniform weights are determined by a parameter controlling neighbor influence, but the key extension lies in relaxing neighborhood interactions through an alignment-sensitivity parameter, which allows controlled deviations from pure hardmax behavior. As we prove, while the convex hull of the token states still converges to a convex polytope, its structure can no longer be fully described by a maximal alignment set, prompting the introduction of quiescent sets to capture the invariant behavior of tokens near vertices. We show that these sets play a key role in understanding the asymptotic behavior of the system, even under time-varying alignment sensitivity parameters. We further show that localmax dynamics does not exhibit finite-time convergence and provide results for vanishing, nonzero, time-varying alignment-sensitivity parameters, recovering the limiting behavior of hardmax as a by-product. Finally, we adapt Lyapunov-based methods from classical opinion dynamics, highlighting their limitations in the asymmetric setting of localmax interactions and outlining directions for future research.

Localmax dynamics for attention in transformers and its asymptotic behavior 投稿を読む »

AI, Committee, ニュース, Uncategorized

Tag&Tab: Pretraining Data Detection in Large Language Models Using Keyword-Based Membership Inference Attack

admin NU / 9月 22, 2025

arXiv:2501.08454v2 Announce Type: replace-cross Abstract: Large language models (LLMs) have become essential tools for digital task assistance. Their training relies heavily on the collection of vast amounts of data, which may include copyright-protected or sensitive information. Recent studies on detecting pretraining data in LLMs have primarily focused on sentence- or paragraph-level membership inference attacks (MIAs), usually involving probability analysis of the target model’s predicted tokens. However, these methods often exhibit poor accuracy, failing to account for the semantic importance of textual content and word significance. To address these shortcomings, we propose Tag&Tab, a novel approach for detecting data used in LLM pretraining. Our method leverages established natural language processing (NLP) techniques to tag keywords in the input text, a process we term Tagging. Then, the LLM is used to obtain probabilities for these keywords and calculate their average log-likelihood to determine input text membership, a process we refer to as Tabbing. Our experiments on four benchmark datasets (BookMIA, MIMIR, PatentMIA, and the Pile) and several open-source LLMs of varying sizes demonstrate an average increase in AUC scores ranging from 5.3% to 17.6% over state-of-the-art methods. Tag&Tab not only sets a new standard for data leakage detection in LLMs, but its outstanding performance is a testament to the importance of words in MIAs on LLMs.

Tag&Tab: Pretraining Data Detection in Large Language Models Using Keyword-Based Membership Inference Attack 投稿を読む »

AI, Committee, ニュース, Uncategorized

Xiaomi Released MiMo-Audio, a 7B Speech Language Model Trained on 100M+ Hours with High-Fidelity Discrete Tokens

admin NU / 9月 21, 2025

Xiaomi’s MiMo team released MiMo-Audio, a 7-billion-parameter audio-language model that runs a single next-token objective over interleaved text and discretized speech, scaling pretraining beyond 100 million hours of audio. What’s actually new? Instead of relying on task-specific heads or lossy acoustic tokens, MiMo-Audio uses a bespoke RVQ (residual vector quantization) tokenizer that targets both semantic fidelity and high-quality reconstruction. The tokenizer runs at 25 Hz and outputs 8 RVQ layers (≈200 tokens/s), giving the LM access to “lossless” speech features it can model autoregressively alongside text. Architecture: patch encoder → 7B LLM → patch decoder To handle the audio/text rate mismatch, the system packs four timesteps per patch for LM consumption (downsampling 25 Hz → 6.25 Hz), then reconstructs full-rate RVQ streams with a causal patch decoder. A delayed multi-layer RVQ generation scheme staggers predictions per codebook to stabilize synthesis and respect inter-layer dependencies. All three parts—patch encoder, MiMo-7B backbone, and patch decoder—are trained under a single next-token objective. https://xiaomimimo.github.io/MiMo-Audio-Demo/ Scale is the algorithm Training proceeds in two big phases: (1) an “understanding” stage that optimizes text-token loss over interleaved speech-text corpora, and (2) a joint “understanding + generation” stage that turns on audio losses for speech continuation, S2T/T2S tasks, and instruction-style data. The report emphasizes a compute/data threshold where few-shot behavior appears to “switch on,” echoing emergence curves seen in large text-only LMs. Benchmarks: speech intelligence and general audio MiMo-Audio is evaluated on speech-reasoning suites (e.g., SpeechMMLU) and broad audio understanding benchmarks (e.g., MMAU), reporting strong scores across speech, sound, and music and a reduced “modality gap” between text-only and speech-in/speech-out settings. Xiaomi also releases MiMo-Audio-Eval, a public toolkit to reproduce these results. Listen-and-respond demos (speech continuation, voice/emotion conversion, denoising, and speech translation) are available online. https://xiaomimimo.github.io/MiMo-Audio-Demo/ Why this is important? The approach is intentionally simple—no multi-head task tower, no bespoke ASR/TTS objectives at pretraining time—just GPT-style next-token prediction over lossless audio tokens plus text. The key engineering ideas are (i) a tokenizer the LM can actually use without throwing away prosody and speaker identity; (ii) patchification to keep sequence lengths manageable; and (iii) delayed RVQ decoding to preserve quality at generation time. For teams building spoken agents, those design choices translate into few-shot speech-to-speech editing and robust speech continuation with minimal task-specific finetuning. 6 Technical Takeaways: High-Fidelity TokenizationMiMo-Audio uses a custom RVQ tokenizer operating at 25 Hz with 8 active codebooks, ensuring speech tokens preserve prosody, timbre, and speaker identity while keeping them LM-friendly. Patchified Sequence ModelingThe model reduces sequence length by grouping 4 timesteps into one patch (25 Hz → 6.25 Hz), letting the 7B LLM handle long speech efficiently without discarding detail. Unified Next-Token ObjectiveRather than separate heads for ASR, TTS, or dialogue, MiMo-Audio trains under a single next-token prediction loss across interleaved text and audio, simplifying architecture while supporting multi-task generalization. Emergent Few-Shot AbilitiesFew-shot behaviors such as speech continuation, voice conversion, emotion transfer, and speech translation emerge once training surpasses a large-scale data threshold (~100M hours, trillions of tokens). Benchmark LeadershipMiMo-Audio sets state-of-the-art scores on SpeechMMLU (S2S 69.1, T2S 71.5) and MMAU (66.0 overall), while minimizing the text-to-speech modality gap to just 3.4 points. Open Ecosystem ReleaseXiaomi provides the tokenizer, 7B checkpoints (base and instruct), MiMo-Audio-Eval toolkit, and public demos, enabling researchers and developers to test and extend speech-to-speech intelligence in open-source pipelines. Summary MiMo-Audio demonstrates that high-fidelity, RVQ-based “lossless” tokenization combined with patchified next-token pretraining at scale is sufficient to unlock few-shot speech intelligence without task-specific heads. The 7B stack—tokenizer → patch encoder → LLM → patch decoder—bridges the audio/text rate gap (25→6.25 Hz) and preserves prosody and speaker identity via delayed multi-layer RVQ decoding. Empirically, the model narrows the textspeech modality gap, generalizes across speech/sound/music benchmarks, and supports in-context S2S editing and continuation. Check out the Paper, Technical details and GitHub Page. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter. The post Xiaomi Released MiMo-Audio, a 7B Speech Language Model Trained on 100M+ Hours with High-Fidelity Discrete Tokens appeared first on MarkTechPost.

Xiaomi Released MiMo-Audio, a 7B Speech Language Model Trained on 100M+ Hours with High-Fidelity Discrete Tokens 投稿を読む »

AI, Committee, ニュース, Uncategorized

LLM-as-a-Judge: Where Do Its Signals Break, When Do They Hold, and What Should “Evaluation” Mean?

admin NU / 9月 21, 2025

What exactly is being measured when a judge LLM assigns a 1–5 (or pairwise) score? Most “correctness/faithfulness/completeness” rubrics are project-specific. Without task-grounded definitions, a scalar score can drift from business outcomes (e.g., “useful marketing post” vs. “high completeness”). Surveys of LLM-as-a-judge (LAJ) note that rubric ambiguity and prompt template choices materially shift scores and human correlations. How stable are judge decisions to prompt position and formatting? Large controlled studies find position bias: identical candidates receive different preferences depending on order; list-wise and pairwise setups both show measurable drift (e.g., repetition stability, position consistency, preference fairness). Work cataloging verbosity bias shows longer responses are often favored independent of quality; several reports also describe self-preference (judges prefer text closer to their own style/policy). Do judge scores consistently match human judgments of factuality? Empirical results are mixed. For summary factuality, one study reported low or inconsistent correlations with humans for strong models (GPT-4, PaLM-2), with only partial signal from GPT-3.5 on certain error types. Conversely, domain-bounded setups (e.g., explanation quality for recommenders) have reported usable agreement with careful prompt design and ensembling across heterogeneous judges. Taken together, correlation seems task- and setup-dependent, not a general guarantee. How robust are judge LLMs to strategic manipulation? LLM-as-a-Judge (LAJ) pipelines are attackable. Studies show universal and transferable prompt attacks can inflate assessment scores; defenses (template hardening, sanitization, re-tokenization filters) mitigate but do not eliminate susceptibility. Newer evaluations differentiate content-author vs. system-prompt attacks and document degradation across several families (Gemma, Llama, GPT-4, Claude) under controlled perturbations. Is pairwise preference safer than absolute scoring? Preference learning often favors pairwise ranking, yet recent research finds protocol choice itself introduces artifacts: pairwise judges can be more vulnerable to distractors that generator models learn to exploit; absolute (pointwise) scores avoid order bias but suffer scale drift. Reliability therefore hinges on protocol, randomization, and controls rather than a single universally superior scheme. Could “judging” encourage overconfident model behavior? Recent reporting on evaluation incentives argues that test-centric scoring can reward guessing and penalize abstention, shaping models toward confident hallucinations; proposals suggest scoring schemes that explicitly value calibrated uncertainty. While this is a training-time concern, it feeds back into how evaluations are designed and interpreted. Where do generic “judge” scores fall short for production systems? When an application has deterministic sub-steps (retrieval, routing, ranking), component metrics offer crisp targets and regression tests. Common retrieval metrics include Precision@k, Recall@k, MRR, and nDCG; these are well-defined, auditable, and comparable across runs. Industry guides emphasize separating retrieval and generation and aligning subsystem metrics with end goals, independent of any judge LLM. If judge LLMs are fragile, what does “evaluation” look like in the wild? Public engineering playbooks increasingly describe trace-first, outcome-linked evaluation: capture end-to-end traces (inputs, retrieved chunks, tool calls, prompts, responses) using OpenTelemetry GenAI semantic conventions and attach explicit outcome labels (resolved/unresolved, complaint/no-complaint). This supports longitudinal analysis, controlled experiments, and error clustering—regardless of whether any judge model is used for triage. Tooling ecosystems (e.g., LangSmith and others) document trace/eval wiring and OTel interoperability; these are descriptions of current practice rather than endorsements of a particular vendor. Are there domains where LLM-as-a-Judge (LAJ) seems comparatively reliable? Some constrained tasks with tight rubrics and short outputs report better reproducibility, especially when ensembles of judges and human-anchored calibration sets are used. But cross-domain generalization remains limited, and bias/attack vectors persist. Does LLM-as-a-Judge (LAJ) performance drift with content style, domain, or “polish”? Beyond length and order, studies and news coverage indicate LLMs sometimes over-simplify or over-generalize scientific claims compared to domain experts—useful context when using LAJ to score technical material or safety-critical text. Key Technical Observations Biases are measurable (position, verbosity, self-preference) and can materially change rankings without content changes. Controls (randomization, de-biasing templates) reduce but do not eliminate effects. Adversarial pressure matters: prompt-level attacks can systematically inflate scores; current defenses are partial. Human agreement varies by task: factuality and long-form quality show mixed correlations; narrow domains with careful design and ensembling fare better. Component metrics remain well-posed for deterministic steps (retrieval/routing), enabling precise regression tracking independent of judge LLMs. Trace-based online evaluation described in industry literature (OTel GenAI) supports outcome-linked monitoring and experimentation. Summary In conclusion, this article does not argue against the existence of LLM-as-a-Judge but highlights the nuances, limitations, and ongoing debates around its reliability and robustness. The intention is not to dismiss its use but to frame open questions that need further exploration. Companies and research groups actively developing or deploying LLM-as-a-Judge (LAJ) pipelines are invited to share their perspectives, empirical findings, and mitigation strategies—adding valuable depth and balance to the broader conversation on evaluation in the GenAI era. The post LLM-as-a-Judge: Where Do Its Signals Break, When Do They Hold, and What Should “Evaluation” Mean? appeared first on MarkTechPost.

LLM-as-a-Judge: Where Do Its Signals Break, When Do They Hold, and What Should “Evaluation” Mean? 投稿を読む »

AI, Committee, ニュース, Uncategorized

xAI launches Grok-4-Fast: Unified Reasoning and Non-Reasoning Model with 2M-Token Context and Trained End-to-End with Tool-Use Reinforcement Learning (RL)

admin NU / 9月 21, 2025

xAI introduced Grok-4-Fast, a cost-optimized successor to Grok-4 that merges “reasoning” and “non-reasoning” behaviors into a single set of weights controllable via system prompts. The model targets high-throughput search, coding, and Q&A with a 2M-token context window and native tool-use RL that decides when to browse the web, execute code, or call tools. Architecture note Previous Grok releases split long-chain “reasoning” and short “non-reasoning” responses across separate models. Grok-4-Fast’s unified weight space reduces end-to-end latency and tokens by steering behavior via system prompts, which is relevant for real-time applications (search, assistive agents, and interactive coding) where switching models penalizes both latency and cost. Search and agentic use Grok-4-Fast was trained end-to-end with tool-use reinforcement learning and shows gains on search-centric agent benchmarks: BrowseComp 44.9%, SimpleQA 95.0%, Reka Research 66.0%, plus higher scores on Chinese variants (e.g., BrowseComp-zh 51.2%). xAI also cites private battle-testing on LMArena where grok-4-fast-search (codename “menlo”) ranks #1 in the Search Arena with 1163 Elo, and the text variant (codename “tahoe”) sits at #8 in the Text Arena, roughly on par with grok-4-0709. Performance and efficiency deltas On internal and public benchmarks, Grok-4-Fast posts frontier-class scores while cutting token usage. xAI reports pass@1 results of 92.0% (AIME 2025, no tools), 93.3% (HMMT 2025, no tools), 85.7% (GPQA Diamond), and 80.0% (LiveCodeBench Jan–May), approaching or matching Grok-4 but using ~40% fewer “thinking” tokens on average. The company frames this as “intelligence density,” claiming a ~98% reduction in price to reach the same benchmark performance as Grok-4 when the lower token count and new per-token pricing are combined. Deployment and price The model is generally available to all users in Grok’s Fast and Auto modes across web and mobile; Auto will select Grok-4-Fast for difficult queries to improve latency without losing quality, and—for the first time—free users access xAI’s latest model tier. For developers, xAI exposes two SKUs—grok-4-fast-reasoning and grok-4-fast-non-reasoning—both with 2M context. Pricing (xAI API) is $0.20 / 1M input tokens (<128k), $0.40 / 1M input tokens (≥128k), $0.50 / 1M output tokens (<128k), $1.00 / 1M output tokens (≥128k), and $0.05 / 1M cached input tokens. https://x.ai/news/grok-4-fast 5 Technical Takeaways: Unified model + 2M context. Grok-4-Fast uses a single weight space for “reasoning” and “non-reasoning,” prompt-steered, with a 2,000,000-token window across both SKUs. Pricing for scale. API pricing starts at $0.20/M input, $0.50/M output, with cached input at $0.05/M and higher rates only beyond 128K context. Efficiency claims. xAI reports ~40% fewer “thinking” tokens at comparable accuracy vs Grok-4, yielding a ~98% lower price to match Grok-4 performance on frontier benchmarks. Benchmark profile. Reported pass@1: AIME-2025 92.0%, HMMT-2025 93.3%, GPQA-Diamond 85.7%, LiveCodeBench (Jan–May) 80.0%. Agentic/search use. Post-training with tool-use RL; positioned for browsing/search workflows with documented search-agent metrics and live-search billing in docs. Summary Grok-4-Fast packages Grok-4-level capability into a single, prompt-steerable model with a 2M-token window, tool-use RL, and pricing tuned for high-throughput search and agent workloads. Early public signals (LMArena #1 in Search, competitive Text placement) align with xAI’s claim of similar accuracy using ~40% fewer “thinking” tokens, translating to lower latency and unit cost in production. Check out the Technical details. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter. The post xAI launches Grok-4-Fast: Unified Reasoning and Non-Reasoning Model with 2M-Token Context and Trained End-to-End with Tool-Use Reinforcement Learning (RL) appeared first on MarkTechPost.

xAI launches Grok-4-Fast: Unified Reasoning and Non-Reasoning Model with 2M-Token Context and Trained End-to-End with Tool-Use Reinforcement Learning (RL) 投稿を読む »

Committee

A Rigorous Evaluation of LLM Data Generation Strategies for Low-Resource Languages

MIT Researchers Enhanced Artificial Intelligence (AI) 64x Better at Planning, Achieving 94% Accuracy

Emergent Abilities of Large Language Models under Continued Pretraining for Language Adaptation

MANZANO: A Simple and Scalable Unified Multimodal Model with a Hybrid Vision Tokenizer

Layer-wise Minimal Pair Probing Reveals Contextual Grammatical-Conceptual Hierarchy in Speech Representations

Localmax dynamics for attention in transformers and its asymptotic behavior

Tag&Tab: Pretraining Data Detection in Large Language Models Using Keyword-Based Membership Inference Attack

Xiaomi Released MiMo-Audio, a 7B Speech Language Model Trained on 100M+ Hours with High-Fidelity Discrete Tokens

LLM-as-a-Judge: Where Do Its Signals Break, When Do They Hold, and What Should “Evaluation” Mean?

xAI launches Grok-4-Fast: Unified Reasoning and Non-Reasoning Model with 2M-Token Context and Trained End-to-End with Tool-Use Reinforcement Learning (RL)

私たちのサービス

ホーム

仕組み

ニュース

料金

サポート

ヘルプセンター

問題を報告

フィードバックを送る

プライバシーポリシー

ユーザーアカウント

フォローする