YouZum

Committee

AI, Committee, Noticias, Uncategorized

Nemotron-CC-Math: A 133 Billion-Token-Scale High Quality Math Pretraining Dataset

arXiv:2508.15096v1 Announce Type: new Abstract: Pretraining large language models (LLMs) on high-quality, structured data such as mathematics and code substantially enhances reasoning capabilities. However, existing math-focused datasets built from Common Crawl suffer from degraded quality due to brittle extraction heuristics, lossy HTML-to-text conversion, and the failure to reliably preserve mathematical structure. In this work, we introduce Nemotron-CC-Math, a large-scale, high-quality mathematical corpus constructed from Common Crawl using a novel, domain-agnostic pipeline specifically designed for robust scientific text extraction. Unlike previous efforts, our pipeline recovers math across various formats (e.g., MathJax, KaTeX, MathML) by leveraging layout-aware rendering with lynx and a targeted LLM-based cleaning stage. This approach preserves the structural integrity of equations and code blocks while removing boilerplate, standardizing notation into LaTeX representation, and correcting inconsistencies. We collected a large, high-quality math corpus, namely Nemotron-CC-Math-3+ (133B tokens) and Nemotron-CC-Math-4+ (52B tokens). Notably, Nemotron-CC-Math-4+ not only surpasses all prior open math datasets-including MegaMath, FineMath, and OpenWebMath-but also contains 5.5 times more tokens than FineMath-4+, which was previously the highest-quality math pretraining dataset. When used to pretrain a Nemotron-T 8B model, our corpus yields +4.8 to +12.6 gains on MATH and +4.6 to +14.3 gains on MBPP+ over strong baselines, while also improving general-domain performance on MMLU and MMLU-Stem. We present the first pipeline to reliably extract scientific content–including math–from noisy web-scale data, yielding measurable gains in math, code, and general reasoning, and setting a new state of the art among open math pretraining corpora. To support open-source efforts, we release our code and datasets.

Nemotron-CC-Math: A 133 Billion-Token-Scale High Quality Math Pretraining Dataset Leer entrada »

AI, Committee, Noticias, Uncategorized

Customizing Speech Recognition Model with Large Language Model Feedback

arXiv:2506.11091v2 Announce Type: replace Abstract: Automatic speech recognition (ASR) systems have achieved strong performance on general transcription tasks. However, they continue to struggle with recognizing rare named entities and adapting to domain mismatches. In contrast, large language models (LLMs), trained on massive internet-scale datasets, are often more effective across a wide range of domains. In this work, we propose a reinforcement learning based approach for unsupervised domain adaptation, leveraging unlabeled data to enhance transcription quality, particularly the named entities affected by domain mismatch, through feedback from a LLM. Given contextual information, our framework employs a LLM as the reward model to score the hypotheses from the ASR model. These scores serve as reward signals to fine-tune the ASR model via reinforcement learning. Our method achieves a 21% improvement on entity word error rate over conventional self-training methods.

Customizing Speech Recognition Model with Large Language Model Feedback Leer entrada »

AI, Committee, Noticias, Uncategorized

DEPTH: Hallucination-Free Relation Extraction via Dependency-Aware Sentence Simplification and Two-tiered Hierarchical Refinement

arXiv:2508.14391v1 Announce Type: new Abstract: Relation extraction enables the construction of structured knowledge for many downstream applications. While large language models (LLMs) have shown great promise in this domain, most existing methods concentrate on relation classification, which predicts the semantic relation type between a related entity pair. However, we observe that LLMs often struggle to reliably determine whether a relation exists, especially in cases involving complex sentence structures or intricate semantics, which leads to spurious predictions. Such hallucinations can introduce noisy edges in knowledge graphs, compromising the integrity of structured knowledge and downstream reliability. To address these challenges, we propose DEPTH, a framework that integrates Dependency-aware sEntence simPlification and Two-tiered Hierarchical refinement into the relation extraction pipeline. Given a sentence and its candidate entity pairs, DEPTH operates in two stages: (1) the Grounding module extracts relations for each pair by leveraging their shortest dependency path, distilling the sentence into a minimal yet coherent relational context that reduces syntactic noise while preserving key semantics; (2) the Refinement module aggregates all local predictions and revises them based on a holistic understanding of the sentence, correcting omissions and inconsistencies. We further introduce a causality-driven reward model that mitigates reward hacking by disentangling spurious correlations, enabling robust fine-tuning via reinforcement learning with human feedback. Experiments on six benchmarks demonstrate that DEPTH reduces the average hallucination rate to 7.0% while achieving a 17.2% improvement in average F1 score over state-of-the-art baselines.

DEPTH: Hallucination-Free Relation Extraction via Dependency-Aware Sentence Simplification and Two-tiered Hierarchical Refinement Leer entrada »

AI, Committee, Noticias, Uncategorized

EmoTale: An Enacted Speech-emotion Dataset in Danish

arXiv:2508.14548v1 Announce Type: new Abstract: While multiple emotional speech corpora exist for commonly spoken languages, there is a lack of functional datasets for smaller (spoken) languages, such as Danish. To our knowledge, Danish Emotional Speech (DES), published in 1997, is the only other database of Danish emotional speech. We present EmoTale; a corpus comprising Danish and English speech recordings with their associated enacted emotion annotations. We demonstrate the validity of the dataset by investigating and presenting its predictive power using speech emotion recognition (SER) models. We develop SER models for EmoTale and the reference datasets using self-supervised speech model (SSLM) embeddings and the openSMILE feature extractor. We find the embeddings superior to the hand-crafted features. The best model achieves an unweighted average recall (UAR) of 64.1% on the EmoTale corpus using leave-one-speaker-out cross-validation, comparable to the performance on DES.

EmoTale: An Enacted Speech-emotion Dataset in Danish Leer entrada »

AI, Committee, Noticias, Uncategorized

Cognitive Surgery: The Awakening of Implicit Territorial Awareness in LLMs

arXiv:2508.14408v1 Announce Type: new Abstract: Large language models (LLMs) have been shown to possess a degree of self-recognition capability-the ability to identify whether a given text was generated by themselves. Prior work has demonstrated that this capability is reliably expressed under the Pair Presentation Paradigm (PPP), where the model is presented with two texts and asked to choose which one it authored. However, performance deteriorates sharply under the Individual Presentation Paradigm (IPP), where the model is given a single text to judge authorship. Although this phenomenon has been observed, its underlying causes have not been systematically analyzed. In this paper, we first replicate existing findings to confirm that LLMs struggle to distinguish self- from other-generated text under IPP. We then investigate the reasons for this failure and attribute it to a phenomenon we term Implicit Territorial Awareness (ITA)-the model’s latent ability to distinguish self- and other-texts in representational space, which remains unexpressed in its output behavior. To awaken the ITA of LLMs, we propose Cognitive Surgery (CoSur), a novel framework comprising four main modules: representation extraction, territory construction, authorship discrimination and cognitive editing. Experimental results demonstrate that our proposed method improves the performance of three different LLMs in the IPP scenario, achieving average accuracies of 83.25%, 66.19%, and 88.01%, respectively.

Cognitive Surgery: The Awakening of Implicit Territorial Awareness in LLMs Leer entrada »

AI, Committee, Noticias, Uncategorized

The Prompting Brain: Neurocognitive Markers of Expertise in Guiding Large Language Models

arXiv:2508.14869v1 Announce Type: cross Abstract: Prompt engineering has rapidly emerged as a critical skill for effective interaction with large language models (LLMs). However, the cognitive and neural underpinnings of this expertise remain largely unexplored. This paper presents findings from a cross-sectional pilot fMRI study investigating differences in brain functional connectivity and network activity between experts and intermediate prompt engineers. Our results reveal distinct neural signatures associated with higher prompt engineering literacy, including increased functional connectivity in brain regions such as the left middle temporal gyrus and the left frontal pole, as well as altered power-frequency dynamics in key cognitive networks. These findings offer initial insights into the neurobiological basis of prompt engineering proficiency. We discuss the implications of these neurocognitive markers in Natural Language Processing (NLP). Understanding the neural basis of human expertise in interacting with LLMs can inform the design of more intuitive human-AI interfaces, contribute to cognitive models of LLM interaction, and potentially guide the development of AI systems that better align with human cognitive workflows. This interdisciplinary approach aims to bridge the gap between human cognition and machine intelligence, fostering a deeper understanding of how humans learn and adapt to complex AI systems.

The Prompting Brain: Neurocognitive Markers of Expertise in Guiding Large Language Models Leer entrada »

AI, Committee, Noticias, Uncategorized

EEG-MedRAG: Enhancing EEG-based Clinical Decision-Making via Hierarchical Hypergraph Retrieval-Augmented Generation

arXiv:2508.13735v1 Announce Type: new Abstract: With the widespread application of electroencephalography (EEG) in neuroscience and clinical practice, efficiently retrieving and semantically interpreting large-scale, multi-source, heterogeneous EEG data has become a pressing challenge. We propose EEG-MedRAG, a three-layer hypergraph-based retrieval-augmented generation framework that unifies EEG domain knowledge, individual patient cases, and a large-scale repository into a traversable n-ary relational hypergraph, enabling joint semantic-temporal retrieval and causal-chain diagnostic generation. Concurrently, we introduce the first cross-disease, cross-role EEG clinical QA benchmark, spanning seven disorders and five authentic clinical perspectives. This benchmark allows systematic evaluation of disease-agnostic generalization and role-aware contextual understanding. Experiments show that EEG-MedRAG significantly outperforms TimeRAG and HyperGraphRAG in answer accuracy and retrieval, highlighting its strong potential for real-world clinical decision support. Our data and code are publicly available at https://github.com/yi9206413-boop/EEG-MedRAG.

EEG-MedRAG: Enhancing EEG-based Clinical Decision-Making via Hierarchical Hypergraph Retrieval-Augmented Generation Leer entrada »

AI, Committee, Noticias, Uncategorized

A Coding Implementation to Build a Complete Self-Hosted LLM Workflow with Ollama, REST API, and Gradio Chat Interface

In this tutorial, we implement a fully functional Ollama environment inside Google Colab to replicate a self-hosted LLM workflow. We begin by installing Ollama directly on the Colab VM using the official Linux installer and then launch the Ollama server in the background to expose the HTTP API on localhost:11434. After verifying the service, we pull lightweight models such as qwen2.5:0.5b-instruct or llama3.2:1b, which balance resource constraints with usability in a CPU-only environment. To interact with these models programmatically, we use the /api/chat endpoint via Python’s requests module with streaming enabled, which allows token-level output to be captured incrementally. Finally, we layer a Gradio-based UI on top of this client so we can issue prompts, maintain multi-turn history, configure parameters like temperature and context size, and view results in real time. Check out the Full Codes here. Copy CodeCopiedUse a different Browser import os, sys, subprocess, time, json, requests, textwrap from pathlib import Path def sh(cmd, check=True): “””Run a shell command, stream output.””” p = subprocess.Popen(cmd, shell=True, stdout=subprocess.PIPE, stderr=subprocess.STDOUT, text=True) for line in p.stdout: print(line, end=””) p.wait() if check and p.returncode != 0: raise RuntimeError(f”Command failed: {cmd}”) if not Path(“/usr/local/bin/ollama”).exists() and not Path(“/usr/bin/ollama”).exists(): print(” Installing Ollama …”) sh(“curl -fsSL https://ollama.com/install.sh | sh”) else: print(” Ollama already installed.”) try: import gradio except Exception: print(” Installing Gradio …”) sh(“pip -q install gradio==4.44.0”) We first check if Ollama is already installed on the system, and if not, we install it using the official script. At the same time, we ensure Gradio is available by importing it or installing the required version when missing. This way, we prepare our Colab environment for running the chat interface smoothly. Check out the Full Codes here. Copy CodeCopiedUse a different Browser def start_ollama(): try: requests.get(“http://127.0.0.1:11434/api/tags”, timeout=1) print(” Ollama server already running.”) return None except Exception: pass print(” Starting Ollama server …”) proc = subprocess.Popen([“ollama”, “serve”], stdout=subprocess.PIPE, stderr=subprocess.STDOUT, text=True) for _ in range(60): time.sleep(1) try: r = requests.get(“http://127.0.0.1:11434/api/tags”, timeout=1) if r.ok: print(” Ollama server is up.”) break except Exception: pass else: raise RuntimeError(“Ollama did not start in time.”) return proc server_proc = start_ollama() We start the Ollama server in the background and keep checking its health endpoint until it responds successfully. By doing this, we ensure the server is running and ready before sending any API requests. Check out the Full Codes here. Copy CodeCopiedUse a different Browser MODEL = os.environ.get(“OLLAMA_MODEL”, “qwen2.5:0.5b-instruct”) print(f” Using model: {MODEL}”) try: tags = requests.get(“http://127.0.0.1:11434/api/tags”, timeout=5).json() have = any(m.get(“name”)==MODEL for m in tags.get(“models”, [])) except Exception: have = False if not have: print(f” Pulling model {MODEL} (first time only) …”) sh(f”ollama pull {MODEL}”) We define the default model to use, check if it is already available on the Ollama server, and if not, we automatically pull it. This ensures that the chosen model is ready before we start running any chat sessions. Check out the Full Codes here. Copy CodeCopiedUse a different Browser OLLAMA_URL = “http://127.0.0.1:11434/api/chat” def ollama_chat_stream(messages, model=MODEL, temperature=0.2, num_ctx=None): “””Yield streaming text chunks from Ollama /api/chat.””” payload = { “model”: model, “messages”: messages, “stream”: True, “options”: {“temperature”: float(temperature)} } if num_ctx: payload[“options”][“num_ctx”] = int(num_ctx) with requests.post(OLLAMA_URL, json=payload, stream=True) as r: r.raise_for_status() for line in r.iter_lines(): if not line: continue data = json.loads(line.decode(“utf-8”)) if “message” in data and “content” in data[“message”]: yield data[“message”][“content”] if data.get(“done”): break We create a streaming client for the Ollama /api/chat endpoint, where we send messages as JSON payloads and yield tokens as they arrive. This lets us handle responses incrementally, so we see the model’s output in real time instead of waiting for the full completion. Check out the Full Codes here. Copy CodeCopiedUse a different Browser def smoke_test(): print(“n Smoke test:”) sys_msg = {“role”:”system”,”content”:”You are concise. Use short bullets.”} user_msg = {“role”:”user”,”content”:”Give 3 quick tips to sleep better.”} out = [] for chunk in ollama_chat_stream([sys_msg, user_msg], temperature=0.3): print(chunk, end=””) out.append(chunk) print(“n Done.n”) try: smoke_test() except Exception as e: print(” Smoke test skipped:”, e) We run a quick smoke test by sending a simple prompt through our streaming client to confirm that the model responds correctly. This helps us verify that Ollama is installed, the server is running, and the chosen model is working before we build the full chat UI. Check out the Full Codes here. Copy CodeCopiedUse a different Browser import gradio as gr SYSTEM_PROMPT = “You are a helpful, crisp assistant. Prefer bullets when helpful.” def chat_fn(message, history, temperature, num_ctx): msgs = [{“role”:”system”,”content”:SYSTEM_PROMPT}] for u, a in history: if u: msgs.append({“role”:”user”,”content”:u}) if a: msgs.append({“role”:”assistant”,”content”:a}) msgs.append({“role”:”user”,”content”: message}) acc = “” try: for part in ollama_chat_stream(msgs, model=MODEL, temperature=temperature, num_ctx=num_ctx or None): acc += part yield acc except Exception as e: yield f” Error: {e}” with gr.Blocks(title=”Ollama Chat (Colab)”, fill_height=True) as demo: gr.Markdown(“# Ollama Chat (Colab)nSmall local-ish LLM via Ollama + Gradio.n”) with gr.Row(): temp = gr.Slider(0.0, 1.0, value=0.3, step=0.1, label=”Temperature”) num_ctx = gr.Slider(512, 8192, value=2048, step=256, label=”Context Tokens (num_ctx)”) chat = gr.Chatbot(height=460) msg = gr.Textbox(label=”Your message”, placeholder=”Ask anything…”, lines=3) clear = gr.Button(“Clear”) def user_send(m, h): m = (m or “”).strip() if not m: return “”, h return “”, h + [[m, None]] def bot_reply(h, temperature, num_ctx): u = h[-1][0] stream = chat_fn(u, h[:-1], temperature, int(num_ctx)) acc = “” for partial in stream: acc = partial h[-1][1] = acc yield h msg.submit(user_send, [msg, chat], [msg, chat]) .then(bot_reply, [chat, temp, num_ctx], [chat]) clear.click(lambda: None, None, chat) print(” Launching Gradio …”) demo.launch(share=True) We integrate Gradio to build an interactive chat UI on top of the Ollama server, where user input and conversation history are converted into the correct message format and streamed back as model responses. The sliders let us adjust parameters like temperature and context length, while the chat box and clear button provide a simple, real-time interface for testing different prompts. In conclusion, we establish a reproducible pipeline for running Ollama in Colab: installation, server startup, model management, API access, and user interface integration. The system uses Ollama’s REST API as the core interaction layer, providing both command-line and Python streaming access, while Gradio handles session persistence and chat rendering. This approach preserves the “self-hosted” design described in the original guide but

A Coding Implementation to Build a Complete Self-Hosted LLM Workflow with Ollama, REST API, and Gradio Chat Interface Leer entrada »

AI, Committee, Noticias, Uncategorized

Ask Good Questions for Large Language Models

arXiv:2508.14025v1 Announce Type: new Abstract: Recent advances in large language models (LLMs) have significantly improved the performance of dialog systems, yet current approaches often fail to provide accurate guidance of topic due to their inability to discern user confusion in related concepts. To address this, we introduce the Ask-Good-Question (AGQ) framework, which features an improved Concept-Enhanced Item Response Theory (CEIRT) model to better identify users’ knowledge levels. Our contributions include applying the CEIRT model along with LLMs to directly generate guiding questions based on the inspiring text, greatly improving information retrieval efficiency during the question & answer process. Through comparisons with other baseline methods, our approach outperforms by significantly enhencing the users’ information retrieval experiences.

Ask Good Questions for Large Language Models Leer entrada »

AI, Committee, Noticias, Uncategorized

ReviewGraph: A Knowledge Graph Embedding Based Framework for Review Rating Prediction with Sentiment Features

arXiv:2508.13953v1 Announce Type: new Abstract: In the hospitality industry, understanding the factors that drive customer review ratings is critical for improving guest satisfaction and business performance. This work proposes ReviewGraph for Review Rating Prediction (RRP), a novel framework that transforms textual customer reviews into knowledge graphs by extracting (subject, predicate, object) triples and associating sentiment scores. Using graph embeddings (Node2Vec) and sentiment features, the framework predicts review rating scores through machine learning classifiers. We compare ReviewGraph performance with traditional NLP baselines (such as Bag of Words, TF-IDF, and Word2Vec) and large language models (LLMs), evaluating them in the HotelRec dataset. In comparison to the state of the art literature, our proposed model performs similar to their best performing model but with lower computational cost (without ensemble). While ReviewGraph achieves comparable predictive performance to LLMs and outperforms baselines on agreement-based metrics such as Cohen’s Kappa, it offers additional advantages in interpretability, visual exploration, and potential integration into Retrieval-Augmented Generation (RAG) systems. This work highlights the potential of graph-based representations for enhancing review analytics and lays the groundwork for future research integrating advanced graph neural networks and fine-tuned LLM-based extraction methods. We will share ReviewGraph output and platform open-sourced on our GitHub page https://github.com/aaronlifenghan/ReviewGraph

ReviewGraph: A Knowledge Graph Embedding Based Framework for Review Rating Prediction with Sentiment Features Leer entrada »

We use cookies to improve your experience and performance on our website. You can learn more at Política de privacidad and manage your privacy settings by clicking Settings.

Privacy Preferences

You can choose your cookie settings by turning on/off each type of cookie as you wish, except for essential cookies.

Allow All
Manage Consent Preferences
  • Always Active

Save
es_ES