Uncategorized Archives - Página 20 de 101

CoT-Space: A Theoretical Framework for Internal Slow-Thinking via Reinforcement Learning

admin NU / septiembre 26, 2025

arXiv:2509.04027v2 Announce Type: replace-cross Abstract: Reinforcement Learning (RL) has become a pivotal approach for enhancing the reasoning capabilities of Large Language Models (LLMs). However, a significant theoretical gap persists, as traditional token-level RL frameworks fail to align with the reasoning-level nature of complex, multi-step thought processes like Chain-of-Thought (CoT). To address this challenge, we introduce CoT-Space, a novel theoretical framework that recasts LLM reasoning from a discrete token-prediction task to an optimization process within a continuous, reasoning-level semantic space. This shift in perspective serves as a conceptual bridge, revitalizing foundational principles from classical learning theory to analyze the unique dynamics of LLMs. By analyzing this process from both a noise perspective and a risk perspective, we demonstrate that the convergence to an optimal CoT length is a natural consequence of the fundamental trade-off between underfitting and overfitting. Furthermore, extensive experiments provide strong empirical validation for our theoretical findings. Our framework not only provides a coherent explanation for empirical phenomena such as overthinking but also offers a solid theoretical foundation to guide the future development of more effective and generalizable reasoning agents. We open-source our code at https://github.com/ZyGan1999/CoT-Space.

CoT-Space: A Theoretical Framework for Internal Slow-Thinking via Reinforcement Learning Leer entrada »

AI, Committee, Noticias, Uncategorized

Generative AI for FFRDCs

admin NU / septiembre 26, 2025

arXiv:2509.21040v1 Announce Type: new Abstract: Federally funded research and development centers (FFRDCs) face text-heavy workloads, from policy documents to scientific and engineering papers, that are slow to analyze manually. We show how large language models can accelerate summarization, classification, extraction, and sense-making with only a few input-output examples. To enable use in sensitive government contexts, we apply OnPrem$.$LLM, an open-source framework for secure and flexible application of generative AI. Case studies on defense policy documents and scientific corpora, including the National Defense Authorization Act (NDAA) and National Science Foundation (NSF) Awards, demonstrate how this approach enhances oversight and strategic analysis while maintaining auditability and data sovereignty.

Generative AI for FFRDCs Leer entrada »

AI, Committee, Noticias, Uncategorized

Investigating Factuality in Long-Form Text Generation: The Roles of Self-Known and Self-Unknown

admin NU / septiembre 26, 2025

arXiv:2411.15993v2 Announce Type: replace Abstract: Large language models (LLMs) have demonstrated strong capabilities in text understanding and generation. However, they often lack factuality, producing a mixture of true and false information, especially in long-form generation. In this work, we investigates the factuality of long-form text generation across various large language models (LLMs), including GPT-4, Gemini-1.5-Pro, Claude-3-Opus, Llama-3-70B, and Mistral. Our analysis reveals that factuality tend to decline in later sentences of the generated text, accompanied by a rise in the number of unsupported claims. Furthermore, we explore the effectiveness of different evaluation settings to assess whether LLMs can accurately judge the correctness of their own outputs: Self-Known (the percentage of supported atomic claims, decomposed from LLM outputs, that the corresponding LLMs judge as correct) and Self-Unknown (the percentage of unsupported atomic claims that the corresponding LLMs judge as incorrect). Empirically, we observe a positive correlation between higher Self-Known scores and improved factuality, whereas higher Self-Unknown scores are associated with reduced factuality. Interestingly, the number of unsupported claims can increase even without significant changes in a model’s self-judgment scores (Self-Known and Self-Unknown), likely as a byproduct of long-form text generation. We also derive a mathematical framework linking Self-Known and Self-Unknown scores to factuality: $textrm{Factuality}=frac{1-textrm{Self-Unknown}}{2-textrm{Self-Unknown}-textrm{Self-Known}}$, which aligns with our empirical observations. Additional Retrieval-Augmented Generation (RAG) experiments further highlight the limitations of current LLMs in long-form generation and underscore the need for continued research to improve factuality in long-form text.

Investigating Factuality in Long-Form Text Generation: The Roles of Self-Known and Self-Unknown Leer entrada »

AI, Committee, Noticias, Uncategorized

Analysis of instruction-based LLMs’ capabilities to score and judge text-input problems in an academic setting

admin NU / septiembre 26, 2025

arXiv:2509.20982v1 Announce Type: new Abstract: Large language models (LLMs) can act as evaluators, a role studied by methods like LLM-as-a-Judge and fine-tuned judging LLMs. In the field of education, LLMs have been studied as assistant tools for students and teachers. Our research investigates LLM-driven automatic evaluation systems for academic Text-Input Problems using rubrics. We propose five evaluation systems that have been tested on a custom dataset of 110 answers about computer science from higher education students with three models: JudgeLM, Llama-3.1-8B and DeepSeek-R1-Distill-Llama-8B. The evaluation systems include: The JudgeLM evaluation, which uses the model’s single answer prompt to obtain a score; Reference Aided Evaluation, which uses a correct answer as a guide aside from the original context of the question; No Reference Evaluation, which ommits the reference answer; Additive Evaluation, which uses atomic criteria; and Adaptive Evaluation, which is an evaluation done with generated criteria fitted to each question. All evaluation methods have been compared with the results of a human evaluator. Results show that the best method to automatically evaluate and score Text-Input Problems using LLMs is Reference Aided Evaluation. With the lowest median absolute deviation (0.945) and the lowest root mean square deviation (1.214) when compared to human evaluation, Reference Aided Evaluation offers fair scoring as well as insightful and complete evaluations. Other methods such as Additive and Adaptive Evaluation fail to provide good results in concise answers, No Reference Evaluation lacks information needed to correctly assess questions and JudgeLM Evaluations have not provided good results due to the model’s limitations. As a result, we conclude that Artificial Intelligence-driven automatic evaluation systems, aided with proper methodologies, show potential to work as complementary tools to other academic resources.

Analysis of instruction-based LLMs’ capabilities to score and judge text-input problems in an academic setting Leer entrada »

AI, Committee, Noticias, Uncategorized

Reinforcement Learning on Pre-Training Data

admin NU / septiembre 26, 2025

arXiv:2509.19249v2 Announce Type: replace Abstract: The growing disparity between the exponential scaling of computational resources and the finite growth of high-quality text data now constrains conventional scaling approaches for large language models (LLMs). To address this challenge, we introduce Reinforcement Learning on Pre-Training data (RLPT), a new training-time scaling paradigm for optimizing LLMs. In contrast to prior approaches that scale training primarily through supervised learning, RLPT enables the policy to autonomously explore meaningful trajectories to learn from pre-training data and improve its capability through reinforcement learning (RL). While existing RL strategies such as reinforcement learning from human feedback (RLHF) and reinforcement learning with verifiable rewards (RLVR) rely on human annotation for reward construction, RLPT eliminates this dependency by deriving reward signals directly from pre-training data. Specifically, it adopts a next-segment reasoning objective, rewarding the policy for accurately predicting subsequent text segments conditioned on the preceding context. This formulation allows RL to be scaled on pre-training data, encouraging the exploration of richer trajectories across broader contexts and thereby fostering more generalizable reasoning skills. Extensive experiments on both general-domain and mathematical reasoning benchmarks across multiple models validate the effectiveness of RLPT. For example, when applied to Qwen3-4B-Base, RLPT yields absolute improvements of $3.0$, $5.1$, $8.1$, $6.0$, $6.6$, and $5.3$ on MMLU, MMLU-Pro, GPQA-Diamond, KOR-Bench, AIME24, and AIME25, respectively. The results further demonstrate favorable scaling behavior, suggesting strong potential for continued gains with more compute. In addition, RLPT provides a solid foundation, extending the reasoning boundaries of LLMs and enhancing RLVR performance.

Reinforcement Learning on Pre-Training Data Leer entrada »

AI, Committee, Noticias, Uncategorized

How to Build an End-to-End Data Science Workflow with Machine Learning, Interpretability, and Gemini AI Assistance?

admin NU / septiembre 25, 2025

In this tutorial, we walk through an advanced end-to-end data science workflow where we combine traditional machine learning with the power of Gemini. We begin by preparing and modeling the diabetes dataset, then we dive into evaluation, feature importance, and partial dependence. Along the way, we bring in Gemini as our AI data scientist to explain results, answer exploratory questions, and highlight risks. By doing this, we build a predictive model while also enhancing our insights and decision-making through natural language interaction. Check out the FULL CODES here. Copy CodeCopiedUse a different Browser !pip -qU google-generativeai scikit-learn matplotlib pandas numpy from getpass import getpass import os, json, numpy as np, pandas as pd, matplotlib.pyplot as plt if not os.environ.get(“GOOGLE_API_KEY”): os.environ[“GOOGLE_API_KEY”] = getpass(” Enter your Gemini API key (hidden): “) import google.generativeai as genai genai.configure(api_key=os.environ[“GOOGLE_API_KEY”]) LLM = genai.GenerativeModel(“gemini-1.5-flash”) def ask_llm(prompt, sys=None): p = prompt if sys is None else f”System:n{sys}nnUser:n{prompt}” r = LLM.generate_content(p) return (getattr(r, “text”, “”) or “”).strip() from sklearn.datasets import load_diabetes raw = load_diabetes(as_frame=True) df = raw.frame.rename(columns={“target”:”disease_progression”}) print(“Shape:”, df.shape); display(df.head()) from sklearn.model_selection import train_test_split, KFold, cross_val_score from sklearn.compose import ColumnTransformer from sklearn.preprocessing import StandardScaler, QuantileTransformer from sklearn.ensemble import HistGradientBoostingRegressor from sklearn.pipeline import Pipeline X = df.drop(columns=[“disease_progression”]); y = df[“disease_progression”] num_cols = X.columns.tolist() pre = ColumnTransformer( [(“scale”, StandardScaler(), num_cols), (“rank”, QuantileTransformer(n_quantiles=min(200, len(X)), output_distribution=”normal”), num_cols)], remainder=”drop”, verbose_feature_names_out=False) model = HistGradientBoostingRegressor(max_depth=3, learning_rate=0.07, l2_regularization=0.0, max_iter=500, early_stopping=True, validation_fraction=0.15) pipe = Pipeline([(“prep”, pre), (“hgbt”, model)]) Xtr, Xte, ytr, yte = train_test_split(X, y, test_size=0.20, random_state=42) cv = KFold(n_splits=5, shuffle=True, random_state=42) cv_mse = -cross_val_score(pipe, Xtr, ytr, scoring=”neg_mean_squared_error”, cv=cv).mean() cv_rmse = float(cv_mse ** 0.5) pipe.fit(Xtr, ytr) We load the diabetes dataset, preprocess the features, and build a robust pipeline using scaling, quantile transformation, and gradient boosting. We split the data, perform cross-validation to estimate RMSE, and then fit the final model to see how well it generalizes. Check out the FULL CODES here. Copy CodeCopiedUse a different Browser pred_tr = pipe.predict(Xtr); pred_te = pipe.predict(Xte) rmse_tr = mean_squared_error(ytr, pred_tr) ** 0.5 rmse_te = mean_squared_error(yte, pred_te) ** 0.5 mae_te = mean_absolute_error(yte, pred_te) r2_te = r2_score(yte, pred_te) print(f”CV RMSE={cv_rmse:.2f} | Train RMSE={rmse_tr:.2f} | Test RMSE={rmse_te:.2f} | Test MAE={mae_te:.2f} | R²={r2_te:.3f}”) plt.figure(figsize=(5,4)) plt.scatter(pred_te, yte – pred_te, s=12) plt.axhline(0, lw=1); plt.xlabel(“Predicted”); plt.ylabel(“Residual”); plt.title(“Residuals (Test)”) plt.show() from sklearn.inspection import permutation_importance imp = permutation_importance(pipe, Xte, yte, scoring=”neg_mean_squared_error”, n_repeats=10, random_state=0) imp_df = pd.DataFrame({“feature”: X.columns, “importance”: imp.importances_mean}).sort_values(“importance”, ascending=False) display(imp_df.head(10)) plt.figure(figsize=(6,4)) top10 = imp_df.head(10).iloc[::-1] plt.barh(top10[“feature”], top10[“importance”]) plt.title(“Permutation Importance (Top 10)”); plt.xlabel(“Δ(MSE)”); plt.tight_layout(); plt.show() We evaluate our model by computing train, test, and cross-validation metrics, and visualize residuals to check prediction errors. We then calculate permutation importance to identify which features drive the model most, and display the top contributors using a clear bar plot. Check out the FULL CODES here. Copy CodeCopiedUse a different Browser def compute_pdp(pipe, Xref: pd.DataFrame, feat: str, grid=40): xs = np.linspace(np.percentile(Xref[feat], 5), np.percentile(Xref[feat], 95), grid) Xtmp = Xref.copy() ys = [] for v in xs: Xtmp[feat] = v ys.append(pipe.predict(Xtmp).mean()) return xs, np.array(ys) top_feats = imp_df[“feature”].head(3).tolist() plt.figure(figsize=(6,4)) for f in top_feats: xs, ys = compute_pdp(pipe, Xte.copy(), f, grid=40) plt.plot(xs, ys, label=f) plt.legend(); plt.xlabel(“Feature value”); plt.ylabel(“Predicted target”); plt.title(“Manual PDP (Top 3)”) plt.tight_layout(); plt.show() report_obj = { “dataset”: {“rows”: int(df.shape[0]), “cols”: int(df.shape[1]-1), “target”: “disease_progression”}, “metrics”: {“cv_rmse”: float(cv_rmse), “train_rmse”: float(rmse_tr), “test_rmse”: float(rmse_te), “test_mae”: float(mae_te), “r2”: float(r2_te)}, “top_importances”: imp_df.head(10).to_dict(orient=”records”) } print(json.dumps(report_obj, indent=2)) sys_msg = (“You are a senior data scientist. Return: (1) ≤120-word executive summary, ” “(2) key risks/assumptions bullets, (3) 5 prioritized next experiments w/ rationale, ” “(4) quick-win feature engineering ideas as Python pseudocode.”) summary = ask_llm(f”Dataset + metrics + importances:n{json.dumps(report_obj)}”, sys=sys_msg) print(“n Gemini Executive Briefn” + “-“*80 + f”n{summary}n”) We compute the manual partial dependence for the top three features and visualize how changing each one affects the predictions. We then assemble a compact JSON report of dataset statistics, metrics, and importances, and ask Gemini to generate an executive brief that includes risks, next experiments, and quick-win feature engineering ideas. Check out the FULL CODES here. Copy CodeCopiedUse a different Browser SAFE_GLOBALS = {“pd”: pd, “np”: np} def run_generated_pandas(code: str, df_local: pd.DataFrame): banned = [“__”, “import”, “open(“, “exec(“, “eval(“, “os.”, “sys.”, “pd.read”, “to_csv”, “to_pickle”, “to_sql”] if any(b in code for b in banned): raise ValueError(“Unsafe code rejected.”) loc = {“df”: df_local.copy()} exec(code, SAFE_GLOBALS, loc) return {k:v for k,v in loc.items() if k not in (“df”,)} def eda_qa(question: str): prompt = f”””You are a Python+Pandas analyst. DataFrame `df` columns: {list(df.columns)}. Write a SHORT pandas snippet (no comments/prints) that computes the answer to: “{question}”. Use only pd/np/df; assign the final result to a variable named `answer`.””” code = ask_llm(prompt, sys=”Return only code. No prose.”) try: out = run_generated_pandas(code, df) return code, out.get(“answer”, None) except Exception as e: return code, f”[Execution error: {e}]” questions = [ “What is the Pearson correlation between BMI and disease_progression?”, “Show mean target by tertiles of BMI (low/med/high).”, “Which single feature correlates most with the target (absolute value)?” ] for q in questions: code, ans = eda_qa(q) print(“nQ:”, q, “nCode:n”, code, “nAnswer:n”, ans) We build a safe sandbox to execute pandas code that Gemini generates for exploratory data analysis. We then ask natural language questions about correlations and feature relationships, let Gemini write the pandas snippets, and automatically run them to get direct answers from the dataset. Check out the FULL CODES here. Copy CodeCopiedUse a different Browser crossitique = ask_llm( f”””Metrics: {report_obj[‘metrics’]} Top importances: {report_obj[‘top_importances’]} Identify risks around leakage, overfitting, calibration, OOD robustness, and fairness (even proxy-only). Propose quick checks (concise Python sketches).””” ) print(“n Gemini Risk & Robustness Reviewn” + “-“*80 + f”n{critique}n”) def what_if(pipe, Xref: pd.DataFrame, feat: str, delta: float = 0.05): x0 = Xref.median(numeric_only=True).to_dict() x1, x2 = x0.copy(), x0.copy() if feat not in x1: return np.nan x2[feat] = x1[feat] + delta X1 = pd.DataFrame([x1], columns=X.columns) X2 = pd.DataFrame([x2], columns=X.columns) return float(pipe.predict(X2)[0] – pipe.predict(X1)[0]) for f in top_feats: print(f”Estimated Δtarget if {f} increases by +0.05 ≈ {what_if(pipe, Xte, f, 0.05):.2f}”) print(“n Done: Train → Explain → Query with Gemini → Review risks → What-if analysis. ” “Swap the dataset or tweak model params to extend this notebook.”) We ask Gemini to review our model for risks like leakage, overfitting, and fairness, and get quick Python checks as suggestions. We then

How to Build an End-to-End Data Science Workflow with Machine Learning, Interpretability, and Gemini AI Assistance? Leer entrada »

AI, Committee, Noticias, Uncategorized

Frustratingly Easy Data Augmentation for Low-Resource ASR

admin NU / septiembre 25, 2025

arXiv:2509.15373v2 Announce Type: replace Abstract: This paper introduces three self-contained data augmentation methods for low-resource Automatic Speech Recognition (ASR). Our techniques first generate novel text–using gloss-based replacement, random replacement, or an LLM-based approach–and then apply Text-to-Speech (TTS) to produce synthetic audio. We apply these methods, which leverage only the original annotated data, to four languages with extremely limited resources (Vatlongos, Nashta, Shinekhen Buryat, and Kakabe). Fine-tuning a pretrained Wav2Vec2-XLSR-53 model on a combination of the original audio and generated synthetic data yields significant performance gains, including a 14.3% absolute WER reduction for Nashta. The methods prove effective across all four low-resource languages and also show utility for high-resource languages like English, demonstrating their broad applicability.

Frustratingly Easy Data Augmentation for Low-Resource ASR Leer entrada »

AI, Committee, Noticias, Uncategorized

SINAI at eRisk@CLEF 2025: Transformer-Based and Conversational Strategies for Depression Detection

admin NU / septiembre 25, 2025

arXiv:2509.19861v1 Announce Type: new Abstract: This paper describes the participation of the SINAI-UJA team in the eRisk@CLEF 2025 lab. Specifically, we addressed two of the proposed tasks: (i) Task 2: Contextualized Early Detection of Depression, and (ii) Pilot Task: Conversational Depression Detection via LLMs. Our approach for Task 2 combines an extensive preprocessing pipeline with the use of several transformer-based models, such as RoBERTa Base or MentalRoBERTA Large, to capture the contextual and sequential nature of multi-user conversations. For the Pilot Task, we designed a set of conversational strategies to interact with LLM-powered personas, focusing on maximizing information gain within a limited number of dialogue turns. In Task 2, our system ranked 8th out of 12 participating teams based on F1 score. However, a deeper analysis revealed that our models were among the fastest in issuing early predictions, which is a critical factor in real-world deployment scenarios. This highlights the trade-off between early detection and classification accuracy, suggesting potential avenues for optimizing both jointly in future work. In the Pilot Task, we achieved 1st place out of 5 teams, obtaining the best overall performance across all evaluation metrics: DCHR, ADODL and ASHR. Our success in this task demonstrates the effectiveness of structured conversational design when combined with powerful language models, reinforcing the feasibility of deploying LLMs in sensitive mental health assessment contexts.

SINAI at eRisk@CLEF 2025: Transformer-Based and Conversational Strategies for Depression Detection Leer entrada »

AI, Committee, Noticias, Uncategorized

Localized LoRA: A Structured Low-Rank Approximation for Efficient Fine-Tuning

admin NU / septiembre 25, 2025

arXiv:2506.00236v2 Announce Type: replace-cross Abstract: Parameter-efficient fine-tuning (PEFT) methods, such as LoRA, offer compact and effective alternatives to full model fine-tuning by introducing low-rank updates to pre-trained weights. However, most existing approaches rely on global low rank structures, which can overlook spatial patterns spread across the parameter space. In this work, we propose Localized LoRA, a generalized framework that models weight updates as a composition of low-rank matrices applied to structured blocks of the weight matrix. This formulation enables dense, localized updates throughout the parameter space without increasing the total number of trainable parameters. We provide a formal comparison between global, diagonal-local, and fully localized low-rank approximations, and show that our method consistently achieves lower approximation error under matched parameter budgets. Experiments on both synthetic and practical settings demonstrate that Localized LoRA offers a more expressive and adaptable alternative to existing methods, enabling efficient fine-tuning with improved performance.

Localized LoRA: A Structured Low-Rank Approximation for Efficient Fine-Tuning Leer entrada »

AI, Committee, Noticias, Uncategorized

Human-AI Narrative Synthesis to Foster Shared Understanding in Civic Decision-Making

admin NU / septiembre 25, 2025

arXiv:2509.19643v1 Announce Type: cross Abstract: Community engagement processes in representative political contexts, like school districts, generate massive volumes of feedback that overwhelm traditional synthesis methods, creating barriers to shared understanding not only between civic leaders and constituents but also among community members. To address these barriers, we developed StoryBuilder, a human-AI collaborative pipeline that transforms community input into accessible first-person narratives. Using 2,480 community responses from an ongoing school rezoning process, we generated 124 composite stories and deployed them through a mobile-friendly StorySharer interface. Our mixed-methods evaluation combined a four-month field deployment, user studies with 21 community members, and a controlled experiment examining how narrative composition affects participant reactions. Field results demonstrate that narratives helped community members relate across diverse perspectives. In the experiment, experience-grounded narratives generated greater respect and trust than opinion-heavy narratives. We contribute a human-AI narrative synthesis system and insights on its varied acceptance and effectiveness in a real-world civic context.

Human-AI Narrative Synthesis to Foster Shared Understanding in Civic Decision-Making Leer entrada »

Uncategorized

CoT-Space: A Theoretical Framework for Internal Slow-Thinking via Reinforcement Learning

Generative AI for FFRDCs

Investigating Factuality in Long-Form Text Generation: The Roles of Self-Known and Self-Unknown

Analysis of instruction-based LLMs’ capabilities to score and judge text-input problems in an academic setting

Reinforcement Learning on Pre-Training Data

How to Build an End-to-End Data Science Workflow with Machine Learning, Interpretability, and Gemini AI Assistance?

Frustratingly Easy Data Augmentation for Low-Resource ASR

SINAI at eRisk@CLEF 2025: Transformer-Based and Conversational Strategies for Depression Detection

Localized LoRA: A Structured Low-Rank Approximation for Efficient Fine-Tuning

Human-AI Narrative Synthesis to Foster Shared Understanding in Civic Decision-Making

Nuestros servicios

Inicio

Cómo funciona

Noticias

Precios

Soporte

Centro de ayuda

Reportar un problema

Dar comentarios

Política de privacidad

Cuenta de usuario

Síguenos