YouZum

AI

AI, Committee, ข่าว, Uncategorized

How to Build an End-to-End Data Science Workflow with Machine Learning, Interpretability, and Gemini AI Assistance?

In this tutorial, we walk through an advanced end-to-end data science workflow where we combine traditional machine learning with the power of Gemini. We begin by preparing and modeling the diabetes dataset, then we dive into evaluation, feature importance, and partial dependence. Along the way, we bring in Gemini as our AI data scientist to explain results, answer exploratory questions, and highlight risks. By doing this, we build a predictive model while also enhancing our insights and decision-making through natural language interaction. Check out the FULL CODES here. Copy CodeCopiedUse a different Browser !pip -qU google-generativeai scikit-learn matplotlib pandas numpy from getpass import getpass import os, json, numpy as np, pandas as pd, matplotlib.pyplot as plt if not os.environ.get(“GOOGLE_API_KEY”): os.environ[“GOOGLE_API_KEY”] = getpass(” Enter your Gemini API key (hidden): “) import google.generativeai as genai genai.configure(api_key=os.environ[“GOOGLE_API_KEY”]) LLM = genai.GenerativeModel(“gemini-1.5-flash”) def ask_llm(prompt, sys=None): p = prompt if sys is None else f”System:n{sys}nnUser:n{prompt}” r = LLM.generate_content(p) return (getattr(r, “text”, “”) or “”).strip() from sklearn.datasets import load_diabetes raw = load_diabetes(as_frame=True) df = raw.frame.rename(columns={“target”:”disease_progression”}) print(“Shape:”, df.shape); display(df.head()) from sklearn.model_selection import train_test_split, KFold, cross_val_score from sklearn.compose import ColumnTransformer from sklearn.preprocessing import StandardScaler, QuantileTransformer from sklearn.ensemble import HistGradientBoostingRegressor from sklearn.pipeline import Pipeline X = df.drop(columns=[“disease_progression”]); y = df[“disease_progression”] num_cols = X.columns.tolist() pre = ColumnTransformer( [(“scale”, StandardScaler(), num_cols), (“rank”, QuantileTransformer(n_quantiles=min(200, len(X)), output_distribution=”normal”), num_cols)], remainder=”drop”, verbose_feature_names_out=False) model = HistGradientBoostingRegressor(max_depth=3, learning_rate=0.07, l2_regularization=0.0, max_iter=500, early_stopping=True, validation_fraction=0.15) pipe = Pipeline([(“prep”, pre), (“hgbt”, model)]) Xtr, Xte, ytr, yte = train_test_split(X, y, test_size=0.20, random_state=42) cv = KFold(n_splits=5, shuffle=True, random_state=42) cv_mse = -cross_val_score(pipe, Xtr, ytr, scoring=”neg_mean_squared_error”, cv=cv).mean() cv_rmse = float(cv_mse ** 0.5) pipe.fit(Xtr, ytr) We load the diabetes dataset, preprocess the features, and build a robust pipeline using scaling, quantile transformation, and gradient boosting. We split the data, perform cross-validation to estimate RMSE, and then fit the final model to see how well it generalizes. Check out the FULL CODES here. Copy CodeCopiedUse a different Browser pred_tr = pipe.predict(Xtr); pred_te = pipe.predict(Xte) rmse_tr = mean_squared_error(ytr, pred_tr) ** 0.5 rmse_te = mean_squared_error(yte, pred_te) ** 0.5 mae_te = mean_absolute_error(yte, pred_te) r2_te = r2_score(yte, pred_te) print(f”CV RMSE={cv_rmse:.2f} | Train RMSE={rmse_tr:.2f} | Test RMSE={rmse_te:.2f} | Test MAE={mae_te:.2f} | R²={r2_te:.3f}”) plt.figure(figsize=(5,4)) plt.scatter(pred_te, yte – pred_te, s=12) plt.axhline(0, lw=1); plt.xlabel(“Predicted”); plt.ylabel(“Residual”); plt.title(“Residuals (Test)”) plt.show() from sklearn.inspection import permutation_importance imp = permutation_importance(pipe, Xte, yte, scoring=”neg_mean_squared_error”, n_repeats=10, random_state=0) imp_df = pd.DataFrame({“feature”: X.columns, “importance”: imp.importances_mean}).sort_values(“importance”, ascending=False) display(imp_df.head(10)) plt.figure(figsize=(6,4)) top10 = imp_df.head(10).iloc[::-1] plt.barh(top10[“feature”], top10[“importance”]) plt.title(“Permutation Importance (Top 10)”); plt.xlabel(“Δ(MSE)”); plt.tight_layout(); plt.show() We evaluate our model by computing train, test, and cross-validation metrics, and visualize residuals to check prediction errors. We then calculate permutation importance to identify which features drive the model most, and display the top contributors using a clear bar plot. Check out the FULL CODES here. Copy CodeCopiedUse a different Browser def compute_pdp(pipe, Xref: pd.DataFrame, feat: str, grid=40): xs = np.linspace(np.percentile(Xref[feat], 5), np.percentile(Xref[feat], 95), grid) Xtmp = Xref.copy() ys = [] for v in xs: Xtmp[feat] = v ys.append(pipe.predict(Xtmp).mean()) return xs, np.array(ys) top_feats = imp_df[“feature”].head(3).tolist() plt.figure(figsize=(6,4)) for f in top_feats: xs, ys = compute_pdp(pipe, Xte.copy(), f, grid=40) plt.plot(xs, ys, label=f) plt.legend(); plt.xlabel(“Feature value”); plt.ylabel(“Predicted target”); plt.title(“Manual PDP (Top 3)”) plt.tight_layout(); plt.show() report_obj = { “dataset”: {“rows”: int(df.shape[0]), “cols”: int(df.shape[1]-1), “target”: “disease_progression”}, “metrics”: {“cv_rmse”: float(cv_rmse), “train_rmse”: float(rmse_tr), “test_rmse”: float(rmse_te), “test_mae”: float(mae_te), “r2”: float(r2_te)}, “top_importances”: imp_df.head(10).to_dict(orient=”records”) } print(json.dumps(report_obj, indent=2)) sys_msg = (“You are a senior data scientist. Return: (1) ≤120-word executive summary, ” “(2) key risks/assumptions bullets, (3) 5 prioritized next experiments w/ rationale, ” “(4) quick-win feature engineering ideas as Python pseudocode.”) summary = ask_llm(f”Dataset + metrics + importances:n{json.dumps(report_obj)}”, sys=sys_msg) print(“n Gemini Executive Briefn” + “-“*80 + f”n{summary}n”) We compute the manual partial dependence for the top three features and visualize how changing each one affects the predictions. We then assemble a compact JSON report of dataset statistics, metrics, and importances, and ask Gemini to generate an executive brief that includes risks, next experiments, and quick-win feature engineering ideas. Check out the FULL CODES here. Copy CodeCopiedUse a different Browser SAFE_GLOBALS = {“pd”: pd, “np”: np} def run_generated_pandas(code: str, df_local: pd.DataFrame): banned = [“__”, “import”, “open(“, “exec(“, “eval(“, “os.”, “sys.”, “pd.read”, “to_csv”, “to_pickle”, “to_sql”] if any(b in code for b in banned): raise ValueError(“Unsafe code rejected.”) loc = {“df”: df_local.copy()} exec(code, SAFE_GLOBALS, loc) return {k:v for k,v in loc.items() if k not in (“df”,)} def eda_qa(question: str): prompt = f”””You are a Python+Pandas analyst. DataFrame `df` columns: {list(df.columns)}. Write a SHORT pandas snippet (no comments/prints) that computes the answer to: “{question}”. Use only pd/np/df; assign the final result to a variable named `answer`.””” code = ask_llm(prompt, sys=”Return only code. No prose.”) try: out = run_generated_pandas(code, df) return code, out.get(“answer”, None) except Exception as e: return code, f”[Execution error: {e}]” questions = [ “What is the Pearson correlation between BMI and disease_progression?”, “Show mean target by tertiles of BMI (low/med/high).”, “Which single feature correlates most with the target (absolute value)?” ] for q in questions: code, ans = eda_qa(q) print(“nQ:”, q, “nCode:n”, code, “nAnswer:n”, ans) We build a safe sandbox to execute pandas code that Gemini generates for exploratory data analysis. We then ask natural language questions about correlations and feature relationships, let Gemini write the pandas snippets, and automatically run them to get direct answers from the dataset. Check out the FULL CODES here. Copy CodeCopiedUse a different Browser crossitique = ask_llm( f”””Metrics: {report_obj[‘metrics’]} Top importances: {report_obj[‘top_importances’]} Identify risks around leakage, overfitting, calibration, OOD robustness, and fairness (even proxy-only). Propose quick checks (concise Python sketches).””” ) print(“n Gemini Risk & Robustness Reviewn” + “-“*80 + f”n{critique}n”) def what_if(pipe, Xref: pd.DataFrame, feat: str, delta: float = 0.05): x0 = Xref.median(numeric_only=True).to_dict() x1, x2 = x0.copy(), x0.copy() if feat not in x1: return np.nan x2[feat] = x1[feat] + delta X1 = pd.DataFrame([x1], columns=X.columns) X2 = pd.DataFrame([x2], columns=X.columns) return float(pipe.predict(X2)[0] – pipe.predict(X1)[0]) for f in top_feats: print(f”Estimated Δtarget if {f} increases by +0.05 ≈ {what_if(pipe, Xte, f, 0.05):.2f}”) print(“n Done: Train → Explain → Query with Gemini → Review risks → What-if analysis. ” “Swap the dataset or tweak model params to extend this notebook.”) We ask Gemini to review our model for risks like leakage, overfitting, and fairness, and get quick Python checks as suggestions. We then

How to Build an End-to-End Data Science Workflow with Machine Learning, Interpretability, and Gemini AI Assistance? Read Post »

AI, Committee, ข่าว, Uncategorized

Frustratingly Easy Data Augmentation for Low-Resource ASR

arXiv:2509.15373v2 Announce Type: replace Abstract: This paper introduces three self-contained data augmentation methods for low-resource Automatic Speech Recognition (ASR). Our techniques first generate novel text–using gloss-based replacement, random replacement, or an LLM-based approach–and then apply Text-to-Speech (TTS) to produce synthetic audio. We apply these methods, which leverage only the original annotated data, to four languages with extremely limited resources (Vatlongos, Nashta, Shinekhen Buryat, and Kakabe). Fine-tuning a pretrained Wav2Vec2-XLSR-53 model on a combination of the original audio and generated synthetic data yields significant performance gains, including a 14.3% absolute WER reduction for Nashta. The methods prove effective across all four low-resource languages and also show utility for high-resource languages like English, demonstrating their broad applicability.

Frustratingly Easy Data Augmentation for Low-Resource ASR Read Post »

AI, Committee, ข่าว, Uncategorized

Human-AI Narrative Synthesis to Foster Shared Understanding in Civic Decision-Making

arXiv:2509.19643v1 Announce Type: cross Abstract: Community engagement processes in representative political contexts, like school districts, generate massive volumes of feedback that overwhelm traditional synthesis methods, creating barriers to shared understanding not only between civic leaders and constituents but also among community members. To address these barriers, we developed StoryBuilder, a human-AI collaborative pipeline that transforms community input into accessible first-person narratives. Using 2,480 community responses from an ongoing school rezoning process, we generated 124 composite stories and deployed them through a mobile-friendly StorySharer interface. Our mixed-methods evaluation combined a four-month field deployment, user studies with 21 community members, and a controlled experiment examining how narrative composition affects participant reactions. Field results demonstrate that narratives helped community members relate across diverse perspectives. In the experiment, experience-grounded narratives generated greater respect and trust than opinion-heavy narratives. We contribute a human-AI narrative synthesis system and insights on its varied acceptance and effectiveness in a real-world civic context.

Human-AI Narrative Synthesis to Foster Shared Understanding in Civic Decision-Making Read Post »

AI, Committee, ข่าว, Uncategorized

AutoSpec: An Agentic Framework for Automatically Drafting Patent Specification

arXiv:2509.19640v1 Announce Type: new Abstract: Patents play a critical role in driving technological innovation by granting inventors exclusive rights to their inventions. However the process of drafting a patent application is often expensive and time-consuming, making it a prime candidate for automation. Despite recent advancements in language models, several challenges hinder the development of robust automated patent drafting systems. First, the information within a patent application is highly confidential, which often prevents the use of closed-source LLMs for automating this task. Second, the process of drafting a patent application is difficult for even the most advanced language models due to their long context, technical writing style, and specialized domain knowledge. To address these challenges, we introduce AutoSpec, a secure, agentic framework for Automatically drafting patent Specification. Our approach decomposes the drafting process into a sequence of manageable subtasks, each solvable by smaller, open-source language models enhanced with custom tools tailored for drafting patent specification. To assess our system, we design a novel evaluation protocol in collaboration with experienced patent attorneys. Our automatic and expert evaluations show that AutoSpec outperforms existing baselines on a patent drafting task.

AutoSpec: An Agentic Framework for Automatically Drafting Patent Specification Read Post »

AI, Committee, ข่าว, Uncategorized

SINAI at eRisk@CLEF 2025: Transformer-Based and Conversational Strategies for Depression Detection

arXiv:2509.19861v1 Announce Type: new Abstract: This paper describes the participation of the SINAI-UJA team in the eRisk@CLEF 2025 lab. Specifically, we addressed two of the proposed tasks: (i) Task 2: Contextualized Early Detection of Depression, and (ii) Pilot Task: Conversational Depression Detection via LLMs. Our approach for Task 2 combines an extensive preprocessing pipeline with the use of several transformer-based models, such as RoBERTa Base or MentalRoBERTA Large, to capture the contextual and sequential nature of multi-user conversations. For the Pilot Task, we designed a set of conversational strategies to interact with LLM-powered personas, focusing on maximizing information gain within a limited number of dialogue turns. In Task 2, our system ranked 8th out of 12 participating teams based on F1 score. However, a deeper analysis revealed that our models were among the fastest in issuing early predictions, which is a critical factor in real-world deployment scenarios. This highlights the trade-off between early detection and classification accuracy, suggesting potential avenues for optimizing both jointly in future work. In the Pilot Task, we achieved 1st place out of 5 teams, obtaining the best overall performance across all evaluation metrics: DCHR, ADODL and ASHR. Our success in this task demonstrates the effectiveness of structured conversational design when combined with powerful language models, reinforcing the feasibility of deploying LLMs in sensitive mental health assessment contexts.

SINAI at eRisk@CLEF 2025: Transformer-Based and Conversational Strategies for Depression Detection Read Post »

AI, Committee, ข่าว, Uncategorized

Localized LoRA: A Structured Low-Rank Approximation for Efficient Fine-Tuning

arXiv:2506.00236v2 Announce Type: replace-cross Abstract: Parameter-efficient fine-tuning (PEFT) methods, such as LoRA, offer compact and effective alternatives to full model fine-tuning by introducing low-rank updates to pre-trained weights. However, most existing approaches rely on global low rank structures, which can overlook spatial patterns spread across the parameter space. In this work, we propose Localized LoRA, a generalized framework that models weight updates as a composition of low-rank matrices applied to structured blocks of the weight matrix. This formulation enables dense, localized updates throughout the parameter space without increasing the total number of trainable parameters. We provide a formal comparison between global, diagonal-local, and fully localized low-rank approximations, and show that our method consistently achieves lower approximation error under matched parameter budgets. Experiments on both synthetic and practical settings demonstrate that Localized LoRA offers a more expressive and adaptable alternative to existing methods, enabling efficient fine-tuning with improved performance.

Localized LoRA: A Structured Low-Rank Approximation for Efficient Fine-Tuning Read Post »

AI, Committee, ข่าว, Uncategorized

Expanding the WMT24++ Benchmark with Rumantsch Grischun, Sursilvan, Sutsilvan, Surmiran, Puter, and Vallader

arXiv:2509.03148v2 Announce Type: replace Abstract: The Romansh language, spoken in Switzerland, has limited resources for machine translation evaluation. In this paper, we present a benchmark for six varieties of Romansh: Rumantsch Grischun, a supra-regional variety, and five regional varieties: Sursilvan, Sutsilvan, Surmiran, Puter, and Vallader. Our reference translations were created by human translators based on the WMT24++ benchmark, which ensures parallelism with more than 55 other languages. An automatic evaluation of existing MT systems and LLMs shows that translation out of Romansh into German is handled relatively well for all the varieties, but translation into Romansh is still challenging.

Expanding the WMT24++ Benchmark with Rumantsch Grischun, Sursilvan, Sutsilvan, Surmiran, Puter, and Vallader Read Post »

AI, Committee, ข่าว, Uncategorized

When Long Helps Short: How Context Length in Supervised Fine-tuning Affects Behavior of Large Language Models

arXiv:2509.18762v1 Announce Type: new Abstract: Large language models (LLMs) have achieved impressive performance across natural language processing (NLP) tasks. As real-world applications increasingly demand longer context windows, continued pretraining and supervised fine-tuning (SFT) on long-context data has become a common approach. While the effects of data length in continued pretraining have been extensively studied, their implications for SFT remain unclear. In this work, we systematically investigate how SFT data length influences LLM behavior on short-context tasks. Counterintuitively, we find that long-context SFT improves short-context performance, contrary to the commonly observed degradation from long-context pretraining. To uncover the underlying mechanisms of this phenomenon, we first decouple and analyze two key components, Multi-Head Attention (MHA) and Feed-Forward Network (FFN), and show that both independently benefit from long-context SFT. We further study their interaction and reveal a knowledge preference bias: long-context SFT promotes contextual knowledge, while short-context SFT favors parametric knowledge, making exclusive reliance on long-context SFT suboptimal. Finally, we demonstrate that hybrid training mitigates this bias, offering explainable guidance for fine-tuning LLMs.

When Long Helps Short: How Context Length in Supervised Fine-tuning Affects Behavior of Large Language Models Read Post »

AI, Committee, ข่าว, Uncategorized

Human-Annotated NER Dataset for the Kyrgyz Language

arXiv:2509.19109v1 Announce Type: new Abstract: We introduce KyrgyzNER, the first manually annotated named entity recognition dataset for the Kyrgyz language. Comprising 1,499 news articles from the 24.KG news portal, the dataset contains 10,900 sentences and 39,075 entity mentions across 27 named entity classes. We show our annotation scheme, discuss the challenges encountered in the annotation process, and present the descriptive statistics. We also evaluate several named entity recognition models, including traditional sequence labeling approaches based on conditional random fields and state-of-the-art multilingual transformer-based models fine-tuned on our dataset. While all models show difficulties with rare entity categories, models such as the multilingual RoBERTa variant pretrained on a large corpus across many languages achieve a promising balance between precision and recall. These findings emphasize both the challenges and opportunities of using multilingual pretrained models for processing languages with limited resources. Although the multilingual RoBERTa model performed best, other multilingual models yielded comparable results. This suggests that future work exploring more granular annotation schemes may offer deeper insights for Kyrgyz language processing pipelines evaluation.

Human-Annotated NER Dataset for the Kyrgyz Language Read Post »

AI, Committee, ข่าว, Uncategorized

Finding My Voice: Generative Reconstruction of Disordered Speech for Automated Clinical Evaluation

arXiv:2509.19231v1 Announce Type: cross Abstract: We present ChiReSSD, a speech reconstruction framework that preserves children speaker’s identity while suppressing mispronunciations. Unlike prior approaches trained on healthy adult speech, ChiReSSD adapts to the voices of children with speech sound disorders (SSD), with particular emphasis on pitch and prosody. We evaluate our method on the STAR dataset and report substantial improvements in lexical accuracy and speaker identity preservation. Furthermore, we automatically predict the phonetic content in the original and reconstructed pairs, where the proportion of corrected consonants is comparable to the percentage of correct consonants (PCC), a clinical speech assessment metric. Our experiments show Pearson correlation of 0.63 between automatic and human expert annotations, highlighting the potential to reduce the manual transcription burden. In addition, experiments on the TORGO dataset demonstrate effective generalization for reconstructing adult dysarthric speech. Our results indicate that disentangled, style-based TTS reconstruction can provide identity-preserving speech across diverse clinical populations.

Finding My Voice: Generative Reconstruction of Disordered Speech for Automated Clinical Evaluation Read Post »

We use cookies to improve your experience and performance on our website. You can learn more at นโยบายความเป็นส่วนตัว and manage your privacy settings by clicking Settings.

ตั้งค่าความเป็นส่วนตัว

You can choose your cookie settings by turning on/off each type of cookie as you wish, except for essential cookies.

ยอมรับทั้งหมด
จัดการความเป็นส่วนตัว
  • เปิดใช้งานตลอด

บันทึกการตั้งค่า
th