YouZum

Committee

AI, Committee, News, Uncategorized

US investigators are using AI to detect child abuse images made by AI

Generative AI has enabled the production of child sexual abuse images to skyrocket. Now the leading investigator of child exploitation in the US is experimenting with using AI to distinguish AI-generated images from material depicting real victims, according to a new government filing. The Department of Homeland Security’s Cyber Crimes Center, which investigates child exploitation across international borders, has awarded a $150,000 contract to San Francisco–based Hive AI for its software, which can identify whether a piece of content was AI-generated. The filing, posted on September 19, is heavily redacted and Hive cofounder and CEO Kevin Guo told MIT Technology Review that he could not discuss the details of the contract, but confirmed it involves use of the company’s AI detection algorithms for child sexual abuse material (CSAM). The filing quotes data from the National Center for Missing and Exploited Children that reported a 1,325% increase in incidents involving generative AI in 2024. “The sheer volume of digital content circulating online necessitates the use of automated tools to process and analyze data efficiently,” the filing reads. The first priority of child exploitation investigators is to find and stop any abuse currently happening, but the flood of AI-generated CSAM has made it difficult for investigators to know whether images depict a real victim currently at risk. A tool that could successfully flag real victims would be a massive help when they try to prioritize cases. Identifying AI-generated images “ensures that investigative resources are focused on cases involving real victims, maximizing the program’s impact and safeguarding vulnerable individuals,” the filing reads. Hive AI offers AI tools that create videos and images, as well as a range of content moderation tools that can flag violence, spam, and sexual material and even identify celebrities. In December, MIT Technology Review reported that the company was selling its deepfake-detection technology to the US military.  For detecting CSAM, Hive offers a tool created with Thorn, a child safety nonprofit, which companies can integrate into their platforms. This tool uses a “hashing” system, which assigns unique IDs to content known by investigators to be CSAM, and blocks that material from being uploaded. This tool, and others like it, have become a standard line of defense for tech companies.  But these tools simply identify a piece of content as CSAM; they don’t detect whether it was generated by AI. Hive has created a separate tool that determines whether images in general were AI-generated. Though it is not trained specifically to work on CSAM, according to Guo, it doesn’t need to be. “There’s some underlying combination of pixels in this image that we can identify” as AI-generated, he says. “It can be generalizable.”  This tool, Guo says, is what the Cyber Crimes Center will be using to evaluate CSAM. He adds that Hive benchmarks its detection tools for each specific use case its customers have in mind. The National Center for Missing and Exploited Children, which participates in efforts to stop the spread of CSAM, did not respond to requests for comment on the effectiveness of such detection models in time for publication.  In its filing, the government justifies awarding the contract to Hive without a competitive bidding process. Though parts of this justification are redacted, it primarily references two points also found in a Hive presentation slide deck. One involves a 2024 study from the University of Chicago, which found that Hive’s AI detection tool outranked four other detectors in identifying AI-generated art. The other is its contract with the Pentagon for identifying deepfakes. The trial will last three months. 

US investigators are using AI to detect child abuse images made by AI Read Post »

AI, Committee, News, Uncategorized

The Download: shoplifter-chasing drones, and Trump’s TikTok deal

Shoplifters in the US could soon be chased down by drones The news: Flock Safety, whose drones were once reserved for police departments, is now offering them for private-sector security, the company has announced. Potential customers include businesses trying to curb shoplifting.  How it works: If the security team at a store sees shoplifters leave, they can activate a camera-equipped drone. “The drone follows the people. The people get in a car. You click a button and you track the vehicle with the drone, and the drone just follows the car,” says Keith Kauffman, a former police chief who now directs Flock’s drone program. The video feed of that drone might go to the company’s security team, but it could also be automatically transmitted directly to police departments.  The response: Flock’s expansion into private-sector security is “a logical step, but in the wrong direction,” says Rebecca Williams, senior strategist for the ACLU’s privacy and data governance unit. Read the full story.  —James O’Donnell  Read more of our stories about the latest in drone tech: + Why you’re about to see a lot more drones over America’s skies. + Meet Serhii “Flash” Beskrestnov, the radio-obsessed civilian shaping Ukraine’s drone defense. His work could help to determine the future of Ukraine, and wars far beyond it. + We examined four big trends that show what’s next for drone technology. + The defense tech startup Epirus has developed a cutting-edge, cost-efficient drone zapper that’s sparking the interest of the US military. Read our story about how it could change the future of war. The must-reads I’ve combed the internet to find you today’s most fun/important/scary/fascinating stories about technology. 1 TikTok US is being valued at $14 billion by Trump’s dealThat’s shockingly low for a fast-growing social media company. (FT $) + The deal is basically just Trump giving TikTok to his friends. (Vox $)+ Here’s what the sale means for you. (WP $) 2 Microsoft has stopped letting Israel use its technology for surveillanceThe system was used to collect millions of Palestinian civilians’ phone calls every day. (The Guardian) 3 There are more robots working in China than the rest of the world combinedIt’s a trend that’ll further cement its status as the world’s leading manufacturer. (NYT $)+ China’s EV giants are betting big on humanoid robots. (MIT Technology Review) 4 The inside story of what happened when DOGE came to townIf anything, this is even more grim and chaotic than you might imagine. (Wired $) 5 Instagram’s teen safety features are flawedResearchers tested 47 of these features, and found that only 8 were fully effective. (Reuters $)+ There’s growing concern among lawmakers about the risks of kids forming bonds with chatbots. (MIT Technology Review) 6 Brazil’s judicial system is adopting AI with gustoThe trouble is that rather than reducing the amount of work for judges and lawyers, AI seems to be increasing it. (Rest of World)+ Meet the early-adopter judges using AI. (MIT Technology Review) 7 Amazon is refunding $1.5 billion to Prime subscribersThe deal with the FTC lets it avoid a trial over claims it tricked consumers into signing up. (WP $) 8 These women are in love with AI Like it or not, these sorts of romances are becoming more common. (Slate $)+ It’s surprisingly easy to stumble into a relationship with an AI chatbot. (MIT Technology Review)  9 Scientists are improving how we measure nothingResearchers are developing a vacuum-measurement tool that could unlock exciting new possibilities for science. (IEEE Spectrum)+ This quantum radar could image buried objects. (MIT Technology Review) 10 Why does everything online feel so icky? Most of us will go to extreme lengths to avoid awkwardness IRL. On social media, it’s another matter entirely… (Vox $)+ China’s government has had enough of everyone being negative on its internet. (BBC) Quote of the day “AI machines—in quite a literal sense—appear to be saving the US economy right now. In the absence of tech-related spending, the US would be close to, or in, recession this year.” —George Saravelos, global head of FX research at Deutsche Bank, warns that the AI boom is unsustainable in a note to clients, Fortune reports. One more thing COURTESY OF OPENAI The two people shaping the future of OpenAI’s research —Will Douglas Heaven For the past couple of years, OpenAI has felt like a one-man brand. With his showbiz style and fundraising glitz, CEO Sam Altman overshadows all other big names on the firm’s roster. But Altman is not the one building the technology on which its reputation rests. That responsibility falls to OpenAI’s twin heads of research—chief research officer Mark Chen and chief scientist Jakub Pachocki. Between them, they share the role of making sure OpenAI stays one step ahead of powerhouse rivals like Google. I recently sat down with Chen and Pachocki for an exclusive conversation which covered everything from how they manage the inherent tension between research and product, to what they really mean when they talk about AGI, and what happened to OpenAI’s superalignment team. Read the full story. We can still have nice things + Wherever you are, this website helps you discover the most interesting bars nearby. + Take a tour of Norway’s lighthouses.+ Inside London’s flourishing underground rave scene.+ Meaningful changes rarely occur instantly. Here’s how they do happen.

The Download: shoplifter-chasing drones, and Trump’s TikTok deal Read Post »

AI, Committee, News, Uncategorized

Meet Qwen3Guard: The Qwen3-based Multilingual Safety Guardrail Models Built for Global, Real-Time AI Safety

Can safety keep up with real-time LLMs? Alibaba’s Qwen team thinks so, and it just shipped Qwen3Guard—a multilingual guardrail model family built to moderate prompts and streaming responses in-real-time. Qwen3Guard comes in two variants: Qwen3Guard-Gen (a generative classifier that reads full prompt/response context) and Qwen3Guard-Stream (a token-level classifier that moderates as text is generated). Both are released in 0.6B, 4B, and 8B parameter sizes and target global deployments with coverage for 119 languages and dialects. The models are open-sourced, with weights on Hugging Face and GitHub Repo. https://github.com/QwenLM/Qwen3Guard What’s new? Streaming moderation head: Stream attaches two lightweight classification heads to the final transformer layer—one monitors the user prompt, the other scores each generated token in real time as Safe / Controversial / Unsafe. This enables policy enforcement while a reply is being produced, instead of post-hoc filtering. Three-tier risk semantics: Beyond binary safe/unsafe labels, a Controversial tier supports adjustable strictness (binary tightening/loosening) across datasets and policies—useful when “borderline” content must be routed or escalated, not simply dropped. Structured outputs for Gen: The generative variant emits a standard header—Safety: …, Categories: …, Refusal: …—that’s trivial to parse for pipelines and RL reward functions. Categories include Violent, Non-violent Illegal Acts, Sexual Content, PII, Suicide & Self-Harm, Unethical Acts, Politically Sensitive Topics, Copyright Violation, Jailbreak. Benchmarks and safety RL The Qwen research team shows state-of-the-art average F1 across English, Chinese, and multilingual safety benchmarks for both prompt and response classification, with data plotted for Qwen3Guard-Gen versus prior open models. While the research team emphasizes relative gains rather than a single composite metric, the consistent lead across settings is the key point. For training downstream assistants, the research team test safety-driven RL using Qwen3Guard-Gen as a reward signal. A Guard-only reward maximizes safety but spikes refusals and slightly dents arena-hard-v2 win rate; a Hybrid reward (penalizing over-refusals, blending quality signals) lifts the WildGuard-measured safety score from ~60 to >97 without degrading reasoning tasks, and even nudges arena-hard-v2 upward. This is a practical recipe for teams that saw prior reward shaping collapse into “refuse-everything” behavior. https://github.com/QwenLM/Qwen3Guard Where it fits? Most open guard models only classify completed outputs. Qwen3Guard’s dual heads + token-time scoring align with production agents that stream responses, enabling early intervention (block, redact, or redirect) with lower latency cost than re-decoding. The Controversial tier also maps cleanly onto enterprise policy knobs (e.g., treat “Controversial” as unsafe in regulated contexts, but allow with review in consumer chat). Summary Qwen3Guard is a practical guardrail stack: open-weights (0.6B/4B/8B), two operating modes (full-context Gen, token-time Stream), tri-level risk labeling, and multilingual coverage (119 languages). For production teams, this is a credible baseline to replace post-hoc filters with real-time moderation and to align assistants with safety rewards while monitoring refusal rates. Check out the Paper, GitHub Page and Full Collection on HF. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter. The post Meet Qwen3Guard: The Qwen3-based Multilingual Safety Guardrail Models Built for Global, Real-Time AI Safety appeared first on MarkTechPost.

Meet Qwen3Guard: The Qwen3-based Multilingual Safety Guardrail Models Built for Global, Real-Time AI Safety Read Post »

AI, Committee, News, Uncategorized

CoT-Space: A Theoretical Framework for Internal Slow-Thinking via Reinforcement Learning

arXiv:2509.04027v2 Announce Type: replace-cross Abstract: Reinforcement Learning (RL) has become a pivotal approach for enhancing the reasoning capabilities of Large Language Models (LLMs). However, a significant theoretical gap persists, as traditional token-level RL frameworks fail to align with the reasoning-level nature of complex, multi-step thought processes like Chain-of-Thought (CoT). To address this challenge, we introduce CoT-Space, a novel theoretical framework that recasts LLM reasoning from a discrete token-prediction task to an optimization process within a continuous, reasoning-level semantic space. This shift in perspective serves as a conceptual bridge, revitalizing foundational principles from classical learning theory to analyze the unique dynamics of LLMs. By analyzing this process from both a noise perspective and a risk perspective, we demonstrate that the convergence to an optimal CoT length is a natural consequence of the fundamental trade-off between underfitting and overfitting. Furthermore, extensive experiments provide strong empirical validation for our theoretical findings. Our framework not only provides a coherent explanation for empirical phenomena such as overthinking but also offers a solid theoretical foundation to guide the future development of more effective and generalizable reasoning agents. We open-source our code at https://github.com/ZyGan1999/CoT-Space.

CoT-Space: A Theoretical Framework for Internal Slow-Thinking via Reinforcement Learning Read Post »

AI, Committee, News, Uncategorized

Generative AI for FFRDCs

arXiv:2509.21040v1 Announce Type: new Abstract: Federally funded research and development centers (FFRDCs) face text-heavy workloads, from policy documents to scientific and engineering papers, that are slow to analyze manually. We show how large language models can accelerate summarization, classification, extraction, and sense-making with only a few input-output examples. To enable use in sensitive government contexts, we apply OnPrem$.$LLM, an open-source framework for secure and flexible application of generative AI. Case studies on defense policy documents and scientific corpora, including the National Defense Authorization Act (NDAA) and National Science Foundation (NSF) Awards, demonstrate how this approach enhances oversight and strategic analysis while maintaining auditability and data sovereignty.

Generative AI for FFRDCs Read Post »

AI, Committee, News, Uncategorized

Investigating Factuality in Long-Form Text Generation: The Roles of Self-Known and Self-Unknown

arXiv:2411.15993v2 Announce Type: replace Abstract: Large language models (LLMs) have demonstrated strong capabilities in text understanding and generation. However, they often lack factuality, producing a mixture of true and false information, especially in long-form generation. In this work, we investigates the factuality of long-form text generation across various large language models (LLMs), including GPT-4, Gemini-1.5-Pro, Claude-3-Opus, Llama-3-70B, and Mistral. Our analysis reveals that factuality tend to decline in later sentences of the generated text, accompanied by a rise in the number of unsupported claims. Furthermore, we explore the effectiveness of different evaluation settings to assess whether LLMs can accurately judge the correctness of their own outputs: Self-Known (the percentage of supported atomic claims, decomposed from LLM outputs, that the corresponding LLMs judge as correct) and Self-Unknown (the percentage of unsupported atomic claims that the corresponding LLMs judge as incorrect). Empirically, we observe a positive correlation between higher Self-Known scores and improved factuality, whereas higher Self-Unknown scores are associated with reduced factuality. Interestingly, the number of unsupported claims can increase even without significant changes in a model’s self-judgment scores (Self-Known and Self-Unknown), likely as a byproduct of long-form text generation. We also derive a mathematical framework linking Self-Known and Self-Unknown scores to factuality: $textrm{Factuality}=frac{1-textrm{Self-Unknown}}{2-textrm{Self-Unknown}-textrm{Self-Known}}$, which aligns with our empirical observations. Additional Retrieval-Augmented Generation (RAG) experiments further highlight the limitations of current LLMs in long-form generation and underscore the need for continued research to improve factuality in long-form text.

Investigating Factuality in Long-Form Text Generation: The Roles of Self-Known and Self-Unknown Read Post »

AI, Committee, News, Uncategorized

Analysis of instruction-based LLMs’ capabilities to score and judge text-input problems in an academic setting

arXiv:2509.20982v1 Announce Type: new Abstract: Large language models (LLMs) can act as evaluators, a role studied by methods like LLM-as-a-Judge and fine-tuned judging LLMs. In the field of education, LLMs have been studied as assistant tools for students and teachers. Our research investigates LLM-driven automatic evaluation systems for academic Text-Input Problems using rubrics. We propose five evaluation systems that have been tested on a custom dataset of 110 answers about computer science from higher education students with three models: JudgeLM, Llama-3.1-8B and DeepSeek-R1-Distill-Llama-8B. The evaluation systems include: The JudgeLM evaluation, which uses the model’s single answer prompt to obtain a score; Reference Aided Evaluation, which uses a correct answer as a guide aside from the original context of the question; No Reference Evaluation, which ommits the reference answer; Additive Evaluation, which uses atomic criteria; and Adaptive Evaluation, which is an evaluation done with generated criteria fitted to each question. All evaluation methods have been compared with the results of a human evaluator. Results show that the best method to automatically evaluate and score Text-Input Problems using LLMs is Reference Aided Evaluation. With the lowest median absolute deviation (0.945) and the lowest root mean square deviation (1.214) when compared to human evaluation, Reference Aided Evaluation offers fair scoring as well as insightful and complete evaluations. Other methods such as Additive and Adaptive Evaluation fail to provide good results in concise answers, No Reference Evaluation lacks information needed to correctly assess questions and JudgeLM Evaluations have not provided good results due to the model’s limitations. As a result, we conclude that Artificial Intelligence-driven automatic evaluation systems, aided with proper methodologies, show potential to work as complementary tools to other academic resources.

Analysis of instruction-based LLMs’ capabilities to score and judge text-input problems in an academic setting Read Post »

AI, Committee, News, Uncategorized

Reinforcement Learning on Pre-Training Data

arXiv:2509.19249v2 Announce Type: replace Abstract: The growing disparity between the exponential scaling of computational resources and the finite growth of high-quality text data now constrains conventional scaling approaches for large language models (LLMs). To address this challenge, we introduce Reinforcement Learning on Pre-Training data (RLPT), a new training-time scaling paradigm for optimizing LLMs. In contrast to prior approaches that scale training primarily through supervised learning, RLPT enables the policy to autonomously explore meaningful trajectories to learn from pre-training data and improve its capability through reinforcement learning (RL). While existing RL strategies such as reinforcement learning from human feedback (RLHF) and reinforcement learning with verifiable rewards (RLVR) rely on human annotation for reward construction, RLPT eliminates this dependency by deriving reward signals directly from pre-training data. Specifically, it adopts a next-segment reasoning objective, rewarding the policy for accurately predicting subsequent text segments conditioned on the preceding context. This formulation allows RL to be scaled on pre-training data, encouraging the exploration of richer trajectories across broader contexts and thereby fostering more generalizable reasoning skills. Extensive experiments on both general-domain and mathematical reasoning benchmarks across multiple models validate the effectiveness of RLPT. For example, when applied to Qwen3-4B-Base, RLPT yields absolute improvements of $3.0$, $5.1$, $8.1$, $6.0$, $6.6$, and $5.3$ on MMLU, MMLU-Pro, GPQA-Diamond, KOR-Bench, AIME24, and AIME25, respectively. The results further demonstrate favorable scaling behavior, suggesting strong potential for continued gains with more compute. In addition, RLPT provides a solid foundation, extending the reasoning boundaries of LLMs and enhancing RLVR performance.

Reinforcement Learning on Pre-Training Data Read Post »

AI, Committee, News, Uncategorized

How to Build an End-to-End Data Science Workflow with Machine Learning, Interpretability, and Gemini AI Assistance?

In this tutorial, we walk through an advanced end-to-end data science workflow where we combine traditional machine learning with the power of Gemini. We begin by preparing and modeling the diabetes dataset, then we dive into evaluation, feature importance, and partial dependence. Along the way, we bring in Gemini as our AI data scientist to explain results, answer exploratory questions, and highlight risks. By doing this, we build a predictive model while also enhancing our insights and decision-making through natural language interaction. Check out the FULL CODES here. Copy CodeCopiedUse a different Browser !pip -qU google-generativeai scikit-learn matplotlib pandas numpy from getpass import getpass import os, json, numpy as np, pandas as pd, matplotlib.pyplot as plt if not os.environ.get(“GOOGLE_API_KEY”): os.environ[“GOOGLE_API_KEY”] = getpass(” Enter your Gemini API key (hidden): “) import google.generativeai as genai genai.configure(api_key=os.environ[“GOOGLE_API_KEY”]) LLM = genai.GenerativeModel(“gemini-1.5-flash”) def ask_llm(prompt, sys=None): p = prompt if sys is None else f”System:n{sys}nnUser:n{prompt}” r = LLM.generate_content(p) return (getattr(r, “text”, “”) or “”).strip() from sklearn.datasets import load_diabetes raw = load_diabetes(as_frame=True) df = raw.frame.rename(columns={“target”:”disease_progression”}) print(“Shape:”, df.shape); display(df.head()) from sklearn.model_selection import train_test_split, KFold, cross_val_score from sklearn.compose import ColumnTransformer from sklearn.preprocessing import StandardScaler, QuantileTransformer from sklearn.ensemble import HistGradientBoostingRegressor from sklearn.pipeline import Pipeline X = df.drop(columns=[“disease_progression”]); y = df[“disease_progression”] num_cols = X.columns.tolist() pre = ColumnTransformer( [(“scale”, StandardScaler(), num_cols), (“rank”, QuantileTransformer(n_quantiles=min(200, len(X)), output_distribution=”normal”), num_cols)], remainder=”drop”, verbose_feature_names_out=False) model = HistGradientBoostingRegressor(max_depth=3, learning_rate=0.07, l2_regularization=0.0, max_iter=500, early_stopping=True, validation_fraction=0.15) pipe = Pipeline([(“prep”, pre), (“hgbt”, model)]) Xtr, Xte, ytr, yte = train_test_split(X, y, test_size=0.20, random_state=42) cv = KFold(n_splits=5, shuffle=True, random_state=42) cv_mse = -cross_val_score(pipe, Xtr, ytr, scoring=”neg_mean_squared_error”, cv=cv).mean() cv_rmse = float(cv_mse ** 0.5) pipe.fit(Xtr, ytr) We load the diabetes dataset, preprocess the features, and build a robust pipeline using scaling, quantile transformation, and gradient boosting. We split the data, perform cross-validation to estimate RMSE, and then fit the final model to see how well it generalizes. Check out the FULL CODES here. Copy CodeCopiedUse a different Browser pred_tr = pipe.predict(Xtr); pred_te = pipe.predict(Xte) rmse_tr = mean_squared_error(ytr, pred_tr) ** 0.5 rmse_te = mean_squared_error(yte, pred_te) ** 0.5 mae_te = mean_absolute_error(yte, pred_te) r2_te = r2_score(yte, pred_te) print(f”CV RMSE={cv_rmse:.2f} | Train RMSE={rmse_tr:.2f} | Test RMSE={rmse_te:.2f} | Test MAE={mae_te:.2f} | R²={r2_te:.3f}”) plt.figure(figsize=(5,4)) plt.scatter(pred_te, yte – pred_te, s=12) plt.axhline(0, lw=1); plt.xlabel(“Predicted”); plt.ylabel(“Residual”); plt.title(“Residuals (Test)”) plt.show() from sklearn.inspection import permutation_importance imp = permutation_importance(pipe, Xte, yte, scoring=”neg_mean_squared_error”, n_repeats=10, random_state=0) imp_df = pd.DataFrame({“feature”: X.columns, “importance”: imp.importances_mean}).sort_values(“importance”, ascending=False) display(imp_df.head(10)) plt.figure(figsize=(6,4)) top10 = imp_df.head(10).iloc[::-1] plt.barh(top10[“feature”], top10[“importance”]) plt.title(“Permutation Importance (Top 10)”); plt.xlabel(“Δ(MSE)”); plt.tight_layout(); plt.show() We evaluate our model by computing train, test, and cross-validation metrics, and visualize residuals to check prediction errors. We then calculate permutation importance to identify which features drive the model most, and display the top contributors using a clear bar plot. Check out the FULL CODES here. Copy CodeCopiedUse a different Browser def compute_pdp(pipe, Xref: pd.DataFrame, feat: str, grid=40): xs = np.linspace(np.percentile(Xref[feat], 5), np.percentile(Xref[feat], 95), grid) Xtmp = Xref.copy() ys = [] for v in xs: Xtmp[feat] = v ys.append(pipe.predict(Xtmp).mean()) return xs, np.array(ys) top_feats = imp_df[“feature”].head(3).tolist() plt.figure(figsize=(6,4)) for f in top_feats: xs, ys = compute_pdp(pipe, Xte.copy(), f, grid=40) plt.plot(xs, ys, label=f) plt.legend(); plt.xlabel(“Feature value”); plt.ylabel(“Predicted target”); plt.title(“Manual PDP (Top 3)”) plt.tight_layout(); plt.show() report_obj = { “dataset”: {“rows”: int(df.shape[0]), “cols”: int(df.shape[1]-1), “target”: “disease_progression”}, “metrics”: {“cv_rmse”: float(cv_rmse), “train_rmse”: float(rmse_tr), “test_rmse”: float(rmse_te), “test_mae”: float(mae_te), “r2”: float(r2_te)}, “top_importances”: imp_df.head(10).to_dict(orient=”records”) } print(json.dumps(report_obj, indent=2)) sys_msg = (“You are a senior data scientist. Return: (1) ≤120-word executive summary, ” “(2) key risks/assumptions bullets, (3) 5 prioritized next experiments w/ rationale, ” “(4) quick-win feature engineering ideas as Python pseudocode.”) summary = ask_llm(f”Dataset + metrics + importances:n{json.dumps(report_obj)}”, sys=sys_msg) print(“n Gemini Executive Briefn” + “-“*80 + f”n{summary}n”) We compute the manual partial dependence for the top three features and visualize how changing each one affects the predictions. We then assemble a compact JSON report of dataset statistics, metrics, and importances, and ask Gemini to generate an executive brief that includes risks, next experiments, and quick-win feature engineering ideas. Check out the FULL CODES here. Copy CodeCopiedUse a different Browser SAFE_GLOBALS = {“pd”: pd, “np”: np} def run_generated_pandas(code: str, df_local: pd.DataFrame): banned = [“__”, “import”, “open(“, “exec(“, “eval(“, “os.”, “sys.”, “pd.read”, “to_csv”, “to_pickle”, “to_sql”] if any(b in code for b in banned): raise ValueError(“Unsafe code rejected.”) loc = {“df”: df_local.copy()} exec(code, SAFE_GLOBALS, loc) return {k:v for k,v in loc.items() if k not in (“df”,)} def eda_qa(question: str): prompt = f”””You are a Python+Pandas analyst. DataFrame `df` columns: {list(df.columns)}. Write a SHORT pandas snippet (no comments/prints) that computes the answer to: “{question}”. Use only pd/np/df; assign the final result to a variable named `answer`.””” code = ask_llm(prompt, sys=”Return only code. No prose.”) try: out = run_generated_pandas(code, df) return code, out.get(“answer”, None) except Exception as e: return code, f”[Execution error: {e}]” questions = [ “What is the Pearson correlation between BMI and disease_progression?”, “Show mean target by tertiles of BMI (low/med/high).”, “Which single feature correlates most with the target (absolute value)?” ] for q in questions: code, ans = eda_qa(q) print(“nQ:”, q, “nCode:n”, code, “nAnswer:n”, ans) We build a safe sandbox to execute pandas code that Gemini generates for exploratory data analysis. We then ask natural language questions about correlations and feature relationships, let Gemini write the pandas snippets, and automatically run them to get direct answers from the dataset. Check out the FULL CODES here. Copy CodeCopiedUse a different Browser crossitique = ask_llm( f”””Metrics: {report_obj[‘metrics’]} Top importances: {report_obj[‘top_importances’]} Identify risks around leakage, overfitting, calibration, OOD robustness, and fairness (even proxy-only). Propose quick checks (concise Python sketches).””” ) print(“n Gemini Risk & Robustness Reviewn” + “-“*80 + f”n{critique}n”) def what_if(pipe, Xref: pd.DataFrame, feat: str, delta: float = 0.05): x0 = Xref.median(numeric_only=True).to_dict() x1, x2 = x0.copy(), x0.copy() if feat not in x1: return np.nan x2[feat] = x1[feat] + delta X1 = pd.DataFrame([x1], columns=X.columns) X2 = pd.DataFrame([x2], columns=X.columns) return float(pipe.predict(X2)[0] – pipe.predict(X1)[0]) for f in top_feats: print(f”Estimated Δtarget if {f} increases by +0.05 ≈ {what_if(pipe, Xte, f, 0.05):.2f}”) print(“n Done: Train → Explain → Query with Gemini → Review risks → What-if analysis. ” “Swap the dataset or tweak model params to extend this notebook.”) We ask Gemini to review our model for risks like leakage, overfitting, and fairness, and get quick Python checks as suggestions. We then

How to Build an End-to-End Data Science Workflow with Machine Learning, Interpretability, and Gemini AI Assistance? Read Post »

AI, Committee, News, Uncategorized

Frustratingly Easy Data Augmentation for Low-Resource ASR

arXiv:2509.15373v2 Announce Type: replace Abstract: This paper introduces three self-contained data augmentation methods for low-resource Automatic Speech Recognition (ASR). Our techniques first generate novel text–using gloss-based replacement, random replacement, or an LLM-based approach–and then apply Text-to-Speech (TTS) to produce synthetic audio. We apply these methods, which leverage only the original annotated data, to four languages with extremely limited resources (Vatlongos, Nashta, Shinekhen Buryat, and Kakabe). Fine-tuning a pretrained Wav2Vec2-XLSR-53 model on a combination of the original audio and generated synthetic data yields significant performance gains, including a 14.3% absolute WER reduction for Nashta. The methods prove effective across all four low-resource languages and also show utility for high-resource languages like English, demonstrating their broad applicability.

Frustratingly Easy Data Augmentation for Low-Resource ASR Read Post »

We use cookies to improve your experience and performance on our website. You can learn more at Privacy Policy and manage your privacy settings by clicking Settings.

Privacy Preferences

You can choose your cookie settings by turning on/off each type of cookie as you wish, except for essential cookies.

Allow All
Manage Consent Preferences
  • Always Active

Save
en_US