YouZum

Notizie

AI, Committee, Notizie, Uncategorized

How to Build a Complete End-to-End NLP Pipeline with Gensim: Topic Modeling, Word Embeddings, Semantic Search, and Advanced Text Analysis

In this tutorial, we present a complete end-to-end Natural Language Processing (NLP) pipeline built with Gensim and supporting libraries, designed to run seamlessly in Google Colab. It integrates multiple core techniques in modern NLP, including preprocessing, topic modeling with Latent Dirichlet Allocation (LDA), word embeddings with Word2Vec, TF-IDF-based similarity analysis, and semantic search. The pipeline not only demonstrates how to train and evaluate these models but also showcases practical visualizations, advanced topic analysis, and document classification workflows. By combining statistical methods with machine learning approaches, the tutorial provides a comprehensive framework for understanding and experimenting with text data at scale. Check out the FULL CODES here. Copy CodeCopiedUse a different Browser !pip install –upgrade scipy==1.11.4 !pip install gensim==4.3.2 nltk wordcloud matplotlib seaborn pandas numpy scikit-learn !pip install –upgrade setuptools print(“Please restart runtime after installation!”) print(“Go to Runtime > Restart runtime, then run the next cell”) import numpy as np import pandas as pd import matplotlib.pyplot as plt import seaborn as sns from wordcloud import WordCloud import warnings warnings.filterwarnings(‘ignore’) from gensim import corpora, models, similarities from gensim.models import Word2Vec, LdaModel, TfidfModel, CoherenceModel from gensim.parsing.preprocessing import preprocess_string, strip_tags, strip_punctuation, strip_multiple_whitespaces, strip_numeric, remove_stopwords, strip_short import nltk nltk.download(‘punkt’, quiet=True) nltk.download(‘stopwords’, quiet=True) from nltk.corpus import stopwords from nltk.tokenize import word_tokenize We install and upgrade the necessary libraries, such as SciPy, Gensim, NLTK, and visualization tools, to ensure compatibility. We then import all required modules for preprocessing, modeling, and analysis. We also download NLTK resources to tokenize and handle stopwords efficiently, thereby setting up the environment for our NLP pipeline. Check out the FULL CODES here. Copy CodeCopiedUse a different Browser class AdvancedGensimPipeline: def __init__(self): self.dictionary = None self.corpus = None self.lda_model = None self.word2vec_model = None self.tfidf_model = None self.similarity_index = None self.processed_docs = None def create_sample_corpus(self): “””Create a diverse sample corpus for demonstration””” documents = [ “Data science combines statistics, programming, and domain expertise to extract insights”, “Big data analytics helps organizations make data-driven decisions at scale”, “Cloud computing provides scalable infrastructure for modern applications and services”, “Cybersecurity protects digital systems from threats and unauthorized access attempts”, “Software engineering practices ensure reliable and maintainable code development”, “Database management systems store and organize large amounts of structured information”, “Python programming language is widely used for data analysis and machine learning”, “Statistical modeling helps identify patterns and relationships in complex datasets”, “Cross-validation techniques ensure robust model performance evaluation and selection”, “Recommendation systems suggest relevant items based on user preferences and behavior”, “Text mining extracts valuable insights from unstructured textual data sources”, “Image classification assigns predefined categories to visual content automatically”, “Reinforcement learning trains agents through interaction with dynamic environments” ] return documents def preprocess_documents(self, documents): “””Advanced document preprocessing using Gensim filters””” print(“Preprocessing documents…”) CUSTOM_FILTERS = [ strip_tags, strip_punctuation, strip_multiple_whitespaces, strip_numeric, remove_stopwords, strip_short, lambda x: x.lower() ] processed_docs = [] for doc in documents: processed = preprocess_string(doc, CUSTOM_FILTERS) stop_words = set(stopwords.words(‘english’)) processed = [word for word in processed if word not in stop_words and len(word) > 2] processed_docs.append(processed) self.processed_docs = processed_docs print(f”Processed {len(processed_docs)} documents”) return processed_docs def create_dictionary_and_corpus(self): “””Create Gensim dictionary and corpus””” print(“Creating dictionary and corpus…”) self.dictionary = corpora.Dictionary(self.processed_docs) self.dictionary.filter_extremes(no_below=2, no_above=0.8) self.corpus = [self.dictionary.doc2bow(doc) for doc in self.processed_docs] print(f”Dictionary size: {len(self.dictionary)}”) print(f”Corpus size: {len(self.corpus)}”) def train_word2vec_model(self): “””Train Word2Vec model for word embeddings””” print(“Training Word2Vec model…”) self.word2vec_model = Word2Vec( sentences=self.processed_docs, vector_size=100, window=5, min_count=2, workers=4, epochs=50 ) print(“Word2Vec model trained successfully”) def analyze_word_similarities(self): “””Analyze word similarities using Word2Vec””” print(“n=== Word2Vec Similarity Analysis ===”) test_words = [‘machine’, ‘data’, ‘learning’, ‘computer’] for word in test_words: if word in self.word2vec_model.wv: similar_words = self.word2vec_model.wv.most_similar(word, topn=3) print(f”Words similar to ‘{word}’: {similar_words}”) try: if all(w in self.word2vec_model.wv for w in [‘machine’, ‘computer’, ‘data’]): analogy = self.word2vec_model.wv.most_similar( positive=[‘computer’, ‘data’], negative=[‘machine’], topn=1 ) print(f”Analogy result: {analogy}”) except: print(“Not enough vocabulary for complex analogies”) def train_lda_model(self, num_topics=5): “””Train LDA topic model””” print(f”Training LDA model with {num_topics} topics…”) self.lda_model = LdaModel( corpus=self.corpus, id2word=self.dictionary, num_topics=num_topics, random_state=42, passes=10, alpha=’auto’, per_word_topics=True, eval_every=None ) print(“LDA model trained successfully”) def evaluate_topic_coherence(self): “””Evaluate topic model coherence””” print(“Evaluating topic coherence…”) coherence_model = CoherenceModel( model=self.lda_model, texts=self.processed_docs, dictionary=self.dictionary, coherence=’c_v’ ) coherence_score = coherence_model.get_coherence() print(f”Topic Coherence Score: {coherence_score:.4f}”) return coherence_score def display_topics(self): “””Display discovered topics””” print(“n=== Discovered Topics ===”) topics = self.lda_model.print_topics(num_words=8) for idx, topic in enumerate(topics): print(f”Topic {idx}: {topic[1]}”) def create_tfidf_model(self): “””Create TF-IDF model for document similarity””” print(“Creating TF-IDF model…”) self.tfidf_model = TfidfModel(self.corpus) corpus_tfidf = self.tfidf_model[self.corpus] self.similarity_index = similarities.MatrixSimilarity(corpus_tfidf) print(“TF-IDF model and similarity index created”) def find_similar_documents(self, query_doc_idx=0): “””Find documents similar to a query document””” print(f”n=== Document Similarity Analysis ===”) query_doc_tfidf = self.tfidf_model[self.corpus[query_doc_idx]] similarities_scores = self.similarity_index[query_doc_tfidf] sorted_similarities = sorted(enumerate(similarities_scores), key=lambda x: x[1], reverse=True) print(f”Documents most similar to document {query_doc_idx}:”) for doc_idx, similarity in sorted_similarities[:5]: print(f”Doc {doc_idx}: {similarity:.4f}”) def visualize_topics(self): “””Create visualizations for topic analysis””” print(“Creating topic visualizations…”) doc_topic_matrix = [] for doc_bow in self.corpus: doc_topics = dict(self.lda_model.get_document_topics(doc_bow, minimum_probability=0)) topic_vec = [doc_topics.get(i, 0) for i in range(self.lda_model.num_topics)] doc_topic_matrix.append(topic_vec) doc_topic_df = pd.DataFrame(doc_topic_matrix, columns=[f’Topic_{i}’ for i in range(self.lda_model.num_topics)]) plt.figure(figsize=(12, 8)) sns.heatmap(doc_topic_df.T, annot=True, cmap=’Blues’, fmt=’.2f’) plt.title(‘Document-Topic Distribution Heatmap’) plt.xlabel(‘Documents’) plt.ylabel(‘Topics’) plt.tight_layout() plt.show() fig, axes = plt.subplots(2, 3, figsize=(15, 10)) axes = axes.flatten() for topic_id in range(min(6, self.lda_model.num_topics)): topic_words = dict(self.lda_model.show_topic(topic_id, topn=20)) wordcloud = WordCloud( width=300, height=200, background_color=’white’, colormap=’viridis’ ).generate_from_frequencies(topic_words) axes[topic_id].imshow(wordcloud, interpolation=’bilinear’) axes[topic_id].set_title(f’Topic {topic_id}’) axes[topic_id].axis(‘off’) for i in range(self.lda_model.num_topics, 6): axes[i].axis(‘off’) plt.tight_layout() plt.show() def advanced_topic_analysis(self): “””Perform advanced topic analysis””” print(“n=== Advanced Topic Analysis ===”) topic_distributions = [] for i, doc_bow in enumerate(self.corpus): doc_topics = self.lda_model.get_document_topics(doc_bow) dominant_topic = max(doc_topics, key=lambda x: x[1]) if doc_topics else (0, 0) topic_distributions.append({ ‘doc_id’: i, ‘dominant_topic’: dominant_topic[0], ‘topic_probability’: dominant_topic[1] }) topic_df = pd.DataFrame(topic_distributions) plt.figure(figsize=(10, 6)) topic_counts = topic_df[‘dominant_topic’].value_counts().sort_index() plt.bar(range(len(topic_counts)), topic_counts.values) plt.xlabel(‘Topic ID’) plt.ylabel(‘Number of Documents’) plt.title(‘Distribution of Dominant Topics Across Documents’) plt.xticks(range(len(topic_counts)), [f’Topic {i}’ for i in topic_counts.index]) plt.show() return topic_df def document_classification_demo(self, new_document): “””Classify a new document using trained models””” print(f”n=== Document Classification Demo ===”) print(f”Classifying: ‘{new_document[:50]}…'”) processed_new = preprocess_string(new_document, [ strip_tags, strip_punctuation, strip_multiple_whitespaces, strip_numeric, remove_stopwords, strip_short, lambda x: x.lower() ]) new_doc_bow = self.dictionary.doc2bow(processed_new) doc_topics = self.lda_model.get_document_topics(new_doc_bow) print(“Topic probabilities:”) for topic_id, prob in doc_topics: print(f” Topic {topic_id}: {prob:.4f}”) new_doc_tfidf = self.tfidf_model[new_doc_bow] similarities_scores = self.similarity_index[new_doc_tfidf] most_similar = np.argmax(similarities_scores) print(f”Most similar document: {most_similar} (similarity: {similarities_scores[most_similar]:.4f})”) return doc_topics, most_similar def run_complete_pipeline(self): “””Execute the complete NLP pipeline””” print(“=== Advanced Gensim NLP Pipeline

How to Build a Complete End-to-End NLP Pipeline with Gensim: Topic Modeling, Word Embeddings, Semantic Search, and Advanced Text Analysis Leggi l'articolo »

AI, Committee, Notizie, Uncategorized

Google AI Introduces Personal Health Agent (PHA): A Multi-Agent Framework that Enables Personalized Interactions to Address Individual Health Needs

Table of contents What is a Personal Health Agent? How does the PHA framework operate? How was the PHA evaluated? Evaluation of the Data Science Agent Evaluation of the Domain Expert Agent Evaluation of the Health Coach Agent Evaluation of the Integrated PHA System How does the PHA contribute to health AI? What is the larger significance of Google’s PHA blueprint? Conclusion https://arxiv.org/abs/2508.20148v1 What is a Personal Health Agent? Large language models (LLMs) have demonstrated strong performance across various domains like clinical reasoning, decision support, and consumer health applications. However, most existing platforms are designed as single-purpose tools, such as symptom checkers, digital coaches, or health information assistants. These approaches often fail to address the complexity of real-world health needs, where individuals require integrated reasoning over wearable streams, personal health records, and laboratory test results. A team of researchers from Google has proposed a Personal Health Agent (PHA) framework. The PHA is designed as a multi-agent system that unifies complementary roles: data analysis, medical knowledge reasoning, and health coaching. Instead of returning isolated outputs from a single model, the PHA employs a central orchestrator to coordinate specialized sub-agents, iteratively synthesize their outputs, and deliver coherent, personalized guidance. https://arxiv.org/abs/2508.20148v1 How does the PHA framework operate? The Personal Health Agent (PHA) is built on top of the Gemini 2.0 model family. It follows a modular architecture consisting of three sub-agents and one orchestrator: Data Science Agent (DS)The DS agent interprets and analyzes time-series data from wearables (e.g., step counts, heart rate variability, sleep metrics) and structured health records. It is capable of decomposing open-ended user questions into formal analysis plans, executing statistical reasoning, and comparing results against population-level reference data. For example, it can quantify whether physical activity in the past month is associated with improvements in sleep quality. Domain Expert Agent (DE)The DE agent provides medically contextualized information. It integrates personal health records, demographic information, and wearable signals to generate explanations grounded in medical knowledge. Unlike general-purpose LLMs that may produce plausible but unreliable outputs, the DE agent follows an iterative reasoning-investigation-examination loop, combining authoritative medical resources with personal data. This allows it to provide evidence-based interpretations, such as whether a specific blood pressure measurement is within a safe range for an individual with a particular condition. Health Coach Agent (HC)The HC agent addresses behavioral change and long-term goal setting. Drawing from established coaching strategies such as motivational interviewing, it conducts multi-turn conversations, identifies user goals, clarifies constraints, and generates structured, personalized plans. For example, it may guide a user through setting a weekly exercise schedule, adapting to individual barriers, and incorporating feedback from progress tracking. OrchestratorThe orchestrator coordinates these three agents. When a query is received, it assigns a primary agent responsible for generating the main output and supporting agents to provide contextual data or domain knowledge. After collecting the results, the orchestrator runs an iterative reflection loop, checking outputs for coherence and accuracy before synthesizing them into a single response. This ensures that the final output is not merely an aggregation of agent responses but an integrated recommendation. How was the PHA evaluated? The research team conducted one of the most comprehensive evaluations of a health AI system to date. Their evaluation framework involved 10 benchmark tasks, 7,000+ human annotations, and 1,100 hours of assessment from health experts and end-users. Evaluation of the Data Science Agent The DS agent was assessed on its ability to generate structured analysis plans and produce correct, executable code. Compared to baseline Gemini models, it demonstrated: A significant increase in analysis plan quality, improving mean expert-rated scores from 53.7% to 75.6%. A reduction in critical data handling errors from 25.4% to 11.0%. An improvement in code pass rates from 58.4% to 75.5% on first attempts, with further gains under iterative self-correction. https://arxiv.org/abs/2508.20148v1 https://arxiv.org/abs/2508.20148v1 https://arxiv.org/abs/2508.20148v1 Evaluation of the Domain Expert Agent The DE agent was benchmarked across four capabilities: factual accuracy, diagnostic reasoning, contextual personalization, and multimodal data synthesis. Results include: Factual knowledge: On over 2,000 board-style exam questions across endocrinology, cardiology, sleep medicine, and fitness, the DE agent achieved 83.6% accuracy, outperforming baseline Gemini (81.8%). Diagnostic reasoning: On 2,000 self-reported symptom cases, it achieved 46.1% top-1 diagnostic accuracy compared to 41.4% for a state-of-the-art Gemini baseline. Personalization: In user studies, 72% of participants preferred DE agent responses to baseline outputs, citing higher trustworthiness and contextual relevance. Multimodal synthesis: In expert clinician reviews of health summaries generated from wearable, lab, and survey data, the DE agent’s outputs were rated more clinically significant, comprehensive, and trustworthy than baseline outputs. Evaluation of the Health Coach Agent The HC agent was designed and assessed through expert interviews and user studies. Experts emphasized the need for six coaching capabilities: goal identification, active listening, context clarification, empowerment, SMART (Specific, Measurable, Attainable, Relevant, Time-bound) recommendations, and iterative feedback incorporation. In evaluations, the HC agent demonstrated improved conversation flow and user engagement compared to baseline models. It avoided premature recommendations and instead balanced information gathering with actionable advice, producing outputs more consistent with expert coaching practices. Evaluation of the Integrated PHA System At the system level, the orchestrator and three agents were tested together in open-ended, multimodal conversations reflecting realistic health scenarios. Both experts and end-users rated the integrated Personal Health Agent (PHA) significantly higher than baseline Gemini systems across measures of accuracy, coherence, personalization, and trustworthiness. How does the PHA contribute to health AI? The introduction of a multi-agent PHA addresses several limitations of existing health AI systems: Integration of heterogeneous data: Wearable signals, medical records, and lab test results are analyzed jointly rather than in isolation. Division of labor: Each sub-agent specializes in a domain where single monolithic models often underperform, e.g., numerical reasoning for DS, clinical grounding for DE, and behavioral engagement for HC. Iterative reflection: The orchestrator’s review cycle reduces inconsistencies that often arise when multiple outputs are simply concatenated. Systematic evaluation: Unlike most prior work, which relied on small-scale case studies, the Personal Health Agent (PHA) was validated with a large multimodal dataset (the WEAR-ME study)

Google AI Introduces Personal Health Agent (PHA): A Multi-Agent Framework that Enables Personalized Interactions to Address Individual Health Needs Leggi l'articolo »

AI, Committee, Notizie, Uncategorized

FRIDA to the Rescue! Analyzing Synthetic Data Effectiveness in Object-Based Common Sense Reasoning for Disaster Response

arXiv:2502.18452v3 Announce Type: replace Abstract: During Human Robot Interactions in disaster relief scenarios, Large Language Models (LLMs) have the potential for substantial physical reasoning to assist in mission objectives. However, these reasoning capabilities are often found only in larger models, which are not currently reasonable to deploy on robotic systems due to size constraints. To meet our problem space requirements, we introduce a dataset and pipeline to create Field Reasoning and Instruction Decoding Agent (FRIDA) models. In our pipeline, domain experts and linguists combine their knowledge to make high-quality, few-shot prompts used to generate synthetic data for fine-tuning. We hand-curate datasets for this few-shot prompting and for evaluation to improve LLM reasoning on both general and disaster-specific objects. We concurrently run an ablation study to understand which kinds of synthetic data most affect performance. We fine-tune several small instruction-tuned models and find that ablated FRIDA models only trained on objects’ physical state and function data outperformed both the FRIDA models trained on all synthetic data and the base models in our evaluation. We demonstrate that the FRIDA pipeline is capable of instilling physical common sense with minimal data.

FRIDA to the Rescue! Analyzing Synthetic Data Effectiveness in Object-Based Common Sense Reasoning for Disaster Response Leggi l'articolo »

AI, Committee, Notizie, Uncategorized

NER Retriever: Zero-Shot Named Entity Retrieval with Type-Aware Embeddings

arXiv:2509.04011v1 Announce Type: cross Abstract: We present NER Retriever, a zero-shot retrieval framework for ad-hoc Named Entity Retrieval, a variant of Named Entity Recognition (NER), where the types of interest are not provided in advance, and a user-defined type description is used to retrieve documents mentioning entities of that type. Instead of relying on fixed schemas or fine-tuned models, our method builds on internal representations of large language models (LLMs) to embed both entity mentions and user-provided open-ended type descriptions into a shared semantic space. We show that internal representations, specifically the value vectors from mid-layer transformer blocks, encode fine-grained type information more effectively than commonly used top-layer embeddings. To refine these representations, we train a lightweight contrastive projection network that aligns type-compatible entities while separating unrelated types. The resulting entity embeddings are compact, type-aware, and well-suited for nearest-neighbor search. Evaluated on three benchmarks, NER Retriever significantly outperforms both lexical and dense sentence-level retrieval baselines. Our findings provide empirical support for representation selection within LLMs and demonstrate a practical solution for scalable, schema-free entity retrieval. The NER Retriever Codebase is publicly available at https://github.com/ShacharOr100/ner_retriever

NER Retriever: Zero-Shot Named Entity Retrieval with Type-Aware Embeddings Leggi l'articolo »

AI, Committee, Notizie, Uncategorized

False Sense of Security: Why Probing-based Malicious Input Detection Fails to Generalize

arXiv:2509.03888v1 Announce Type: new Abstract: Large Language Models (LLMs) can comply with harmful instructions, raising serious safety concerns despite their impressive capabilities. Recent work has leveraged probing-based approaches to study the separability of malicious and benign inputs in LLMs’ internal representations, and researchers have proposed using such probing methods for safety detection. We systematically re-examine this paradigm. Motivated by poor out-of-distribution performance, we hypothesize that probes learn superficial patterns rather than semantic harmfulness. Through controlled experiments, we confirm this hypothesis and identify the specific patterns learned: instructional patterns and trigger words. Our investigation follows a systematic approach, progressing from demonstrating comparable performance of simple n-gram methods, to controlled experiments with semantically cleaned datasets, to detailed analysis of pattern dependencies. These results reveal a false sense of security around current probing-based approaches and highlight the need to redesign both models and evaluation protocols, for which we provide further discussions in the hope of suggesting responsible further research in this direction. We have open-sourced the project at https://github.com/WangCheng0116/Why-Probe-Fails.

False Sense of Security: Why Probing-based Malicious Input Detection Fails to Generalize Leggi l'articolo »

AI, Committee, Notizie, Uncategorized

ArcMemo: Abstract Reasoning Composition with Lifelong LLM Memory

arXiv:2509.04439v1 Announce Type: cross Abstract: While inference-time scaling enables LLMs to carry out increasingly long and capable reasoning traces, the patterns and insights uncovered during these traces are immediately discarded once the context window is reset for a new query. External memory is a natural way to persist these discoveries, and recent work has shown clear benefits for reasoning-intensive tasks. We see an opportunity to make such memories more broadly reusable and scalable by moving beyond instance-based memory entries (e.g. exact query/response pairs, or summaries tightly coupled with the original problem context) toward concept-level memory: reusable, modular abstractions distilled from solution traces and stored in natural language. For future queries, relevant concepts are selectively retrieved and integrated into the prompt, enabling test-time continual learning without weight updates. Our design introduces new strategies for abstracting takeaways from rollouts and retrieving entries for new queries, promoting reuse and allowing memory to expand with additional experiences. On the challenging ARC-AGI benchmark, our method yields a 7.5% relative gain over a strong no-memory baseline with performance continuing to scale with inference compute. We find abstract concepts to be the most consistent memory design, outscoring the baseline at all tested inference compute scales. Moreover, we confirm that dynamically updating memory during test-time outperforms an otherwise identical fixed memory setting with additional attempts, supporting the hypothesis that solving more problems and abstracting more patterns to memory enables further solutions in a form of self-improvement. Code available at https://github.com/matt-seb-ho/arc_memo.

ArcMemo: Abstract Reasoning Composition with Lifelong LLM Memory Leggi l'articolo »

AI, Committee, Notizie, Uncategorized

Biomni-R0: New Agentic LLMs Trained End-to-End with Multi-Turn Reinforcement Learning for Expert-Level Intelligence in Biomedical Research

Table of contents The Growing Role of AI in Biomedical Research The Core Challenge: Matching Expert-Level Reasoning Why Traditional Approaches Fall Short Biomni-R0: A New Paradigm Using Reinforcement Learning Training Strategy and System Design Results That Outperform Frontier Models Designing for Scalability and Precision Key Takeaways from the research include: The Growing Role of AI in Biomedical Research The field of biomedical artificial intelligence is evolving rapidly, with increasing demand for agents capable of performing tasks that span genomics, clinical diagnostics, and molecular biology. These agents aren’t merely designed to retrieve facts; they are expected to reason through complex biological problems, interpret patient data, and extract meaningful insights from vast biomedical databases. Unlike general-purpose AI models, biomedical agents must interface with domain-specific tools, comprehend biological hierarchies, and simulate workflows similar to those of researchers to effectively support modern biomedical research. The Core Challenge: Matching Expert-Level Reasoning However, achieving expert-level performance in these tasks is far from trivial. Most large language models fall short when dealing with the nuance and depth of biomedical reasoning. They may succeed on surface-level retrieval or pattern recognition tasks, but often fail when challenged with multi-step reasoning, rare disease diagnosis, or gene prioritization, areas that require not just data access, but contextual understanding and domain-specific judgment. This limitation has created a clear gap: how to train biomedical AI agents that can think and act like domain experts. Why Traditional Approaches Fall Short While some solutions leverage supervised learning on curated biomedical datasets or retrieval-augmented generation to ground responses in literature or databases, these approaches have drawbacks. They often rely on static prompts and pre-defined behaviors that lack adaptability. Furthermore, many of these agents struggle to effectively execute external tools, and their reasoning chains collapse when faced with unfamiliar biomedical structures. This fragility makes them ill-suited for dynamic or high-stakes environments, where interpretability and accuracy are non-negotiable. Biomni-R0: A New Paradigm Using Reinforcement Learning Researchers from Stanford University and UC Berkeley introduced a new family of models called Biomni-R0, built by applying reinforcement learning (RL) to a biomedical agent foundation. These models, Biomni-R0-8B and Biomni-R0-32B, were trained in an RL environment specifically tailored for biomedical reasoning, using both expert-annotated tasks and a novel reward structure. The collaboration combines Stanford’s Biomni agent and environment platform with UC Berkeley’s SkyRL reinforcement learning infrastructure, aiming to push biomedical agents past human-level capabilities. Training Strategy and System Design The research introduced a two-phase training process. First, they used supervised fine-tuning (SFT) on high-quality trajectories sampled from Claude-4 Sonnet using rejection sampling, effectively bootstrapping the agent’s ability to follow structured reasoning formats. Next, they fine-tuned the models using reinforcement learning, optimizing for two kinds of rewards: one for correctness (e.g., selecting the right gene or diagnosis), and another for response formatting (e.g., using structured <think> and <answer> tags correctly). To ensure computational efficiency, the team developed asynchronous rollout scheduling that minimized bottlenecks caused by external tool delays. They also expanded the context length to 64k tokens, allowing the agent to manage long multi-step reasoning conversations effectively. Results That Outperform Frontier Models The performance gains were significant. Biomni-R0-32B achieved a score of 0.669, a jump from the base model’s 0.346. Even Biomni-R0-8B, the smaller version, scored 0.588, outperforming general-purpose models like Claude 4 Sonnet and GPT-5, which are both much larger. On a task-by-task basis, Biomni-R0-32B scored highest on 7 out of 10 tasks, while GPT-5 led in 2, and Claude 4 in just 1. One of the most striking results was in rare disease diagnosis, where Biomni-R0-32B reached 0.67, compared to Qwen-32B’s 0.03, a more than 20× improvement. Similarly, in GWAS variant prioritization, the model’s score increased from 0.16 to 0.74, demonstrating the value of domain-specific reasoning. Designing for Scalability and Precision Training large biomedical agents requires dealing with resource-heavy rollouts involving external tool execution, database queries, and code evaluation. To manage this, the system decoupled environment execution from model inference, allowing more flexible scaling and reducing idle GPU time. This innovation ensured efficient use of resources, even with tools that had varying execution latencies. Longer reasoning sequences also proved beneficial. The RL-trained models consistently produced lengthier, structured responses, which strongly correlated with better performance, highlighting that depth and structure in reasoning are key indicators of expert-level understanding in biomedicine. Key Takeaways from the research include: Biomedical agents must perform deep reasoning, not just retrieval, across genomics, diagnostics, and molecular biology. The central problem is achieving expert-level task performance, mainly in complex areas such as rare diseases and gene prioritization. Traditional methods, including supervised fine-tuning and retrieval-based models, often fall short in terms of robustness and adaptability. Biomni-R0, developed by Stanford and UC Berkeley, uses reinforcement learning with expert-based rewards and structured output formatting. The two-phase training pipeline, SFT followed by RL, proved highly effective in optimizing performance and reasoning quality. Biomni-R0-8B delivers strong results with a smaller architecture, while Biomni-R0-32B sets new benchmarks, outperforming Claude 4 and GPT-5 on 7 of 10 tasks. Reinforcement learning enabled the agent to generate longer, more coherent reasoning traces, a key trait of expert behavior. This work lays the foundation for super-expert biomedical agents, capable of automating complex research workflows with precision. Check out the Technical details. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter. The post Biomni-R0: New Agentic LLMs Trained End-to-End with Multi-Turn Reinforcement Learning for Expert-Level Intelligence in Biomedical Research appeared first on MarkTechPost.

Biomni-R0: New Agentic LLMs Trained End-to-End with Multi-Turn Reinforcement Learning for Expert-Level Intelligence in Biomedical Research Leggi l'articolo »

AI, Committee, Notizie, Uncategorized

PadChest-GR: A Bilingual Chest X-ray Dataset for Grounded Radiology Report Generation

arXiv:2411.05085v2 Announce Type: replace-cross Abstract: Radiology report generation (RRG) aims to create free-text radiology reports from clinical imaging. Grounded radiology report generation (GRRG) extends RRG by including the localisation of individual findings on the image. Currently, there are no manually annotated chest X-ray (CXR) datasets to train GRRG models. In this work, we present a dataset called PadChest-GR (Grounded-Reporting) derived from PadChest aimed at training GRRG models for CXR images. We curate a public bi-lingual dataset of 4,555 CXR studies with grounded reports (3,099 abnormal and 1,456 normal), each containing complete lists of sentences describing individual present (positive) and absent (negative) findings in English and Spanish. In total, PadChest-GR contains 7,037 positive and 3,422 negative finding sentences. Every positive finding sentence is associated with up to two independent sets of bounding boxes labelled by different readers and has categorical labels for finding type, locations, and progression. To the best of our knowledge, PadChest-GR is the first manually curated dataset designed to train GRRG models for understanding and interpreting radiological images and generated text. By including detailed localization and comprehensive annotations of all clinically relevant findings, it provides a valuable resource for developing and evaluating GRRG models from CXR images. PadChest-GR can be downloaded under request from https://bimcv.cipf.es/bimcv-projects/padchest-gr/

PadChest-GR: A Bilingual Chest X-ray Dataset for Grounded Radiology Report Generation Leggi l'articolo »

AI, Committee, Notizie, Uncategorized

FELLE: Autoregressive Speech Synthesis with Token-Wise Coarse-to-Fine Flow Matching

arXiv:2502.11128v2 Announce Type: replace Abstract: To advance continuous-valued token modeling and temporal-coherence enforcement, we propose FELLE, an autoregressive model that integrates language modeling with token-wise flow matching. By leveraging the autoregressive nature of language models and the generative efficacy of flow matching, FELLE effectively predicts continuous-valued tokens (mel-spectrograms). For each continuous-valued token, FELLE modifies the general prior distribution in flow matching by incorporating information from the previous step, improving coherence and stability. Furthermore, to enhance synthesis quality, FELLE introduces a coarse-to-fine flow-matching mechanism, generating continuous-valued tokens hierarchically, conditioned on the language model’s output. Experimental results demonstrate the potential of incorporating flow-matching techniques in autoregressive mel-spectrogram modeling, leading to significant improvements in TTS generation quality, as shown in https://aka.ms/felle.

FELLE: Autoregressive Speech Synthesis with Token-Wise Coarse-to-Fine Flow Matching Leggi l'articolo »

AI, Committee, Notizie, Uncategorized

Texture or Semantics? Vision-Language Models Get Lost in Font Recognition

arXiv:2503.23768v3 Announce Type: replace Abstract: Modern Vision-Language Models (VLMs) exhibit remarkable visual and linguistic capabilities, achieving impressive performance in various tasks such as image recognition and object localization. However, their effectiveness in fine-grained tasks remains an open question. In everyday scenarios, individuals encountering design materials, such as magazines, typography tutorials, research papers, or branding content, may wish to identify aesthetically pleasing fonts used in the text. Given their multimodal capabilities and free accessibility, many VLMs are often considered potential tools for font recognition. This raises a fundamental question: Do VLMs truly possess the capability to recognize fonts? To investigate this, we introduce the Font Recognition Benchmark (FRB), a compact and well-structured dataset comprising 15 commonly used fonts. FRB includes two versions: (i) an easy version, where 10 sentences are rendered in different fonts, and (ii) a hard version, where each text sample consists of the names of the 15 fonts themselves, introducing a stroop effect that challenges model perception. Through extensive evaluation of various VLMs on font recognition tasks, we arrive at the following key findings: (i) Current VLMs exhibit limited font recognition capabilities, with many state-of-the-art models failing to achieve satisfactory performance and being easily affected by the stroop effect introduced by textual information. (ii) Few-shot learning and Chain-of-Thought (CoT) prompting provide minimal benefits in improving font recognition accuracy across different VLMs. (iii) Attention analysis sheds light on the inherent limitations of VLMs in capturing semantic features.

Texture or Semantics? Vision-Language Models Get Lost in Font Recognition Leggi l'articolo »

We use cookies to improve your experience and performance on our website. You can learn more at Politica sulla privacy and manage your privacy settings by clicking Settings.

Privacy Preferences

You can choose your cookie settings by turning on/off each type of cookie as you wish, except for essential cookies.

Allow All
Manage Consent Preferences
  • Always Active

Save
it_IT