YouZum

Committee

AI, Committee, News, Uncategorized

A new CRISPR startup is betting regulators will ease up on gene-editing

Here at MIT Technology Review we’ve been writing about the gene-editing technology CRISPR since 2013, calling it the biggest biotech breakthrough of the century. Yet so far, there’s been only one gene-editing drug approved. It’s been used commercially on only about 40 patients, all with sickle-cell disease. It’s becoming clear that the impact of CRISPR isn’t as big as we all hoped. In fact, there’s a pall of discouragement over the entire field—with some journalists saying the gene-editing revolution has “lost its mojo.” So what will it take for CRISPR to help more people? A new startup says the answer could be an “umbrella approach” to testing and commercializing treatments. Aurora Therapeutics, which has $16 million from Menlo Ventures and counts CRISPR co-inventor Jennifer Doudna as an advisor, essentially hopes to win approval for gene-editing drugs that can be slightly adjusted, or personalized, without requiring costly new trials or approvals for every new version. The need to change regulations around gene-editing treatments was endorsed in November by the head of the US Food and Drug Administration, Martin Makary, who said the agency would open a “new” regulatory pathway for “bespoke, personalized therapies” that can’t easily be tested in conventional ways.  Aurora’s first target, the rare inherited disease phenylketonuria, also known as PKU, is a case in point. People with PKU lack a working version of an enzyme needed to use up the amino acid phenylalanine, a component of pretty much all meat and protein. If the amino acid builds up, it causes brain damage. So patients usually go on an onerous “diet for life” of special formula drinks and vegetables. In theory, gene editing can fix PKU. In mice, scientists have already restored the gene for the enzyme by rewriting DNA in liver cells, which both make the enzyme and are some of the easiest to reach with a gene-editing drug. The problem is that in human patients, many different mutations can affect the critical gene. According to Cory Harding, a researcher at Oregon Health Sciences University, scientists know about 1,600 different DNA mutations that cause PKU. There’s no way anyone will develop 1,600 different gene-editing drugs. Instead, Aurora’s goal is to eventually win approval for a single gene editor that, with minor adjustments, could be used to correct several of the most common mutations, including one that’s responsible for about 10% of the estimated 20,000 PKU cases in the US. “We can’t have a separate clinical trial for each mutation,” says Edward Kaye, the CEO of Aurora. “The way the FDA approves gene editing has to change, and I think they’ve been very understanding that is the case.” A gene editor is a special protein that can zero in on a specific location in the genome and change it. To prepare one, Aurora will put genetic code for the editor into a nanoparticle along with a targeting molecule. In total, it will involve about 5,000 gene letters. But only 20 of them need to change in order to redirect the treatment to repair a different mutation. “Over 99% of the drug stays the same,” says Johnny Hu, a partner at Menlo Ventures, which put up the funding for the startup. The new company came together after Hu met over pizza with Fyodor Urnov, an outspoken gene-editing scientist at the University of California, Berkeley, who is Aurora’s cofounder and sits on its board. In 2022, Urnov had written a New York Times editorial bemoaning the “chasm” between what editing technology can do and the “legal, financial, and organizational” realities preventing researchers from curing people. “I went to Fyodor and said, ‘Hey, we’re getting all these great results in the clinic with CRISPR, but why hasn’t it scaled?” says Hu. Part of the reason is that most gene-editing companies are chasing the same few conditions, such as sickle-cell, where (as luck would have it) a single edit works for all patients. But that leaves around 400 million people who have 7,000 other inherited conditions without much hope to get their DNA fixed, Urnov estimated in his editorial. Then, last May, came the dramatic demonstration of the first fully “personalized” gene-editing treatment. A team in Philadelphia, assisted by Urnov and others, succeeded in correcting the DNA of a baby, named KJ Muldoon, who had an entirely unique mutation that caused a metabolic disease. Though it didn’t target PKU, the project showed that gene editing could theoretically fix some inherited diseases “on demand.”  It also underscored a big problem. Treating a single child required a large team and cost millions in time, effort, and materials—all to create a drug that would never be used again.  That’s exactly the sort of situation the new “umbrella” trials are supposed to address. Kiran Musunuru, who co-led the team at the University of Pennsylvania, says he’s been in discussions with the FDA to open a study of bespoke gene editors this year focusing on diseases of the type Baby KJ had, called urea cycle disorders. Each time a new patient appears, he says, they’ll try to quickly put together a variant of their gene-editing drug that’s tuned to fix that child’s particular genetic problem. Musunuru, who isn’t involved with Aurora, does not think the company’s plans for PKU count as fully personalized editors. “These corporate PKU efforts have nothing whatsoever to do with Baby KJ,” he says. He says his center continues to focus on mutations “so ultra-rare that we don’t see any scenario where a for-profit gene-editing company would find that indication to be commercially viable.” Instead, what’s occurring in PKU, says Musunuru, is that researchers have realized they can assemble “a bunch” of the most frequent mutations “into a large enough group of patients to make a platform PKU therapy commercially viable.”  While that would still leave out many patients with extra-rare gene errors, Musunuru says any gene-editing treatment at all would still be “a big improvement over the status quo, which  is zero genetic therapies for PKU.”

A new CRISPR startup is betting regulators will ease up on gene-editing Read Post »

AI, Committee, News, Uncategorized

America’s new dietary guidelines ignore decades of scientific research

The new year has barely begun, but the first days of 2026 have brought big news for health. On Monday, the US’s federal health agency upended its recommendations for routine childhood vaccinations—a move that health associations worry puts children at unnecessary risk of preventable disease. There was more news from the federal government on Wednesday, when health secretary Robert F. Kennedy Jr. and his colleagues at the Departments of Health and Human Services and Agriculture unveiled new dietary guidelines for Americans. And they are causing a bit of a stir. That’s partly because they recommend products like red meat, butter, and beef tallow—foods that have been linked to cardiovascular disease, and that nutrition experts have been recommending people limit in their diets. These guidelines are a big deal—they influence food assistance programs and school lunches, for example. So this week let’s look at the good, the bad, and the ugly advice being dished up to Americans by their government. The government dietary guidelines have been around since the 1980s. They are updated every five years, in a process that typically involves a team of nutrition scientists who have combed over scientific research for years. That team will first publish its findings in a scientific report, and, around a year later, the finalized Dietary Guidelines for Americans are published. The last guidelines covered the period 2020 to 2025, and new guidelines were expected in the summer of 2025. Work had already been underway for years; the scientific report intended to inform them was published back in 2024. But the publication of the guidelines was delayed by last year’s government shutdown, Kennedy said last year. They were finally published yesterday. Nutrition experts had been waiting with bated breath. Nutrition science has evolved slightly over the last five years, and some were expecting to see new recommendations. Research now suggests, for example, that there is no “safe” level of alcohol consumption. We are also beginning to learn more about health risks associated with some ultraprocessed foods (although we still don’t have a good understanding of what they might be, or what even counts as “ultraprocessed”.) And some scientists were expecting to see the new guidelines factor in environmental sustainability, says Gabby Headrick, the associate director of food and nutrition policy at George Washington University’s Institute for Food Safety & Nutrition Security in Washington DC. They didn’t. Many of the recommendations are sensible. The guidelines recommend a diet rich in whole foods, particularly fresh fruits and vegetables. They recommend avoiding highly processed foods and added sugars. They also highlight the importance of dietary protein, whole grains, and “healthy” fats. But not all of them are, says Headrick. The guidelines open with a “new pyramid” of foods. This inverted triangle is topped with “protein, dairy, and healthy fats” on one side and “vegetables and fruits” on the other. USDA There are a few problems with this image. For starters, its shape—nutrition scientists have long moved on from the food pyramids of the 1990s, says Headrick. They’re confusing and make it difficult for people to understand what the contents of their plate should look like. That’s why scientists now use an image of a plate to depict a healthy diet. “We’ve been using MyPlate to describe the dietary guidelines in a very consumer-friendly, nutrition-education-friendly way for over the last decade now,” says Headrick. (The UK’s National Health Service takes a similar approach.) And then there’s the content of that food pyramid. It puts a significant focus on meat and whole-fat dairy produce. The top left image—the one most viewers will probably see first—is of a steak. Smack in the middle of the pyramid is a stick of butter. That’s new. And it’s not a good thing. While both red meat and whole-fat dairy can certainly form part of a healthy diet, nutrition scientists have long been recommending that most people try to limit their consumption of these foods. Both can be high in saturated fat, which can increase the risk of cardiovascular disease—the leading cause of death in the US. In 2015, on the basis of limited evidence, the World Health Organization classified red meat as “probably carcinogenic to humans.”  Also concerning is the document’s definition of “healthy fats,” which includes butter and beef tallow (a MAHA favorite). Neither food is generally considered to be as healthy as olive oil, for example. While olive oil contains around two grams of saturated fat per tablespoon, a tablespoon of beef tallow has around six grams of saturated fat, and the same amount of butter contains around seven grams of saturated fat, says Headrick. “I think these are pretty harmful dietary recommendations to be making when we have established that those specific foods likely do not have health-promoting benefits,” she adds. Red meat is not exactly a sustainable food, and neither are dairy products. And the advice on alcohol is relatively vague, recommending that people “consume less alcohol for better overall health” (which might leave you wondering: Less than what?). There are other questionable recommendations in the guidelines. Americans are advised to include more protein in their diets—at levels between 1.2 and 1.6 grams daily per kilo of body weight, 50% to 100% more than recommended in previous guidelines. There’s a risk that increasing protein consumption to such levels could raise a person’s intake of both calories and saturated fats to unhealthy levels, says José Ordovás, a senior nutrition scientist at Tufts University. “I would err on the low side,” he says. Some nutrition scientists are questioning why these changes have been made. It’s not as though the new recommendations were in the 2024 scientific report. And the evidence on red meat and saturated fat hasn’t changed, says Headrick. In reporting this piece, I contacted many contributors to the previous guidelines, and some who had led research for 2024’s scientific report. None of them agreed to comment on the new guidelines on the record. Some seemed disgruntled. One merely told me that the process by which the new

America’s new dietary guidelines ignore decades of scientific research Read Post »

AI, Committee, News, Uncategorized

The Download: the case for AI slop, and helping CRISPR fulfill its promise

This is today’s edition of The Download, our weekday newsletter that provides a daily dose of what’s going on in the world of technology. How I learned to stop worrying and love AI slop —Caiwei Chen If I were to locate the moment AI slop broke through into popular consciousness, I’d pick the video of rabbits bouncing on a trampoline that went viral last summer. For many savvy internet users, myself included, it was the first time we were fooled by an AI video, and it ended up spawning a wave of almost identical generated clips. My first reaction was that, broadly speaking, all of this sucked. That’s become a familiar refrain, in think pieces and at dinner parties. Everything online is slop now—the internet “enshittified,” with AI taking much of the blame. Initially, I largely agreed. But then friends started sharing AI clips in group chats that were compellingly weird, or funny. Some even had a grain of brilliance.  I had to admit I didn’t fully understand what I was rejecting—what I found so objectionable. To try to get to the bottom of how I felt (and why), I spoke to the people making the videos, a company creating bespoke tools for creators, and experts who study how new media becomes culture. What I found convinced me that maybe generative AI will not end up ruining everything after all. Read the full story. A new CRISPR startup is betting regulators will ease up on gene-editing Here at MIT Technology Review we’ve been writing about the gene-editing technology CRISPR since 2013, calling it the biggest biotech breakthrough of the century. Yet so far, there’s been only one gene-editing drug approved, and it’s been used commercially on only about 40 patients, all with sickle-cell disease. It’s becoming clear that the impact of CRISPR isn’t as big as we all hoped. In fact, there’s a pall of discouragement over the entire field—with some journalists saying the gene-editing revolution has “lost its mojo.” So what will it take for CRISPR to help more people? A new startup says the answer could be an “umbrella approach” to testing and commercializing treatments which could avoid costly new trials or approvals for every new version. Read the full story. —Antonio Regalado America’s new dietary guidelines ignore decades of scientific research The first days of 2026 have brought big news for health. On Wednesday, health secretary Robert F. Kennedy Jr. and his colleagues at the Departments of Health and Human Services and Agriculture unveiled new dietary guidelines for Americans. And they are causing a bit of a stir. That’s partly because they recommend products like red meat, butter, and beef tallow—foods that have been linked to cardiovascular disease, and that nutrition experts have been recommending people limit in their diets. These guidelines are a big deal—they influence food assistance programs and school lunches, for example. Let’s take a look at the good, the bad, and the ugly advice being dished up to Americans by their government. —Jessica Hamzelou This article first appeared in The Checkup, MIT Technology Review’s weekly biotech newsletter. To receive it in your inbox every Thursday, and read articles like this first, sign up here. The must-reads I’ve combed the internet to find you today’s most fun/important/scary/fascinating stories about technology. 1 Grok has switched off its image-generating function for most usersFollowing a global backlash to its sexualized pictures of women and children. (The Guardian)+ Elon Musk has previously lamented the “guardrails” around the chatbot. (CNN)+ XAI has been burning through cash lately. (Bloomberg $) 2 Online sleuths tried to use AI to unmask the ICE agent who killed a womanThe problem is, its results are far from reliable. (WP $)+ The Trump administration is pushing videos of the incident filmed from a specific angle. (The Verge)+ Minneapolis is struggling to make sense of the shooting of Renee Nicole Good. (WSJ $) 3 Smartphones and PCs are about to get more expensiveYou can thank the memory chip shortage sparked by the AI data center boom. (FT $)+ Expect delays alongside those price rises, too. (Economist $) 4 NASA is bringing four of the seven ISS crew members back to EarthIt’s not clear exactly why, but it said one of them experienced a “medical situation” earlier this week. (Ars Technica) 5 The vast majority of humanoid robots shipped last year were from ChinaThe country is dominating early supply for the bipedal machines. (Bloomberg $)+ Why a Chinese robot vacuum firm is moving into EVs. (Wired $)+ China’s EV giants are betting big on humanoid robots. (MIT Technology Review) 6 New Jersey has banned students’ phones in schoolsIt’s the latest in a long line of states to restrict devices during school hours. (NYT $) 7 Are AI coding assistants getting worse?This data scientist certainly seems to think so. (IEEE Spectrum)+ AI coding is now everywhere. But not everyone is convinced. (MIT Technology Review) 8 How to save wine from wildfires Smoke leaves the alcohol with an ashy taste, but a group of scientists are working on a solution. (New Yorker $) 9 Celebrity Letterboxd accounts are good funUnsurprisingly, a subset of web users have chosen to hound them. (NY Mag $) 10 Craigslist refuses to dieThe old-school classifieds corner of the web still has a legion of diehard fans. (Wired $) Quote of the day “Tools like Grok now risk bringing sexual AI imagery of children into the mainstream. The harms are rippling out.” —Ngaire Alexander, head of the Internet Watch Foundation’s reporting hotline, explains the dangers around low-moderation AI tools like Grok to the Wall Street Journal. One more thing How to measure the returns on R&D spending Given the draconian cuts to US federal funding for science, it’s worth asking some hard-nosed money questions: How much should we be spending on R&D? How much value do we get out of such investments, anyway? To answer that, in several recent papers, economists have approached this issue in clever new ways.  And, though they ask slightly different questions,

The Download: the case for AI slop, and helping CRISPR fulfill its promise Read Post »

AI, Committee, News, Uncategorized

How to Build Portable, In-Database Feature Engineering Pipelines with Ibis Using Lazy Python APIs and DuckDB Execution

In this tutorial, we demonstrate how we use Ibis to build a portable, in-database feature engineering pipeline that looks and feels like Pandas but executes entirely inside the database. We show how we connect to DuckDB, register data safely inside the backend, and define complex transformations using window functions and aggregations without ever pulling raw data into local memory. By keeping all transformations lazy and backend-agnostic, we demonstrate how to write analytics code once in Python and rely on Ibis to translate it into efficient SQL. Check out the FULL CODES here. Copy CodeCopiedUse a different Browser !pip -q install “ibis-framework[duckdb,examples]” duckdb pyarrow pandas import ibis from ibis import _ print(“Ibis version:”, ibis.__version__) con = ibis.duckdb.connect() ibis.options.interactive = True We install the required libraries and initialize the Ibis environment. We establish a DuckDB connection and enable interactive execution so that all subsequent operations remain lazy and backend-driven. Check out the FULL CODES here. Copy CodeCopiedUse a different Browser try: base_expr = ibis.examples.penguins.fetch(backend=con) except TypeError: base_expr = ibis.examples.penguins.fetch() if “penguins” not in con.list_tables(): try: con.create_table(“penguins”, base_expr, overwrite=True) except Exception: con.create_table(“penguins”, base_expr.execute(), overwrite=True) t = con.table(“penguins”) print(t.schema()) We load the Penguins dataset and explicitly register it inside the DuckDB catalog to ensure it is available for SQL execution. We verify the table schema and confirm that the data now lives inside the database rather than in local memory. Check out the FULL CODES here. Copy CodeCopiedUse a different Browser def penguin_feature_pipeline(penguins): base = penguins.mutate( bill_ratio=_.bill_length_mm / _.bill_depth_mm, is_male=(_.sex == “male”).ifelse(1, 0), ) cleaned = base.filter( _.bill_length_mm.notnull() & _.bill_depth_mm.notnull() & _.body_mass_g.notnull() & _.flipper_length_mm.notnull() & _.species.notnull() & _.island.notnull() & _.year.notnull() ) w_species = ibis.window(group_by=[cleaned.species]) w_island_year = ibis.window( group_by=[cleaned.island], order_by=[cleaned.year], preceding=2, following=0, ) feat = cleaned.mutate( species_avg_mass=cleaned.body_mass_g.mean().over(w_species), species_std_mass=cleaned.body_mass_g.std().over(w_species), mass_z=( cleaned.body_mass_g – cleaned.body_mass_g.mean().over(w_species) ) / cleaned.body_mass_g.std().over(w_species), island_mass_rank=cleaned.body_mass_g.rank().over( ibis.window(group_by=[cleaned.island]) ), rolling_3yr_island_avg_mass=cleaned.body_mass_g.mean().over( w_island_year ), ) return feat.group_by([“species”, “island”, “year”]).agg( n=feat.count(), avg_mass=feat.body_mass_g.mean(), avg_flipper=feat.flipper_length_mm.mean(), avg_bill_ratio=feat.bill_ratio.mean(), avg_mass_z=feat.mass_z.mean(), avg_rolling_3yr_mass=feat.rolling_3yr_island_avg_mass.mean(), pct_male=feat.is_male.mean(), ).order_by([“species”, “island”, “year”]) We define a reusable feature engineering pipeline using pure Ibis expressions. We compute derived features, apply data cleaning, and use window functions and grouped aggregations to build advanced, database-native features while keeping the entire pipeline lazy. Check out the FULL CODES here. Copy CodeCopiedUse a different Browser features = penguin_feature_pipeline(t) print(con.compile(features)) try: df = features.to_pandas() except Exception: df = features.execute() display(df.head()) We invoke the feature pipeline and compile it into DuckDB SQL to validate that all transformations are pushed down to the database. We then run the pipeline and return only the final aggregated results for inspection. Check out the FULL CODES here. Copy CodeCopiedUse a different Browser con.create_table(“penguin_features”, features, overwrite=True) feat_tbl = con.table(“penguin_features”) try: preview = feat_tbl.limit(10).to_pandas() except Exception: preview = feat_tbl.limit(10).execute() display(preview) out_path = “/content/penguin_features.parquet” con.raw_sql(f”COPY penguin_features TO ‘{out_path}’ (FORMAT PARQUET);”) print(out_path) We materialize the engineered features as a table directly inside DuckDB and query it lazily for verification. We also export the results to a Parquet file, demonstrating how we can hand off database-computed features to downstream analytics or machine learning workflows. In conclusion, we constructed, compiled, and executed an advanced feature engineering workflow fully inside DuckDB using Ibis. We demonstrated how to inspect the generated SQL, materialized results directly in the database, and exported them for downstream use while preserving portability across analytical backends. This approach reinforces the core idea behind Ibis: we keep computation close to the data, minimize unnecessary data movement, and maintain a single, reusable Python codebase that scales from local experimentation to production databases. Check out the FULL CODES here. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well. Check out our latest release of ai2025.dev, a 2025-focused analytics platform that turns model launches, benchmarks, and ecosystem activity into a structured dataset you can filter, compare, and export. The post How to Build Portable, In-Database Feature Engineering Pipelines with Ibis Using Lazy Python APIs and DuckDB Execution appeared first on MarkTechPost.

How to Build Portable, In-Database Feature Engineering Pipelines with Ibis Using Lazy Python APIs and DuckDB Execution Read Post »

AI, Committee, News, Uncategorized

Meta and Harvard Researchers Introduce the Confucius Code Agent (CCA): A Software Engineering Agent that can Operate at Large-Scale Codebases

How far can a mid sized language model go if the real innovation moves from the backbone into the agent scaffold and tool stack? Meta and Harvard researchers have released the Confucius Code Agent, an open sourced AI software engineer built on the Confucius SDK that is designed for industrial scale software repositories and long running sessions. The system targets real GitHub projects, complex test toolchains at evaluation time, and reproducible results on benchmarks such as SWE Bench Pro and SWE Bench Verified, while exposing the full scaffold for developers. https://arxiv.org/pdf/2512.10398 Confucius SDK, scaffolding around the model The Confucius SDK is an agent development platform that treats scaffolding as a primary design problem rather than a thin wrapper around a language model. It is organized around 3 axes, Agent Experience, User Experience, and Developer Experience. Agent Experience controls what the model sees, including context layout, working memory and tool results. User Experience focuses on readable traces, code diffs and safeguards for human engineers. Developer Experience focuses on observability, configuration and debugging of the agent itself. The SDK introduces 3 core mechanisms, a unified orchestrator with hierarchical working memory, a persistent note taking system, and a modular extension interface for tools. A meta agent then automates synthesis and refinement of agent configurations through a build, test, improve loop. The Confucius Code Agent is one concrete instantiation of this scaffold for software engineering. https://arxiv.org/pdf/2512.10398 Hierarchical working memory for long horizon coding Real software tasks on SWE Bench Pro often require reasoning over dozens of files and many interaction steps. The orchestrator in Confucius SDK maintains hierarchical working memory, which partitions a trajectory into scopes, summarizes past steps and keeps compressed context for later turns. This design helps keep prompts within model context limits while preserving important artifacts such as patches, error logs and design decisions. The key point is that effective tool based coding agents need an explicit memory architecture, not just a sliding window of previous messages. Persistent note taking for cross session learning The second mechanism is a note taking system that uses a dedicated agent to write structured Markdown notes from execution traces. These notes capture task specific strategies, repository conventions and common failure modes, and they are stored as long term memory that can be reused across sessions. The research team ran Confucius Code Agent twice on 151 SWE Bench Pro instances with Claude 4.5 Sonnet. On the first run the agent solves tasks from scratch and generates notes. On the second run the agent reads these notes. In this setting, average turns drop from 64 to 61, token usage drops from about 104k to 93k, and Resolve@1 improves from 53.0 to 54.4. This shows that notes are not just logs, they function as effective cross session memory. Modular extensions and tool use sophistication Confucius SDK exposes tools as extensions, for example file editing, command execution, test runners and code search. Each extension can maintain its own state and prompt wiring. The research team studies the impact of tool use sophistication using an ablation on a 100 example subset of SWE Bench Pro. With Claude 4 Sonnet, moving from a configuration without advanced context features to one with advanced context raises Resolve@1 from 42.0 to 48.6. With Claude 4.5 Sonnet, a simple tool use configuration reaches 44.0, while richer tool handling reaches 51.6, with 51.0 for an intermediate variant. These numbers indicate that how the agent chooses and sequences tools matters almost as much as the backbone model choice. https://arxiv.org/pdf/2512.10398 Meta agent for automatic agent design On top of these mechanisms, the Confucius SDK includes a meta agent that takes a natural language specification of an agent and iteratively proposes configurations, prompts and extension sets. It then runs the candidate agent on tasks, inspects traces and metrics, and edits the configuration in a build, test, improve loop. The Confucius Code Agent that the research team evaluates is produced with the help of this meta agent, rather than only hand tuned. This approach turns some of the agent engineering process itself into an LLM guided optimization problem. Results on SWE Bench Pro and SWE Bench Verified The main evaluation uses SWE Bench Pro, which has 731 GitHub issues that require modifying real repositories until tests pass. All compared systems share the same repositories, tool environment and evaluation harness, so differences come from the scaffolds and models. On SWE Bench Pro, the reported Resolve@1 scores are Claude 4 Sonnet with SWE Agent, 42.7 Claude 4 Sonnet with Confucius Code Agent, 45.5 Claude 4.5 Sonnet with SWE Agent, 43.6 Claude 4.5 Sonnet with Live SWE Agent, 45.8 Claude 4.5 Sonnet with Confucius Code Agent, 52.7 Claude 4.5 Opus with Anthropic system card scaffold, 52.0 Claude 4.5 Opus with Confucius Code Agent, 54.3 These results show that a strong scaffold with a mid tier model, Claude 4.5 Sonnet with Confucius Code Agent at 52.7, can outperform a stronger model with a weaker scaffold, Claude 4.5 Opus with 52.0. On SWE Bench Verified, Confucius Code Agent with Claude 4 Sonnet reaches Resolve@1 74.6, compared to 66.6 for SWE Agent and 72.8 for OpenHands. A mini SWE Agent variant with Claude 4.5 Sonnet reaches 70.6, which is also below Confucius Code Agent with Claude 4 Sonnet. The research team also report performance as a function of edited file count. For tasks editing 1 to 2 files, Confucius Code Agent reaches 57.8 Resolve@1, for 3 to 4 files it reaches 49.2, for 5 to 6 files it reaches 44.1, for 7 to 10 files it reaches 52.6, and for more than 10 files it reaches 44.4. This indicates stable behavior on multi file changes in large codebases. Key Takeaways Scaffolding can outweigh model size: Confucius Code Agent shows that with strong scaffolding, Claude 4.5 Sonnet reaches 52.7 Resolve@1 on SWE-Bench-Pro, surpassing Claude 4.5 Opus with a weaker scaffold at 52.0. Hierarchical working memory is essential for long horizon coding: The Confucius SDK orchestrator uses hierarchical working memory and context compression to manage long trajectories

Meta and Harvard Researchers Introduce the Confucius Code Agent (CCA): A Software Engineering Agent that can Operate at Large-Scale Codebases Read Post »

AI, Committee, News, Uncategorized

LPFQA: A Long-Tail Professional Forum-based Benchmark for LLM Evaluation

arXiv:2511.06346v2 Announce Type: replace-cross Abstract: Large Language Models (LLMs) perform well on standard reasoning and question-answering benchmarks, yet such evaluations often fail to capture their ability to handle long-tail, expertise-intensive knowledge in real-world professional scenarios. We introduce LPFQA, a long-tail knowledge benchmark derived from authentic professional forum discussions, covering 7 academic and industrial domains with 430 curated tasks grounded in practical expertise. LPFQA evaluates specialized reasoning, domain-specific terminology understanding, and contextual interpretation, and adopts a hierarchical difficulty structure to ensure semantic clarity and uniquely identifiable answers. Experiments on over multiple mainstream LLMs reveal substantial performance gaps, particularly on tasks requiring deep domain reasoning, exposing limitations overlooked by existing benchmarks. Overall, LPFQA provides an authentic and discriminative evaluation framework that complements prior benchmarks and informs future LLM development.

LPFQA: A Long-Tail Professional Forum-based Benchmark for LLM Evaluation Read Post »

AI, Committee, News, Uncategorized

Tool-MAD: A Multi-Agent Debate Framework for Fact Verification with Diverse Tool Augmentation and Adaptive Retrieval

arXiv:2601.04742v1 Announce Type: new Abstract: Large Language Models (LLMs) suffer from hallucinations and factual inaccuracies, especially in complex reasoning and fact verification tasks. Multi-Agent Debate (MAD) systems aim to improve answer accuracy by enabling multiple LLM agents to engage in dialogue, promoting diverse reasoning and mutual verification. However, existing MAD frameworks primarily rely on internal knowledge or static documents, making them vulnerable to hallucinations. While MADKE introduces external evidence to mitigate this, its one-time retrieval mechanism limits adaptability to new arguments or emerging information during the debate. To address these limitations, We propose Tool-MAD, a multi-agent debate framework that enhances factual verification by assigning each agent a distinct external tool, such as a search API or RAG module. Tool-MAD introduces three key innovations: (1) a multi-agent debate framework where agents leverage heterogeneous external tools, encouraging diverse perspectives, (2) an adaptive query formulation mechanism that iteratively refines evidence retrieval based on the flow of the debate, and (3) the integration of Faithfulness and Answer Relevance scores into the final decision process, allowing the Judge agent to quantitatively assess the coherence and question alignment of each response and effectively detect hallucinations. Experimental results on four fact verification benchmarks demonstrate that Tool-MAD consistently outperforms state-of-the-art MAD frameworks, achieving up to 5.5% accuracy improvement. Furthermore, in medically specialized domains, Tool-MAD exhibits strong robustness and adaptability across various tool configurations and domain conditions, confirming its potential for broader real-world fact-checking applications.

Tool-MAD: A Multi-Agent Debate Framework for Fact Verification with Diverse Tool Augmentation and Adaptive Retrieval Read Post »

AI, Committee, News, Uncategorized

SpeechMedAssist: Efficiently and Effectively Adapting Speech Language Models for Medical Consultation

arXiv:2601.04638v1 Announce Type: new Abstract: Medical consultations are intrinsically speech-centric. However, most prior works focus on long-text-based interactions, which are cumbersome and patient-unfriendly. Recent advances in speech language models (SpeechLMs) have enabled more natural speech-based interaction, yet the scarcity of medical speech data and the inefficiency of directly fine-tuning on speech data jointly hinder the adoption of SpeechLMs in medical consultation. In this paper, we propose SpeechMedAssist, a SpeechLM natively capable of conducting speech-based multi-turn interactions with patients. By exploiting the architectural properties of SpeechLMs, we decouple the conventional one-stage training into a two-stage paradigm consisting of (1) Knowledge & Capability Injection via Text and (2) Modality Re-alignment with Limited Speech Data, thereby reducing the requirement for medical speech data to only 10k synthesized samples. To evaluate SpeechLMs for medical consultation scenarios, we design a benchmark comprising both single-turn question answering and multi-turn simulated interactions. Experimental results show that our model outperforms all baselines in both effectiveness and robustness in most evaluation settings.

SpeechMedAssist: Efficiently and Effectively Adapting Speech Language Models for Medical Consultation Read Post »

AI, Committee, News, Uncategorized

Rate or Fate? RLV$^varepsilon$R: Reinforcement Learning with Verifiable Noisy Rewards

arXiv:2601.04411v1 Announce Type: cross Abstract: Reinforcement learning with verifiable rewards (RLVR) is a simple but powerful paradigm for training LLMs: sample a completion, verify it, and update. In practice, however, the verifier is almost never clean–unit tests probe only limited corner cases; human and synthetic labels are imperfect; and LLM judges (e.g., RLAIF) are noisy and can be exploited–and this problem worsens on harder domains (especially coding) where tests are sparse and increasingly model-generated. We ask a pragmatic question: does the verification noise merely slow down the learning (rate), or can it flip the outcome (fate)? To address this, we develop an analytically tractable multi-armed bandit view of RLVR dynamics, instantiated with GRPO and validated in controlled experiments. Modeling false positives and false negatives and grouping completions into recurring reasoning modes yields a replicator-style (natural-selection) flow on the probability simplex. The dynamics decouples into within-correct-mode competition and a one-dimensional evolution for the mass on incorrect modes, whose drift is determined solely by Youden’s index J=TPR-FPR. This yields a sharp phase transition: when J>0, the incorrect mass is driven toward extinction (learning); when J=0, the process is neutral; and when J0, noise primarily rescales convergence time (“rate, not fate”). Experiments on verifiable programming tasks under synthetic noise reproduce the predicted J=0 boundary. Beyond noise, the framework offers a general lens for analyzing RLVR stability, convergence, and algorithmic interventions.

Rate or Fate? RLV$^varepsilon$R: Reinforcement Learning with Verifiable Noisy Rewards Read Post »

AI, Committee, News, Uncategorized

Distilling the Essence: Efficient Reasoning Distillation via Sequence Truncation

arXiv:2512.21002v2 Announce Type: replace Abstract: Distilling the capabilities from a large reasoning model (LRM) to a smaller student model often involves training on substantial amounts of reasoning data. However, knowledge distillation (KD) over lengthy sequences with prompt (P), chain-of-thought (CoT), and answer (A) sections makes the process computationally expensive. In this work, we investigate how the allocation of supervision across different sections (P, CoT, A) affects student performance. Our analysis shows that selective KD over only the CoT tokens can be effective when the prompt and answer information is encompassed by it. Building on this insight, we establish a truncation protocol to quantify computation-quality tradeoffs as a function of sequence length. We observe that beyond a specific length, longer training sequences provide marginal returns for downstream performance but require substantially higher memory and FLOPs. To this end, training on only the first $50%$ of tokens of every training sequence can retain, on average, $approx91%$ of full-sequence performance on math benchmarks while reducing training time, memory usage, and FLOPs by about $50%$ each. Codes are available at https://github.com/weiruichen01/distilling-the-essence.

Distilling the Essence: Efficient Reasoning Distillation via Sequence Truncation Read Post »

We use cookies to improve your experience and performance on our website. You can learn more at Privacy Policy and manage your privacy settings by clicking Settings.

Privacy Preferences

You can choose your cookie settings by turning on/off each type of cookie as you wish, except for essential cookies.

Allow All
Manage Consent Preferences
  • Always Active

Save
en_US