News Archives - 第4页共97页

DICE: Structured Reasoning in LLMs through SLM-Guided Chain-of-Thought Correction

admin NU / 10 月 21, 2025

arXiv:2510.09211v2 Announce Type: replace Abstract: When performing reasoning tasks with user-specific requirements, such as strict output formats, large language models (LLMs) often prioritize reasoning over adherence to detailed instructions. Fine-tuning LLMs on supervised datasets to address this is impractical due to high computational costs and limited parameter access. To tackle this, we propose DICE, a lightweight framework that guides small language models (SLMs) to refine LLMs’ outputs through chain-of-thought (CoT) correction. DICE decouples the process by first prompting LLMs to generate natural language responses, then using trained SLMs to analyze and refine these outputs to meet structured output specifications. This framework preserves LLMs’ broad knowledge and reasoning capabilities while ensuring the outputs conform to user demands. Specifically, DICE first constructs structured CoT adaptation datasets via a two-stage method and subsequently applies a dual-tuning strategy to fine-tune SLMs for generating structured outputs in an analyze-then-answer pattern. Experiments demonstrate that DICE improves the average format accuracy and content correctness of LLM outputs by 35.4% and 29.4%, respectively, achieving state-of-the-art (SOTA) performance over other competitive baselines.

DICE: Structured Reasoning in LLMs through SLM-Guided Chain-of-Thought Correction Read Post »

AI, Committee, 新闻, Uncategorized

Does Math Reasoning Improve General LLM Capabilities? Understanding Transferability of LLM Reasoning

admin NU / 10 月 21, 2025

arXiv:2507.00432v2 Announce Type: replace-cross Abstract: Math reasoning has become the poster child of progress in large language models (LLMs), with new models rapidly surpassing human-level performance on benchmarks like MATH and AIME. But as math leaderboards improve week by week, it is worth asking: do these gains reflect broader problem-solving ability or just narrow overfitting? To answer this question, we evaluate over 20 open-weight reasoning-tuned models across a broad suite of tasks, including math, scientific QA, agent planning, coding, and standard instruction-following. We surprisingly find that most models that succeed in math fail to transfer their gains to other domains. To rigorously study this phenomenon, we conduct controlled experiments on Qwen3-14B models using math-only data but different tuning methods. We find that reinforcement learning (RL)-tuned models generalize well across domains, while supervised fine-tuning (SFT)-tuned models often forget general capabilities. Latent-space representation and token-space distribution shift analyses reveal that SFT induces substantial representation and output drift, while RL preserves general-domain structure. Our results suggest a need to rethink standard post-training recipes, particularly the reliance on SFT-distilled data for advancing reasoning models.

Does Math Reasoning Improve General LLM Capabilities? Understanding Transferability of LLM Reasoning Read Post »

AI, Committee, 新闻, Uncategorized

Google AI Research Releases DeepSomatic: A New AI Model that Identifies Cancer Cell Genetic Variants

admin NU / 10 月 21, 2025

A team of researchers from Google Research and UC Santa Cruz released DeepSomatic, an AI model that identifies cancer cell genetic variants. In research with Children’s Mercy, it found 10 variants in pediatric leukemia cells missed by other tools. DeepSomatic has a somatic small variant caller for cancer genomes that works across Illumina short reads, PacBio HiFi long reads, and Oxford Nanopore long reads. The method extends DeepVariant, detects single nucleotide variants and small insertions and deletions in whole genome and whole exome data, and supports tumor normal and tumor only workflows, including FFPE models. https://research.google/blog/using-ai-to-identify-genetic-variants-in-tumors-with-deepsomatic/?utm_source=twitter&utm_medium=social&utm_campaign=social_post&utm_content=gr-acct How It Works? DeepSomatic converts aligned reads into image like tensors that encode pileups, base qualities, and alignment context. A convolutional neural network classifies candidate sites as somatic or not and the pipeline emits VCF or gVCF. This design is platform agnostic because the tensor summarizes local haplotype and error patterns across technologies. Google researchers describe the approach and its focus on distinguishing inherited and acquired variants including difficult samples such as glioblastoma and pediatric leukemia. Datasets and Benchmarking Training and evaluation use CASTLE, Cancer Standards Long read Evaluation. CASTLE contains 6 matched tumor and normal cell line pairs that were whole genome sequenced on Illumina, PacBio HiFi, and Oxford Nanopore. The research team releases benchmark sets and accessions for reuse. This fills a gap in multi technology somatic training and testing resources. https://research.google/blog/using-ai-to-identify-genetic-variants-in-tumors-with-deepsomatic/?utm_source=twitter&utm_medium=social&utm_campaign=social_post&utm_content=gr-acct Reported Results The research team report consistent gains over widely used methods in both single nucleotide variants and indels. On Illumina indels, the next best method is about 80 percent F1, DeepSomatic is about 90 percent. On PacBio indels, the next best method is under 50 percent, DeepSomatic is above 80 percent. Baselines include SomaticSniper, MuTect2, and Strelka2 for short reads and ClairS for long reads. The study reports 329,011 somatic variants across the reference lines and an additional preserved sample. Google research team reports that DeepSomatic outperforms current methods with particular strength on indels. https://research.google/blog/using-ai-to-identify-genetic-variants-in-tumors-with-deepsomatic/?utm_source=twitter&utm_medium=social&utm_campaign=social_post&utm_content=gr-acct Generalization to Real Samples The research team evaluates transfer to cancers beyond the training set. A glioblastoma sample shows recovery of known drivers. Pediatric leukemia samples test the tumor only mode where a clean normal is not available. The tool recovers known calls and reports additional variants in that cohort. These studies indicate the representation and training scheme generalize to new disease contexts and to settings without matched normals. Key Takeaways DeepSomatic detects somatic SNVs (single nucleotide variants) and indels across Illumina, PacBio HiFi, and Oxford Nanopore, and builds on the DeepVariant methodology. The pipeline supports tumor normal and tumor only workflows, includes FFPE WGS and WES models, and is released on GitHub. It encodes read pileups as image like tensors and uses a convolutional neural network to classify somatic sites and emit VCF or gVCF. Training and evaluation use the CASTLE dataset with 6 matched tumor normal cell line pairs sequenced on three platforms, with benchmarks and accessions provided. Reported results show about 90 percent indel F1 on Illumina and above 80 percent on PacBio, outperforming common baselines, with 329,011 somatic variants identified across reference samples. Editorial Comments DeepSomatic is a pragmatic step for somatic variant calling across sequencing platforms, the model keeps DeepVariant’s image tensor representation and a convolutional neural network, so the same architecture scales from Illumina to PacBio HiFi to Oxford Nanopore with consistent preprocessing and outputs. The CASTLE dataset is the right move, it supplies matched tumor and normal cell lines across 3 technologies, which strengthens training and benchmarking and aids reproducibility. Reported results emphasize indel accuracy, about 90% F1 on Illumina and more than 80% on PacBio against lower baselines, which addresses a long running weakness in indel detection. The pipeline supports WGS and WES, tumor normal and tumor only, and FFPE, which matches real laboratory constraints. Check out the Technical Paper, Technical details, Dataset and GitHub Repo. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well. The post Google AI Research Releases DeepSomatic: A New AI Model that Identifies Cancer Cell Genetic Variants appeared first on MarkTechPost.

Google AI Research Releases DeepSomatic: A New AI Model that Identifies Cancer Cell Genetic Variants Read Post »

AI, Committee, 新闻, Uncategorized

Readability Reconsidered: A Cross-Dataset Analysis of Reference-Free Metrics

admin NU / 10 月 20, 2025

arXiv:2510.15345v1 Announce Type: new Abstract: Automatic readability assessment plays a key role in ensuring effective and accessible written communication. Despite significant progress, the field is hindered by inconsistent definitions of readability and measurements that rely on surface-level text properties. In this work, we investigate the factors shaping human perceptions of readability through the analysis of 897 judgments, finding that, beyond surface-level cues, information content and topic strongly shape text comprehensibility. Furthermore, we evaluate 15 popular readability metrics across five English datasets, contrasting them with six more nuanced, model-based metrics. Our results show that four model-based metrics consistently place among the top four in rank correlations with human judgments, while the best performing traditional metric achieves an average rank of 8.6. These findings highlight a mismatch between current readability metrics and human perceptions, pointing to model-based approaches as a more promising direction.

Readability Reconsidered: A Cross-Dataset Analysis of Reference-Free Metrics Read Post »

AI, Committee, 新闻, Uncategorized

TokenTiming: A Dynamic Alignment Method for Universal Speculative Decoding Model Pairs

admin NU / 10 月 20, 2025

arXiv:2510.15545v1 Announce Type: new Abstract: Accelerating the inference of large language models (LLMs) has been a critical challenge in generative AI. Speculative decoding (SD) substantially improves LLM inference efficiency. However, its utility is limited by a fundamental constraint: the draft and target models must share the same vocabulary, thus limiting the herd of available draft models and often necessitating the training of a new model from scratch. Inspired by Dynamic Time Warping (DTW), a classic algorithm for aligning time series, we propose the algorithm TokenTiming for universal speculative decoding. It operates by re-encoding the draft token sequence to get a new target token sequence, and then uses DTW to build a mapping to transfer the probability distributions for speculative sampling. Benefiting from this, our method accommodates mismatched vocabularies and works with any off-the-shelf models without retraining and modification. We conduct comprehensive experiments on various tasks, demonstrating 1.57x speedup. This work enables a universal approach for draft model selection, making SD a more versatile and practical tool for LLM acceleration.

TokenTiming: A Dynamic Alignment Method for Universal Speculative Decoding Model Pairs Read Post »

AI, Committee, 新闻, Uncategorized

Capabilities and Evaluation Biases of Large Language Models in Classical Chinese Poetry Generation: A Case Study on Tang Poetry

admin NU / 10 月 20, 2025

arXiv:2510.15313v1 Announce Type: new Abstract: Large Language Models (LLMs) are increasingly applied to creative domains, yet their performance in classical Chinese poetry generation and evaluation remains poorly understood. We propose a three-step evaluation framework that combines computational metrics, LLM-as-a-judge assessment, and human expert validation. Using this framework, we evaluate six state-of-the-art LLMs across multiple dimensions of poetic quality, including themes, emotions, imagery, form, and style. Our analysis reveals systematic generation and evaluation biases: LLMs exhibit “echo chamber” effects when assessing creative quality, often converging on flawed standards that diverge from human judgments. These findings highlight both the potential and limitations of current capabilities of LLMs as proxy for literacy generation and the limited evaluation practices, thereby demonstrating the continued need of hybrid validation from both humans and models in culturally and technically complex creative tasks.

Capabilities and Evaluation Biases of Large Language Models in Classical Chinese Poetry Generation: A Case Study on Tang Poetry Read Post »

AI, Committee, 新闻, Uncategorized

Build Your Personalized Research Group: A Multiagent Framework for Continual and Interactive Science Automation

admin NU / 10 月 20, 2025

arXiv:2510.15624v1 Announce Type: cross Abstract: The automation of scientific discovery represents a critical milestone in Artificial Intelligence (AI) research. However, existing agentic systems for science suffer from two fundamental limitations: rigid, pre-programmed workflows that cannot adapt to intermediate findings, and inadequate context management that hinders long-horizon research. We present texttt{freephdlabor}, an open-source multiagent framework featuring textit{fully dynamic workflows} determined by real-time agent reasoning and a coloremph{textit{modular architecture}} enabling seamless customization — users can modify, add, or remove agents to address domain-specific requirements. The framework provides comprehensive infrastructure including textit{automatic context compaction}, textit{workspace-based communication} to prevent information degradation, textit{memory persistence} across sessions, and textit{non-blocking human intervention} mechanisms. These features collectively transform automated research from isolated, single-run attempts into textit{continual research programs} that build systematically on prior explorations and incorporate human feedback. By providing both the architectural principles and practical implementation for building customizable co-scientist systems, this work aims to facilitate broader adoption of automated research across scientific domains, enabling practitioners to deploy interactive multiagent systems that autonomously conduct end-to-end research — from ideation through experimentation to publication-ready manuscripts.

Build Your Personalized Research Group: A Multiagent Framework for Continual and Interactive Science Automation Read Post »

AI, Committee, 新闻, Uncategorized

Toward Safe and Human-Aligned Game Conversational Recommendation via Multi-Agent Decomposition

admin NU / 10 月 20, 2025

arXiv:2504.20094v2 Announce Type: replace-cross Abstract: Conversational recommender systems (CRS) have advanced with large language models, showing strong results in domains like movies. These domains typically involve fixed content and passive consumption, where user preferences can be matched by genre or theme. In contrast, games present distinct challenges: fast-evolving catalogs, interaction-driven preferences (e.g., skill level, mechanics, hardware), and increased risk of unsafe responses in open-ended conversation. We propose MATCHA, a multi-agent framework for CRS that assigns specialized agents for intent parsing, tool-augmented retrieval, multi-LLM ranking with reflection, explanation, and risk control which enabling finer personalization, long-tail coverage, and stronger safety. Evaluated on real user request dataset, MATCHA outperforms six baselines across eight metrics, improving Hit@5 by 20%, reducing popularity bias by 24%, and achieving 97.9% adversarial defense. Human and virtual-judge evaluations confirm improved explanation quality and user alignment.

Toward Safe and Human-Aligned Game Conversational Recommendation via Multi-Agent Decomposition Read Post »

AI, Committee, 新闻, Uncategorized

AutoCode: A New AI Framework that Lets LLMs Create and Verify Competitive Programming Problems, Mirroring the Workflow of Human Problem Setters

admin NU / 10 月 19, 2025

Are your LLM code benchmarks actually rejecting wrong-complexity solutions and interactive-protocol violations, or are they passing under-specified unit tests? A team of researchers from UCSD, NYU, University of Washington, Princeton University, Canyon Crest Academy, OpenAI, UC Berkeley, MIT, University of Waterloo, and Sentient Labs introduce AutoCode, a new AI framework that lets LLMs create and verify competitive programming problems, mirroring the workflow of human problem setters. AutoCode reframes evaluation for code-reasoning models by treating problem setting (not only problem solving) as the target task. The system trains LLMs to produce competition-grade statements, test data, and verdict logic that match official online judges at high rates. On a 7,538-problem benchmark built from prior datasets, AutoCode achieves 91.1% consistency with official judgments (FPR 3.7%, FNR 14.1%). On a separate, more difficult 720 recent Codeforces problems (including interactive tasks), the full framework reports 98.7% consistency, 1.3% FPR, 1.2% FNR. https://arxiv.org/pdf/2510.12803 Why problem setting matters for evaluation? Public code benchmarks often rely on under-specified tests that let wrong-complexity or shortcut solutions pass. That inflates scores and pollutes reinforcement signals (rewarding fragile tactics). AutoCode’s validator-first approach and adversarial test generation aim to reduce false positives (FPR)—incorrect programs that pass—and false negatives (FNR)—correct programs rejected due to malformed inputs. https://arxiv.org/pdf/2510.12803 The core loop: Validator → Generator → Checker AutoCode runs a closed loop that mirrors human contest workflows, but each step is selected from LLM-generated candidates using targeted in-framework tests. 1) Validator (minimize FNR by enforcing input legality) The system first asks an LLM to synthesize 40 evaluation inputs—10 valid and 30 near-valid illegal (e.g., off-by-one boundary violations). It then prompts the LLM for three candidate validator programs and selects the one that best classifies these cases. This prevents “correct” solutions from crashing on malformed data. https://arxiv.org/pdf/2510.12803 2) Generator (reduce FPR by adversarial coverage) Three complementary strategies produce test cases:• Small-data exhaustion for boundary coverage,• Randomized + extreme cases (overflows, precision, hash-collisions),• TLE-inducing structures to break wrong-complexity solutions. Invalid cases are filtered by the selected validator; then cases are deduplicated and bucket-balanced before sampling. https://arxiv.org/pdf/2510.12803 3) Checker (verdict logic) The checker compares contestant outputs with the reference solution under complex rules. AutoCode again generates 40 checker scenarios and three candidate checker programs, keeps only scenarios with validator-approved inputs, and selects the best checker by accuracy against the 40 labeled scenarios. https://arxiv.org/pdf/2510.12803 4) Interactor (for interactive problems) For tasks that require dialogue with the judge, AutoCode introduces a mutant-based interactor: it makes small logical edits (“mutants”) to the reference solution, selects interactors that accept the true solution but reject the mutants, maximizing discrimination. This addresses a gap in earlier public datasets that avoided interactives. https://arxiv.org/pdf/2510.12803 Dual verification enables new problems (not just tests for existing ones) AutoCode can generate novel problem variants starting from a random “seed” Codeforces problem (<2200 Elo). The LLM drafts a new statement and two solutions: an efficient reference and a simpler brute-force baseline. A problem is accepted only if the reference output matches brute force across the generated test suite (the brute force may TLE on large cases but serves as ground truth on small/exhaustive cases). This dual-verification protocol filters ~27% of error-prone items, lifting reference-solution correctness from 86% → 94% before human review. Human experts then grade the survivors on solvability, solution correctness, quality, novelty, difficulty. After filtering, 61.6% are usable for model training, 76.3% for human training, and 3.2% are ICPC/IOI-level problems. Difficulty typically increases relative to the seed, and difficulty gain correlates with perceived quality. https://arxiv.org/pdf/2510.12803 Understanding the results Existing problems (7,538 total; 195,988 human submissions). AutoCode: 91.1% consistency, 3.7% FPR, 14.1% FNR, vs 72.9–81.0% consistency for prior generators (CodeContests, CodeContests+, TACO, HardTests). Recent Codeforces problems (720, unfiltered; includes interactives). AutoCode: 98.7% consistency, 1.3% FPR, 1.2% FNR. Ablations show all three generator strategies and prompt optimization contribute: removing prompt optimization drops consistency to 98.0% and more than doubles FNR to 2.9%. https://arxiv.org/pdf/2510.12803 Key Takeaways AutoCode couples a Validator–Generator–Checker (+Interactor) loop with dual verification (reference vs. brute-force) to build contest-grade test suites and new problems. On held-out problems, AutoCode’s test suites reach ~99% consistency with official judges, surpassing prior generators like HardTests (<81%). For recent Codeforces tasks (including interactives), the full framework reports ~98.7% consistency with ~1.3% FPR and ~1.2% FNR. The mutant-based interactor reliably accepts the true solution while rejecting mutated variants, improving evaluation for interactive problems. Human experts rate a sizable fraction of AutoCode-generated items as training-usable and a non-trivial share as contest-quality, aligning with the LiveCodeBench Pro program’s aims. Editorial Comments AutoCode is a practical fix for current code benchmarks. It centers problem setting and uses a closed-loop Validator–Generator–Checker (+Interactor) pipeline with dual verification (reference vs. brute-force). This structure reduces false positives/negatives and yields judge-aligned consistency (≈99% on held-out problems; 98.7% on recent Codeforces, including interactives). The approach standardizes constraint legality, adversarial coverage, and protocol-aware judging, which makes downstream RL reward signals cleaner. Its placement under LiveCodeBench Pro fits a hallucination-resistant evaluation program that emphasizes expert-checked rigor. Check out the Paper and Project. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well. The post AutoCode: A New AI Framework that Lets LLMs Create and Verify Competitive Programming Problems, Mirroring the Workflow of Human Problem Setters appeared first on MarkTechPost.

AutoCode: A New AI Framework that Lets LLMs Create and Verify Competitive Programming Problems, Mirroring the Workflow of Human Problem Setters Read Post »

AI, Committee, 新闻, Uncategorized

Abstract or die: Why AI enterprises can’t afford rigid vector stacks

admin NU / 10 月 19, 2025

Vector databases (DBs), once specialist research instruments, have become widely used infrastructure in just a few years. They power today’s semantic search, recommendation engines, anti-fraud measures and gen AI applications across industries. There are a deluge of options: PostgreSQL with pgvector, MySQL HeatWave, DuckDB VSS, SQLite VSS, Pinecone, Weaviate, Milvus and several others. The riches of choices sound like a boon to companies. But just beneath, a growing problem looms: Stack instability. New vector DBs appear each quarter, with disparate APIs, indexing schemes and performance trade-offs. Today’s ideal choice may look dated or limiting tomorrow. To business AI teams, volatility translates into lock-in risks and migration hell. Most projects begin life with lightweight engines like DuckDB or SQLite for prototyping, then move to Postgres, MySQL or a cloud-native service in production. Each switch involves rewriting queries, reshaping pipelines, and slowing down deployments. This re-engineering merry-go-round undermines the very speed and agility that AI adoption is supposed to bring. Why portability matters now Companies have a tricky balancing act: Experiment quickly with minimal overhead, in hopes of trying and getting early value; Scale safely on stable, production-quality infrastructure without months of refactoring; Be nimble in a world where new and better backends arrive nearly every month. Without portability, organizations stagnate. They have technical debt from recursive code paths, are hesitant to adopt new technology and cannot move prototypes to production at pace. In effect, the database is a bottleneck rather than an accelerator. Portability, or the ability to move underlying infrastructure without re-encoding the application, is ever more a strategic requirement for enterprises rolling out AI at scale. Abstraction as infrastructure The solution is not to pick the “perfect” vector database (there isn’t one), but to change how enterprises think about the problem. In software engineering, the adapter pattern provides a stable interface while hiding underlying complexity. Historically, we’ve seen how this principle reshaped entire industries: ODBC/JDBC gave enterprises a single way to query relational databases, reducing the risk of being tied to Oracle, MySQL or SQL Server; Apache Arrow standardized columnar data formats, so data systems could play nice together; ONNX created a vendor-agnostic format for machine learning (ML) models, bringing TensorFlow, PyTorch, etc. together; Kubernetes abstracted infrastructure details, so workloads could run the same everywhere on clouds; any-llm (Mozilla AI) now makes it possible to have one API across lots of large language model (LLM) vendors, so playing with AI is safer. All these abstractions led to adoption by lowering switching costs. They turned broken ecosystems into solid, enterprise-level infrastructure. Vector databases are also at the same tipping point. The adapter approach to vectors Instead of having application code directly bound to some specific vector backend, companies can compile against an abstraction layer that normalizes operations like inserts, queries and filtering. This doesn’t necessarily eliminate the need to choose a backend; it makes that choice less rigid. Development teams can start with DuckDB or SQLite in the lab, then scale up to Postgres or MySQL for production and ultimately adopt a special-purpose cloud vector DB without having to re-architect the application. Open source efforts like Vectorwrap are early examples of this approach, presenting a single Python API to Postgres, MySQL, DuckDB and SQLite. They demonstrate the power of abstraction to accelerate prototyping, reduce lock-in risk and support hybrid architectures employing numerous backends. Why businesses should care For leaders of data infrastructure and decision-makers for AI, abstraction offers three benefits: Speed from prototype to production Teams are able to prototype on lightweight local environments and scale without expensive rewrites. Reduced vendor risk Organizations can adopt new backends as they emerge without long migration projects by decoupling app code from specific databases. Hybrid flexibility Companies can mix transactional, analytical and specialized vector DBs under one architecture, all behind an aggregated interface. The result is data layer agility, and that’s more and more the difference between fast and slow companies. A broader movement in open source What’s happening in the vector space is one example of a bigger trend: Open-source abstractions as critical infrastructure. In data formats: Apache Arrow In ML models: ONNX In orchestration: Kubernetes In AI APIs: Any-LLM and other such frameworks These projects succeed, not by adding new capability, but by removing friction. They enable enterprises to move more quickly, hedge bets and evolve along with the ecosystem. Vector DB adapters continue this legacy, transforming a high-speed, fragmented space into infrastructure that enterprises can truly depend on. The future of vector DB portability The landscape of vector DBs will not converge anytime soon. Instead, the number of options will grow, and every vendor will tune for different use cases, scale, latency, hybrid search, compliance or cloud platform integration. Abstraction becomes strategy in this case. Companies adopting portable approaches will be capable of: Prototyping boldly Deploying in a flexible manner Scaling rapidly to new tech It’s possible we’ll eventually see a “JDBC for vectors,” a universal standard that codifies queries and operations across backends. Until then, open-source abstractions are laying the groundwork. Conclusion Enterprises adopting AI cannot afford to be slowed by database lock-in. As the vector ecosystem evolves, the winners will be those who treat abstraction as infrastructure, building against portable interfaces rather than binding themselves to any single backend. The decades-long lesson of software engineering is simple: Standards and abstractions lead to adoption. For vector DBs, that revolution has already begun. Mihir Ahuja is an AI/ML engineer and open-source contributor based in San Francisco.

Abstract or die: Why AI enterprises can’t afford rigid vector stacks Read Post »

新闻

DICE: Structured Reasoning in LLMs through SLM-Guided Chain-of-Thought Correction

Does Math Reasoning Improve General LLM Capabilities? Understanding Transferability of LLM Reasoning

Google AI Research Releases DeepSomatic: A New AI Model that Identifies Cancer Cell Genetic Variants

Readability Reconsidered: A Cross-Dataset Analysis of Reference-Free Metrics

TokenTiming: A Dynamic Alignment Method for Universal Speculative Decoding Model Pairs

Capabilities and Evaluation Biases of Large Language Models in Classical Chinese Poetry Generation: A Case Study on Tang Poetry

Build Your Personalized Research Group: A Multiagent Framework for Continual and Interactive Science Automation

Toward Safe and Human-Aligned Game Conversational Recommendation via Multi-Agent Decomposition

AutoCode: A New AI Framework that Lets LLMs Create and Verify Competitive Programming Problems, Mirroring the Workflow of Human Problem Setters

Abstract or die: Why AI enterprises can’t afford rigid vector stacks

我们的服务

首页

工作原理

新闻

定价

支持

幫助中心

报告问题

提供反馈

隱私權政策

用户账户

关注我们