YouZum

AI

AI, Committee, Actualités, Uncategorized

Large reasoning models almost certainly can think

Recently, there has been a lot of hullabaloo about the idea that large reasoning models (LRM) are unable to think. This is mostly due to a research article published by Apple, “The Illusion of Thinking” Apple argues that LRMs must not be able to think; instead, they just perform pattern-matching. The evidence they provided is that LRMs with chain-of-thought (CoT) reasoning are unable to carry on the calculation using a predefined algorithm as the problem grows. This is a fundamentally flawed argument. If you ask a human who already knows the algorithm for solving the Tower-of-Hanoi problem to solve a Tower-of-Hanoi problem with twenty discs, for instance, he or she would almost certainly fail to do so. By that logic, we must conclude that humans cannot think either. However, this argument only points to the idea that there is no evidence that LRMs cannot think. This alone certainly does not mean that LRMs can think — just that we cannot be sure they don’t. In this article, I will make a bolder claim: LRMs almost certainly can think. I say ‘almost’ because there is always a chance that further research would surprise us. But I think my argument is pretty conclusive. What is thinking? Before we try to understand if LRMs can think, we need to define what we mean by thinking. But first, we have to make sure that humans can think per the definition. We will only consider thinking in relation to problem solving, which is the matter of contention. 1. Problem representation (frontal and parietal lobes) When you think about a problem, the process engages your prefrontal cortex. This region is responsible for working memory, attention and executive functions — capacities that let you hold the problem in mind, break it into sub-components and set goals. Your parietal cortex helps encode symbolic structure for math or puzzle problems. 2. Mental simulation (morking Memory and inner speech) This has two components: One is an auditory loop that lets you talk to yourself — very similar to CoT generation. The other is visual imagery, which allows you to manipulate objects visually. Geometry was so important for navigating the world that we developed specialized capabilities for it. The auditory part is linked to Broca’s area and the auditory cortex, both reused from language centers. The visual cortex and parietal areas primarily control the visual component. 3. Pattern matching and retrieval (Hippocampus and Temporal Lobes) These actions depend on past experiences and stored knowledge from long-term memory: The hippocampus helps retrieve related memories and facts. The temporal Lobe brings in semantic knowledge — meanings, rules, categories. This is similar to how neural networks depend on their training to process the task. 4. Monitoring and evaluation (Anterior Cingulate Cortex) Our anterior cingulate cortex (ACC) monitors for errors, conflicts or impasses — it’s where you notice contradictions or dead ends. This process is essentially based on pattern matching from prior experience. 5. Insight or reframing (default mode network and right hemisphere) When you’re stuck, your brain might shift into default mode — a more relaxed, internally-directed network. This is when you step back, let go of the current thread and sometimes ‘suddenly’ see a new angle (the classic “aha!” moment). This is similar to how DeepSeek-R1 was trained for CoT reasoning without having CoT examples in its training data. Remember, the brain continuously learns as it processes data and solves problems. In contrast, LRMs aren’t allowed to change based on real-world feedback during prediction or generation. But with DeepSeek-R1’s CoT training, learning did happen as it attempted to solve the problems — essentially updating while reasoning. Similarities betweem CoT reasoning and biological thinking LRM does not have all of the faculties mentioned above. For example, an LRM is very unlikely to do too much visual reasoning in its circuit, although a little may happen. But it certainly does not generate intermediate images in the CoT generation. Most humans can make spatial models in their heads to solve problems. Does this mean we can conclude that LRMs cannot think? I would disagree. Some humans also find it difficult to form spatial models of the concepts they think about. This condition is called aphantasia. People with this condition can think just fine. In fact, they go about life as if they don’t lack any ability at all. Many of them are actually great at symbolic reasoning and quite good at math — often enough to compensate for their lack of visual reasoning. We might expect our neural network models also to be able to circumvent this limitation. If we take a more abstract view of the human thought process described earlier, we can see mainly the following things involved: 1.  Pattern-matching is used for recalling learned experience, problem representation and monitoring and evaluating chains of thought. 2.  Working memory is to store all the intermediate steps. 3.  Backtracking search concludes that the CoT is not going anywhere and backtracks to some reasonable point. Pattern-matching in an LRM comes from its training. The whole point of training is to learn both knowledge of the world and the patterns to process that knowledge effectively. Since an LRM is a layered network, the entire working memory needs to fit within one layer. The weights store the knowledge of the world and the patterns to follow, while processing happens between layers using the learned patterns stored as model parameters. Note that even in CoT, the entire text — including the input, CoT and part of the output already generated — must fit into each layer. Working memory is just one layer (in the case of the attention mechanism, this includes the KV-cache). CoT is, in fact, very similar to what we do when we are talking to ourselves (which is almost always). We nearly always verbalize our thoughts, and so does a CoT reasoner. There is also good evidence that CoT reasoner can take backtracking steps when a certain line of reasoning seems futile. In fact, this is

Large reasoning models almost certainly can think Lire l’article »

AI, Committee, Actualités, Uncategorized

Anthropic’s New Research Shows Claude can Detect Injected Concepts, but only in Controlled Layers

How do you tell whether a model is actually noticing its own internal state instead of just repeating what training data said about thinking? In a latest Anthropic’s research study ‘Emergent Introspective Awareness in Large Language Models‘ asks whether current Claude models can do more than talk about their abilities, it asks whether they can notice real changes inside their network. To remove guesswork, the research team does not test on text alone, they directly edit the model’s internal activations and then ask the model what happened. This lets them tell apart genuine introspection from fluent self description. Method, concept injection as activation steering The core method is concept injection, described in the Transformer Circuits write up as an application of activation steering. The researchers first capture an activation pattern that corresponds to a concept, for example an all caps style or a concrete noun, then they add that vector into the activations of a later layer while the model is answering. If the model then says, there is an injected thought that matches X, that answer is causally grounded in the current state, not in prior internet text. Anthropic research team reports that this works best in later layers and with tuned strength. https://transformer-circuits.pub/2025/introspection/index.html Main result, about 20 percent success with zero false positives in controls Claude Opus 4 and Claude Opus 4.1 show the clearest effect. When the injection is done in the correct layer band and with the right scale, the models correctly report the injected concept in about 20 percent of trials. On control runs with no injection, production models do not falsely claim to detect an injected thought over 100 runs, which makes the 20 percent signal meaningful. Separating internal concepts from user text A natural objection is that the model could be importing the injected word into the text channel. Anthropic researchers tests this. The model receives a normal sentence, the researchers inject an unrelated concept such as bread on the same tokens, and then they ask the model to name the concept and to repeat the sentence. The stronger Claude models can do both, they keep the user text intact and they name the injected thought, which shows that internal concept state can be reported separately from the visible input stream. For agent style systems, this is the interesting part, because it shows that a model can talk about the extra state that tool calls or agents may depend on. Prefill, using introspection to tell what was intended Another experiment targets an evaluation problem. Anthropic prefilled the assistant message with content the model did not plan. By default Claude says that the output was not intended. When the researchers retroactively inject the matching concept into earlier activations, the model now accepts the prefilled output as its own and can justify it. This shows that the model is consulting an internal record of its previous state to decide authorship, not only the final text. That is a concrete use of introspection. Key Takeaways Concept injection gives causal evidence of introspection: Anthropic shows that if you take a known activation pattern, inject it into Claude’s hidden layers, and then ask the model what is happening, advanced Claude variants can sometimes name the injected concept. This separates real introspection from fluent roleplay. Best models succeed only in a narrow regime: Claude Opus 4 and 4.1 detect injected concepts only when the vector is added in the right layer band and with tuned strength, and the reported success rate is around the same scale Anthropic stated, while production runs show 0 false positives in controls, so the signal is real but small. Models can keep text and internal ‘thoughts’ separate: In experiments where an unrelated concept is injected on top of normal input text, the model can both repeat the user sentence and report the injected concept, which means the internal concept stream is not just leaking into the text channel. Introspection supports authorship checks: When Anthropic prefilled outputs that the model did not intend, the model disavowed them, but if the matching concept was retroactively injected, the model accepted the output as its own. This shows the model can consult past activations to decide whether it meant to say something. This is a measurement tool, not a consciousness claim: The research team frame the work as functional, limited introspective awareness that could feed future transparency and safety evaluations, including ones about evaluation awareness, but they do not claim general self awareness or stable access to all internal features. Editorial Comments Anthropic’s ‘Emergent Introspective Awareness in LLMs‘ research is a useful measurement advance, not a grand metaphysical claim. The setup is clean, inject a known concept into hidden activations using activation steering, then query the model for a grounded self report. Claude variants sometimes detect and name the injected concept, and they can keep injected ‘thoughts’ distinct from input text, which is operationally relevant for agent debugging and audit trails. The research team also shows limited intentional control of internal states. Constraints remain strong, effects are narrow, and reliability is modest, so downstream use should be evaluative, not safety critical. Check out the Paper and Technical details. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well. The post Anthropic’s New Research Shows Claude can Detect Injected Concepts, but only in Controlled Layers appeared first on MarkTechPost.

Anthropic’s New Research Shows Claude can Detect Injected Concepts, but only in Controlled Layers Lire l’article »

AI, Committee, Actualités, Uncategorized

Here’s the latest company planning for gene-edited babies

A West Coast biotech entrepreneur says he’s secured $30 million to form a public-benefit company to study how to safely create genetically edited babies, marking the largest known investment into the taboo technology.   The new company, called Preventive, is being formed to research so-called “heritable genome editing,” in which the DNA of embryos would be modified by correcting harmful mutations or installing beneficial genes. The goal would be to prevent disease. Preventive was founded by the gene-editing scientist Lucas Harrington, who described his plans yesterday in a blog post announcing the venture. Preventive, he said, will not rush to try out the technique but instead will dedicate itself “to rigorously researching whether heritable genome editing can be done safely and responsibly.” Creating genetically edited humans remains controversial, and the first scientist to do it, in China, was imprisoned for three years. The procedure remains illegal in many countries, including the US, and doubts surround its usefulness as a form of medicine. Still, as gene-editing technology races forward, the temptation to shape the future of the species may prove irresistible, particularly to entrepreneurs keen to put their stamp on the human condition. In theory, even small genetic tweaks could create people who never get heart disease or Alzheimer’s, and who would pass those traits on to their own offspring. According to Harrington, if the technique proves safe, it “could become one of the most important health technologies of our time.” He has estimated that editing an embryo would cost only about $5,000 and believes regulations could change in the future.  Preventive is the third US startup this year to say it is pursuing technology to produce gene-edited babies. The first, Bootstrap Bio, based in California, is reportedly seeking seed funding and has an interest in enhancing intelligence. Another, Manhattan Genomics, is also in the formation stage but has not announced funding yet. As of now, none of these companies have significant staff or facilities, and they largely lack any credibility among mainstream gene-editing scientists. Reached by email, Fyodor Urnov, an expert in gene editing at the University of California, Berkeley, where Harrington studied, said he believes such ventures should not move forward. Urnov has been a pointed critic of the concept of heritable genome editing, calling it dangerous, misguided, and a distraction from the real benefits of gene editing to treat adults and children.  In his email, Urnov said the launch of still another venture into the area made him want to “howl with pain.”   Harrinton’s venture was incorporated in Delaware in May 2025,under the name Preventive Medicine PBC. As a public-benefit corporation, it is organized to put its public mission above profits. “If our research shows [heritable genome editing] cannot be done safely, that conclusion is equally valuable to the scientific community and society,” Harrington wrote in his post. Harrington is a cofounder of Mammoth Biosciences, a gene-editing company pursuing drugs for adults, and remains a board member there. In recent months, Preventive has sought endorsements from leading figures in genome editing, but according to its post, it had secured only one—from Paula Amato, a fertility doctor at Oregon Health Sciences University, who said she had agreed to act as an advisor to the company. Amato is a member of a US team that has researched embryo editing in the country since 2017, and she has promoted the technology as a way to increase IVF success. That could be the case if editing could correct abnormal embryos, making more available for use in trying to create a pregnancy. It remains unclear where Preventive’s funding is coming from. Harrington said the $30 million was gathered from “private funders who share our commitment to pursuing this research responsibly.” But he declined to identify those investors other than SciFounders, a venture firm he runs with his personal and business partner Matt Krisiloff, the CEO of the biotech company Conception, which aims to create human eggs from stem cells. That’s yet another technology that could change reproduction, if it works. Krisiloff is listed as a member of Preventive’s founding team. The idea of edited babies has received growing attention from figures in the cryptocurrency business. These include Brian Armstrong, the billionaire founder of Coinbase, who has held a series of off-the-record dinners to discuss the technology (which Harrington attended). Armstrong previously argued that the “time is right” for a startup venture in the area. Will Harborne, a crypto entrepreneur and partner at LongGame Ventures, says he’s “thrilled” to see Preventive launch. If the technology proves safe, he argues, “widespread adoption is inevitable,” calling its use a “societal obligation.” Harborne’s fund has invested in Herasight, a company that uses genetic tests to rank IVF embryos for future IQ and other traits. That’s another hotly debated technology, but one that has already reached the market, since such testing isn’t strictly regulated. Some have begun to use the term “human enhancement companies” to refer to such ventures. What’s still lacking is evidence that leading gene-editing specialists support these ventures. Preventive was unsuccessful in establishing a collaboration with at least one key research group, and Urnov says he had harsh words for Manhattan Genomics when that company reached out to him about working together. “I encourage you to stop,” he wrote back. “You will cause zero good and formidable harm.” Harrington thinks Preventive could change such attitudes, if it shows that it is serious about doing responsible research. “Most scientists I speak with either accept embryo editing as inevitable or are enthusiastic about the potential but hesitate to voice these opinions publicly,” he told MIT Technology Review earlier this year. “Part of being more public about this is to encourage others in the field to discuss this instead of ignoring it.”

Here’s the latest company planning for gene-edited babies Lire l’article »

AI, Committee, Actualités, Uncategorized

The Download: down the Mandela effect rabbit hole, and the promise of a vaccine for colds

This is today’s edition of The Download, our weekday newsletter that provides a daily dose of what’s going on in the world of technology. Why do so many people think the Fruit of the Loom logo had a cornucopia? Quick question: Does the Fruit of the Loom logo feature a cornucopia? Many of us have been wearing the company’s T-shirts for decades, and yet the question of whether there is a woven brown horn of plenty on the logo is surprisingly contentious. According to a 2022 poll, 55% of Americans believe the logo does include a cornucopia, 25% are unsure, and only 21% are confident that it doesn’t, even though this last group is correct. There’s a name for what’s happening here: the “Mandela effect,” or collective false memory, so called because a number of people misremember that Nelson Mandela died in prison. Yet while many find it easy to let their unconfirmable beliefs go, some spend years seeking answers—and vindication. Read the full story. —Amelia Tait This story is part of MIT Technology Review’s series “The New Conspiracy Age,” on how the present boom in conspiracy theories is reshaping science and technology. Here’s why we don’t have a cold vaccine. Yet. For those of us in the Northern Hemisphere, it’s the season of the sniffles. As the weather turns, we’re all spending more time indoors. The kids have been back at school for a couple of months. And cold germs are everywhere. So why can’t we get a vaccine to protect us against the common cold? Scientists have been working on this for decades, but it turns out that creating a cold vaccine is hard. Really hard. But not impossible. There’s still hope. Read the full story. —Jessica Hamzelou This article first appeared in The Checkup, MIT Technology Review’s weekly biotech newsletter. To receive it in your inbox every Thursday, and read articles like this first, sign up here. Inside the archives of the NASA Ames Research Center At the southern tip of San Francisco Bay, surrounded by the tech giants Google, Apple, and Microsoft, sits the historic NASA Ames Research Center. Its rich history includes a grab bag of fascinating scientific research involving massive wind tunnels, experimental aircraft, supercomputing, astrobiology, and more. A collection of 5,000 images from NASA Ames’s archives paints a vivid picture of bleeding-edge work at the heart of America’s technology hub. Read the full story. —Jon Keegan This story is from the latest print issue of MIT Technology Review magazine, which is full of stories about the body. If you haven’t already, subscribe now to receive future issues once they land. The must-reads I’ve combed the internet to find you today’s most fun/important/scary/fascinating stories about technology. 1 The US government is considering banning TP-Link routersAn investigation has raised concerns over the company’s links to China. (WP $)+ Lawmakers are worried its equipment is vulnerable to hacking. (Bloomberg $) 2 ICE has proposed building a deportation network in TexasThe 24/7 operation would transfer detained immigrants into holding facilities. (Wired $)+ But US citizens keep being detained, too. (NY Mag $)+ Inside the operation giving ICE a run for its money. (Slate $)+ Another effort to track ICE raids was just taken offline. (MIT Technology Review) 3 Ukrainian drone teams are gamifying their war effortsOfficials say rewarding soldiers for successful attacks keeps them motivated. (NYT $)+ A Peter Thiel-backed drone startup crashed and burned during military trials. (FT $)+ Meet the radio-obsessed civilian shaping Ukraine’s drone defense. (MIT Technology Review) 4 Meta has denied torrenting porn to train its AI modelsInstead, it claims, the downloads were for someone’s “private personal use.” (Ars Technica) 5 Bird flu is getting harder to keep tabs onThe virus has wreaked havoc on the US poultry industry for close to four years. (Vox)+ A new biosensor can detect bird flu in five minutes. (MIT Technology Review) 6 AI browsers are a cybersecurity nightmareThey’re a hotbed of known—and unknown—risks. (The Verge)+ I tried OpenAI’s new Atlas browser but I still don’t know what it’s for. (MIT Technology Review) 7 Robots are starting to do more jobs across AmericaBut they’re still proving buggy and expensive to run. (WSJ $)+ When you might start speaking to robots. (MIT Technology Review) 8 These are the jobs that AI builtFrom conversation designer to adoption strategist. (WP $)+ if you fancy landing a job in quantum computing, here’s how to do it. (IEEE Spectrum) 9 Computer vision is getting much, much better Their blind spots are rapidly being eliminated. (Knowable Magazine) 10 A lock-cracking YouTuber is being sued by a lockmaking company  It’s arguing he defamed the company, even though he didn’t say a word during the clip. (Ars Technica) Quote of the day “Yes, we’ve been to the Moon before… six times!” —NASA’s acting administrator Sean Duffy reacts to Kim Kardashian’s belief that man has never set foot on the moon, the Guardian reports. One more thing What happens when you donate your body to science Rebecca George doesn’t mind the vultures that complain from the trees that surround the Western Carolina University body farm. Her arrival has interrupted their breakfast. George studies human decomposition, and part of decomposing is becoming food. Scavengers are welcome. In the US, about 20,000 people or their families donate their bodies to scientific research and education each year. Whatever the reason, the decision becomes a gift. Western Carolina’s FOREST is among the places where watchful caretakers know that the dead and the living are deeply connected, and the way you treat the first reflects how you treat the second. Read the full story. —Abby Ohlheiser We can still have nice things A place for comfort, fun and distraction to brighten up your day. (Got any ideas? Drop me a line or skeet ’em at me.) + Zoo animals across the world are getting into the Halloween spirit with some tasty pumpkins.+ If you’re stuck for something suitably spooky to watch tonight, this list is a great place to start.+ New York’s

The Download: down the Mandela effect rabbit hole, and the promise of a vaccine for colds Lire l’article »

AI, Committee, Actualités, Uncategorized

How to Build an End-to-End Data Engineering and Machine Learning Pipeline with Apache Spark and PySpark

In this tutorial, we explore how to harness Apache Spark’s techniques using PySpark directly in Google Colab. We begin by setting up a local Spark session, then progressively move through transformations, SQL queries, joins, and window functions. We also build and evaluate a simple machine-learning model to predict user subscription types and finally demonstrate how to save and reload Parquet files. Also, we experience how Spark’s distributed data-processing capabilities can be leveraged for analytics and ML workflows even in a single-node Colab environment. Check out the FULL CODES here. Copy CodeCopiedUse a different Browser !pip install -q pyspark==3.5.1 from pyspark.sql import SparkSession, functions as F, Window from pyspark.sql.types import IntegerType, StringType, StructType, StructField, FloatType from pyspark.ml.feature import StringIndexer, VectorAssembler from pyspark.ml.classification import LogisticRegression from pyspark.ml.evaluation import MulticlassClassificationEvaluator spark = (SparkSession.builder.appName(“ColabSparkAdvancedTutorial”) .master(“local[*]”) .config(“spark.sql.shuffle.partitions”, “4”) .getOrCreate()) print(“Spark version:”, spark.version) data = [ (1, “Alice”, “IN”, “2025-10-01”, 56000.0, “premium”), (2, “Bob”, “US”, “2025-10-03”, 43000.0, “standard”), (3, “Carlos”, “IN”, “2025-09-27”, 72000.0, “premium”), (4, “Diana”, “UK”, “2025-09-30”, 39000.0, “standard”), (5, “Esha”, “IN”, “2025-10-02”, 85000.0, “premium”), (6, “Farid”, “AE”, “2025-10-02”, 31000.0, “basic”), (7, “Gita”, “IN”, “2025-09-29”, 46000.0, “standard”), (8, “Hassan”, “PK”, “2025-10-01”, 52000.0, “premium”), ] schema = StructType([ StructField(“id”, IntegerType(), False), StructField(“name”, StringType(), True), StructField(“country”, StringType(), True), StructField(“signup_date”, StringType(), True), StructField(“income”, FloatType(), True), StructField(“plan”, StringType(), True), ]) df = spark.createDataFrame(data, schema) df.show() We begin by setting up PySpark, initializing the Spark session, and preparing our dataset. We create a structured DataFrame containing user information, including country, income, and plan type. This forms the foundation for all transformations and analyses that follow. Check out the FULL CODES here. Copy CodeCopiedUse a different Browser df2 = (df.withColumn(“signup_ts”, F.to_timestamp(“signup_date”)) .withColumn(“year”, F.year(“signup_ts”)) .withColumn(“month”, F.month(“signup_ts”)) .withColumn(“is_india”, (F.col(“country”) == “IN”).cast(“int”))) df2.show() df2.createOrReplaceTempView(“users”) spark.sql(“”” SELECT country, COUNT(*) AS cnt, AVG(income) AS avg_income FROM users GROUP BY country ORDER BY cnt DESC “””).show() w = Window.partitionBy(“country”).orderBy(F.col(“income”).desc()) df_ranked = df2.withColumn(“income_rank_in_country”, F.rank().over(w)) df_ranked.show() def plan_priority(plan): if plan == “premium”: return 3 if plan == “standard”: return 2 if plan == “basic”: return 1 return 0 plan_priority_udf = F.udf(plan_priority, IntegerType()) df_udf = df_ranked.withColumn(“plan_priority”, plan_priority_udf(F.col(“plan”))) df_udf.show() We now perform various data transformations, add new columns, and register the DataFrame as a SQL table. We explore Spark SQL for aggregation and apply window functions to rank users by income. We also introduce a user-defined function (UDF) to assign priority levels to subscription plans. Check out the FULL CODES here. Copy CodeCopiedUse a different Browser country_data = [ (“IN”, “Asia”, 1.42), (“US”, “North America”, 0.33), (“UK”, “Europe”, 0.07), (“AE”, “Asia”, 0.01), (“PK”, “Asia”, 0.24), ] country_schema = StructType([ StructField(“country”, StringType(), True), StructField(“region”, StringType(), True), StructField(“population_bn”, FloatType(), True), ]) country_df = spark.createDataFrame(country_data, country_schema) joined = df_udf.alias(“u”).join(country_df.alias(“c”), on=”country”, how=”left”) joined.show() region_stats = (joined.groupBy(“region”, “plan”) .agg(F.count(“*”).alias(“users”), F.round(F.avg(“income”), 2).alias(“avg_income”)) .orderBy(“region”, “plan”)) region_stats.show() We enrich our user dataset by joining it with country-level metadata that includes region and population. We then compute analytical summaries such as average income and user counts by region and plan type. This step demonstrates how Spark simplifies the seamless combination and aggregation of large datasets. Check out the FULL CODES here. Copy CodeCopiedUse a different Browser ml_df = joined.withColumn(“label”, (F.col(“plan”) == “premium”).cast(“int”)).na.drop() country_indexer = StringIndexer(inputCol=”country”, outputCol=”country_idx”, handleInvalid=”keep”) country_fitted = country_indexer.fit(ml_df) ml_df2 = country_fitted.transform(ml_df) assembler = VectorAssembler(inputCols=[“income”, “country_idx”, “plan_priority”], outputCol=”features”) ml_final = assembler.transform(ml_df2) train_df, test_df = ml_final.randomSplit([0.7, 0.3], seed=42) lr = LogisticRegression(featuresCol=”features”, labelCol=”label”, maxIter=20) lr_model = lr.fit(train_df) preds = lr_model.transform(test_df) preds.select(“name”, “country”, “income”, “plan”, “label”, “prediction”, “probability”).show(truncate=False) evaluator = MulticlassClassificationEvaluator(labelCol=”label”, predictionCol=”prediction”, metricName=”accuracy”) acc = evaluator.evaluate(preds) print(“Classification accuracy:”, acc) We move into machine learning by preparing data for model training and feature engineering. We index categorical columns, assemble features, and train a logistic regression model to predict premium users. We then evaluate its accuracy, showcasing how Spark MLlib integrates easily into the data workflow. Check out the FULL CODES here. Copy CodeCopiedUse a different Browser output_path = “/content/spark_users_parquet” joined.write.mode(“overwrite”).parquet(output_path) parquet_df = spark.read.parquet(output_path) print(“Parquet reloaded:”) parquet_df.show() recent = spark.sql(“”” SELECT name, country, income, signup_ts FROM users WHERE signup_ts >= ‘2025-10-01’ ORDER BY signup_ts DESC “””) recent.show() recent.explain() spark.stop() We conclude by writing the processed data to Parquet format and reading it back into Spark for verification. We run a SQL query to extract recent signups and inspect the query plan for optimization insights. Finally, we gracefully stop the Spark session to complete our workflow. In conclusion, we gain a practical understanding of how PySpark unifies data engineering and machine learning tasks within a single scalable framework. We witness how simple DataFrame transformations evolve into SQL analytics, feature engineering, and predictive modeling, all while staying within Google Colab. By experimenting with these concepts, we strengthen our ability to prototype and deploy Spark-based data solutions efficiently in both local and distributed setups. Check out the FULL CODES here. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well. The post How to Build an End-to-End Data Engineering and Machine Learning Pipeline with Apache Spark and PySpark appeared first on MarkTechPost.

How to Build an End-to-End Data Engineering and Machine Learning Pipeline with Apache Spark and PySpark Lire l’article »

AI, Committee, Actualités, Uncategorized

Google AI Unveils Supervised Reinforcement Learning (SRL): A Step Wise Framework with Expert Trajectories to Teach Small Language Models to Reason through Hard Problems

How can a small model learn to solve tasks it currently fails at, without rote imitation or relying on a correct rollout? A team of researchers from Google Cloud AI Research and UCLA have released a training framework, ‘Supervised Reinforcement Learning’ (SRL), that makes 7B scale models actually learn from very hard math and agent trajectories that normal supervised fine tuning and outcome based reinforcement learning RL cannot learn from. Small open source models such as Qwen2.5 7B Instruct fail on the hardest problems in s1K 1.1, even when the teacher trace is good. If we apply supervised fine tuning on the full DeepSeek R1 style solutions, the model imitates token by token, the sequence is long, the data is only 1,000 items, and the final scores drop below the base model. https://arxiv.org/pdf/2510.25992 Core idea of ‘Supervised Reinforcement Learning’ SRL ‘Supervised Reinforcement Learning’ (SRL) keeps the RL style optimization, but it injects supervision into the reward channel instead of into the loss. Each expert trajectory from s1K 1.1 is parsed into a sequence of actions. For every prefix of that sequence, the research team creates a new training example, the model first produces a private reasoning span wrapped in <think> … </think>, then it outputs the action for that step, and only this action is compared with the teacher action using a sequence similarity metric based on difflib. The reward is dense because every step has a score, even when the final answer is wrong. The rest of the text, the reasoning part, is not constrained, so the model can search its own chain without being forced to copy the teacher tokens. Math results All models are initialized from Qwen2.5 7B Instruct and all are trained on the same DeepSeek R1 formatted s1K 1.1 set, so comparisons are clean. The exact numbers in Table 1 are: Base Qwen2.5 7B Instruct, AMC23 greedy 50.0, AIME24 greedy 13.3, AIME25 greedy 6.7. SRL, AMC23 greedy 50.0, AIME24 greedy 16.7, AIME25 greedy 13.3. SRL then RLVR, AMC23 greedy 57.5, AIME24 greedy 20.0, AIME25 greedy 10.0. https://arxiv.org/pdf/2510.25992 This is the key improvement, SRL alone already removes the SFT degradation and raises AIME24 and AIME25, and when RLVR is run after SRL, the system reaches the best open source scores in the research. The research team is explicit that the best pipeline is SRL then RLVR, not SRL in isolation. Software engineering results The research team also applies SRL to Qwen2.5 Coder 7B Instruct using 5,000 verified agent trajectories generated by claude 3 7 sonnet, every trajectory is decomposed into step wise instances, and in total 134,000 step items are produced. Evaluation is on SWE Bench Verified. The base model gets 5.8 percent in the oracle file edit mode and 3.2 percent end to end. SWE Gym 7B gets 8.4 percent and 4.2 percent. SRL gets 14.8 percent and 8.6 percent, which is about 2 times the base model and clearly higher than the SFT baseline. https://arxiv.org/pdf/2510.25992 Key Takeaways SRL reformulates hard reasoning as step wise action generation, the model first produces an internal monologue then outputs a single action, and only that action is rewarded by sequence similarity, so the model gets signal even when the final answer is wrong. SRL is run on the same DeepSeek R1 formatted s1K 1.1 data as SFT and RLVR, but unlike SFT it does not overfit long demonstrations, and unlike RLVR it does not collapse when no rollout is correct. On math, the exact order that gives the strongest results in the research is, initialize Qwen2.5 7B Instruct with SRL, then apply RLVR, which pushes reasoning benchmarks higher than either method alone. The same SRL recipe generalizes to agentic software engineering, using 5,000 verified trajectories from claude 3 7 sonnet 20250219, and it lifts SWE Bench Verified well above both the base Qwen2.5 Coder 7B Instruct and the SFT style SWE Gym 7B baseline. Compared to other step wise RL methods that need an extra reward model, this SRL keeps a GRPO style objective and uses only actions from expert trajectories and a lightweight string similarity, so it is easy to run on small hard datasets. Editorial Comments ‘Supervised Reinforcement Learning’ (SRL) is a practical contribution by the research team. It keeps the GRPO style reinforcement learning setup, but it replaces fragile outcome level rewards with supervised, step wise rewards that are computed directly from expert trajectories, so the model always receives informative signal, even in the Dhard regime where RLVR and SFT both stall. It is important that the research team shows SRL on math and on SWE Bench Verified with the same recipe, and that the strongest configuration is SRL followed by RLVR, not either one alone. This makes SRL a realistic path for open models to learn hard tasks. Overall, SRL is a clean bridge between process supervision and RL that open model teams can adopt immediately. Check out the Paper. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well. The post Google AI Unveils Supervised Reinforcement Learning (SRL): A Step Wise Framework with Expert Trajectories to Teach Small Language Models to Reason through Hard Problems appeared first on MarkTechPost.

Google AI Unveils Supervised Reinforcement Learning (SRL): A Step Wise Framework with Expert Trajectories to Teach Small Language Models to Reason through Hard Problems Lire l’article »

AI, Committee, Actualités, Uncategorized

Are Video Models Ready as Zero-Shot Reasoners? An Empirical Study with the MME-CoF Benchmark

arXiv:2510.26802v1 Announce Type: cross Abstract: Recent video generation models can produce high-fidelity, temporally coherent videos, indicating that they may encode substantial world knowledge. Beyond realistic synthesis, they also exhibit emerging behaviors indicative of visual perception, modeling, and manipulation. Yet, an important question still remains: Are video models ready to serve as zero-shot reasoners in challenging visual reasoning scenarios? In this work, we conduct an empirical study to comprehensively investigate this question, focusing on the leading and popular Veo-3. We evaluate its reasoning behavior across 12 dimensions, including spatial, geometric, physical, temporal, and embodied logic, systematically characterizing both its strengths and failure modes. To standardize this study, we curate the evaluation data into MME-CoF, a compact benchmark that enables in-depth and thorough assessment of Chain-of-Frame (CoF) reasoning. Our findings reveal that while current video models demonstrate promising reasoning patterns on short-horizon spatial coherence, fine-grained grounding, and locally consistent dynamics, they remain limited in long-horizon causal reasoning, strict geometric constraints, and abstract logic. Overall, they are not yet reliable as standalone zero-shot reasoners, but exhibit encouraging signs as complementary visual engines alongside dedicated reasoning models. Project page: https://video-cof.github.io

Are Video Models Ready as Zero-Shot Reasoners? An Empirical Study with the MME-CoF Benchmark Lire l’article »

AI, Committee, Actualités, Uncategorized

Distilling Multilingual Vision-Language Models: When Smaller Models Stay Multilingual

arXiv:2510.26271v1 Announce Type: new Abstract: Vision-language models (VLMs) exhibit uneven performance across languages, a problem that is often exacerbated when the model size is reduced. While Knowledge distillation (KD) demonstrates promising results in transferring knowledge from larger to smaller VLMs, applying KD in multilingualism is an underexplored area. This paper presents a controlled empirical study of KD behavior across five distillation approaches, isolating their effects on cross-lingual representation consistency and downstream performance stability under model compression. We study five distillation formulations across CLIP and SigLIP2, and evaluate them on in-domain retrieval and out-of-domain visual QA. We find that some configurations preserve or even improve multilingual retrieval robustness despite halving model size, but others fail to maintain cross-task stability, exposing design-sensitive trade-offs that aggregate accuracy alone does not reveal.

Distilling Multilingual Vision-Language Models: When Smaller Models Stay Multilingual Lire l’article »

AI, Committee, Actualités, Uncategorized

How to Design an Autonomous Multi-Agent Data and Infrastructure Strategy System Using Lightweight Qwen Models for Efficient Pipeline Intelligence?

In this tutorial, we build an Agentic Data and Infrastructure Strategy system using the lightweight Qwen2.5-0.5B-Instruct model for efficient execution. We begin by creating a flexible LLM agent framework and then develop specialized agents that handle different layers of data management, from ingestion and quality analysis to infrastructure optimization. We integrate these agents into an orchestrator that coordinates their interactions, ensuring smooth multi-agent collaboration across the data pipeline. Through hands-on examples like e-commerce and IoT pipelines, we explore how autonomous decision-making can streamline complex data operations. Check out the FULL CODES here. Copy CodeCopiedUse a different Browser !pip install -q transformers torch accelerate datasets huggingface_hub import torch from transformers import AutoModelForCausalLM, AutoTokenizer import json, time from typing import List, Dict, Any from dataclasses import dataclass from datetime import datetime import pandas as pd class LightweightLLMAgent: def __init__(self, role: str, model_name: str = “Qwen/Qwen2.5-0.5B-Instruct”): self.role = role self.model_name = model_name self.device = “cuda” if torch.cuda.is_available() else “cpu” print(f”Loading {model_name} for {role} agent on {self.device}…”) self.tokenizer = AutoTokenizer.from_pretrained(model_name) self.model = AutoModelForCausalLM.from_pretrained( model_name, torch_dtype=torch.float16 if self.device == “cuda” else torch.float32, device_map=”auto” ) self.conversation_history = [] def generate_response(self, prompt: str, max_tokens: int = 150) -> str: messages = [ {“role”: “system”, “content”: f”You are a {self.role} agent in a data infrastructure system.”}, {“role”: “user”, “content”: prompt} ] text = self.tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True) model_inputs = self.tokenizer([text], return_tensors=”pt”).to(self.device) with torch.no_grad(): generated_ids = self.model.generate( model_inputs.input_ids, max_new_tokens=max_tokens, temperature=0.7, do_sample=True, top_p=0.95 ) generated_ids = [output_ids[len(input_ids):] for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids)] response = self.tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0] self.conversation_history.append({“prompt”: prompt, “response”: response}) return response We start by setting up the lightweight LLM agent infrastructure using the Qwen2.5-0.5B-Instruct model. We load the model and tokenizer, and define a base agent class capable of handling contextual conversations and generating intelligent responses. This forms the core foundation upon which our specialized agents operate efficiently within Colab. Check out the FULL CODES here. Copy CodeCopiedUse a different Browser class DataIngestionAgent(LightweightLLMAgent): def __init__(self): super().__init__(role=”Data Ingestion Specialist”) def analyze_data_source(self, source_info: Dict) -> Dict: prompt = f”””Analyze this data source and provide ingestion strategy: Source Type: {source_info.get(‘type’, ‘unknown’)} Volume: {source_info.get(‘volume’, ‘unknown’)} Frequency: {source_info.get(‘frequency’, ‘unknown’)} Provide a brief strategy focusing on: 1) Ingestion method, 2) Key considerations.””” strategy = self.generate_response(prompt, max_tokens=100) return {“source”: source_info, “strategy”: strategy, “timestamp”: datetime.now().isoformat()} class DataQualityAgent(LightweightLLMAgent): def __init__(self): super().__init__(role=”Data Quality Analyst”) def assess_data_quality(self, data_sample: Dict) -> Dict: prompt = f”””Assess data quality for this sample: Completeness: {data_sample.get(‘completeness’, ‘N/A’)}% Consistency: {data_sample.get(‘consistency’, ‘N/A’)}% Issues Found: {data_sample.get(‘issues’, 0)} Provide brief quality assessment and top 2 recommendations.””” assessment = self.generate_response(prompt, max_tokens=100) return {“assessment”: assessment, “severity”: self._calculate_severity(data_sample), “timestamp”: datetime.now().isoformat()} def _calculate_severity(self, data_sample: Dict) -> str: completeness = data_sample.get(‘completeness’, 100) consistency = data_sample.get(‘consistency’, 100) avg_score = (completeness + consistency) / 2 if avg_score >= 90: return “LOW” elif avg_score >= 70: return “MEDIUM” else: return “HIGH” We design the Data Ingestion and Data Quality agents to focus on structured analysis of data pipelines. We let the ingestion agent determine the best approach to data flow, while the quality agent evaluates data completeness, consistency, and issues to provide actionable insights. Together, they establish the first two layers of autonomous data management. Check out the FULL CODES here. Copy CodeCopiedUse a different Browser class InfrastructureOptimizationAgent(LightweightLLMAgent): def __init__(self): super().__init__(role=”Infrastructure Optimization Specialist”) def optimize_resources(self, metrics: Dict) -> Dict: prompt = f”””Analyze infrastructure metrics and suggest optimizations: CPU Usage: {metrics.get(‘cpu_usage’, 0)}% Memory Usage: {metrics.get(‘memory_usage’, 0)}% Storage: {metrics.get(‘storage_used’, 0)}GB / {metrics.get(‘storage_total’, 0)}GB Query Latency: {metrics.get(‘query_latency’, 0)}ms Provide 2 optimization recommendations.””” recommendations = self.generate_response(prompt, max_tokens=100) return {“current_metrics”: metrics, “recommendations”: recommendations, “priority”: self._calculate_priority(metrics), “timestamp”: datetime.now().isoformat()} def _calculate_priority(self, metrics: Dict) -> str: cpu = metrics.get(‘cpu_usage’, 0) memory = metrics.get(‘memory_usage’, 0) if cpu > 85 or memory > 85: return “CRITICAL” elif cpu > 70 or memory > 70: return “HIGH” else: return “NORMAL” We develop the Infrastructure Optimization Agent to continuously analyze key metrics like CPU, memory, and storage utilization. We use it to generate intelligent optimization suggestions, helping us maintain high performance and resource efficiency. This agent ensures that our infrastructure remains responsive and scalable during data operations. Check out the FULL CODES here. Copy CodeCopiedUse a different Browser class AgenticDataOrchestrator: def __init__(self): print(“n” + “=”*70) print(“Initializing Agentic Data Infrastructure System”) print(“=”*70 + “n”) self.ingestion_agent = DataIngestionAgent() self.quality_agent = DataQualityAgent() self.optimization_agent = InfrastructureOptimizationAgent() self.execution_log = [] def process_data_pipeline(self, pipeline_config: Dict) -> Dict: results = {“pipeline_id”: pipeline_config.get(“id”, “unknown”), “start_time”: datetime.now().isoformat(), “stages”: []} print(“n[Stage 1] Data Ingestion Analysis”) ingestion_result = self.ingestion_agent.analyze_data_source(pipeline_config.get(“source”, {})) print(f”Strategy: {ingestion_result[‘strategy’][:150]}…”) results[“stages”].append({“stage”: “ingestion”, “result”: ingestion_result}) print(“n[Stage 2] Data Quality Assessment”) quality_result = self.quality_agent.assess_data_quality(pipeline_config.get(“quality_metrics”, {})) print(f”Assessment: {quality_result[‘assessment’][:150]}…”) print(f”Severity: {quality_result[‘severity’]}”) results[“stages”].append({“stage”: “quality”, “result”: quality_result}) print(“n[Stage 3] Infrastructure Optimization”) optimization_result = self.optimization_agent.optimize_resources(pipeline_config.get(“infrastructure_metrics”, {})) print(f”Recommendations: {optimization_result[‘recommendations’][:150]}…”) print(f”Priority: {optimization_result[‘priority’]}”) results[“stages”].append({“stage”: “optimization”, “result”: optimization_result}) results[“end_time”] = datetime.now().isoformat() results[“status”] = “completed” self.execution_log.append(results) return results def generate_summary_report(self) -> pd.DataFrame: if not self.execution_log: return pd.DataFrame() summary_data = [] for log in self.execution_log: summary_data.append({“Pipeline ID”: log[“pipeline_id”], “Start Time”: log[“start_time”], “Status”: log[“status”], “Stages Completed”: len(log[“stages”])}) return pd.DataFrame(summary_data) We built an Agentic Data Orchestrator to coordinate all specialized agents under a unified workflow. We use it to manage end-to-end pipeline execution, triggering ingestion, quality checks, and optimization sequentially. By doing this, we bring structure, collaboration, and automation to the entire multi-agent system. Check out the FULL CODES here. Copy CodeCopiedUse a different Browser def main(): orchestrator = AgenticDataOrchestrator() print(“n” + “=”*70) print(“EXAMPLE 1: E-commerce Data Pipeline”) print(“=”*70) ecommerce_pipeline = { “id”: “ecommerce_pipeline_001”, “source”: {“type”: “REST API”, “volume”: “10GB/day”, “frequency”: “real-time”}, “quality_metrics”: {“completeness”: 87, “consistency”: 92, “issues”: 15}, “infrastructure_metrics”: {“cpu_usage”: 78, “memory_usage”: 82, “storage_used”: 450, “storage_total”: 1000, “query_latency”: 250} } result1 = orchestrator.process_data_pipeline(ecommerce_pipeline) print(“nn” + “=”*70) print(“EXAMPLE 2: IoT Sensor Data Pipeline”) print(“=”*70) iot_pipeline = { “id”: “iot_pipeline_002”, “source”: {“type”: “Message Queue (Kafka)”, “volume”: “50GB/day”, “frequency”: “streaming”}, “quality_metrics”: {“completeness”: 95, “consistency”: 88, “issues”: 8}, “infrastructure_metrics”: {“cpu_usage”: 65, “memory_usage”: 71, “storage_used”: 780, “storage_total”: 2000, “query_latency”: 180} } result2 = orchestrator.process_data_pipeline(iot_pipeline) print(“nn” + “=”*70) print(“EXECUTION SUMMARY REPORT”) print(“=”*70 + “n”) summary_df = orchestrator.generate_summary_report() print(summary_df.to_string(index=False)) print(“n” + “=”*70) print(“Tutorial Complete!”) print(“=”*70) print(“nKey Concepts Demonstrated:”) print(“✓ Lightweight LLM agent architecture”) print(“✓ Specialized agents for different data tasks”) print(“✓ Multi-agent orchestration”) print(“✓ Infrastructure monitoring and optimization”) print(“✓ Autonomous decision-making in data pipelines”) if __name__ == “__main__”: main() We demonstrate our complete

How to Design an Autonomous Multi-Agent Data and Infrastructure Strategy System Using Lightweight Qwen Models for Efficient Pipeline Intelligence? Lire l’article »

We use cookies to improve your experience and performance on our website. You can learn more at Politique de confidentialité and manage your privacy settings by clicking Settings.

Privacy Preferences

You can choose your cookie settings by turning on/off each type of cookie as you wish, except for essential cookies.

Allow All
Manage Consent Preferences
  • Always Active

Save
fr_FR