YouZum

Committee

AI, Committee, News, Uncategorized

How do AI models generate videos?

MIT Technology Review Explains: Let our writers untangle the complex, messy world of technology to help you understand what’s coming next. You can read more from the series here. It’s been a big year for video generation. In the last nine months OpenAI made Sora public, Google DeepMind launched Veo 3, the video startup Runway launched Gen-4. All can produce video clips that are (almost) impossible to distinguish from actual filmed footage or CGI animation. This year also saw Netflix debut an AI visual effect in its show The Eternaut, the first time video generation has been used to make mass-market TV. Sure, the clips you see in demo reels are cherry-picked to showcase a company’s models at the top of their game. But with the technology in the hands of more users than ever before—Sora and Veo 3 are available in the ChatGPT and Gemini apps for paying subscribers—even the most casual filmmaker can now knock out something remarkable.  The downside is that creators are competing with AI slop, and social media feeds are filling up with faked news footage. Video generation also uses up a huge amount of energy, many times more than text or image generation.  With AI-generated videos everywhere, let’s take a moment to talk about the tech that makes them work. How do you generate a video? Let’s assume you’re a casual user. There are now a range of high-end tools that allow pro video makers to insert video generation models into their workflows. But most people will use this technology in an app or via a website. You know the drill: “Hey, Gemini, make me a video of a unicorn eating spaghetti. Now make its horn take off like a rocket.” What you get back will be hit or miss, and you’ll typically need to ask the model to take another pass or 10 before you get more or less what you wanted.  So what’s going on under the hood? Why is it hit or miss—and why does it take so much energy? The latest wave of video generation models are what’s known as latent diffusion transformers. Yes, that’s quite a mouthful. Let’s unpack each part in turn, starting with diffusion.  What’s a diffusion model? Imagine taking an image and adding a random spattering of pixels to it. Take that pixel-spattered image and spatter it again and then again. Do that enough times and you will have turned the initial image into a random mess of pixels, like static on an old TV set.  A diffusion model is a neural network trained to reverse that process, turning random static into images. During training, it gets shown millions of images in various stages of pixelation. It learns how those images change each time new pixels are thrown at them and, thus, how to undo those changes.  The upshot is that when you ask a diffusion model to generate an image, it will start off with a random mess of pixels and step by step turn that mess into an image that is more or less similar to images in its training set.  But you don’t want any image—you want the image you specified, typically with a text prompt. And so the diffusion model is paired with a second model—such as a large language model (LLM) trained to match images with text descriptions—that guides each step of the cleanup process, pushing the diffusion model toward images that the large language model considers a good match to the prompt.  An aside: This LLM isn’t pulling the links between text and images out of thin air. Most text-to-image and text-to-video models today are trained on large data sets that contain billions of pairings of text and images or text and video scraped from the internet (a practice many creators are very unhappy about). This means that what you get from such models is a distillation of the world as it’s represented online, distorted by prejudice (and pornography). It’s easiest to imagine diffusion models working with images. But the technique can be used with many kinds of data, including audio and video. To generate movie clips, a diffusion model must clean up sequences of images—the consecutive frames of a video—instead of just one image.  What’s a latent diffusion model?  All this takes a huge amount of compute (read: energy). That’s why most diffusion models used for video generation use a technique called latent diffusion. Instead of processing raw data—the millions of pixels in each video frame—the model works in what’s known as a latent space, in which the video frames (and text prompt) are compressed into a mathematical code that captures just the essential features of the data and throws out the rest.  A similar thing happens whenever you stream a video over the internet: A video is sent from a server to your screen in a compressed format to make it get to you faster, and when it arrives, your computer or TV will convert it back into a watchable video.  And so the final step is to decompress what the latent diffusion process has come up with. Once the compressed frames of random static have been turned into the compressed frames of a video that the LLM guide considers a good match for the user’s prompt, the compressed video gets converted into something you can watch.   With latent diffusion, the diffusion process works more or less the way it would for an image. The difference is that the pixelated video frames are now mathematical encodings of those frames rather than the frames themselves. This makes latent diffusion far more efficient than a typical diffusion model. (Even so, video generation still uses more energy than image or text generation. There’s just an eye-popping amount of computation involved.)  What’s a latent diffusion transformer? Still with me? There’s one more piece to the puzzle—and that’s how to make sure the diffusion process produces a sequence of frames that are consistent, maintaining objects and lighting and so on from one frame

How do AI models generate videos? Read Post »

AI, Committee, News, Uncategorized

The Download: America’s gun crisis, and how AI video models work

This is today’s edition of The Download, our weekday newsletter that provides a daily dose of what’s going on in the world of technology. We can’t “make American children healthy again” without tackling the gun crisis This week, the Trump administration released a strategy for improving the health and well-being of American children. The report was titled—you guessed it—Make Our Children Healthy Again. It suggests American children should be eating more healthily. And they should be getting more exercise. But there’s a glaring omission. The leading cause of death for American children and teenagers isn’t ultraprocessed food or exposure to some chemical. It’s gun violence.  This week’s news of yet more high-profile shootings at schools in the US throws this disconnect into even sharper relief. Experts believe it is time to treat gun violence in the US as what it is: a public health crisis. Read the full story. —Jessica Hamzelou This article first appeared in The Checkup, MIT Technology Review’s weekly biotech newsletter. To receive it in your inbox every Thursday, and read articles like this first, sign up here. How do AI models generate videos? It’s been a big year for video generation. In the last nine months OpenAI made Sora public, Google DeepMind launched Veo 3, and the video startup Runway launched Gen-4. All can produce video clips that are (almost) impossible to distinguish from actual filmed footage or CGI animation. The downside is that creators are competing with AI slop, and social media feeds are filling up with faked news footage. Video generation also uses up a huge amount of energy, many times more than text or image generation. With AI-generated videos everywhere, let’s take a moment to talk about the tech that makes them work. Read the full story. —Will Douglas Heaven This article is part of MIT Technology Review Explains, our series untangling the complex, messy world of technology to help you understand what’s coming next. You can read more from the series here. Meet our 2025 Innovator of the Year: Sneha Goenka Up to a quarter of children entering intensive care have undiagnosed genetic conditions. To be treated properly, they must first get diagnoses—which means having their genomes sequenced. This process typically takes up to seven weeks. Sadly, that’s often too slow to save a critically ill child. Hospitals may soon have a faster option, thanks to a groundbreaking system built in part by Sneha Goenka, an assistant professor of electrical and computer engineering at Princeton—and MIT Technology Review’s 2025 Innovator of the Year. Read all about Goenka and her work in this profile. —Helen Thomson As well as our Innovator of the Year, Goenka is one of the biotech honorees on our 35 Innovators Under 35 list for 2025. Meet the rest of our biotech and materials science innovators, and the full list here.  The must-reads I’ve combed the internet to find you today’s most fun/important/scary/fascinating stories about technology. 1 OpenAI and Microsoft have agreed a revised dealBut haven’t actually revealed any details of said deal. (Axios)+ The news comes as OpenAI keeps pursuing its for-profit pivot. (Ars Technica)+ The world’s largest startup is going to need more paying users soon. (WSJ $) 2 A child has died from a measles complication in Los AngelesThey had contracted the virus before they were old enough to be vaccinated. (Ars Technica)+ Infants are best protected by community immunity. (LA Times $)+ They’d originally recovered from measles before developing the condition. (CNN)+ Why childhood vaccines are a public health success story. (MIT Technology Review) 3 Ukrainian drone attacks triggered internet blackouts in RussiaThe Kremlin cut internet access in a bid to thwart the mobile-guided drones. (FT $)+ The UK is poised to mass-produce drones to aid Ukraine. (Sky News)+ On the ground in Ukraine’s largest Starlink repair shop. (MIT Technology Review) 4 Demis Hasabis says AI may slash drug discovery time to under a yearOr perhaps even faster. (Bloomberg $)+ But there’s good reason to be skeptical of that claim. (FT $)+ An AI-driven “factory of drugs” claims to have hit a big milestone. (MIT Technology Review) 5 How chatbots alter how we thinkWe shouldn’t outsource our critical thinking to them. (Undark)+ AI companies have stopped warning you that their chatbots aren’t doctors. (MIT Technology Review) 6 Fraudsters are threatening small businesses with one-star reviewsOnline reviews can make or break fledgling enterprises, and scammers know it. (NYT $) 7 Why humanoid robots aren’t taking off any time soonThe industry has a major hype problem. (IEEE Spectrum)+ Chinese tech giant Ant Group showed off its own humanoid machine. (The Verge)+ Why the humanoid workforce is running late. (MIT Technology Review) 8 Encyclopedia Britannica and Merriam-Webster are suing PerplexityIn yet another case of alleged copyright infringement. (Reuters)+ What comes next for AI copyright lawsuits? (MIT Technology Review) 9 Where we’re most likely to find extraterrestrial life in the next decadeWarning: Hollywood may have given us unrealistic expectations. (BBC) 10 Want to build a trillion-dollar company?Then kiss your social life goodbye. (WSJ $) Quote of the day “Nooooo I’m going to have to use my brain again and write 100% of my code like a caveman from December 2024.” —A Hacker News commenter jokes about a service outage that left Anthropic users unable to access its AI coding tools, Ars Technica reports. One more thing What Africa needs to do to become a major AI player Africa is still early in the process of adopting AI technologies. But researchers say the continent is uniquely hospitable to it for several reasons, including a relatively young and increasingly well-educated population, a rapidly growing ecosystem of AI startups, and lots of potential consumers. However, ambitious efforts to develop AI tools that answer the needs of Africans face numerous hurdles. Read our story to learn what they are, and how they could be overcome. —Abdullahi Tsanni We can still have nice things A place for comfort, fun and distraction to brighten up your day. (Got any ideas? Drop me a line or skeet

The Download: America’s gun crisis, and how AI video models work Read Post »

AI, Committee, News, Uncategorized

How to Build a Multilingual OCR AI Agent in Python with EasyOCR and OpenCV

In this tutorial, we build an Advanced OCR AI Agent in Google Colab using EasyOCR, OpenCV, and Pillow, running fully offline with GPU acceleration. The agent includes a preprocessing pipeline with contrast enhancement (CLAHE), denoising, sharpening, and adaptive thresholding to improve recognition accuracy. Beyond basic OCR, we filter results by confidence, generate text statistics, and perform pattern detection (emails, URLs, dates, phone numbers) along with simple language hints. The design also supports batch processing, visualization with bounding boxes, and structured exports for flexible usage. Check out the FULL CODES here. Copy CodeCopiedUse a different Browser !pip install easyocr opencv-python pillow matplotlib import easyocr import cv2 import numpy as np from PIL import Image, ImageEnhance, ImageFilter import matplotlib.pyplot as plt import os import json from typing import List, Dict, Tuple, Optional import re from google.colab import files import io We start by installing the required libraries, EasyOCR, OpenCV, Pillow, and Matplotlib, to set up our environment. We then import all necessary modules so we can handle image preprocessing, OCR, visualization, and file operations seamlessly. Check out the FULL CODES here. Copy CodeCopiedUse a different Browser class AdvancedOCRAgent: “”” Advanced OCR AI Agent with preprocessing, multi-language support, and intelligent text extraction capabilities. “”” def __init__(self, languages: List[str] = [‘en’], gpu: bool = True): “””Initialize OCR agent with specified languages.””” print(” Initializing Advanced OCR Agent…”) self.languages = languages self.reader = easyocr.Reader(languages, gpu=gpu) self.confidence_threshold = 0.5 print(f” OCR Agent ready! Languages: {languages}”) def upload_image(self) -> Optional[str]: “””Upload image file through Colab interface.””” print(” Upload your image file:”) uploaded = files.upload() if uploaded: filename = list(uploaded.keys())[0] print(f” Uploaded: {filename}”) return filename return None def preprocess_image(self, image: np.ndarray, enhance: bool = True) -> np.ndarray: “””Advanced image preprocessing for better OCR accuracy.””” if len(image.shape) == 3: gray = cv2.cvtColor(image, cv2.COLOR_BGR2GRAY) else: gray = image.copy() if enhance: clahe = cv2.createCLAHE(clipLimit=2.0, tileGridSize=(8,8)) gray = clahe.apply(gray) gray = cv2.fastNlMeansDenoising(gray) kernel = np.array([[-1,-1,-1], [-1,9,-1], [-1,-1,-1]]) gray = cv2.filter2D(gray, -1, kernel) binary = cv2.adaptiveThreshold( gray, 255, cv2.ADAPTIVE_THRESH_GAUSSIAN_C, cv2.THRESH_BINARY, 11, 2 ) return binary def extract_text(self, image_path: str, preprocess: bool = True) -> Dict: “””Extract text from image with advanced processing.””” print(f” Processing image: {image_path}”) image = cv2.imread(image_path) if image is None: raise ValueError(f”Could not load image: {image_path}”) if preprocess: processed_image = self.preprocess_image(image) else: processed_image = image results = self.reader.readtext(processed_image) extracted_data = { ‘raw_results’: results, ‘filtered_results’: [], ‘full_text’: ”, ‘confidence_stats’: {}, ‘word_count’: 0, ‘line_count’: 0 } high_confidence_text = [] confidences = [] for (bbox, text, confidence) in results: if confidence >= self.confidence_threshold: extracted_data[‘filtered_results’].append({ ‘text’: text, ‘confidence’: confidence, ‘bbox’: bbox }) high_confidence_text.append(text) confidences.append(confidence) extracted_data[‘full_text’] = ‘ ‘.join(high_confidence_text) extracted_data[‘word_count’] = len(extracted_data[‘full_text’].split()) extracted_data[‘line_count’] = len(high_confidence_text) if confidences: extracted_data[‘confidence_stats’] = { ‘mean’: np.mean(confidences), ‘min’: np.min(confidences), ‘max’: np.max(confidences), ‘std’: np.std(confidences) } return extracted_data def visualize_results(self, image_path: str, results: Dict, show_bbox: bool = True): “””Visualize OCR results with bounding boxes.””” image = cv2.imread(image_path) image_rgb = cv2.cvtColor(image, cv2.COLOR_BGR2RGB) plt.figure(figsize=(15, 10)) if show_bbox: plt.subplot(2, 2, 1) img_with_boxes = image_rgb.copy() for item in results[‘filtered_results’]: bbox = np.array(item[‘bbox’]).astype(int) cv2.polylines(img_with_boxes, [bbox], True, (255, 0, 0), 2) x, y = bbox[0] cv2.putText(img_with_boxes, f”{item[‘confidence’]:.2f}”, (x, y-10), cv2.FONT_HERSHEY_SIMPLEX, 0.5, (255, 0, 0), 1) plt.imshow(img_with_boxes) plt.title(“OCR Results with Bounding Boxes”) plt.axis(‘off’) plt.subplot(2, 2, 2) processed = self.preprocess_image(image) plt.imshow(processed, cmap=’gray’) plt.title(“Preprocessed Image”) plt.axis(‘off’) plt.subplot(2, 2, 3) confidences = [item[‘confidence’] for item in results[‘filtered_results’]] if confidences: plt.hist(confidences, bins=20, alpha=0.7, color=’blue’) plt.xlabel(‘Confidence Score’) plt.ylabel(‘Frequency’) plt.title(‘Confidence Score Distribution’) plt.axvline(self.confidence_threshold, color=’red’, linestyle=’–‘, label=f’Threshold: {self.confidence_threshold}’) plt.legend() plt.subplot(2, 2, 4) stats = results[‘confidence_stats’] if stats: labels = [‘Mean’, ‘Min’, ‘Max’] values = [stats[‘mean’], stats[‘min’], stats[‘max’]] plt.bar(labels, values, color=[‘green’, ‘red’, ‘blue’]) plt.ylabel(‘Confidence Score’) plt.title(‘Confidence Statistics’) plt.ylim(0, 1) plt.tight_layout() plt.show() def smart_text_analysis(self, text: str) -> Dict: “””Perform intelligent analysis of extracted text.””” analysis = { ‘language_detection’: ‘unknown’, ‘text_type’: ‘unknown’, ‘key_info’: {}, ‘patterns’: [] } email_pattern = r’b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+.[A-Z|a-z]{2,}b’ phone_pattern = r'(+d{1,3}[-.s]?)?(?d{3})?[-.s]?d{3}[-.s]?d{4}’ url_pattern = r’http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+’ date_pattern = r’bd{1,2}[/-]d{1,2}[/-]d{2,4}b’ patterns = { ’emails’: re.findall(email_pattern, text, re.IGNORECASE), ‘phones’: re.findall(phone_pattern, text), ‘urls’: re.findall(url_pattern, text, re.IGNORECASE), ‘dates’: re.findall(date_pattern, text) } analysis[‘patterns’] = {k: v for k, v in patterns.items() if v} if any(patterns.values()): if patterns.get(’emails’) or patterns.get(‘phones’): analysis[‘text_type’] = ‘contact_info’ elif patterns.get(‘urls’): analysis[‘text_type’] = ‘web_content’ elif patterns.get(‘dates’): analysis[‘text_type’] = ‘document_with_dates’ if re.search(r'[а-яё]’, text.lower()): analysis[‘language_detection’] = ‘russian’ elif re.search(r'[àáâãäåæçèéêëìíîïñòóôõöøùúûüý]’, text.lower()): analysis[‘language_detection’] = ‘romance_language’ elif re.search(r'[一-龯]’, text): analysis[‘language_detection’] = ‘chinese’ elif re.search(r'[ひらがなカタカナ]’, text): analysis[‘language_detection’] = ‘japanese’ elif re.search(r'[a-zA-Z]’, text): analysis[‘language_detection’] = ‘latin_based’ return analysis def process_batch(self, image_folder: str) -> List[Dict]: “””Process multiple images in batch.””” results = [] supported_formats = (‘.png’, ‘.jpg’, ‘.jpeg’, ‘.bmp’, ‘.tiff’) for filename in os.listdir(image_folder): if filename.lower().endswith(supported_formats): image_path = os.path.join(image_folder, filename) try: result = self.extract_text(image_path) result[‘filename’] = filename results.append(result) print(f” Processed: {filename}”) except Exception as e: print(f” Error processing {filename}: {str(e)}”) return results def export_results(self, results: Dict, format: str = ‘json’) -> str: “””Export results in specified format.””” if format.lower() == ‘json’: output = json.dumps(results, indent=2, ensure_ascii=False) filename = ‘ocr_results.json’ elif format.lower() == ‘txt’: output = results[‘full_text’] filename = ‘extracted_text.txt’ else: raise ValueError(“Supported formats: ‘json’, ‘txt'”) with open(filename, ‘w’, encoding=’utf-8′) as f: f.write(output) print(f” Results exported to: {filename}”) return filename We define an AdvancedOCRAgent that we initialize with multilingual EasyOCR and a GPU, and we set a confidence threshold to control output quality. We preprocess images (CLAHE, denoise, sharpen, adaptive threshold), extract text, visualize bounding boxes and confidence, run smart pattern/language analysis, support batch folders, and export results as JSON or TXT. Check out the FULL CODES here. Copy CodeCopiedUse a different Browser def demo_ocr_agent(): “””Demonstrate the OCR agent capabilities.””” print(” Advanced OCR AI Agent Demo”) print(“=” * 50) ocr = AdvancedOCRAgent(languages=[‘en’], gpu=True) image_path = ocr.upload_image() if image_path: try: results = ocr.extract_text(image_path, preprocess=True) print(“n OCR Results:”) print(f”Words detected: {results[‘word_count’]}”) print(f”Lines detected: {results[‘line_count’]}”) print(f”Average confidence: {results[‘confidence_stats’].get(‘mean’, 0):.2f}”) print(“n Extracted Text:”) print(“-” * 30) print(results[‘full_text’]) print(“-” * 30) analysis = ocr.smart_text_analysis(results[‘full_text’]) print(f”n Smart Analysis:”) print(f”Detected text type: {analysis[‘text_type’]}”) print(f”Language hints: {analysis[‘language_detection’]}”) if analysis[‘patterns’]: print(f”Found patterns: {list(analysis[‘patterns’].keys())}”) ocr.visualize_results(image_path, results) ocr.export_results(results, ‘json’) except Exception as e: print(f” Error: {str(e)}”) else: print(“No image uploaded. Please try again.”) if __name__ == “__main__”: demo_ocr_agent() We create a demo function that walks us through the full OCR workflow: we initialize the agent with English and GPU support, upload an image, preprocess it, and extract text with confidence stats. We then display

How to Build a Multilingual OCR AI Agent in Python with EasyOCR and OpenCV Read Post »

AI, Committee, News, Uncategorized

IBM AI Research Releases Two English Granite Embedding Models, Both Based on the ModernBERT Architecture

IBM has quietly built a strong presence in the open-source AI ecosystem, and its latest release shows why it shouldn’t be overlooked. The company has introduced two new embedding models—granite-embedding-english-r2 and granite-embedding-small-english-r2—designed specifically for high-performance retrieval and RAG (retrieval-augmented generation) systems. These models are not only compact and efficient but also licensed under Apache 2.0, making them ready for commercial deployment. What Models Did IBM Release? The two models target different compute budgets. The larger granite-embedding-english-r2 has 149 million parameters with an embedding size of 768, built on a 22-layer ModernBERT encoder. Its smaller counterpart, granite-embedding-small-english-r2, comes in at just 47 million parameters with an embedding size of 384, using a 12-layer ModernBERT encoder. Despite their differences in size, both support a maximum context length of 8192 tokens, a major upgrade from the first-generation Granite embeddings. This long-context capability makes them highly suitable for enterprise workloads involving long documents and complex retrieval tasks. https://arxiv.org/abs/2508.21085 What’s Inside the Architecture? Both models are built on the ModernBERT backbone, which introduces several optimizations: Alternating global and local attention to balance efficiency with long-range dependencies. Rotary positional embeddings (RoPE) tuned for positional interpolation, enabling longer context windows. FlashAttention 2 to improve memory usage and throughput at inference time. IBM also trained these models with a multi-stage pipeline. The process started with masked language pretraining on a two-trillion-token dataset sourced from web, Wikipedia, PubMed, BookCorpus, and internal IBM technical documents. This was followed by context extension from 1k to 8k tokens, contrastive learning with distillation from Mistral-7B, and domain-specific tuning for conversational, tabular, and code retrieval tasks. How Do They Perform on Benchmarks? The Granite R2 models deliver strong results across widely used retrieval benchmarks. On MTEB-v2 and BEIR, the larger granite-embedding-english-r2 outperforms similarly sized models like BGE Base, E5, and Arctic Embed. The smaller model, granite-embedding-small-english-r2, achieves accuracy close to models two to three times larger, making it particularly attractive for latency-sensitive workloads. https://arxiv.org/abs/2508.21085 Both models also perform well in specialized domains: Long-document retrieval (MLDR, LongEmbed) where 8k context support is critical. Table retrieval tasks (OTT-QA, FinQA, OpenWikiTables) where structured reasoning is required. Code retrieval (CoIR), handling both text-to-code and code-to-text queries. Are They Fast Enough for Large-Scale Use? Efficiency is one of the standout aspects of these models. On an Nvidia H100 GPU, the granite-embedding-small-english-r2 encodes nearly 200 documents per second, which is significantly faster than BGE Small and E5 Small. The larger granite-embedding-english-r2 also reaches 144 documents per second, outperforming many ModernBERT-based alternatives. Crucially, these models remain practical even on CPUs, allowing enterprises to run them in less GPU-intensive environments. This balance of speed, compact size, and retrieval accuracy makes them highly adaptable for real-world deployment. What Does This Mean for Retrieval in Practice? IBM’s Granite Embedding R2 models demonstrate that embedding systems don’t need massive parameter counts to be effective. They combine long-context support, benchmark-leading accuracy, and high throughput in compact architectures. For companies building retrieval pipelines, knowledge management systems, or RAG workflows, Granite R2 provides a production-ready, commercially viable alternative to existing open-source options. https://arxiv.org/abs/2508.21085 Summary In short, IBM’s Granite Embedding R2 models strike an effective balance between compact design, long-context capability, and strong retrieval performance. With throughput optimized for both GPU and CPU environments, and an Apache 2.0 license that enables unrestricted commercial use, they present a practical alternative to bulkier open-source embeddings. For enterprises deploying RAG, search, or large-scale knowledge systems, Granite R2 stands out as an efficient and production-ready option. Check out the Paper, granite-embedding-small-english-r2 and granite-embedding-english-r2. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter. The post IBM AI Research Releases Two English Granite Embedding Models, Both Based on the ModernBERT Architecture appeared first on MarkTechPost.

IBM AI Research Releases Two English Granite Embedding Models, Both Based on the ModernBERT Architecture Read Post »

AI, Committee, News, Uncategorized

DiFlow-TTS: Discrete Flow Matching with Factorized Speech Tokens for Low-Latency Zero-Shot Text-To-Speech

arXiv:2509.09631v1 Announce Type: cross Abstract: Zero-shot Text-to-Speech (TTS) aims to synthesize high-quality speech that mimics the voice of an unseen speaker using only a short reference sample, requiring not only speaker adaptation but also accurate modeling of prosodic attributes. Recent approaches based on language models, diffusion, and flow matching have shown promising results in zero-shot TTS, but still suffer from slow inference and repetition artifacts. Discrete codec representations have been widely adopted for speech synthesis, and recent works have begun to explore diffusion models in purely discrete settings, suggesting the potential of discrete generative modeling for speech synthesis. However, existing flow-matching methods typically embed these discrete tokens into a continuous space and apply continuous flow matching, which may not fully leverage the advantages of discrete representations. To address these challenges, we introduce DiFlow-TTS, which, to the best of our knowledge, is the first model to explore purely Discrete Flow Matching for speech synthesis. DiFlow-TTS explicitly models factorized speech attributes within a compact and unified architecture. It leverages in-context learning by conditioning on textual content, along with prosodic and acoustic attributes extracted from a reference speech, enabling effective attribute cloning in a zero-shot setting. In addition, the model employs a factorized flow prediction mechanism with distinct heads for prosody and acoustic details, allowing it to learn aspect-specific distributions. Experimental results demonstrate that DiFlow-TTS achieves promising performance in several key metrics, including naturalness, prosody, preservation of speaker style, and energy control. It also maintains a compact model size and achieves low-latency inference, generating speech up to 25.8 times faster than the latest existing baselines.

DiFlow-TTS: Discrete Flow Matching with Factorized Speech Tokens for Low-Latency Zero-Shot Text-To-Speech Read Post »

AI, Committee, News, Uncategorized

Optimizing Length Compression in Large Reasoning Models

arXiv:2506.14755v2 Announce Type: replace-cross Abstract: Large Reasoning Models (LRMs) have achieved remarkable success, yet they often suffer from producing unnecessary and verbose reasoning chains. We identify a core aspect of this issue as “invalid thinking” — models tend to repeatedly double-check their work after having derived the correct answer. To address this specific inefficiency, we move beyond the general principles of Efficacy and Efficiency to propose two new, fine-grained principles: Brevity, which advocates for eliminating redundancy, and Sufficiency, which ensures critical reasoning steps are preserved. Guided by these principles, we introduce LC-R1, a post-training method based on Group Relative Policy Optimization (GRPO). LC-R1 employs a novel combination of a Length Reward for overall conciseness and a Compress Reward that is specifically designed to remove the invalid portion of the thinking process. Extensive experiments on multiple reasoning benchmarks demonstrate that LC-R1 achieves a significant reduction in sequence length (~50%) with only a marginal (~2%) drop in accuracy, achieving a favorable trade-off point on the Pareto frontier that prioritizes high compression. Our analysis further validates the robustness of LC-R1 and provides valuable insights for developing more powerful yet computationally efficient LRMs. Our code is released at https://github.com/zxiangx/LC-R1.

Optimizing Length Compression in Large Reasoning Models Read Post »

AI, Committee, News, Uncategorized

Thinking with Many Minds: Using Large Language Models for Multi-Perspective Problem-Solving

arXiv:2501.02348v2 Announce Type: replace Abstract: Complex problem-solving requires cognitive flexibility–the capacity to entertain multiple perspectives while preserving their distinctiveness. This flexibility replicates the “wisdom of crowds” within a single individual, allowing them to “think with many minds.” While mental simulation enables imagined deliberation, cognitive constraints limit its effectiveness. We propose synthetic deliberation, a Large Language Model (LLM)-based method that simulates discourse between agents embodying diverse perspectives, as a solution. Using a custom GPT-based model, we showcase its benefits: concurrent processing of multiple viewpoints without cognitive degradation, parallel exploration of perspectives, and precise control over viewpoint synthesis. By externalizing the deliberative process and distributing cognitive labor between parallel search and integration, synthetic deliberation transcends mental simulation’s limitations. This approach shows promise for strategic planning, policymaking, and conflict resolution.

Thinking with Many Minds: Using Large Language Models for Multi-Perspective Problem-Solving Read Post »

AI, Committee, News, Uncategorized

DeMeVa at LeWiDi-2025: Modeling Perspectives with In-Context Learning and Label Distribution Learning

arXiv:2509.09524v1 Announce Type: new Abstract: This system paper presents the DeMeVa team’s approaches to the third edition of the Learning with Disagreements shared task (LeWiDi 2025; Leonardelli et al., 2025). We explore two directions: in-context learning (ICL) with large language models, where we compare example sampling strategies; and label distribution learning (LDL) methods with RoBERTa (Liu et al., 2019b), where we evaluate several fine-tuning methods. Our contributions are twofold: (1) we show that ICL can effectively predict annotator-specific annotations (perspectivist annotations), and that aggregating these predictions into soft labels yields competitive performance; and (2) we argue that LDL methods are promising for soft label predictions and merit further exploration by the perspectivist community.

DeMeVa at LeWiDi-2025: Modeling Perspectives with In-Context Learning and Label Distribution Learning Read Post »

AI, Committee, News, Uncategorized

Efficient Trie-based Biasing using K-step Prediction for Rare Word Recognition

arXiv:2509.09196v1 Announce Type: new Abstract: Contextual biasing improves rare word recognition of ASR models by prioritizing the output of rare words during decoding. A common approach is Trie-based biasing, which gives “bonus scores” to partial hypothesis (e.g. “Bon”) that may lead to the generation of the rare word (e.g. “Bonham”). If the full word (“Bonham”) isn’t ultimately recognized, the system revokes those earlier bonuses. This revocation is limited to beam search and is computationally expensive, particularly for models with large decoders. To overcome these limitations, we propose adapting ASR models to look ahead and predict multiple steps at once. This avoids the revocation step entirely by better estimating whether a partial hypothesis will lead to the generation of the full rare word. By fine-tuning Whisper with only 10 hours of synthetic data, our method reduces the word error rate on the NSC Part 2 test set from 30.86% to 12.19%.

Efficient Trie-based Biasing using K-step Prediction for Rare Word Recognition Read Post »

AI, Committee, News, Uncategorized

Meet mmBERT: An Encoder-only Language Model Pretrained on 3T Tokens of Multilingual Text in over 1800 Languages and 2–4× Faster than Previous Models

Table of contents Why was a new multilingual encoder needed? Understanding the architecture of mmBERT What training data and phases were used? What new training strategies were introduced? How does mmBERT perform on benchmarks? How does mmBERT handle low-resource languages? What efficiency gains does mmBERT achieve? Summary Why was a new multilingual encoder needed? XLM-RoBERTa (XLM-R) has dominated multilingual NLP for more than 5 years, an unusually long reign in AI research. While encoder-only models like BERT and RoBERTa were central to early progress, most research energy shifted toward decoder-based generative models. Encoders, however, remain more efficient and often outperform decoders on embedding, retrieval, and classification tasks. Despite this, multilingual encoder development stalled. A team of researchers from Johns Hopkins University propose mmBERT that addresses this gap by delivering a modern encoder, surpassesing XLM-R and rivals recent large-scale models such as OpenAI’s o3 and Google’s Gemini 2.5 Pro. Understanding the architecture of mmBERT mmBERT comes in two main configurations: Base model: 22 transformer layers, 1152 hidden dimension, ~307M parameters (110M non-embedding). Small model: ~140M parameters (42M non-embedding). It adopts the Gemma 2 tokenizer with a 256k vocabulary, rotary position embeddings (RoPE), and FlashAttention2 for efficiency. Sequence length is extended from 1024 to 8192 tokens, using unpadded embeddings and sliding-window attention. This allows mmBERT to process contexts nearly an order of magnitude longer than XLM-R while maintaining faster inference. What training data and phases were used? mmBERT was trained on 3 trillion tokens spanning 1,833 languages. Data sources include FineWeb2, Dolma, MegaWika v2, ProLong, StarCoder, and others. English makes up only ~10–34% of the corpus depending on the phase. Training was done in three stages: Pre-training: 2.3T tokens across 60 languages and code. Mid-training: 600B tokens across 110 languages, focused on higher-quality sources. Decay phase: 100B tokens covering 1,833 languages, emphasizing low-resource adaptation. What new training strategies were introduced? Three main innovations drive mmBERT’s performance: Annealed Language Learning (ALL): Languages are introduced gradually (60 → 110 → 1833). Sampling distributions are annealed from high-resource to uniform, ensuring low-resource languages gain influence during later stages without overfitting limited data. Inverse Masking Schedule: The masking ratio starts at 30% and decays to 5%, encouraging coarse-grained learning early and fine-grained refinements later. Model Merging Across Decay Variants: Multiple decay-phase models (English-heavy, 110-language, and 1833-language) are combined via TIES merging, leveraging complementary strengths without retraining from scratch. How does mmBERT perform on benchmarks? English NLU (GLUE): mmBERT base achieves 86.3, surpassing XLM-R (83.3) and nearly matching ModernBERT (87.4), despite allocating >75% of training to non-English data. Multilingual NLU (XTREME): mmBERT base scores 72.8 vs. XLM-R’s 70.4, with gains in classification and QA tasks. Embedding tasks (MTEB v2): mmBERT base ties ModernBERT in English (53.9 vs. 53.8) and leads in multilingual (54.1 vs. 52.4 for XLM-R). Code retrieval (CoIR): mmBERT outperforms XLM-R by ~9 points, though EuroBERT remains stronger on proprietary data. How does mmBERT handle low-resource languages? The annealed learning schedule ensures that low-resource languages benefit during later training. On benchmarks like Faroese FoQA and Tigrinya TiQuAD, mmBERT significantly outperforms both o3 and Gemini 2.5 Pro. These results demonstrate that encoder models, if trained carefully, can generalize effectively even in extreme low-resource scenarios. What efficiency gains does mmBERT achieve? mmBERT is 2–4× faster than XLM-R and MiniLM while supporting 8192-token inputs. Notably, it remains faster at 8192 tokens than older encoders were at 512 tokens. This speed boost derives from the ModernBERT training recipe, efficient attention mechanisms, and optimized embeddings. Summary mmBERT comes as the long-overdue replacement for XLM-R, redefining what a multilingual encoder can deliver. It runs 2–4× faster, handles sequences up to 8K tokens, and outperforms prior models on both high-resource benchmarks and low-resource languages that were underserved in the past. Its training recipe—3 trillion tokens paired with annealed language learning, inverse masking, and model merging—shows how careful design can unlock broad generalization without excessive redundancy. The result is an open, efficient, and scalable encoder that not only fills the six-year gap since XLM-R but also provides a robust foundation for the next generation of multilingual NLP systems. Check out the Paper, Model on Hugging Face, GitHub and Technical details. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter. The post Meet mmBERT: An Encoder-only Language Model Pretrained on 3T Tokens of Multilingual Text in over 1800 Languages and 2–4× Faster than Previous Models appeared first on MarkTechPost.

Meet mmBERT: An Encoder-only Language Model Pretrained on 3T Tokens of Multilingual Text in over 1800 Languages and 2–4× Faster than Previous Models Read Post »

We use cookies to improve your experience and performance on our website. You can learn more at Privacy Policy and manage your privacy settings by clicking Settings.

Privacy Preferences

You can choose your cookie settings by turning on/off each type of cookie as you wish, except for essential cookies.

Allow All
Manage Consent Preferences
  • Always Active

Save
en_US