YouZum

Committee

AI, Committee, Notizie, Uncategorized

DiTTO-LLM: Framework for Discovering Topic-based Technology Opportunities via Large Language Model

arXiv:2509.09724v1 Announce Type: new Abstract: Technology opportunities are critical information that serve as a foundation for advancements in technology, industry, and innovation. This paper proposes a framework based on the temporal relationships between technologies to identify emerging technology opportunities. The proposed framework begins by extracting text from a patent dataset, followed by mapping text-based topics to discover inter-technology relationships. Technology opportunities are then identified by tracking changes in these topics over time. To enhance efficiency, the framework leverages a large language model to extract topics and employs a prompt for a chat-based language model to support the discovery of technology opportunities. The framework was evaluated using an artificial intelligence patent dataset provided by the United States Patent and Trademark Office. The experimental results suggest that artificial intelligence technology is evolving into forms that facilitate everyday accessibility. This approach demonstrates the potential of the proposed framework to identify future technology opportunities.

DiTTO-LLM: Framework for Discovering Topic-based Technology Opportunities via Large Language Model Leggi l'articolo »

AI, Committee, Notizie, Uncategorized

Unsupervised Hallucination Detection by Inspecting Reasoning Processes

arXiv:2509.10004v1 Announce Type: new Abstract: Unsupervised hallucination detection aims to identify hallucinated content generated by large language models (LLMs) without relying on labeled data. While unsupervised methods have gained popularity by eliminating labor-intensive human annotations, they frequently rely on proxy signals unrelated to factual correctness. This misalignment biases detection probes toward superficial or non-truth-related aspects, limiting generalizability across datasets and scenarios. To overcome these limitations, we propose IRIS, an unsupervised hallucination detection framework, leveraging internal representations intrinsic to factual correctness. IRIS prompts the LLM to carefully verify the truthfulness of a given statement, and obtain its contextualized embedding as informative features for training. Meanwhile, the uncertainty of each response is considered a soft pseudolabel for truthfulness. Experimental results demonstrate that IRIS consistently outperforms existing unsupervised methods. Our approach is fully unsupervised, computationally low cost, and works well even with few training data, making it suitable for real-time detection.

Unsupervised Hallucination Detection by Inspecting Reasoning Processes Leggi l'articolo »

AI, Committee, Notizie, Uncategorized

Comparing Apples to Oranges: A Dataset & Analysis of LLM Humour Understanding from Traditional Puns to Topical Jokes

arXiv:2507.13335v2 Announce Type: replace Abstract: Humour, as a complex language form, is derived from myriad aspects of life. Whilst existing work on computational humour has focussed almost exclusively on short pun-based jokes, we investigate whether the ability of Large Language Models (LLMs) to explain humour depends on the particular form. We compare models’ joke explanation abilities from simple puns to complex topical humour that requires esoteric knowledge of real-world entities and events. To this end, we curate a dataset of 600 jokes across 4 joke types and manually write high-quality explanations. These jokes include heterographic and homographic puns, contemporary internet humour, and topical jokes. Using this dataset, we compare the zero-shot abilities of a range of LLMs to accurately and comprehensively explain jokes of different types, identifying key research gaps in the task of humour explanation. We find that none of the tested models (including reasoning models) are capable of reliably generating adequate explanations of all joke types, further highlighting the narrow focus of most existing works on overly simple joke forms.

Comparing Apples to Oranges: A Dataset & Analysis of LLM Humour Understanding from Traditional Puns to Topical Jokes Leggi l'articolo »

AI, Committee, Notizie, Uncategorized

Google AI Releases VaultGemma: The Largest and Most Capable Open Model (1B-parameters) Trained from Scratch with Differential Privacy

Google AI Research and DeepMind have released VaultGemma 1B, the largest open-weight large language model trained entirely with differential privacy (DP). This development is a major step toward building AI models that are both powerful and privacy-preserving. Why Do We Need Differential Privacy in LLMs? Large language models trained on vast web-scale datasets are prone to memorization attacks, where sensitive or personally identifiable information can be extracted from the model. Studies have shown that verbatim training data can resurface, especially in open-weight releases. Differential Privacy offers a mathematical guarantee that prevents any single training example from significantly influencing the model. Unlike approaches that apply DP only during fine-tuning, VaultGemma enforces full private pretraining, ensuring that privacy protection begins at the foundational level. https://services.google.com/fh/files/blogs/vaultgemma_tech_report.pdf What Is the Architecture of VaultGemma? VaultGemma is architecturally similar to earlier Gemma models, but optimized for private training. Model size: 1B parameters, 26 layers. Transformer type: Decoder-only. Activations: GeGLU with feedforward dimension of 13,824. Attention: Multi-Query Attention (MQA) with global span of 1024 tokens. Normalization: RMSNorm in pre-norm configuration. Tokenizer: SentencePiece with a 256K vocabulary. A notable change is the reduction of sequence length to 1024 tokens, which lowers compute costs and enables larger batch sizes under DP constraints. What Data Was Used for Training? VaultGemma was trained on the same 13 trillion-token dataset as Gemma 2, composed primarily of English text from web documents, code, and scientific articles. The dataset underwent several filtering stages to: Remove unsafe or sensitive content. Reduce personal information exposure. Prevent evaluation data contamination. This ensures both safety and fairness in benchmarking. How Was Differential Privacy Applied? VaultGemma used DP-SGD (Differentially Private Stochastic Gradient Descent) with gradient clipping and Gaussian noise addition. Implementation was built on JAX Privacy and introduced optimizations for scalability: Vectorized per-example clipping for parallel efficiency. Gradient accumulation to simulate large batches. Truncated Poisson Subsampling integrated into the data loader for efficient on-the-fly sampling. The model achieved a formal DP guarantee of (ε ≤ 2.0, δ ≤ 1.1e−10) at the sequence level (1024 tokens). How Do Scaling Laws Work for Private Training? Training large models under DP constraints requires new scaling strategies. The VaultGemma team developed DP-specific scaling laws with three innovations: Optimal learning rate modeling using quadratic fits across training runs. Parametric extrapolation of loss values to reduce reliance on intermediate checkpoints. Semi-parametric fits to generalize across model size, training steps, and noise-batch ratios. This methodology enabled precise prediction of achievable loss and efficient resource use on the TPUv6e training cluster. What Were the Training Configurations? VaultGemma was trained on 2048 TPUv6e chips using GSPMD partitioning and MegaScale XLA compilation. Batch size: ~518K tokens. Training iterations: 100,000. Noise multiplier: 0.614. The achieved loss was within 1% of predictions from the DP scaling law, validating the approach. How Does VaultGemma Perform Compared to Non-Private Models? On academic benchmarks, VaultGemma trails its non-private counterparts but shows strong utility: ARC-C: 26.45 vs. 38.31 (Gemma-3 1B). PIQA: 68.0 vs. 70.51 (GPT-2 1.5B). TriviaQA (5-shot): 11.24 vs. 39.75 (Gemma-3 1B). These results suggest that DP-trained models are currently comparable to non-private models from about five years ago. Importantly, memorization tests confirmed that no training data leakage was detectable in VaultGemma, unlike in non-private Gemma models. https://services.google.com/fh/files/blogs/vaultgemma_tech_report.pdf Summary In summary, VaultGemma 1B proves that large-scale language models can be trained with rigorous differential privacy guarantees without making them impractical to use. While a utility gap remains compared to non-private counterparts, the release of both the model and its training methodology provides the community with a strong foundation for advancing private AI. This work signals a shift toward building models that are not only capable but also inherently safe, transparent, and privacy-preserving. Check out the Paper, Model on Hugging Face and Technical Details. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter. The post Google AI Releases VaultGemma: The Largest and Most Capable Open Model (1B-parameters) Trained from Scratch with Differential Privacy appeared first on MarkTechPost.

Google AI Releases VaultGemma: The Largest and Most Capable Open Model (1B-parameters) Trained from Scratch with Differential Privacy Leggi l'articolo »

AI, Committee, Notizie, Uncategorized

BentoML Released llm-optimizer: An Open-Source AI Tool for Benchmarking and Optimizing LLM Inference

BentoML has recently released llm-optimizer, an open-source framework designed to streamline the benchmarking and performance tuning of self-hosted large language models (LLMs). The tool addresses a common challenge in LLM deployment: finding optimal configurations for latency, throughput, and cost without relying on manual trial-and-error. Why is tuning the LLM performance difficult? Tuning LLM inference is a balancing act across many moving parts—batch size, framework choice (vLLM, SGLang, etc.), tensor parallelism, sequence lengths, and how well the hardware is utilized. Each of these factors can shift performance in different ways, which makes finding the right combination for speed, efficiency, and cost far from straightforward. Most teams still rely on repetitive trial-and-error testing, a process that is slow, inconsistent, and often inconclusive. For self-hosted deployments, the cost of getting it wrong is high: poorly tuned configurations can quickly translate into higher latency and wasted GPU resources. How llm-optimizer is different? llm-optimizer provides a structured way to explore the LLM performance landscape. It eliminates repetitive guesswork by enabling systematic benchmarking and automated search across possible configurations. Core capabilities include: Running standardized tests across inference frameworks such as vLLM and SGLang. Applying constraint-driven tuning, e.g., surfacing only configurations where time-to-first-token is below 200ms. Automating parameter sweeps to identify optimal settings. Visualizing tradeoffs with dashboards for latency, throughput, and GPU utilization. The framework is open-source and available on GitHub. How can devs explore results without running benchmarks locally? Alongside the optimizer, BentoML released the LLM Performance Explorer, a browser-based interface powered by llm-optimizer. It provides pre-computed benchmark data for popular open-source models and lets users: Compare frameworks and configurations side by side. Filter by latency, throughput, or resource thresholds. Browse tradeoffs interactively without provisioning hardware. How does llm-optimizer impact LLM deployment practices? As the use of LLMs grows, getting the most out of deployments comes down to how well inference parameters are tuned. llm-optimizer lowers the complexity of this process, giving smaller teams access to optimization techniques that once required large-scale infrastructure and deep expertise. By providing standardized benchmarks and reproducible results, the framework adds much-needed transparency to the LLM space. It makes comparisons across models and frameworks more consistent, closing a long-standing gap in the community. Ultimately, BentoML’s llm-optimizer brings a constraint-driven, benchmark-focused method to self-hosted LLM optimization, replacing ad-hoc trial and error with a systematic and repeatable workflow. Check out the GitHub Page. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter. The post BentoML Released llm-optimizer: An Open-Source AI Tool for Benchmarking and Optimizing LLM Inference appeared first on MarkTechPost.

BentoML Released llm-optimizer: An Open-Source AI Tool for Benchmarking and Optimizing LLM Inference Leggi l'articolo »

AI, Committee, Notizie, Uncategorized

How do AI models generate videos?

MIT Technology Review Explains: Let our writers untangle the complex, messy world of technology to help you understand what’s coming next. You can read more from the series here. It’s been a big year for video generation. In the last nine months OpenAI made Sora public, Google DeepMind launched Veo 3, the video startup Runway launched Gen-4. All can produce video clips that are (almost) impossible to distinguish from actual filmed footage or CGI animation. This year also saw Netflix debut an AI visual effect in its show The Eternaut, the first time video generation has been used to make mass-market TV. Sure, the clips you see in demo reels are cherry-picked to showcase a company’s models at the top of their game. But with the technology in the hands of more users than ever before—Sora and Veo 3 are available in the ChatGPT and Gemini apps for paying subscribers—even the most casual filmmaker can now knock out something remarkable.  The downside is that creators are competing with AI slop, and social media feeds are filling up with faked news footage. Video generation also uses up a huge amount of energy, many times more than text or image generation.  With AI-generated videos everywhere, let’s take a moment to talk about the tech that makes them work. How do you generate a video? Let’s assume you’re a casual user. There are now a range of high-end tools that allow pro video makers to insert video generation models into their workflows. But most people will use this technology in an app or via a website. You know the drill: “Hey, Gemini, make me a video of a unicorn eating spaghetti. Now make its horn take off like a rocket.” What you get back will be hit or miss, and you’ll typically need to ask the model to take another pass or 10 before you get more or less what you wanted.  So what’s going on under the hood? Why is it hit or miss—and why does it take so much energy? The latest wave of video generation models are what’s known as latent diffusion transformers. Yes, that’s quite a mouthful. Let’s unpack each part in turn, starting with diffusion.  What’s a diffusion model? Imagine taking an image and adding a random spattering of pixels to it. Take that pixel-spattered image and spatter it again and then again. Do that enough times and you will have turned the initial image into a random mess of pixels, like static on an old TV set.  A diffusion model is a neural network trained to reverse that process, turning random static into images. During training, it gets shown millions of images in various stages of pixelation. It learns how those images change each time new pixels are thrown at them and, thus, how to undo those changes.  The upshot is that when you ask a diffusion model to generate an image, it will start off with a random mess of pixels and step by step turn that mess into an image that is more or less similar to images in its training set.  But you don’t want any image—you want the image you specified, typically with a text prompt. And so the diffusion model is paired with a second model—such as a large language model (LLM) trained to match images with text descriptions—that guides each step of the cleanup process, pushing the diffusion model toward images that the large language model considers a good match to the prompt.  An aside: This LLM isn’t pulling the links between text and images out of thin air. Most text-to-image and text-to-video models today are trained on large data sets that contain billions of pairings of text and images or text and video scraped from the internet (a practice many creators are very unhappy about). This means that what you get from such models is a distillation of the world as it’s represented online, distorted by prejudice (and pornography). It’s easiest to imagine diffusion models working with images. But the technique can be used with many kinds of data, including audio and video. To generate movie clips, a diffusion model must clean up sequences of images—the consecutive frames of a video—instead of just one image.  What’s a latent diffusion model?  All this takes a huge amount of compute (read: energy). That’s why most diffusion models used for video generation use a technique called latent diffusion. Instead of processing raw data—the millions of pixels in each video frame—the model works in what’s known as a latent space, in which the video frames (and text prompt) are compressed into a mathematical code that captures just the essential features of the data and throws out the rest.  A similar thing happens whenever you stream a video over the internet: A video is sent from a server to your screen in a compressed format to make it get to you faster, and when it arrives, your computer or TV will convert it back into a watchable video.  And so the final step is to decompress what the latent diffusion process has come up with. Once the compressed frames of random static have been turned into the compressed frames of a video that the LLM guide considers a good match for the user’s prompt, the compressed video gets converted into something you can watch.   With latent diffusion, the diffusion process works more or less the way it would for an image. The difference is that the pixelated video frames are now mathematical encodings of those frames rather than the frames themselves. This makes latent diffusion far more efficient than a typical diffusion model. (Even so, video generation still uses more energy than image or text generation. There’s just an eye-popping amount of computation involved.)  What’s a latent diffusion transformer? Still with me? There’s one more piece to the puzzle—and that’s how to make sure the diffusion process produces a sequence of frames that are consistent, maintaining objects and lighting and so on from one frame

How do AI models generate videos? Leggi l'articolo »

AI, Committee, Notizie, Uncategorized

The Download: America’s gun crisis, and how AI video models work

This is today’s edition of The Download, our weekday newsletter that provides a daily dose of what’s going on in the world of technology. We can’t “make American children healthy again” without tackling the gun crisis This week, the Trump administration released a strategy for improving the health and well-being of American children. The report was titled—you guessed it—Make Our Children Healthy Again. It suggests American children should be eating more healthily. And they should be getting more exercise. But there’s a glaring omission. The leading cause of death for American children and teenagers isn’t ultraprocessed food or exposure to some chemical. It’s gun violence.  This week’s news of yet more high-profile shootings at schools in the US throws this disconnect into even sharper relief. Experts believe it is time to treat gun violence in the US as what it is: a public health crisis. Read the full story. —Jessica Hamzelou This article first appeared in The Checkup, MIT Technology Review’s weekly biotech newsletter. To receive it in your inbox every Thursday, and read articles like this first, sign up here. How do AI models generate videos? It’s been a big year for video generation. In the last nine months OpenAI made Sora public, Google DeepMind launched Veo 3, and the video startup Runway launched Gen-4. All can produce video clips that are (almost) impossible to distinguish from actual filmed footage or CGI animation. The downside is that creators are competing with AI slop, and social media feeds are filling up with faked news footage. Video generation also uses up a huge amount of energy, many times more than text or image generation. With AI-generated videos everywhere, let’s take a moment to talk about the tech that makes them work. Read the full story. —Will Douglas Heaven This article is part of MIT Technology Review Explains, our series untangling the complex, messy world of technology to help you understand what’s coming next. You can read more from the series here. Meet our 2025 Innovator of the Year: Sneha Goenka Up to a quarter of children entering intensive care have undiagnosed genetic conditions. To be treated properly, they must first get diagnoses—which means having their genomes sequenced. This process typically takes up to seven weeks. Sadly, that’s often too slow to save a critically ill child. Hospitals may soon have a faster option, thanks to a groundbreaking system built in part by Sneha Goenka, an assistant professor of electrical and computer engineering at Princeton—and MIT Technology Review’s 2025 Innovator of the Year. Read all about Goenka and her work in this profile. —Helen Thomson As well as our Innovator of the Year, Goenka is one of the biotech honorees on our 35 Innovators Under 35 list for 2025. Meet the rest of our biotech and materials science innovators, and the full list here.  The must-reads I’ve combed the internet to find you today’s most fun/important/scary/fascinating stories about technology. 1 OpenAI and Microsoft have agreed a revised dealBut haven’t actually revealed any details of said deal. (Axios)+ The news comes as OpenAI keeps pursuing its for-profit pivot. (Ars Technica)+ The world’s largest startup is going to need more paying users soon. (WSJ $) 2 A child has died from a measles complication in Los AngelesThey had contracted the virus before they were old enough to be vaccinated. (Ars Technica)+ Infants are best protected by community immunity. (LA Times $)+ They’d originally recovered from measles before developing the condition. (CNN)+ Why childhood vaccines are a public health success story. (MIT Technology Review) 3 Ukrainian drone attacks triggered internet blackouts in RussiaThe Kremlin cut internet access in a bid to thwart the mobile-guided drones. (FT $)+ The UK is poised to mass-produce drones to aid Ukraine. (Sky News)+ On the ground in Ukraine’s largest Starlink repair shop. (MIT Technology Review) 4 Demis Hasabis says AI may slash drug discovery time to under a yearOr perhaps even faster. (Bloomberg $)+ But there’s good reason to be skeptical of that claim. (FT $)+ An AI-driven “factory of drugs” claims to have hit a big milestone. (MIT Technology Review) 5 How chatbots alter how we thinkWe shouldn’t outsource our critical thinking to them. (Undark)+ AI companies have stopped warning you that their chatbots aren’t doctors. (MIT Technology Review) 6 Fraudsters are threatening small businesses with one-star reviewsOnline reviews can make or break fledgling enterprises, and scammers know it. (NYT $) 7 Why humanoid robots aren’t taking off any time soonThe industry has a major hype problem. (IEEE Spectrum)+ Chinese tech giant Ant Group showed off its own humanoid machine. (The Verge)+ Why the humanoid workforce is running late. (MIT Technology Review) 8 Encyclopedia Britannica and Merriam-Webster are suing PerplexityIn yet another case of alleged copyright infringement. (Reuters)+ What comes next for AI copyright lawsuits? (MIT Technology Review) 9 Where we’re most likely to find extraterrestrial life in the next decadeWarning: Hollywood may have given us unrealistic expectations. (BBC) 10 Want to build a trillion-dollar company?Then kiss your social life goodbye. (WSJ $) Quote of the day “Nooooo I’m going to have to use my brain again and write 100% of my code like a caveman from December 2024.” —A Hacker News commenter jokes about a service outage that left Anthropic users unable to access its AI coding tools, Ars Technica reports. One more thing What Africa needs to do to become a major AI player Africa is still early in the process of adopting AI technologies. But researchers say the continent is uniquely hospitable to it for several reasons, including a relatively young and increasingly well-educated population, a rapidly growing ecosystem of AI startups, and lots of potential consumers. However, ambitious efforts to develop AI tools that answer the needs of Africans face numerous hurdles. Read our story to learn what they are, and how they could be overcome. —Abdullahi Tsanni We can still have nice things A place for comfort, fun and distraction to brighten up your day. (Got any ideas? Drop me a line or skeet

The Download: America’s gun crisis, and how AI video models work Leggi l'articolo »

AI, Committee, Notizie, Uncategorized

How to Build a Multilingual OCR AI Agent in Python with EasyOCR and OpenCV

In this tutorial, we build an Advanced OCR AI Agent in Google Colab using EasyOCR, OpenCV, and Pillow, running fully offline with GPU acceleration. The agent includes a preprocessing pipeline with contrast enhancement (CLAHE), denoising, sharpening, and adaptive thresholding to improve recognition accuracy. Beyond basic OCR, we filter results by confidence, generate text statistics, and perform pattern detection (emails, URLs, dates, phone numbers) along with simple language hints. The design also supports batch processing, visualization with bounding boxes, and structured exports for flexible usage. Check out the FULL CODES here. Copy CodeCopiedUse a different Browser !pip install easyocr opencv-python pillow matplotlib import easyocr import cv2 import numpy as np from PIL import Image, ImageEnhance, ImageFilter import matplotlib.pyplot as plt import os import json from typing import List, Dict, Tuple, Optional import re from google.colab import files import io We start by installing the required libraries, EasyOCR, OpenCV, Pillow, and Matplotlib, to set up our environment. We then import all necessary modules so we can handle image preprocessing, OCR, visualization, and file operations seamlessly. Check out the FULL CODES here. Copy CodeCopiedUse a different Browser class AdvancedOCRAgent: “”” Advanced OCR AI Agent with preprocessing, multi-language support, and intelligent text extraction capabilities. “”” def __init__(self, languages: List[str] = [‘en’], gpu: bool = True): “””Initialize OCR agent with specified languages.””” print(” Initializing Advanced OCR Agent…”) self.languages = languages self.reader = easyocr.Reader(languages, gpu=gpu) self.confidence_threshold = 0.5 print(f” OCR Agent ready! Languages: {languages}”) def upload_image(self) -> Optional[str]: “””Upload image file through Colab interface.””” print(” Upload your image file:”) uploaded = files.upload() if uploaded: filename = list(uploaded.keys())[0] print(f” Uploaded: {filename}”) return filename return None def preprocess_image(self, image: np.ndarray, enhance: bool = True) -> np.ndarray: “””Advanced image preprocessing for better OCR accuracy.””” if len(image.shape) == 3: gray = cv2.cvtColor(image, cv2.COLOR_BGR2GRAY) else: gray = image.copy() if enhance: clahe = cv2.createCLAHE(clipLimit=2.0, tileGridSize=(8,8)) gray = clahe.apply(gray) gray = cv2.fastNlMeansDenoising(gray) kernel = np.array([[-1,-1,-1], [-1,9,-1], [-1,-1,-1]]) gray = cv2.filter2D(gray, -1, kernel) binary = cv2.adaptiveThreshold( gray, 255, cv2.ADAPTIVE_THRESH_GAUSSIAN_C, cv2.THRESH_BINARY, 11, 2 ) return binary def extract_text(self, image_path: str, preprocess: bool = True) -> Dict: “””Extract text from image with advanced processing.””” print(f” Processing image: {image_path}”) image = cv2.imread(image_path) if image is None: raise ValueError(f”Could not load image: {image_path}”) if preprocess: processed_image = self.preprocess_image(image) else: processed_image = image results = self.reader.readtext(processed_image) extracted_data = { ‘raw_results’: results, ‘filtered_results’: [], ‘full_text’: ”, ‘confidence_stats’: {}, ‘word_count’: 0, ‘line_count’: 0 } high_confidence_text = [] confidences = [] for (bbox, text, confidence) in results: if confidence >= self.confidence_threshold: extracted_data[‘filtered_results’].append({ ‘text’: text, ‘confidence’: confidence, ‘bbox’: bbox }) high_confidence_text.append(text) confidences.append(confidence) extracted_data[‘full_text’] = ‘ ‘.join(high_confidence_text) extracted_data[‘word_count’] = len(extracted_data[‘full_text’].split()) extracted_data[‘line_count’] = len(high_confidence_text) if confidences: extracted_data[‘confidence_stats’] = { ‘mean’: np.mean(confidences), ‘min’: np.min(confidences), ‘max’: np.max(confidences), ‘std’: np.std(confidences) } return extracted_data def visualize_results(self, image_path: str, results: Dict, show_bbox: bool = True): “””Visualize OCR results with bounding boxes.””” image = cv2.imread(image_path) image_rgb = cv2.cvtColor(image, cv2.COLOR_BGR2RGB) plt.figure(figsize=(15, 10)) if show_bbox: plt.subplot(2, 2, 1) img_with_boxes = image_rgb.copy() for item in results[‘filtered_results’]: bbox = np.array(item[‘bbox’]).astype(int) cv2.polylines(img_with_boxes, [bbox], True, (255, 0, 0), 2) x, y = bbox[0] cv2.putText(img_with_boxes, f”{item[‘confidence’]:.2f}”, (x, y-10), cv2.FONT_HERSHEY_SIMPLEX, 0.5, (255, 0, 0), 1) plt.imshow(img_with_boxes) plt.title(“OCR Results with Bounding Boxes”) plt.axis(‘off’) plt.subplot(2, 2, 2) processed = self.preprocess_image(image) plt.imshow(processed, cmap=’gray’) plt.title(“Preprocessed Image”) plt.axis(‘off’) plt.subplot(2, 2, 3) confidences = [item[‘confidence’] for item in results[‘filtered_results’]] if confidences: plt.hist(confidences, bins=20, alpha=0.7, color=’blue’) plt.xlabel(‘Confidence Score’) plt.ylabel(‘Frequency’) plt.title(‘Confidence Score Distribution’) plt.axvline(self.confidence_threshold, color=’red’, linestyle=’–‘, label=f’Threshold: {self.confidence_threshold}’) plt.legend() plt.subplot(2, 2, 4) stats = results[‘confidence_stats’] if stats: labels = [‘Mean’, ‘Min’, ‘Max’] values = [stats[‘mean’], stats[‘min’], stats[‘max’]] plt.bar(labels, values, color=[‘green’, ‘red’, ‘blue’]) plt.ylabel(‘Confidence Score’) plt.title(‘Confidence Statistics’) plt.ylim(0, 1) plt.tight_layout() plt.show() def smart_text_analysis(self, text: str) -> Dict: “””Perform intelligent analysis of extracted text.””” analysis = { ‘language_detection’: ‘unknown’, ‘text_type’: ‘unknown’, ‘key_info’: {}, ‘patterns’: [] } email_pattern = r’b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+.[A-Z|a-z]{2,}b’ phone_pattern = r'(+d{1,3}[-.s]?)?(?d{3})?[-.s]?d{3}[-.s]?d{4}’ url_pattern = r’http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+’ date_pattern = r’bd{1,2}[/-]d{1,2}[/-]d{2,4}b’ patterns = { ’emails’: re.findall(email_pattern, text, re.IGNORECASE), ‘phones’: re.findall(phone_pattern, text), ‘urls’: re.findall(url_pattern, text, re.IGNORECASE), ‘dates’: re.findall(date_pattern, text) } analysis[‘patterns’] = {k: v for k, v in patterns.items() if v} if any(patterns.values()): if patterns.get(’emails’) or patterns.get(‘phones’): analysis[‘text_type’] = ‘contact_info’ elif patterns.get(‘urls’): analysis[‘text_type’] = ‘web_content’ elif patterns.get(‘dates’): analysis[‘text_type’] = ‘document_with_dates’ if re.search(r'[а-яё]’, text.lower()): analysis[‘language_detection’] = ‘russian’ elif re.search(r'[àáâãäåæçèéêëìíîïñòóôõöøùúûüý]’, text.lower()): analysis[‘language_detection’] = ‘romance_language’ elif re.search(r'[一-龯]’, text): analysis[‘language_detection’] = ‘chinese’ elif re.search(r'[ひらがなカタカナ]’, text): analysis[‘language_detection’] = ‘japanese’ elif re.search(r'[a-zA-Z]’, text): analysis[‘language_detection’] = ‘latin_based’ return analysis def process_batch(self, image_folder: str) -> List[Dict]: “””Process multiple images in batch.””” results = [] supported_formats = (‘.png’, ‘.jpg’, ‘.jpeg’, ‘.bmp’, ‘.tiff’) for filename in os.listdir(image_folder): if filename.lower().endswith(supported_formats): image_path = os.path.join(image_folder, filename) try: result = self.extract_text(image_path) result[‘filename’] = filename results.append(result) print(f” Processed: {filename}”) except Exception as e: print(f” Error processing {filename}: {str(e)}”) return results def export_results(self, results: Dict, format: str = ‘json’) -> str: “””Export results in specified format.””” if format.lower() == ‘json’: output = json.dumps(results, indent=2, ensure_ascii=False) filename = ‘ocr_results.json’ elif format.lower() == ‘txt’: output = results[‘full_text’] filename = ‘extracted_text.txt’ else: raise ValueError(“Supported formats: ‘json’, ‘txt'”) with open(filename, ‘w’, encoding=’utf-8′) as f: f.write(output) print(f” Results exported to: {filename}”) return filename We define an AdvancedOCRAgent that we initialize with multilingual EasyOCR and a GPU, and we set a confidence threshold to control output quality. We preprocess images (CLAHE, denoise, sharpen, adaptive threshold), extract text, visualize bounding boxes and confidence, run smart pattern/language analysis, support batch folders, and export results as JSON or TXT. Check out the FULL CODES here. Copy CodeCopiedUse a different Browser def demo_ocr_agent(): “””Demonstrate the OCR agent capabilities.””” print(” Advanced OCR AI Agent Demo”) print(“=” * 50) ocr = AdvancedOCRAgent(languages=[‘en’], gpu=True) image_path = ocr.upload_image() if image_path: try: results = ocr.extract_text(image_path, preprocess=True) print(“n OCR Results:”) print(f”Words detected: {results[‘word_count’]}”) print(f”Lines detected: {results[‘line_count’]}”) print(f”Average confidence: {results[‘confidence_stats’].get(‘mean’, 0):.2f}”) print(“n Extracted Text:”) print(“-” * 30) print(results[‘full_text’]) print(“-” * 30) analysis = ocr.smart_text_analysis(results[‘full_text’]) print(f”n Smart Analysis:”) print(f”Detected text type: {analysis[‘text_type’]}”) print(f”Language hints: {analysis[‘language_detection’]}”) if analysis[‘patterns’]: print(f”Found patterns: {list(analysis[‘patterns’].keys())}”) ocr.visualize_results(image_path, results) ocr.export_results(results, ‘json’) except Exception as e: print(f” Error: {str(e)}”) else: print(“No image uploaded. Please try again.”) if __name__ == “__main__”: demo_ocr_agent() We create a demo function that walks us through the full OCR workflow: we initialize the agent with English and GPU support, upload an image, preprocess it, and extract text with confidence stats. We then display

How to Build a Multilingual OCR AI Agent in Python with EasyOCR and OpenCV Leggi l'articolo »

AI, Committee, Notizie, Uncategorized

IBM AI Research Releases Two English Granite Embedding Models, Both Based on the ModernBERT Architecture

IBM has quietly built a strong presence in the open-source AI ecosystem, and its latest release shows why it shouldn’t be overlooked. The company has introduced two new embedding models—granite-embedding-english-r2 and granite-embedding-small-english-r2—designed specifically for high-performance retrieval and RAG (retrieval-augmented generation) systems. These models are not only compact and efficient but also licensed under Apache 2.0, making them ready for commercial deployment. What Models Did IBM Release? The two models target different compute budgets. The larger granite-embedding-english-r2 has 149 million parameters with an embedding size of 768, built on a 22-layer ModernBERT encoder. Its smaller counterpart, granite-embedding-small-english-r2, comes in at just 47 million parameters with an embedding size of 384, using a 12-layer ModernBERT encoder. Despite their differences in size, both support a maximum context length of 8192 tokens, a major upgrade from the first-generation Granite embeddings. This long-context capability makes them highly suitable for enterprise workloads involving long documents and complex retrieval tasks. https://arxiv.org/abs/2508.21085 What’s Inside the Architecture? Both models are built on the ModernBERT backbone, which introduces several optimizations: Alternating global and local attention to balance efficiency with long-range dependencies. Rotary positional embeddings (RoPE) tuned for positional interpolation, enabling longer context windows. FlashAttention 2 to improve memory usage and throughput at inference time. IBM also trained these models with a multi-stage pipeline. The process started with masked language pretraining on a two-trillion-token dataset sourced from web, Wikipedia, PubMed, BookCorpus, and internal IBM technical documents. This was followed by context extension from 1k to 8k tokens, contrastive learning with distillation from Mistral-7B, and domain-specific tuning for conversational, tabular, and code retrieval tasks. How Do They Perform on Benchmarks? The Granite R2 models deliver strong results across widely used retrieval benchmarks. On MTEB-v2 and BEIR, the larger granite-embedding-english-r2 outperforms similarly sized models like BGE Base, E5, and Arctic Embed. The smaller model, granite-embedding-small-english-r2, achieves accuracy close to models two to three times larger, making it particularly attractive for latency-sensitive workloads. https://arxiv.org/abs/2508.21085 Both models also perform well in specialized domains: Long-document retrieval (MLDR, LongEmbed) where 8k context support is critical. Table retrieval tasks (OTT-QA, FinQA, OpenWikiTables) where structured reasoning is required. Code retrieval (CoIR), handling both text-to-code and code-to-text queries. Are They Fast Enough for Large-Scale Use? Efficiency is one of the standout aspects of these models. On an Nvidia H100 GPU, the granite-embedding-small-english-r2 encodes nearly 200 documents per second, which is significantly faster than BGE Small and E5 Small. The larger granite-embedding-english-r2 also reaches 144 documents per second, outperforming many ModernBERT-based alternatives. Crucially, these models remain practical even on CPUs, allowing enterprises to run them in less GPU-intensive environments. This balance of speed, compact size, and retrieval accuracy makes them highly adaptable for real-world deployment. What Does This Mean for Retrieval in Practice? IBM’s Granite Embedding R2 models demonstrate that embedding systems don’t need massive parameter counts to be effective. They combine long-context support, benchmark-leading accuracy, and high throughput in compact architectures. For companies building retrieval pipelines, knowledge management systems, or RAG workflows, Granite R2 provides a production-ready, commercially viable alternative to existing open-source options. https://arxiv.org/abs/2508.21085 Summary In short, IBM’s Granite Embedding R2 models strike an effective balance between compact design, long-context capability, and strong retrieval performance. With throughput optimized for both GPU and CPU environments, and an Apache 2.0 license that enables unrestricted commercial use, they present a practical alternative to bulkier open-source embeddings. For enterprises deploying RAG, search, or large-scale knowledge systems, Granite R2 stands out as an efficient and production-ready option. Check out the Paper, granite-embedding-small-english-r2 and granite-embedding-english-r2. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter. The post IBM AI Research Releases Two English Granite Embedding Models, Both Based on the ModernBERT Architecture appeared first on MarkTechPost.

IBM AI Research Releases Two English Granite Embedding Models, Both Based on the ModernBERT Architecture Leggi l'articolo »

AI, Committee, Notizie, Uncategorized

DiFlow-TTS: Discrete Flow Matching with Factorized Speech Tokens for Low-Latency Zero-Shot Text-To-Speech

arXiv:2509.09631v1 Announce Type: cross Abstract: Zero-shot Text-to-Speech (TTS) aims to synthesize high-quality speech that mimics the voice of an unseen speaker using only a short reference sample, requiring not only speaker adaptation but also accurate modeling of prosodic attributes. Recent approaches based on language models, diffusion, and flow matching have shown promising results in zero-shot TTS, but still suffer from slow inference and repetition artifacts. Discrete codec representations have been widely adopted for speech synthesis, and recent works have begun to explore diffusion models in purely discrete settings, suggesting the potential of discrete generative modeling for speech synthesis. However, existing flow-matching methods typically embed these discrete tokens into a continuous space and apply continuous flow matching, which may not fully leverage the advantages of discrete representations. To address these challenges, we introduce DiFlow-TTS, which, to the best of our knowledge, is the first model to explore purely Discrete Flow Matching for speech synthesis. DiFlow-TTS explicitly models factorized speech attributes within a compact and unified architecture. It leverages in-context learning by conditioning on textual content, along with prosodic and acoustic attributes extracted from a reference speech, enabling effective attribute cloning in a zero-shot setting. In addition, the model employs a factorized flow prediction mechanism with distinct heads for prosody and acoustic details, allowing it to learn aspect-specific distributions. Experimental results demonstrate that DiFlow-TTS achieves promising performance in several key metrics, including naturalness, prosody, preservation of speaker style, and energy control. It also maintains a compact model size and achieves low-latency inference, generating speech up to 25.8 times faster than the latest existing baselines.

DiFlow-TTS: Discrete Flow Matching with Factorized Speech Tokens for Low-Latency Zero-Shot Text-To-Speech Leggi l'articolo »

We use cookies to improve your experience and performance on our website. You can learn more at Politica sulla privacy and manage your privacy settings by clicking Settings.

Privacy Preferences

You can choose your cookie settings by turning on/off each type of cookie as you wish, except for essential cookies.

Allow All
Manage Consent Preferences
  • Always Active

Save
it_IT