YouZum

Committee

AI, Committee, Noticias, Uncategorized

A Coding Implementation to Build a Complete Self-Hosted LLM Workflow with Ollama, REST API, and Gradio Chat Interface

In this tutorial, we implement a fully functional Ollama environment inside Google Colab to replicate a self-hosted LLM workflow. We begin by installing Ollama directly on the Colab VM using the official Linux installer and then launch the Ollama server in the background to expose the HTTP API on localhost:11434. After verifying the service, we pull lightweight models such as qwen2.5:0.5b-instruct or llama3.2:1b, which balance resource constraints with usability in a CPU-only environment. To interact with these models programmatically, we use the /api/chat endpoint via Python’s requests module with streaming enabled, which allows token-level output to be captured incrementally. Finally, we layer a Gradio-based UI on top of this client so we can issue prompts, maintain multi-turn history, configure parameters like temperature and context size, and view results in real time. Check out the Full Codes here. Copy CodeCopiedUse a different Browser import os, sys, subprocess, time, json, requests, textwrap from pathlib import Path def sh(cmd, check=True): “””Run a shell command, stream output.””” p = subprocess.Popen(cmd, shell=True, stdout=subprocess.PIPE, stderr=subprocess.STDOUT, text=True) for line in p.stdout: print(line, end=””) p.wait() if check and p.returncode != 0: raise RuntimeError(f”Command failed: {cmd}”) if not Path(“/usr/local/bin/ollama”).exists() and not Path(“/usr/bin/ollama”).exists(): print(” Installing Ollama …”) sh(“curl -fsSL https://ollama.com/install.sh | sh”) else: print(” Ollama already installed.”) try: import gradio except Exception: print(” Installing Gradio …”) sh(“pip -q install gradio==4.44.0”) We first check if Ollama is already installed on the system, and if not, we install it using the official script. At the same time, we ensure Gradio is available by importing it or installing the required version when missing. This way, we prepare our Colab environment for running the chat interface smoothly. Check out the Full Codes here. Copy CodeCopiedUse a different Browser def start_ollama(): try: requests.get(“http://127.0.0.1:11434/api/tags”, timeout=1) print(” Ollama server already running.”) return None except Exception: pass print(” Starting Ollama server …”) proc = subprocess.Popen([“ollama”, “serve”], stdout=subprocess.PIPE, stderr=subprocess.STDOUT, text=True) for _ in range(60): time.sleep(1) try: r = requests.get(“http://127.0.0.1:11434/api/tags”, timeout=1) if r.ok: print(” Ollama server is up.”) break except Exception: pass else: raise RuntimeError(“Ollama did not start in time.”) return proc server_proc = start_ollama() We start the Ollama server in the background and keep checking its health endpoint until it responds successfully. By doing this, we ensure the server is running and ready before sending any API requests. Check out the Full Codes here. Copy CodeCopiedUse a different Browser MODEL = os.environ.get(“OLLAMA_MODEL”, “qwen2.5:0.5b-instruct”) print(f” Using model: {MODEL}”) try: tags = requests.get(“http://127.0.0.1:11434/api/tags”, timeout=5).json() have = any(m.get(“name”)==MODEL for m in tags.get(“models”, [])) except Exception: have = False if not have: print(f” Pulling model {MODEL} (first time only) …”) sh(f”ollama pull {MODEL}”) We define the default model to use, check if it is already available on the Ollama server, and if not, we automatically pull it. This ensures that the chosen model is ready before we start running any chat sessions. Check out the Full Codes here. Copy CodeCopiedUse a different Browser OLLAMA_URL = “http://127.0.0.1:11434/api/chat” def ollama_chat_stream(messages, model=MODEL, temperature=0.2, num_ctx=None): “””Yield streaming text chunks from Ollama /api/chat.””” payload = { “model”: model, “messages”: messages, “stream”: True, “options”: {“temperature”: float(temperature)} } if num_ctx: payload[“options”][“num_ctx”] = int(num_ctx) with requests.post(OLLAMA_URL, json=payload, stream=True) as r: r.raise_for_status() for line in r.iter_lines(): if not line: continue data = json.loads(line.decode(“utf-8”)) if “message” in data and “content” in data[“message”]: yield data[“message”][“content”] if data.get(“done”): break We create a streaming client for the Ollama /api/chat endpoint, where we send messages as JSON payloads and yield tokens as they arrive. This lets us handle responses incrementally, so we see the model’s output in real time instead of waiting for the full completion. Check out the Full Codes here. Copy CodeCopiedUse a different Browser def smoke_test(): print(“n Smoke test:”) sys_msg = {“role”:”system”,”content”:”You are concise. Use short bullets.”} user_msg = {“role”:”user”,”content”:”Give 3 quick tips to sleep better.”} out = [] for chunk in ollama_chat_stream([sys_msg, user_msg], temperature=0.3): print(chunk, end=””) out.append(chunk) print(“n Done.n”) try: smoke_test() except Exception as e: print(” Smoke test skipped:”, e) We run a quick smoke test by sending a simple prompt through our streaming client to confirm that the model responds correctly. This helps us verify that Ollama is installed, the server is running, and the chosen model is working before we build the full chat UI. Check out the Full Codes here. Copy CodeCopiedUse a different Browser import gradio as gr SYSTEM_PROMPT = “You are a helpful, crisp assistant. Prefer bullets when helpful.” def chat_fn(message, history, temperature, num_ctx): msgs = [{“role”:”system”,”content”:SYSTEM_PROMPT}] for u, a in history: if u: msgs.append({“role”:”user”,”content”:u}) if a: msgs.append({“role”:”assistant”,”content”:a}) msgs.append({“role”:”user”,”content”: message}) acc = “” try: for part in ollama_chat_stream(msgs, model=MODEL, temperature=temperature, num_ctx=num_ctx or None): acc += part yield acc except Exception as e: yield f” Error: {e}” with gr.Blocks(title=”Ollama Chat (Colab)”, fill_height=True) as demo: gr.Markdown(“# Ollama Chat (Colab)nSmall local-ish LLM via Ollama + Gradio.n”) with gr.Row(): temp = gr.Slider(0.0, 1.0, value=0.3, step=0.1, label=”Temperature”) num_ctx = gr.Slider(512, 8192, value=2048, step=256, label=”Context Tokens (num_ctx)”) chat = gr.Chatbot(height=460) msg = gr.Textbox(label=”Your message”, placeholder=”Ask anything…”, lines=3) clear = gr.Button(“Clear”) def user_send(m, h): m = (m or “”).strip() if not m: return “”, h return “”, h + [[m, None]] def bot_reply(h, temperature, num_ctx): u = h[-1][0] stream = chat_fn(u, h[:-1], temperature, int(num_ctx)) acc = “” for partial in stream: acc = partial h[-1][1] = acc yield h msg.submit(user_send, [msg, chat], [msg, chat]) .then(bot_reply, [chat, temp, num_ctx], [chat]) clear.click(lambda: None, None, chat) print(” Launching Gradio …”) demo.launch(share=True) We integrate Gradio to build an interactive chat UI on top of the Ollama server, where user input and conversation history are converted into the correct message format and streamed back as model responses. The sliders let us adjust parameters like temperature and context length, while the chat box and clear button provide a simple, real-time interface for testing different prompts. In conclusion, we establish a reproducible pipeline for running Ollama in Colab: installation, server startup, model management, API access, and user interface integration. The system uses Ollama’s REST API as the core interaction layer, providing both command-line and Python streaming access, while Gradio handles session persistence and chat rendering. This approach preserves the “self-hosted” design described in the original guide but

A Coding Implementation to Build a Complete Self-Hosted LLM Workflow with Ollama, REST API, and Gradio Chat Interface Leer entrada »

AI, Committee, Noticias, Uncategorized

What do Speech Foundation Models Learn? Analysis and Applications

arXiv:2508.12255v1 Announce Type: new Abstract: Speech foundation models (SFMs) are designed to serve as general-purpose representations for a wide range of speech-processing tasks. The last five years have seen an influx of increasingly successful self-supervised and supervised pre-trained models with impressive performance on various downstream tasks. Although the zoo of SFMs continues to grow, our understanding of the knowledge they acquire lags behind. This thesis presents a lightweight analysis framework using statistical tools and training-free tasks to investigate the acoustic and linguistic knowledge encoded in SFM layers. We conduct a comparative study across multiple SFMs and statistical tools. Our study also shows that the analytical insights have concrete implications for downstream task performance. The effectiveness of an SFM is ultimately determined by its performance on speech applications. Yet it remains unclear whether the benefits extend to spoken language understanding (SLU) tasks that require a deeper understanding than widely studied ones, such as speech recognition. The limited exploration of SLU is primarily due to a lack of relevant datasets. To alleviate that, this thesis contributes tasks, specifically spoken named entity recognition (NER) and named entity localization (NEL), to the Spoken Language Understanding Evaluation benchmark. We develop SFM-based approaches for NER and NEL, and find that end-to-end (E2E) models leveraging SFMs can surpass traditional cascaded (speech recognition followed by a text model) approaches. Further, we evaluate E2E SLU models across SFMs and adaptation strategies to assess the impact on task performance. Collectively, this thesis tackles previously unanswered questions about SFMs, providing tools and datasets to further our understanding and to enable the community to make informed design choices for future model development and adoption.

What do Speech Foundation Models Learn? Analysis and Applications Leer entrada »

AI, Committee, Noticias, Uncategorized

Generative Medical Event Models Improve with Scale

arXiv:2508.12104v1 Announce Type: cross Abstract: Realizing personalized medicine at scale calls for methods that distill insights from longitudinal patient journeys, which can be viewed as a sequence of medical events. Foundation models pretrained on large-scale medical event data represent a promising direction for scaling real-world evidence generation and generalizing to diverse downstream tasks. Using Epic Cosmos, a dataset with medical events from de-identified longitudinal health records for 16.3 billion encounters over 300 million unique patient records from 310 health systems, we introduce the Cosmos Medical Event Transformer ( CoMET) models, a family of decoder-only transformer models pretrained on 118 million patients representing 115 billion discrete medical events (151 billion tokens). We present the largest scaling-law study for medical event data, establishing a methodology for pretraining and revealing power-law scaling relationships for compute, tokens, and model size. Based on this, we pretrained a series of compute-optimal models with up to 1 billion parameters. Conditioned on a patient’s real-world history, CoMET autoregressively generates the next medical event, simulating patient health timelines. We studied 78 real-world tasks, including diagnosis prediction, disease prognosis, and healthcare operations. Remarkably for a foundation model with generic pretraining and simulation-based inference, CoMET generally outperformed or matched task-specific supervised models on these tasks, without requiring task-specific fine-tuning or few-shot examples. CoMET’s predictive power consistently improves as the model and pretraining scale. Our results show that CoMET, a generative medical event foundation model, can effectively capture complex clinical dynamics, providing an extensible and generalizable framework to support clinical decision-making, streamline healthcare operations, and improve patient outcomes.

Generative Medical Event Models Improve with Scale Leer entrada »

AI, Committee, Noticias, Uncategorized

Improving Detection of Watermarked Language Models

arXiv:2508.13131v1 Announce Type: new Abstract: Watermarking has recently emerged as an effective strategy for detecting the generations of large language models (LLMs). The strength of a watermark typically depends strongly on the entropy afforded by the language model and the set of input prompts. However, entropy can be quite limited in practice, especially for models that are post-trained, for example via instruction tuning or reinforcement learning from human feedback (RLHF), which makes detection based on watermarking alone challenging. In this work, we investigate whether detection can be improved by combining watermark detectors with non-watermark ones. We explore a number of hybrid schemes that combine the two, observing performance gains over either class of detector under a wide range of experimental conditions.

Improving Detection of Watermarked Language Models Leer entrada »

AI, Committee, Noticias, Uncategorized

Fast, Slow, and Tool-augmented Thinking for LLMs: A Review

arXiv:2508.12265v1 Announce Type: new Abstract: Large Language Models (LLMs) have demonstrated remarkable progress in reasoning across diverse domains. However, effective reasoning in real-world tasks requires adapting the reasoning strategy to the demands of the problem, ranging from fast, intuitive responses to deliberate, step-by-step reasoning and tool-augmented thinking. Drawing inspiration from cognitive psychology, we propose a novel taxonomy of LLM reasoning strategies along two knowledge boundaries: a fast/slow boundary separating intuitive from deliberative processes, and an internal/external boundary distinguishing reasoning grounded in the model’s parameters from reasoning augmented by external tools. We systematically survey recent work on adaptive reasoning in LLMs and categorize methods based on key decision factors. We conclude by highlighting open challenges and future directions toward more adaptive, efficient, and reliable LLMs.

Fast, Slow, and Tool-augmented Thinking for LLMs: A Review Leer entrada »

AI, Committee, Noticias, Uncategorized

Beyond Modality Limitations: A Unified MLLM Approach to Automated Speaking Assessment with Effective Curriculum Learning

arXiv:2508.12591v1 Announce Type: new Abstract: Traditional Automated Speaking Assessment (ASA) systems exhibit inherent modality limitations: text-based approaches lack acoustic information while audio-based methods miss semantic context. Multimodal Large Language Models (MLLM) offer unprecedented opportunities for comprehensive ASA by simultaneously processing audio and text within unified frameworks. This paper presents a very first systematic study of MLLM for comprehensive ASA, demonstrating the superior performance of MLLM across the aspects of content and language use . However, assessment on the delivery aspect reveals unique challenges, which is deemed to require specialized training strategies. We thus propose Speech-First Multimodal Training (SFMT), leveraging a curriculum learning principle to establish more robust modeling foundations of speech before cross-modal synergetic fusion. A series of experiments on a benchmark dataset show MLLM-based systems can elevate the holistic assessment performance from a PCC value of 0.783 to 0.846. In particular, SFMT excels in the evaluation of the delivery aspect, achieving an absolute accuracy improvement of 4% over conventional training approaches, which also paves a new avenue for ASA.

Beyond Modality Limitations: A Unified MLLM Approach to Automated Speaking Assessment with Effective Curriculum Learning Leer entrada »

AI, Committee, Noticias, Uncategorized

Model Interpretability and Rationale Extraction by Input Mask Optimization

arXiv:2508.11388v1 Announce Type: new Abstract: Concurrent to the rapid progress in the development of neural-network based models in areas like natural language processing and computer vision, the need for creating explanations for the predictions of these black-box models has risen steadily. We propose a new method to generate extractive explanations for predictions made by neural networks, that is based on masking parts of the input which the model does not consider to be indicative of the respective class. The masking is done using gradient-based optimization combined with a new regularization scheme that enforces sufficiency, comprehensiveness and compactness of the generated explanation, three properties that are known to be desirable from the related field of rationale extraction in natural language processing. In this way, we bridge the gap between model interpretability and rationale extraction, thereby proving that the latter of which can be performed without training a specialized model, only on the basis of a trained classifier. We further apply the same method to image inputs and obtain high quality explanations for image classifications, which indicates that the conditions proposed for rationale extraction in natural language processing are more broadly applicable to different input types.

Model Interpretability and Rationale Extraction by Input Mask Optimization Leer entrada »

AI, Committee, Noticias, Uncategorized

Feedback Indicators: The Alignment between Llama and a Teacher in Language Learning

arXiv:2508.11364v1 Announce Type: new Abstract: Automated feedback generation has the potential to enhance students’ learning progress by providing timely and targeted feedback. Moreover, it can assist teachers in optimizing their time, allowing them to focus on more strategic and personalized aspects of teaching. To generate high-quality, information-rich formative feedback, it is essential first to extract relevant indicators, as these serve as the foundation upon which the feedback is constructed. Teachers often employ feedback criteria grids composed of various indicators that they evaluate systematically. This study examines the initial phase of extracting such indicators from students’ submissions of a language learning course using the large language model Llama 3.1. Accordingly, the alignment between indicators generated by the LLM and human ratings across various feedback criteria is investigated. The findings demonstrate statistically significant strong correlations, even in cases involving unanticipated combinations of indicators and criteria. The methodology employed in this paper offers a promising foundation for extracting indicators from students’ submissions using LLMs. Such indicators can potentially be utilized to auto-generate explainable and transparent formative feedback in future research.

Feedback Indicators: The Alignment between Llama and a Teacher in Language Learning Leer entrada »

AI, Committee, Noticias, Uncategorized

Retrieval-augmented reasoning with lean language models

arXiv:2508.11386v1 Announce Type: new Abstract: This technical report details a novel approach to combining reasoning and retrieval augmented generation (RAG) within a single, lean language model architecture. While existing RAG systems typically rely on large-scale models and external APIs, our work addresses the increasing demand for performant and privacy-preserving solutions deployable in resource-constrained or secure environments. Building on recent developments in test-time scaling and small-scale reasoning models, we develop a retrieval augmented conversational agent capable of interpreting complex, domain-specific queries using a lightweight backbone model. Our system integrates a dense retriever with fine-tuned Qwen2.5-Instruct models, using synthetic query generation and reasoning traces derived from frontier models (e.g., DeepSeek-R1) over a curated corpus, in this case, the NHS A-to-Z condition pages. We explore the impact of summarisation-based document compression, synthetic data design, and reasoning-aware fine-tuning on model performance. Evaluation against both non-reasoning and general-purpose lean models demonstrates that our domain-specific fine-tuning approach yields substantial gains in answer accuracy and consistency, approaching frontier-level performance while remaining feasible for local deployment. All implementation details and code are publicly released to support reproducibility and adaptation across domains.

Retrieval-augmented reasoning with lean language models Leer entrada »

AI, Committee, Noticias, Uncategorized

ToxiFrench: Benchmarking and Enhancing Language Models via CoT Fine-Tuning for French Toxicity Detection

arXiv:2508.11281v1 Announce Type: new Abstract: Detecting toxic content using language models is crucial yet challenging. While substantial progress has been made in English, toxicity detection in French remains underdeveloped, primarily due to the lack of culturally relevant, large-scale datasets. In this work, we introduce TOXIFRENCH, a new public benchmark of 53,622 French online comments, constructed via a semi-automated annotation pipeline that reduces manual labeling to only 10% through high-confidence LLM-based pre-annotation and human verification. Then, we benchmark a broad range of models and uncover a counterintuitive insight: Small Language Models (SLMs) outperform many larger models in robustness and generalization under the toxicity detection task. Motivated by this finding, we propose a novel Chain-of-Thought (CoT) fine-tuning strategy using a dynamic weighted loss that progressively emphasizes the model’s final decision, significantly improving faithfulness. Our fine-tuned 4B model achieves state-of-the-art performance, improving its F1 score by 13% over its baseline and outperforming LLMs such as GPT-40 and Gemini-2.5. Further evaluation on a cross-lingual toxicity benchmark demonstrates strong multilingual ability, suggesting that our methodology can be effectively extended to other languages and safety-critical classification tasks.

ToxiFrench: Benchmarking and Enhancing Language Models via CoT Fine-Tuning for French Toxicity Detection Leer entrada »

We use cookies to improve your experience and performance on our website. You can learn more at Política de privacidad and manage your privacy settings by clicking Settings.

Privacy Preferences

You can choose your cookie settings by turning on/off each type of cookie as you wish, except for essential cookies.

Allow All
Manage Consent Preferences
  • Always Active

Save
es_ES