{"id":84562,"date":"2026-04-19T15:19:17","date_gmt":"2026-04-19T15:19:17","guid":{"rendered":"https:\/\/youzum.net\/a-coding-tutorial-for-running-prismml-bonsai-1-bit-llm-on-cuda-with-gguf-benchmarking-chat-json-and-rag\/"},"modified":"2026-04-19T15:19:17","modified_gmt":"2026-04-19T15:19:17","slug":"a-coding-tutorial-for-running-prismml-bonsai-1-bit-llm-on-cuda-with-gguf-benchmarking-chat-json-and-rag","status":"publish","type":"post","link":"https:\/\/youzum.net\/ja\/a-coding-tutorial-for-running-prismml-bonsai-1-bit-llm-on-cuda-with-gguf-benchmarking-chat-json-and-rag\/","title":{"rendered":"A Coding Tutorial for Running PrismML Bonsai 1-Bit LLM on CUDA with GGUF, Benchmarking, Chat, JSON, and RAG"},"content":{"rendered":"<p>In this tutorial, we implement how to run the <a href=\"https:\/\/github.com\/PrismML-Eng\/Bonsai-demo\/\"><strong>Bonsai<\/strong><\/a> 1-bit large language model efficiently using GPU acceleration and PrismML\u2019s optimized GGUF deployment stack. We set up the environment, install the required dependencies, and download the prebuilt llama.cpp binaries, and load the Bonsai-1.7B model for fast inference on CUDA. As we progress, we examine how 1-bit quantization works under the hood, why the Q1_0_g128 format is so memory-efficient, and how this makes Bonsai practical for lightweight yet capable language model deployment. We also test core inference, benchmarking, multi-turn chat, structured JSON generation, code generation, OpenAI-compatible server mode, and a small retrieval-augmented generation workflow, giving us a complete, hands-on view of how Bonsai operates in real-world use.<\/p>\n<div class=\"dm-code-snippet dark dm-normal-version default no-background-mobile\">\n<div class=\"control-language\">\n<div class=\"dm-buttons\">\n<div class=\"dm-buttons-left\">\n<div class=\"dm-button-snippet red-button\"><\/div>\n<div class=\"dm-button-snippet orange-button\"><\/div>\n<div class=\"dm-button-snippet green-button\"><\/div>\n<\/div>\n<div class=\"dm-buttons-right\"><a><span class=\"dm-copy-text\">Copy Code<\/span><span class=\"dm-copy-confirmed\">Copied<\/span><span class=\"dm-error-message\">Use a different Browser<\/span><\/a><\/div>\n<\/div>\n<pre class=\"no-line-numbers\"><code class=\"no-wrap language-php\">import os, sys, subprocess, time, json, urllib.request, tarfile, textwrap\n\n\ntry:\n   import google.colab\n   IN_COLAB = True\nexcept ImportError:\n   IN_COLAB = False\n\n\ndef section(title):\n   bar = \"\u2550\" * 60\n   print(f\"n{bar}n  {title}n{bar}\")\n\n\nsection(\"1 \u00b7 Environment &amp; GPU Check\")\n\n\ndef run(cmd, capture=False, check=True, **kw):\n   return subprocess.run(\n       cmd, shell=True, capture_output=capture,\n       text=True, check=check, **kw\n   )\n\n\ngpu_info = run(\"nvidia-smi --query-gpu=name,memory.total,driver_version --format=csv,noheader\",\n              capture=True, check=False)\nif gpu_info.returncode == 0:\n   print(\"<img decoding=\"async\" src=\"https:\/\/s.w.org\/images\/core\/emoji\/17.0.2\/72x72\/2705.png\" alt=\"\u2705\" class=\"wp-smiley\" \/> GPU detected:\", gpu_info.stdout.strip())\nelse:\n   print(\"<img decoding=\"async\" src=\"https:\/\/s.w.org\/images\/core\/emoji\/17.0.2\/72x72\/26a0.png\" alt=\"\u26a0\" class=\"wp-smiley\" \/>  No GPU found \u2014 inference will run on CPU (much slower).\")\n\n\ncuda_check = run(\"nvcc --version\", capture=True, check=False)\nif cuda_check.returncode == 0:\n   for line in cuda_check.stdout.splitlines():\n       if \"release\" in line:\n           print(\"   CUDA:\", line.strip())\n           break\n\n\nprint(f\"   Python {sys.version.split()[0]}  |  Platform: Linux (Colab)\")\n\n\nsection(\"2 \u00b7 Installing Python Dependencies\")\n\n\nrun(\"pip install -q huggingface_hub requests tqdm openai\")\nprint(\"<img decoding=\"async\" src=\"https:\/\/s.w.org\/images\/core\/emoji\/17.0.2\/72x72\/2705.png\" alt=\"\u2705\" class=\"wp-smiley\" \/> huggingface_hub, requests, tqdm, openai installed\")\n\n\nfrom huggingface_hub import hf_hub_download<\/code><\/pre>\n<\/div>\n<\/div>\n<p>We begin by importing the core Python modules that we need for system operations, downloads, timing, and JSON handling. We check whether we are running inside Google Colab, define a reusable section printer, and create a helper function to run shell commands cleanly from Python. We then verify the GPU and CUDA environment, print the Python runtime details, install the required Python dependencies, and prepare the Hugging Face download utility for the next stages.<\/p>\n<div class=\"dm-code-snippet dark dm-normal-version default no-background-mobile\">\n<div class=\"control-language\">\n<div class=\"dm-buttons\">\n<div class=\"dm-buttons-left\">\n<div class=\"dm-button-snippet red-button\"><\/div>\n<div class=\"dm-button-snippet orange-button\"><\/div>\n<div class=\"dm-button-snippet green-button\"><\/div>\n<\/div>\n<div class=\"dm-buttons-right\"><a><span class=\"dm-copy-text\">Copy Code<\/span><span class=\"dm-copy-confirmed\">Copied<\/span><span class=\"dm-error-message\">Use a different Browser<\/span><\/a><\/div>\n<\/div>\n<pre class=\"no-line-numbers\"><code class=\"no-wrap language-php\">section(\"3 \u00b7 Downloading PrismML llama.cpp Prebuilt Binaries\")\n\n\nRELEASE_TAG = \"prism-b8194-1179bfc\"\nBASE_URL    = f\"https:\/\/github.com\/PrismML-Eng\/llama.cpp\/releases\/download\/{RELEASE_TAG}\"\nBIN_DIR     = \"\/content\/bonsai_bin\"\nos.makedirs(BIN_DIR, exist_ok=True)\n\n\ndef detect_cuda_build():\n   r = run(\"nvcc --version\", capture=True, check=False)\n   for line in r.stdout.splitlines():\n       if \"release\" in line:\n           try:\n               ver = float(line.split(\"release\")[-1].strip().split(\",\")[0].strip())\n               if ver &gt;= 13.0: return \"13.1\"\n               if ver &gt;= 12.6: return \"12.8\"\n               return \"12.4\"\n           except ValueError:\n               pass\n   return \"12.4\"\n\n\ncuda_build = detect_cuda_build()\nprint(f\"   Detected CUDA build slot: {cuda_build}\")\n\n\nTAR_NAME = f\"llama-{RELEASE_TAG}-bin-linux-cuda-{cuda_build}-x64.tar.gz\"\nTAR_URL  = f\"{BASE_URL}\/{TAR_NAME}\"\ntar_path = f\"\/tmp\/{TAR_NAME}\"\n\n\nif not os.path.exists(f\"{BIN_DIR}\/llama-cli\"):\n   print(f\"   Downloading: {TAR_URL}\")\n   urllib.request.urlretrieve(TAR_URL, tar_path)\n   print(\"   Extracting \u2026\")\n   with tarfile.open(tar_path, \"r:gz\") as t:\n       t.extractall(BIN_DIR)\n   for fname in os.listdir(BIN_DIR):\n       fp = os.path.join(BIN_DIR, fname)\n       if os.path.isfile(fp):\n           os.chmod(fp, 0o755)\n   print(f\"<img decoding=\"async\" src=\"https:\/\/s.w.org\/images\/core\/emoji\/17.0.2\/72x72\/2705.png\" alt=\"\u2705\" class=\"wp-smiley\" \/> Binaries extracted to {BIN_DIR}\")\n   bins = sorted(f for f in os.listdir(BIN_DIR) if os.path.isfile(os.path.join(BIN_DIR, f)))\n   print(\"   Available:\", \", \".join(bins))\nelse:\n   print(f\"<img decoding=\"async\" src=\"https:\/\/s.w.org\/images\/core\/emoji\/17.0.2\/72x72\/2705.png\" alt=\"\u2705\" class=\"wp-smiley\" \/> Binaries already present at {BIN_DIR}\")\n\n\nLLAMA_CLI    = f\"{BIN_DIR}\/llama-cli\"\nLLAMA_SERVER = f\"{BIN_DIR}\/llama-server\"\n\n\ntest = run(f\"{LLAMA_CLI} --version\", capture=True, check=False)\nif test.returncode == 0:\n   print(f\"   llama-cli version: {test.stdout.strip()[:80]}\")\nelse:\n   print(f\"<img decoding=\"async\" src=\"https:\/\/s.w.org\/images\/core\/emoji\/17.0.2\/72x72\/26a0.png\" alt=\"\u26a0\" class=\"wp-smiley\" \/>  llama-cli test failed: {test.stderr.strip()[:200]}\")\n\n\nsection(\"4 \u00b7 Downloading Bonsai-1.7B GGUF Model\")\n\n\nMODEL_REPO    = \"prism-ml\/Bonsai-1.7B-gguf\"\nMODEL_DIR     = \"\/content\/bonsai_models\"\nGGUF_FILENAME = \"Bonsai-1.7B.gguf\"\nos.makedirs(MODEL_DIR, exist_ok=True)\nMODEL_PATH = os.path.join(MODEL_DIR, GGUF_FILENAME)\n\n\nif not os.path.exists(MODEL_PATH):\n   print(f\"   Downloading {GGUF_FILENAME} (~248 MB) from HuggingFace \u2026\")\n   MODEL_PATH = hf_hub_download(\n       repo_id=MODEL_REPO,\n       filename=GGUF_FILENAME,\n       local_dir=MODEL_DIR,\n   )\n   print(f\"<img decoding=\"async\" src=\"https:\/\/s.w.org\/images\/core\/emoji\/17.0.2\/72x72\/2705.png\" alt=\"\u2705\" class=\"wp-smiley\" \/> Model saved to: {MODEL_PATH}\")\nelse:\n   print(f\"<img decoding=\"async\" src=\"https:\/\/s.w.org\/images\/core\/emoji\/17.0.2\/72x72\/2705.png\" alt=\"\u2705\" class=\"wp-smiley\" \/> Model already cached: {MODEL_PATH}\")\n\n\nsize_mb = os.path.getsize(MODEL_PATH) \/ 1e6\nprint(f\"   File size on disk: {size_mb:.1f} MB\")\n\n\nsection(\"5 \u00b7 Core Inference Helpers\")\n\n\nDEFAULT_GEN_ARGS = dict(\n   temp=0.5,\n   top_p=0.85,\n   top_k=20,\n   repeat_penalty=1.0,\n   n_predict=256,\n   n_gpu_layers=99,\n   ctx_size=4096,\n)\n\n\ndef build_llama_cmd(prompt, system_prompt=\"You are a helpful assistant.\", **overrides):\n   args = {**DEFAULT_GEN_ARGS, **overrides}\n   formatted = (\n       f\"&lt;|im_start|&gt;systemn{system_prompt}&lt;|im_end|&gt;n\"\n       f\"&lt;|im_start|&gt;usern{prompt}&lt;|im_end|&gt;n\"\n       f\"&lt;|im_start|&gt;assistantn\"\n   )\n   safe_prompt = formatted.replace('\"', '\\\"')\n   return (\n       f'{LLAMA_CLI} -m \"{MODEL_PATH}\"'\n       f' -p \"{safe_prompt}\"'\n       f' -n {args[\"n_predict\"]}'\n       f' --temp {args[\"temp\"]}'\n       f' --top-p {args[\"top_p\"]}'\n       f' --top-k {args[\"top_k\"]}'\n       f' --repeat-penalty {args[\"repeat_penalty\"]}'\n       f' -ngl {args[\"n_gpu_layers\"]}'\n       f' -c {args[\"ctx_size\"]}'\n       f' --no-display-prompt'\n       f' -e'\n   )\n\n\ndef infer(prompt, system_prompt=\"You are a helpful assistant.\", verbose=True, **overrides):\n   cmd = build_llama_cmd(prompt, system_prompt, **overrides)\n   t0 = time.time()\n   result = run(cmd, capture=True, check=False)\n   elapsed = time.time() - t0\n   output = result.stdout.strip()\n   if verbose:\n       print(f\"n{'\u2500'*50}\")\n       print(f\"Prompt : {prompt[:100]}{'\u2026' if len(prompt) &gt; 100 else ''}\")\n       print(f\"{'\u2500'*50}\")\n       print(output)\n       print(f\"{'\u2500'*50}\")\n       print(f\"<img decoding=\"async\" src=\"https:\/\/s.w.org\/images\/core\/emoji\/17.0.2\/72x72\/23f1.png\" alt=\"\u23f1\" class=\"wp-smiley\" \/>  {elapsed:.2f}s  |  ~{len(output.split())} words\")\n   return output, elapsed\n\n\nprint(\"<img decoding=\"async\" src=\"https:\/\/s.w.org\/images\/core\/emoji\/17.0.2\/72x72\/2705.png\" alt=\"\u2705\" class=\"wp-smiley\" \/> Inference helpers ready.\")\n\n\nsection(\"6 \u00b7 Basic Inference \u2014 Hello, Bonsai!\")\n\n\ninfer(\"What makes 1-bit language models special compared to standard models?\")<\/code><\/pre>\n<\/div>\n<\/div>\n<p>We download and prepare the PrismML prebuilt llama.cpp CUDA binaries that power local inference for the Bonsai model. We detect the available CUDA version, choose the matching binary build, extract the downloaded archive, make the files executable, and verify that the llama-cli binary works correctly. After that, we download the Bonsai-1.7B GGUF model from Hugging Face, set up the model path, define the default generation settings, and build the core helper functions that format prompts and run inference.<\/p>\n<div class=\"dm-code-snippet dark dm-normal-version default no-background-mobile\">\n<div class=\"control-language\">\n<div class=\"dm-buttons\">\n<div class=\"dm-buttons-left\">\n<div class=\"dm-button-snippet red-button\"><\/div>\n<div class=\"dm-button-snippet orange-button\"><\/div>\n<div class=\"dm-button-snippet green-button\"><\/div>\n<\/div>\n<div class=\"dm-buttons-right\"><a><span class=\"dm-copy-text\">Copy Code<\/span><span class=\"dm-copy-confirmed\">Copied<\/span><span class=\"dm-error-message\">Use a different Browser<\/span><\/a><\/div>\n<\/div>\n<pre class=\"no-line-numbers\"><code class=\"no-wrap language-php\">section(\"7 \u00b7 Q1_0_g128 Quantization \u2014 What's Happening Under the Hood\")\n\n\nprint(textwrap.dedent(\"\"\"\n\u2554\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2557\n\u2551           Bonsai Q1_0_g128 Weight Representation            \u2551\n\u2560\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2563\n\u2551  Each weight = 1 bit:  0  \u2192  \u2212scale                         \u2551\n\u2551                        1  \u2192  +scale                         \u2551\n\u2551  Every 128 weights share one FP16 scale factor.             \u2551\n\u2551                                                              \u2551\n\u2551  Effective bits per weight:                                  \u2551\n\u2551    1 bit (sign) + 16\/128 bits (shared scale) = 1.125 bpw    \u2551\n\u2551                                                              \u2551\n\u2551  Memory comparison for Bonsai-1.7B:                         \u2551\n\u2551    FP16:            3.44 GB  (1.0\u00d7  baseline)               \u2551\n\u2551    Q1_0_g128:       0.24 GB  (14.2\u00d7 smaller!)               \u2551\n\u2551    MLX 1-bit g128:  0.27 GB  (12.8\u00d7 smaller)                \u2551\n\u255a\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u255d\n\"\"\"))\n\n\nprint(\"<img decoding=\"async\" src=\"https:\/\/s.w.org\/images\/core\/emoji\/17.0.2\/72x72\/1f4d0.png\" alt=\"\ud83d\udcd0\" class=\"wp-smiley\" \/> Python demo of Q1_0_g128 quantization logic:n\")\nimport random\nrandom.seed(42)\nGROUP_SIZE   = 128\nweights_fp16 = [random.gauss(0, 0.1) for _ in range(GROUP_SIZE)]\nscale        = max(abs(w) for w in weights_fp16)\nquantized    = [1 if w &gt;= 0 else 0 for w in weights_fp16]\ndequantized  = [scale if b == 1 else -scale for b in quantized]\nmse          = sum((a - b) ** 2 for a, b in zip(weights_fp16, dequantized)) \/ GROUP_SIZE\n\n\nprint(f\"  FP16 weights (first 8): {[f'{w:.4f}' for w in weights_fp16[:8]]}\")\nprint(f\"  1-bit repr  (first 8): {quantized[:8]}\")\nprint(f\"  Shared scale:          {scale:.4f}\")\nprint(f\"  Dequantized (first 8): {[f'{w:.4f}' for w in dequantized[:8]]}\")\nprint(f\"  MSE of reconstruction: {mse:.6f}\")\nmemory_fp16 = GROUP_SIZE * 2\nmemory_1bit = GROUP_SIZE \/ 8 + 2\nprint(f\"n  Memory: FP16={memory_fp16}B  vs  Q1_0_g128={memory_1bit:.1f}B  \"\n     f\"({memory_fp16\/memory_1bit:.1f}\u00d7 reduction)\")\n\n\nsection(\"8 \u00b7 Performance Benchmark \u2014 Tokens per Second\")\n\n\ndef benchmark(prompt, n_tokens=128, n_runs=3, **kw):\n   timings = []\n   for i in range(n_runs):\n       print(f\"   Run {i+1}\/{n_runs} \u2026\", end=\" \", flush=True)\n       _, elapsed = infer(prompt, verbose=False, n_predict=n_tokens, **kw)\n       tps = n_tokens \/ elapsed\n       timings.append(tps)\n       print(f\"{tps:.1f} tok\/s\")\n   avg = sum(timings) \/ len(timings)\n   print(f\"n  <img decoding=\"async\" src=\"https:\/\/s.w.org\/images\/core\/emoji\/17.0.2\/72x72\/2705.png\" alt=\"\u2705\" class=\"wp-smiley\" \/> Average: {avg:.1f} tok\/s  (over {n_runs} runs, {n_tokens} tokens each)\")\n   return avg\n\n\nprint(\"<img decoding=\"async\" src=\"https:\/\/s.w.org\/images\/core\/emoji\/17.0.2\/72x72\/1f4ca.png\" alt=\"\ud83d\udcca\" class=\"wp-smiley\" \/> Benchmarking Bonsai-1.7B on your GPU \u2026\")\ntps = benchmark(\n   \"Explain the concept of neural network backpropagation step by step.\",\n   n_tokens=128, n_runs=3,\n)\n\n\nprint(\"n  Published reference throughputs (from whitepaper):\")\nprint(\"  \u250c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2510\")\nprint(\"  \u2502 Platform             \u2502 Backend \u2502 TG128 tok\/s  \u2502\")\nprint(\"  \u251c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u253c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u253c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2524\")\nprint(\"  \u2502 RTX 4090             \u2502 CUDA    \u2502     674      \u2502\")\nprint(\"  \u2502 M4 Pro 48 GB         \u2502 Metal   \u2502     250      \u2502\")\nprint(f\"  \u2502 Your GPU (measured)  \u2502 CUDA    \u2502  {tps:&gt;7.1f}    \u2502\")\nprint(\"  \u2514\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2518\")\n\n\nsection(\"9 \u00b7 Multi-Turn Chat with Context Accumulation\")\n\n\ndef chat(user_msg, system=\"You are a helpful assistant.\", history=None, **kw):\n   if history is None:\n       history = []\n   history.append((\"user\", user_msg))\n   full = f\"&lt;|im_start|&gt;systemn{system}&lt;|im_end|&gt;n\"\n   for role, msg in history:\n       full += f\"&lt;|im_start|&gt;{role}n{msg}&lt;|im_end|&gt;n\"\n   full += \"&lt;|im_start|&gt;assistantn\"\n   safe = full.replace('\"', '\\\"').replace('n', '\\n')\n   cmd = (\n       f'{LLAMA_CLI} -m \"{MODEL_PATH}\"'\n       f' -p \"{safe}\" -e'\n       f' -n 200 --temp 0.5 --top-p 0.85 --top-k 20'\n       f' -ngl 99 -c 4096 --no-display-prompt'\n   )\n   result = run(cmd, capture=True, check=False)\n   reply = result.stdout.strip()\n   history.append((\"assistant\", reply))\n   return reply, history\n\n\nprint(\"<img decoding=\"async\" src=\"https:\/\/s.w.org\/images\/core\/emoji\/17.0.2\/72x72\/1f5e3.png\" alt=\"\ud83d\udde3\" class=\"wp-smiley\" \/>  Starting a 3-turn conversation about 1-bit models \u2026n\")\nhistory = []\nturns = [\n   \"What is a 1-bit language model?\",\n   \"What are the main trade-offs compared to 4-bit or 8-bit quantization?\",\n   \"How does Bonsai specifically address those trade-offs?\",\n]\nfor i, msg in enumerate(turns, 1):\n   print(f\"<img decoding=\"async\" src=\"https:\/\/s.w.org\/images\/core\/emoji\/17.0.2\/72x72\/1f464.png\" alt=\"\ud83d\udc64\" class=\"wp-smiley\" \/> Turn {i}: {msg}\")\n   reply, history = chat(msg, history=history)\n   print(f\"<img decoding=\"async\" src=\"https:\/\/s.w.org\/images\/core\/emoji\/17.0.2\/72x72\/1f916.png\" alt=\"\ud83e\udd16\" class=\"wp-smiley\" \/> Bonsai: {reply}n\")\n   time.sleep(0.5)\n\n\nsection(\"10 \u00b7 Sampling Parameter Exploration\")\n\n\ncreative_prompt = \"Write a one-sentence description of a futuristic city powered entirely by 1-bit AI.\"\nconfigs = [\n   (\"Precise \/ Focused\",  dict(temp=0.1, top_k=10,  top_p=0.70)),\n   (\"Balanced (default)\", dict(temp=0.5, top_k=20,  top_p=0.85)),\n   (\"Creative \/ Varied\",  dict(temp=0.9, top_k=50,  top_p=0.95)),\n   (\"High entropy\",       dict(temp=1.2, top_k=100, top_p=0.98)),\n]\n\n\nprint(f'Prompt: \"{creative_prompt}\"n')\nfor label, params in configs:\n   out, _ = infer(creative_prompt, verbose=False, n_predict=80, **params)\n   print(f\"  [{label}]\")\n   print(f\"    temp={params['temp']}, top_k={params['top_k']}, top_p={params['top_p']}\")\n   print(f\"    \u2192 {out[:200]}n\")<\/code><\/pre>\n<\/div>\n<\/div>\n<p>We move from setup into experimentation by first running a basic inference call to confirm that the model is functioning properly. We then explain the Q1_0_g128 quantization format through a visual text block and a small Python demo that shows how 1-bit signs and shared scales reconstruct weights with strong memory savings. After that, we benchmark token generation speed, simulate a multi-turn conversation with accumulated history, and compare how different sampling settings affect the style and diversity of the model\u2019s outputs.<\/p>\n<div class=\"dm-code-snippet dark dm-normal-version default no-background-mobile\">\n<div class=\"control-language\">\n<div class=\"dm-buttons\">\n<div class=\"dm-buttons-left\">\n<div class=\"dm-button-snippet red-button\"><\/div>\n<div class=\"dm-button-snippet orange-button\"><\/div>\n<div class=\"dm-button-snippet green-button\"><\/div>\n<\/div>\n<div class=\"dm-buttons-right\"><a><span class=\"dm-copy-text\">Copy Code<\/span><span class=\"dm-copy-confirmed\">Copied<\/span><span class=\"dm-error-message\">Use a different Browser<\/span><\/a><\/div>\n<\/div>\n<pre class=\"no-line-numbers\"><code class=\"no-wrap language-php\">section(\"11 \u00b7 Context Window \u2014 Long-Document Summarisation\")\n\n\nlong_doc = (\n   \"The transformer architecture, introduced in 'Attention is All You Need' (Vaswani et al., 2017), \"\n   \"replaced recurrent and convolutional networks with self-attention mechanisms. The key insight was \"\n   \"that attention weights could be computed in parallel across the entire sequence, unlike RNNs which \"\n   \"stacked identical layers with multi-head self-attention and feed-forward sub-layers. Positional \"\n   \"encodings inject sequence-order information since attention is permutation-invariant. Subsequent \"\n   \"work removed the encoder (GPT family) or decoder (BERT family) to specialise for generation or \"\n   \"understanding tasks respectively. Scaling laws (Kaplan et al., 2020) showed that loss decreases \"\n   \"predictably with more compute, parameters, and data. This motivated the emergence of large language \"\n   \"these models became prohibitive for edge and on-device deployment. Quantisation research sought to \"\n   \"reduce the bit-width of weights from FP16\/BF16 down to INT8, INT4, and eventually binary (1-bit). \"\n   \"BitNet (Wang et al., 2023) was among the first to demonstrate that training with 1-bit weights from \"\n   \"scratch could approach the quality of higher-precision models at scale. Bonsai (Prism ML, 2026) \"\n   \"extended this to an end-to-end 1-bit deployment pipeline across CUDA, Metal, and mobile runtimes, \"\n   \"achieving 14x memory reduction with the Q1_0_g128 GGUF format.\"\n)\n\n\nsummarize_prompt = f\"Summarize the following technical text in 3 bullet points:nn{long_doc}\"\nprint(f\"   Input length: ~{len(long_doc.split())} words\")\nout, elapsed = infer(summarize_prompt, n_predict=200, ctx_size=2048, verbose=False)\nprint(\"<img decoding=\"async\" src=\"https:\/\/s.w.org\/images\/core\/emoji\/17.0.2\/72x72\/1f4dd.png\" alt=\"\ud83d\udcdd\" class=\"wp-smiley\" \/> Summary:\")\nfor line in out.splitlines():\n   print(f\"   {line}\")\nprint(f\"n<img decoding=\"async\" src=\"https:\/\/s.w.org\/images\/core\/emoji\/17.0.2\/72x72\/23f1.png\" alt=\"\u23f1\" class=\"wp-smiley\" \/>  {elapsed:.2f}s\")\n\n\nsection(\"12 \u00b7 Structured Output \u2014 Forcing JSON Responses\")\n\n\njson_system = (\n   \"You are a JSON API. Respond ONLY with valid JSON, no markdown, no explanation. \"\n   \"Never include ```json fences.\"\n)\njson_prompt = (\n   \"Return a JSON object with keys: model_name, parameter_count, \"\n   \"bits_per_weight, memory_gb, top_use_cases (array of 3 strings). \"\n   \"Fill in values for Bonsai-1.7B.\"\n)\n\n\nraw, _ = infer(json_prompt, system_prompt=json_system, temp=0.1, n_predict=300, verbose=False)\nprint(\"Raw model output:\")\nprint(raw)\nprint()\n\n\ntry:\n   clean = raw.strip().lstrip(\"```json\").lstrip(\"```\").rstrip(\"```\").strip()\n   data  = json.loads(clean)\n   print(\"<img decoding=\"async\" src=\"https:\/\/s.w.org\/images\/core\/emoji\/17.0.2\/72x72\/2705.png\" alt=\"\u2705\" class=\"wp-smiley\" \/> Parsed JSON:\")\n   for k, v in data.items():\n       print(f\"   {k}: {v}\")\nexcept json.JSONDecodeError as e:\n   print(f\"<img decoding=\"async\" src=\"https:\/\/s.w.org\/images\/core\/emoji\/17.0.2\/72x72\/26a0.png\" alt=\"\u26a0\" class=\"wp-smiley\" \/>  JSON parse error: {e} \u2014 raw output shown above.\")\n\n\nsection(\"13 \u00b7 Code Generation\")\n\n\ncode_prompt = (\n   \"Write a Python function called `quantize_weights` that takes a list of float \"\n   \"weights and a group_size, applies 1-bit Q1_0_g128-style quantization (sign bit + \"\n   \"per-group FP16 scale), and returns the quantized bits and scale list. \"\n   \"Include a docstring and a short usage example.\"\n)\ncode_system = \"You are an expert Python programmer. Return clean, well-commented Python code only.\"\n\n\ncode_out, _ = infer(code_prompt, system_prompt=code_system,\n                   temp=0.2, n_predict=400, verbose=False)\nprint(code_out)\n\n\nexec_ns = {}\ntry:\n   exec(code_out, exec_ns)\n   if \"quantize_weights\" in exec_ns:\n       import random as _r\n       test_w = [_r.gauss(0, 0.1) for _ in range(256)]\n       bits, scales = exec_ns[\"quantize_weights\"](test_w, 128)\n       print(f\"n<img decoding=\"async\" src=\"https:\/\/s.w.org\/images\/core\/emoji\/17.0.2\/72x72\/2705.png\" alt=\"\u2705\" class=\"wp-smiley\" \/> Function executed successfully!\")\n       print(f\"   Input  : {len(test_w)} weights\")\n       print(f\"   Output : {len(bits)} bits, {len(scales)} scale values\")\nexcept Exception as e:\n   print(f\"n<img decoding=\"async\" src=\"https:\/\/s.w.org\/images\/core\/emoji\/17.0.2\/72x72\/26a0.png\" alt=\"\u26a0\" class=\"wp-smiley\" \/>  Exec note: {e} (model output may need minor tweaks)\")<\/code><\/pre>\n<\/div>\n<\/div>\n<p>We test the model on longer-context and structured tasks to better understand its practical capabilities. We feed a technical passage into a summarization model, ask it to return strict JSON output, and then push it further by generating Python code that we immediately execute in the notebook. This helps us evaluate not only whether Bonsai can answer questions, but also whether it can follow formatting rules, generate usable structured responses, and produce code that works in real execution.<\/p>\n<div class=\"dm-code-snippet dark dm-normal-version default no-background-mobile\">\n<div class=\"control-language\">\n<div class=\"dm-buttons\">\n<div class=\"dm-buttons-left\">\n<div class=\"dm-button-snippet red-button\"><\/div>\n<div class=\"dm-button-snippet orange-button\"><\/div>\n<div class=\"dm-button-snippet green-button\"><\/div>\n<\/div>\n<div class=\"dm-buttons-right\"><a><span class=\"dm-copy-text\">Copy Code<\/span><span class=\"dm-copy-confirmed\">Copied<\/span><span class=\"dm-error-message\">Use a different Browser<\/span><\/a><\/div>\n<\/div>\n<pre class=\"no-line-numbers\"><code class=\"no-wrap language-php\">section(\"14 \u00b7 OpenAI-Compatible Server Mode\")\n\n\nSERVER_PORT = 8088\nSERVER_URL  = f\"http:\/\/localhost:{SERVER_PORT}\"\nserver_proc = None\n\n\ndef start_server():\n   global server_proc\n   if server_proc and server_proc.poll() is None:\n       print(\"   Server already running.\")\n       return\n   cmd = (\n       f\"{LLAMA_SERVER} -m {MODEL_PATH} \"\n       f\"--host 0.0.0.0 --port {SERVER_PORT} \"\n       f\"-ngl 99 -c 4096 --no-display-prompt --log-disable 2&gt;\/dev\/null\"\n   )\n   server_proc = subprocess.Popen(cmd, shell=True,\n                                  stdout=subprocess.DEVNULL,\n                                  stderr=subprocess.DEVNULL)\n   for _ in range(30):\n       try:\n           urllib.request.urlopen(f\"{SERVER_URL}\/health\", timeout=1)\n           print(f\"<img decoding=\"async\" src=\"https:\/\/s.w.org\/images\/core\/emoji\/17.0.2\/72x72\/2705.png\" alt=\"\u2705\" class=\"wp-smiley\" \/> llama-server running at {SERVER_URL}\")\n           return\n       except Exception:\n           time.sleep(1)\n   print(\"<img decoding=\"async\" src=\"https:\/\/s.w.org\/images\/core\/emoji\/17.0.2\/72x72\/26a0.png\" alt=\"\u26a0\" class=\"wp-smiley\" \/>  Server may still be starting up \u2026\")\n\n\ndef stop_server():\n   global server_proc\n   if server_proc:\n       server_proc.terminate()\n       server_proc.wait()\n       print(\"   Server stopped.\")\n\n\nprint(\"<img decoding=\"async\" src=\"https:\/\/s.w.org\/images\/core\/emoji\/17.0.2\/72x72\/1f680.png\" alt=\"\ud83d\ude80\" class=\"wp-smiley\" \/> Starting llama-server \u2026\")\nstart_server()\ntime.sleep(2)\n\n\ntry:\n   from openai import OpenAI\n   client   = OpenAI(base_url=f\"{SERVER_URL}\/v1\", api_key=\"no-key-needed\")\n   print(\"n   Sending request via OpenAI client \u2026\")\n   response = client.chat.completions.create(\n       model=\"bonsai\",\n       messages=[\n           {\"role\": \"user\",   \"content\": \"What are three key advantages of 1-bit LLMs for mobile devices?\"},\n       ],\n       max_tokens=200,\n       temperature=0.5,\n   )\n   reply = response.choices[0].message.content\n   print(f\"n<img decoding=\"async\" src=\"https:\/\/s.w.org\/images\/core\/emoji\/17.0.2\/72x72\/1f916.png\" alt=\"\ud83e\udd16\" class=\"wp-smiley\" \/> Server response:n{reply}\")\n   usage = response.usage\n   print(f\"n   Prompt tokens    : {usage.prompt_tokens}\")\n   print(f\"   Completion tokens: {usage.completion_tokens}\")\n   print(f\"   Total tokens     : {usage.total_tokens}\")\nexcept Exception as e:\n   print(f\"<img decoding=\"async\" src=\"https:\/\/s.w.org\/images\/core\/emoji\/17.0.2\/72x72\/26a0.png\" alt=\"\u26a0\" class=\"wp-smiley\" \/>  OpenAI client error: {e}\")\n\n\nsection(\"15 \u00b7 Mini-RAG \u2014 Grounded Q&amp;A with Context Injection\")\n\n\nKB = {\n   \"bonsai_1.7b\": (\n       \"Bonsai-1.7B uses Q1_0_g128 quantization. It has 1.7B parameters, \"\n       \"deployed size 0.24 GB, context length 32,768 tokens, and is based on \"\n       \"the Qwen3-1.7B dense architecture with GQA attention.\"\n   ),\n   \"bonsai_8b\": (\n       \"Bonsai-8B uses Q1_0_g128 quantization. It supports up to 65,536 tokens \"\n       \"of context. It achieves 3.0x faster token generation than FP16 on RTX 4090.\"\n   ),\n   \"quantization\": (\n       \"Q1_0_g128 packs each weight as a single sign bit (0=-scale, 1=+scale). \"\n       \"Each group of 128 weights shares one FP16 scale factor, giving 1.125 bpw.\"\n   ),\n}\n\n\ndef rag_query(question):\n   q = question.lower()\n   relevant = []\n   if \"1.7\" in q or \"small\" in q:  relevant.append(KB[\"bonsai_1.7b\"])\n   if \"8b\" in q or \"context\" in q: relevant.append(KB[\"bonsai_8b\"])\n   if \"quant\" in q or \"bit\" in q:  relevant.append(KB[\"quantization\"])\n   if not relevant:                 relevant = list(KB.values())\n   context    = \"n\".join(f\"- {c}\" for c in relevant)\n   rag_prompt = (\n       \"If the answer is not in the context, say so.nn\"\n       f\"Context:n{context}nnQuestion: {question}\"\n   )\n   ans, _ = infer(rag_prompt, n_predict=150, temp=0.1, verbose=False)\n   print(f\"<img decoding=\"async\" src=\"https:\/\/s.w.org\/images\/core\/emoji\/17.0.2\/72x72\/2753.png\" alt=\"\u2753\" class=\"wp-smiley\" \/> {question}\")\n   print(f\"<img decoding=\"async\" src=\"https:\/\/s.w.org\/images\/core\/emoji\/17.0.2\/72x72\/1f4a1.png\" alt=\"\ud83d\udca1\" class=\"wp-smiley\" \/> {ans}n\")\n\n\nprint(\"Running RAG queries \u2026n\")\nrag_query(\"What is the deployed file size of the 1.7B model?\")\nrag_query(\"How does Q1_0_g128 quantization work?\")\nrag_query(\"What context length does the 8B model support?\")\n\n\nsection(\"16 \u00b7 Model Family Comparison\")\n\n\nprint(\"\"\"\n\u250c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2510\n\u2502 Model           \u2502 Params   \u2502 GGUF Size  \u2502 Context Len    \u2502 FP16 Size    \u2502 Compression  \u2502\n\u251c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u253c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u253c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u253c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u253c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u253c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2524\n\u2502 Bonsai-1.7B     \u2502  1.7 B   \u2502  0.25 GB   \u2502 32,768 tokens  \u2502   3.44 GB    \u2502    14.2\u00d7     \u2502\n\u2502 Bonsai-4B       \u2502  4.0 B   \u2502  ~0.6 GB   \u2502 32,768 tokens  \u2502   ~8.0  GB   \u2502    ~13\u00d7      \u2502\n\u2502 Bonsai-8B       \u2502  8.0 B   \u2502  ~0.9 GB   \u2502 65,536 tokens  \u2502  ~16.0  GB   \u2502    ~13.9\u00d7    \u2502\n\u2514\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2518\n\n\nThroughput (from whitepaper):\n RTX 4090  \u2014 Bonsai-1.7B:  674 tok\/s (TG128) vs FP16 224 tok\/s  \u2192  3.0\u00d7 faster\n M4 Pro    \u2014 Bonsai-1.7B:  250 tok\/s (TG128) vs FP16  65 tok\/s  \u2192  3.8\u00d7 faster\n\"\"\")\n\n\nsection(\"17 \u00b7 Cleanup\")\n\n\nstop_server()\nprint(\"<img decoding=\"async\" src=\"https:\/\/s.w.org\/images\/core\/emoji\/17.0.2\/72x72\/2705.png\" alt=\"\u2705\" class=\"wp-smiley\" \/> Tutorial complete!n\")\nprint(\"<img decoding=\"async\" src=\"https:\/\/s.w.org\/images\/core\/emoji\/17.0.2\/72x72\/1f4da.png\" alt=\"\ud83d\udcda\" class=\"wp-smiley\" \/> Resources:\")\nprint(\"   GitHub:      https:\/\/github.com\/PrismML-Eng\/Bonsai-demo\")\nprint(\"   HuggingFace: https:\/\/huggingface.co\/collections\/prism-ml\/bonsai\")\nprint(\"   Whitepaper:  https:\/\/github.com\/PrismML-Eng\/Bonsai-demo\/blob\/main\/1-bit-bonsai-8b-whitepaper.pdf\")\nprint(\"   Discord:     https:\/\/discord.gg\/prismml\")<\/code><\/pre>\n<\/div>\n<\/div>\n<p>We launch the OpenAI-compatible llama-server to interact with Bonsai via the OpenAI Python client. We then build a lightweight Mini-RAG example by injecting relevant context into prompts, compare the broader Bonsai model family in terms of size, context length, and compression, and finally shut down the local server cleanly. This closing section shows how Bonsai can fit into API-style workflows, grounded question-answering setups, and broader deployment scenarios beyond simple single-prompt inference.<\/p>\n<p>In conclusion, we built and ran a full Bonsai 1-bit LLM workflow in Google Colab and observed that extreme quantization can dramatically reduce model size while still supporting useful, fast, and flexible inference. We verified the runtime environment, launched the model locally, measured token throughput, and experimented with different prompting, sampling, context handling, and server-based integrations. Along the way, we also connected the practical execution to the underlying quantization logic, helping us understand not just how to use Bonsai, but why its design is important for efficient AI deployment. By the end, we have a compact but advanced setup that demonstrates how 1-bit language models can make high-performance inference more accessible across constrained and mainstream hardware environments.<\/p>\n<hr class=\"wp-block-separator has-alpha-channel-opacity\" \/>\n<p>Check out\u00a0the<strong>\u00a0<a href=\"https:\/\/github.com\/Marktechpost\/AI-Agents-Projects-Tutorials\/blob\/main\/LLM%20Projects\/bonsai_1bit_llm_advanced_colab_cuda_marktechpost.py\" target=\"_blank\" rel=\"noreferrer noopener\">Full Coding Notebook here<\/a><\/strong>.<strong>\u00a0<\/strong>Also,\u00a0feel free to follow us on\u00a0<strong><a href=\"https:\/\/x.com\/intent\/follow?screen_name=marktechpost\" target=\"_blank\" rel=\"noreferrer noopener\"><mark>Twitter<\/mark><\/a><\/strong>\u00a0and don\u2019t forget to join our\u00a0<strong><a href=\"https:\/\/www.reddit.com\/r\/machinelearningnews\/\" target=\"_blank\" rel=\"noreferrer noopener\">130k+ ML SubReddit<\/a><\/strong>\u00a0and Subscribe to\u00a0<strong><a href=\"https:\/\/www.aidevsignals.com\/\" target=\"_blank\" rel=\"noreferrer noopener\">our Newsletter<\/a><\/strong>. Wait! are you on telegram?\u00a0<strong><a href=\"https:\/\/t.me\/machinelearningresearchnews\" target=\"_blank\" rel=\"noreferrer noopener\">now you can join us on telegram as well.<\/a><\/strong><\/p>\n<p>Need to partner with us for promoting your GitHub Repo OR Hugging Face Page OR Product Release OR Webinar etc.?\u00a0<strong><a href=\"https:\/\/forms.gle\/MTNLpmJtsFA3VRVd9\" target=\"_blank\" rel=\"noreferrer noopener\"><mark>Connect with us<\/mark><\/a><\/strong><\/p>\n<p>The post <a href=\"https:\/\/www.marktechpost.com\/2026\/04\/18\/a-coding-tutorial-for-running-prismml-bonsai-1-bit-llm-on-cuda-with-gguf-benchmarking-chat-json-and-rag\/\">A Coding Tutorial for Running PrismML Bonsai 1-Bit LLM on CUDA with GGUF, Benchmarking, Chat, JSON, and RAG<\/a> appeared first on <a href=\"https:\/\/www.marktechpost.com\/\">MarkTechPost<\/a>.<\/p>","protected":false},"excerpt":{"rendered":"<p>In this tutorial, we implement how to run the Bonsai 1-bit large language model efficiently using GPU acceleration and PrismML\u2019s optimized GGUF deployment stack. We set up the environment, install the required dependencies, and download the prebuilt llama.cpp binaries, and load the Bonsai-1.7B model for fast inference on CUDA. As we progress, we examine how 1-bit quantization works under the hood, why the Q1_0_g128 format is so memory-efficient, and how this makes Bonsai practical for lightweight yet capable language model deployment. We also test core inference, benchmarking, multi-turn chat, structured JSON generation, code generation, OpenAI-compatible server mode, and a small retrieval-augmented generation workflow, giving us a complete, hands-on view of how Bonsai operates in real-world use. Copy CodeCopiedUse a different Browser import os, sys, subprocess, time, json, urllib.request, tarfile, textwrap try: import google.colab IN_COLAB = True except ImportError: IN_COLAB = False def section(title): bar = &#8220;\u2550&#8221; * 60 print(f&#8221;n{bar}n {title}n{bar}&#8221;) section(&#8220;1 \u00b7 Environment &amp; GPU Check&#8221;) def run(cmd, capture=False, check=True, **kw): return subprocess.run( cmd, shell=True, capture_output=capture, text=True, check=check, **kw ) gpu_info = run(&#8220;nvidia-smi &#8211;query-gpu=name,memory.total,driver_version &#8211;format=csv,noheader&#8221;, capture=True, check=False) if gpu_info.returncode == 0: print(&#8221; GPU detected:&#8221;, gpu_info.stdout.strip()) else: print(&#8221; No GPU found \u2014 inference will run on CPU (much slower).&#8221;) cuda_check = run(&#8220;nvcc &#8211;version&#8221;, capture=True, check=False) if cuda_check.returncode == 0: for line in cuda_check.stdout.splitlines(): if &#8220;release&#8221; in line: print(&#8221; CUDA:&#8221;, line.strip()) break print(f&#8221; Python {sys.version.split()[0]} | Platform: Linux (Colab)&#8221;) section(&#8220;2 \u00b7 Installing Python Dependencies&#8221;) run(&#8220;pip install -q huggingface_hub requests tqdm openai&#8221;) print(&#8221; huggingface_hub, requests, tqdm, openai installed&#8221;) from huggingface_hub import hf_hub_download We begin by importing the core Python modules that we need for system operations, downloads, timing, and JSON handling. We check whether we are running inside Google Colab, define a reusable section printer, and create a helper function to run shell commands cleanly from Python. We then verify the GPU and CUDA environment, print the Python runtime details, install the required Python dependencies, and prepare the Hugging Face download utility for the next stages. Copy CodeCopiedUse a different Browser section(&#8220;3 \u00b7 Downloading PrismML llama.cpp Prebuilt Binaries&#8221;) RELEASE_TAG = &#8220;prism-b8194-1179bfc&#8221; BASE_URL = f&#8221;https:\/\/github.com\/PrismML-Eng\/llama.cpp\/releases\/download\/{RELEASE_TAG}&#8221; BIN_DIR = &#8220;\/content\/bonsai_bin&#8221; os.makedirs(BIN_DIR, exist_ok=True) def detect_cuda_build(): r = run(&#8220;nvcc &#8211;version&#8221;, capture=True, check=False) for line in r.stdout.splitlines(): if &#8220;release&#8221; in line: try: ver = float(line.split(&#8220;release&#8221;)[-1].strip().split(&#8220;,&#8221;)[0].strip()) if ver &gt;= 13.0: return &#8220;13.1&#8221; if ver &gt;= 12.6: return &#8220;12.8&#8221; return &#8220;12.4&#8221; except ValueError: pass return &#8220;12.4&#8221; cuda_build = detect_cuda_build() print(f&#8221; Detected CUDA build slot: {cuda_build}&#8221;) TAR_NAME = f&#8221;llama-{RELEASE_TAG}-bin-linux-cuda-{cuda_build}-x64.tar.gz&#8221; TAR_URL = f&#8221;{BASE_URL}\/{TAR_NAME}&#8221; tar_path = f&#8221;\/tmp\/{TAR_NAME}&#8221; if not os.path.exists(f&#8221;{BIN_DIR}\/llama-cli&#8221;): print(f&#8221; Downloading: {TAR_URL}&#8221;) urllib.request.urlretrieve(TAR_URL, tar_path) print(&#8221; Extracting \u2026&#8221;) with tarfile.open(tar_path, &#8220;r:gz&#8221;) as t: t.extractall(BIN_DIR) for fname in os.listdir(BIN_DIR): fp = os.path.join(BIN_DIR, fname) if os.path.isfile(fp): os.chmod(fp, 0o755) print(f&#8221; Binaries extracted to {BIN_DIR}&#8221;) bins = sorted(f for f in os.listdir(BIN_DIR) if os.path.isfile(os.path.join(BIN_DIR, f))) print(&#8221; Available:&#8221;, &#8220;, &#8220;.join(bins)) else: print(f&#8221; Binaries already present at {BIN_DIR}&#8221;) LLAMA_CLI = f&#8221;{BIN_DIR}\/llama-cli&#8221; LLAMA_SERVER = f&#8221;{BIN_DIR}\/llama-server&#8221; test = run(f&#8221;{LLAMA_CLI} &#8211;version&#8221;, capture=True, check=False) if test.returncode == 0: print(f&#8221; llama-cli version: {test.stdout.strip()[:80]}&#8221;) else: print(f&#8221; llama-cli test failed: {test.stderr.strip()[:200]}&#8221;) section(&#8220;4 \u00b7 Downloading Bonsai-1.7B GGUF Model&#8221;) MODEL_REPO = &#8220;prism-ml\/Bonsai-1.7B-gguf&#8221; MODEL_DIR = &#8220;\/content\/bonsai_models&#8221; GGUF_FILENAME = &#8220;Bonsai-1.7B.gguf&#8221; os.makedirs(MODEL_DIR, exist_ok=True) MODEL_PATH = os.path.join(MODEL_DIR, GGUF_FILENAME) if not os.path.exists(MODEL_PATH): print(f&#8221; Downloading {GGUF_FILENAME} (~248 MB) from HuggingFace \u2026&#8221;) MODEL_PATH = hf_hub_download( repo_id=MODEL_REPO, filename=GGUF_FILENAME, local_dir=MODEL_DIR, ) print(f&#8221; Model saved to: {MODEL_PATH}&#8221;) else: print(f&#8221; Model already cached: {MODEL_PATH}&#8221;) size_mb = os.path.getsize(MODEL_PATH) \/ 1e6 print(f&#8221; File size on disk: {size_mb:.1f} MB&#8221;) section(&#8220;5 \u00b7 Core Inference Helpers&#8221;) DEFAULT_GEN_ARGS = dict( temp=0.5, top_p=0.85, top_k=20, repeat_penalty=1.0, n_predict=256, n_gpu_layers=99, ctx_size=4096, ) def build_llama_cmd(prompt, system_prompt=&#8221;You are a helpful assistant.&#8221;, **overrides): args = {**DEFAULT_GEN_ARGS, **overrides} formatted = ( f&#8221;&lt;|im_start|&gt;systemn{system_prompt}&lt;|im_end|&gt;n&#8221; f&#8221;&lt;|im_start|&gt;usern{prompt}&lt;|im_end|&gt;n&#8221; f&#8221;&lt;|im_start|&gt;assistantn&#8221; ) safe_prompt = formatted.replace(&#8216;&#8221;&#8216;, &#8216;\\&#8221;&#8216;) return ( f'{LLAMA_CLI} -m &#8220;{MODEL_PATH}&#8221;&#8216; f&#8217; -p &#8220;{safe_prompt}&#8221;&#8216; f&#8217; -n {args[&#8220;n_predict&#8221;]}&#8217; f&#8217; &#8211;temp {args[&#8220;temp&#8221;]}&#8217; f&#8217; &#8211;top-p {args[&#8220;top_p&#8221;]}&#8217; f&#8217; &#8211;top-k {args[&#8220;top_k&#8221;]}&#8217; f&#8217; &#8211;repeat-penalty {args[&#8220;repeat_penalty&#8221;]}&#8217; f&#8217; -ngl {args[&#8220;n_gpu_layers&#8221;]}&#8217; f&#8217; -c {args[&#8220;ctx_size&#8221;]}&#8217; f&#8217; &#8211;no-display-prompt&#8217; f&#8217; -e&#8217; ) def infer(prompt, system_prompt=&#8221;You are a helpful assistant.&#8221;, verbose=True, **overrides): cmd = build_llama_cmd(prompt, system_prompt, **overrides) t0 = time.time() result = run(cmd, capture=True, check=False) elapsed = time.time() &#8211; t0 output = result.stdout.strip() if verbose: print(f&#8221;n{&#8216;\u2500&#8217;*50}&#8221;) print(f&#8221;Prompt : {prompt[:100]}{&#8216;\u2026&#8217; if len(prompt) &gt; 100 else &#8221;}&#8221;) print(f&#8221;{&#8216;\u2500&#8217;*50}&#8221;) print(output) print(f&#8221;{&#8216;\u2500&#8217;*50}&#8221;) print(f&#8221; {elapsed:.2f}s | ~{len(output.split())} words&#8221;) return output, elapsed print(&#8221; Inference helpers ready.&#8221;) section(&#8220;6 \u00b7 Basic Inference \u2014 Hello, Bonsai!&#8221;) infer(&#8220;What makes 1-bit language models special compared to standard models?&#8221;) We download and prepare the PrismML prebuilt llama.cpp CUDA binaries that power local inference for the Bonsai model. We detect the available CUDA version, choose the matching binary build, extract the downloaded archive, make the files executable, and verify that the llama-cli binary works correctly. After that, we download the Bonsai-1.7B GGUF model from Hugging Face, set up the model path, define the default generation settings, and build the core helper functions that format prompts and run inference. Copy CodeCopiedUse a different Browser section(&#8220;7 \u00b7 Q1_0_g128 Quantization \u2014 What&#8217;s Happening Under the Hood&#8221;) print(textwrap.dedent(&#8220;&#8221;&#8221; \u2554\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2557 \u2551 Bonsai Q1_0_g128 Weight Representation \u2551 \u2560\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2563 \u2551 Each weight = 1 bit: 0 \u2192 \u2212scale \u2551 \u2551 1 \u2192 +scale \u2551 \u2551 Every 128 weights share one FP16 scale factor. \u2551 \u2551 \u2551 \u2551 Effective bits per weight: \u2551 \u2551 1 bit (sign) + 16\/128 bits (shared scale) = 1.125 bpw \u2551 \u2551 \u2551 \u2551 Memory comparison for Bonsai-1.7B: \u2551 \u2551 FP16: 3.44 GB (1.0\u00d7 baseline) \u2551 \u2551 Q1_0_g128: 0.24 GB (14.2\u00d7 smaller!) \u2551 \u2551 MLX 1-bit g128: 0.27 GB (12.8\u00d7 smaller) \u2551 \u255a\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u255d &#8220;&#8221;&#8221;)) print(&#8221; Python demo of Q1_0_g128 quantization logic:n&#8221;) import random random.seed(42) GROUP_SIZE = 128 weights_fp16 = [random.gauss(0, 0.1) for _ in range(GROUP_SIZE)] scale = max(abs(w) for w in weights_fp16) quantized = [1 if w &gt;= 0 else 0 for w in weights_fp16] dequantized = [scale if b == 1 else -scale for b in quantized] mse = sum((a &#8211; b) ** 2 for a, b in zip(weights_fp16, dequantized)) \/ GROUP_SIZE print(f&#8221; FP16 weights (first 8): {[f'{w:.4f}&#8217; for w in weights_fp16[:8]]}&#8221;) print(f&#8221; 1-bit repr (first 8): {quantized[:8]}&#8221;) print(f&#8221; Shared scale: {scale:.4f}&#8221;) print(f&#8221; Dequantized (first 8): {[f'{w:.4f}&#8217; for w in dequantized[:8]]}&#8221;) print(f&#8221; MSE of reconstruction: {mse:.6f}&#8221;) memory_fp16 = GROUP_SIZE * 2 memory_1bit = GROUP_SIZE \/ 8 + 2 print(f&#8221;n Memory: FP16={memory_fp16}B vs Q1_0_g128={memory_1bit:.1f}B &#8221; f&#8221;({memory_fp16\/memory_1bit:.1f}\u00d7 reduction)&#8221;) section(&#8220;8 \u00b7<\/p>","protected":false},"author":2,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"_acf_changed":false,"pmpro_default_level":"","site-sidebar-layout":"default","site-content-layout":"","ast-site-content-layout":"","site-content-style":"default","site-sidebar-style":"default","ast-global-header-display":"","ast-banner-title-visibility":"","ast-main-header-display":"","ast-hfb-above-header-display":"","ast-hfb-below-header-display":"","ast-hfb-mobile-header-display":"","site-post-title":"","ast-breadcrumbs-content":"","ast-featured-img":"","footer-sml-layout":"","theme-transparent-header-meta":"","adv-header-id-meta":"","stick-header-meta":"","header-above-stick-meta":"","header-main-stick-meta":"","header-below-stick-meta":"","astra-migrate-meta-layouts":"default","ast-page-background-enabled":"default","ast-page-background-meta":{"desktop":{"background-color":"var(--ast-global-color-4)","background-image":"","background-repeat":"repeat","background-position":"center center","background-size":"auto","background-attachment":"scroll","background-type":"","background-media":"","overlay-type":"","overlay-color":"","overlay-opacity":"","overlay-gradient":""},"tablet":{"background-color":"","background-image":"","background-repeat":"repeat","background-position":"center center","background-size":"auto","background-attachment":"scroll","background-type":"","background-media":"","overlay-type":"","overlay-color":"","overlay-opacity":"","overlay-gradient":""},"mobile":{"background-color":"","background-image":"","background-repeat":"repeat","background-position":"center center","background-size":"auto","background-attachment":"scroll","background-type":"","background-media":"","overlay-type":"","overlay-color":"","overlay-opacity":"","overlay-gradient":""}},"ast-content-background-meta":{"desktop":{"background-color":"var(--ast-global-color-5)","background-image":"","background-repeat":"repeat","background-position":"center center","background-size":"auto","background-attachment":"scroll","background-type":"","background-media":"","overlay-type":"","overlay-color":"","overlay-opacity":"","overlay-gradient":""},"tablet":{"background-color":"var(--ast-global-color-5)","background-image":"","background-repeat":"repeat","background-position":"center center","background-size":"auto","background-attachment":"scroll","background-type":"","background-media":"","overlay-type":"","overlay-color":"","overlay-opacity":"","overlay-gradient":""},"mobile":{"background-color":"var(--ast-global-color-5)","background-image":"","background-repeat":"repeat","background-position":"center center","background-size":"auto","background-attachment":"scroll","background-type":"","background-media":"","overlay-type":"","overlay-color":"","overlay-opacity":"","overlay-gradient":""}},"_pvb_checkbox_block_on_post":false,"footnotes":""},"categories":[52,5,7,1],"tags":[],"class_list":["post-84562","post","type-post","status-publish","format-standard","hentry","category-ai-club","category-committee","category-news","category-uncategorized","pmpro-has-access"],"acf":[],"yoast_head":"<!-- This site is optimized with the Yoast SEO plugin v25.3 - https:\/\/yoast.com\/wordpress\/plugins\/seo\/ -->\n<title>A Coding Tutorial for Running PrismML Bonsai 1-Bit LLM on CUDA with GGUF, Benchmarking, Chat, JSON, and RAG - YouZum<\/title>\n<meta name=\"description\" content=\"\u0e01\u0e34\u0e08\u0e01\u0e23\u0e23\u0e21\u0e40\u0e01\u0e35\u0e48\u0e22\u0e27\u0e01\u0e31\u0e1a\u0e42\u0e14\u0e23\u0e19\" \/>\n<meta name=\"robots\" content=\"index, follow, max-snippet:-1, max-image-preview:large, max-video-preview:-1\" \/>\n<link rel=\"canonical\" href=\"https:\/\/youzum.net\/ja\/a-coding-tutorial-for-running-prismml-bonsai-1-bit-llm-on-cuda-with-gguf-benchmarking-chat-json-and-rag\/\" \/>\n<meta property=\"og:locale\" content=\"ja_JP\" \/>\n<meta property=\"og:type\" content=\"article\" \/>\n<meta property=\"og:title\" content=\"A Coding Tutorial for Running PrismML Bonsai 1-Bit LLM on CUDA with GGUF, Benchmarking, Chat, JSON, and RAG - YouZum\" \/>\n<meta property=\"og:description\" content=\"\u0e01\u0e34\u0e08\u0e01\u0e23\u0e23\u0e21\u0e40\u0e01\u0e35\u0e48\u0e22\u0e27\u0e01\u0e31\u0e1a\u0e42\u0e14\u0e23\u0e19\" \/>\n<meta property=\"og:url\" content=\"https:\/\/youzum.net\/ja\/a-coding-tutorial-for-running-prismml-bonsai-1-bit-llm-on-cuda-with-gguf-benchmarking-chat-json-and-rag\/\" \/>\n<meta property=\"og:site_name\" content=\"YouZum\" \/>\n<meta property=\"article:publisher\" content=\"https:\/\/www.facebook.com\/DroneAssociationTH\/\" \/>\n<meta property=\"article:published_time\" content=\"2026-04-19T15:19:17+00:00\" \/>\n<meta property=\"og:image\" content=\"https:\/\/s.w.org\/images\/core\/emoji\/17.0.2\/72x72\/2705.png\" \/>\n<meta name=\"author\" content=\"admin NU\" \/>\n<meta name=\"twitter:card\" content=\"summary_large_image\" \/>\n<meta name=\"twitter:label1\" content=\"\u57f7\u7b46\u8005\" \/>\n\t<meta name=\"twitter:data1\" content=\"admin NU\" \/>\n\t<meta name=\"twitter:label2\" content=\"\u63a8\u5b9a\u8aad\u307f\u53d6\u308a\u6642\u9593\" \/>\n\t<meta name=\"twitter:data2\" content=\"16\u5206\" \/>\n<script type=\"application\/ld+json\" class=\"yoast-schema-graph\">{\"@context\":\"https:\/\/schema.org\",\"@graph\":[{\"@type\":\"Article\",\"@id\":\"https:\/\/youzum.net\/a-coding-tutorial-for-running-prismml-bonsai-1-bit-llm-on-cuda-with-gguf-benchmarking-chat-json-and-rag\/#article\",\"isPartOf\":{\"@id\":\"https:\/\/youzum.net\/a-coding-tutorial-for-running-prismml-bonsai-1-bit-llm-on-cuda-with-gguf-benchmarking-chat-json-and-rag\/\"},\"author\":{\"name\":\"admin NU\",\"@id\":\"https:\/\/yousum.gpucore.co\/#\/schema\/person\/97fa48242daf3908e4d9a5f26f4a059c\"},\"headline\":\"A Coding Tutorial for Running PrismML Bonsai 1-Bit LLM on CUDA with GGUF, Benchmarking, Chat, JSON, and RAG\",\"datePublished\":\"2026-04-19T15:19:17+00:00\",\"mainEntityOfPage\":{\"@id\":\"https:\/\/youzum.net\/a-coding-tutorial-for-running-prismml-bonsai-1-bit-llm-on-cuda-with-gguf-benchmarking-chat-json-and-rag\/\"},\"wordCount\":767,\"commentCount\":0,\"publisher\":{\"@id\":\"https:\/\/yousum.gpucore.co\/#organization\"},\"image\":{\"@id\":\"https:\/\/youzum.net\/a-coding-tutorial-for-running-prismml-bonsai-1-bit-llm-on-cuda-with-gguf-benchmarking-chat-json-and-rag\/#primaryimage\"},\"thumbnailUrl\":\"https:\/\/s.w.org\/images\/core\/emoji\/17.0.2\/72x72\/2705.png\",\"articleSection\":[\"AI\",\"Committee\",\"News\",\"Uncategorized\"],\"inLanguage\":\"ja\",\"potentialAction\":[{\"@type\":\"CommentAction\",\"name\":\"Comment\",\"target\":[\"https:\/\/youzum.net\/a-coding-tutorial-for-running-prismml-bonsai-1-bit-llm-on-cuda-with-gguf-benchmarking-chat-json-and-rag\/#respond\"]}]},{\"@type\":\"WebPage\",\"@id\":\"https:\/\/youzum.net\/a-coding-tutorial-for-running-prismml-bonsai-1-bit-llm-on-cuda-with-gguf-benchmarking-chat-json-and-rag\/\",\"url\":\"https:\/\/youzum.net\/a-coding-tutorial-for-running-prismml-bonsai-1-bit-llm-on-cuda-with-gguf-benchmarking-chat-json-and-rag\/\",\"name\":\"A Coding Tutorial for Running PrismML Bonsai 1-Bit LLM on CUDA with GGUF, Benchmarking, Chat, JSON, and RAG - YouZum\",\"isPartOf\":{\"@id\":\"https:\/\/yousum.gpucore.co\/#website\"},\"primaryImageOfPage\":{\"@id\":\"https:\/\/youzum.net\/a-coding-tutorial-for-running-prismml-bonsai-1-bit-llm-on-cuda-with-gguf-benchmarking-chat-json-and-rag\/#primaryimage\"},\"image\":{\"@id\":\"https:\/\/youzum.net\/a-coding-tutorial-for-running-prismml-bonsai-1-bit-llm-on-cuda-with-gguf-benchmarking-chat-json-and-rag\/#primaryimage\"},\"thumbnailUrl\":\"https:\/\/s.w.org\/images\/core\/emoji\/17.0.2\/72x72\/2705.png\",\"datePublished\":\"2026-04-19T15:19:17+00:00\",\"description\":\"\u0e01\u0e34\u0e08\u0e01\u0e23\u0e23\u0e21\u0e40\u0e01\u0e35\u0e48\u0e22\u0e27\u0e01\u0e31\u0e1a\u0e42\u0e14\u0e23\u0e19\",\"breadcrumb\":{\"@id\":\"https:\/\/youzum.net\/a-coding-tutorial-for-running-prismml-bonsai-1-bit-llm-on-cuda-with-gguf-benchmarking-chat-json-and-rag\/#breadcrumb\"},\"inLanguage\":\"ja\",\"potentialAction\":[{\"@type\":\"ReadAction\",\"target\":[\"https:\/\/youzum.net\/a-coding-tutorial-for-running-prismml-bonsai-1-bit-llm-on-cuda-with-gguf-benchmarking-chat-json-and-rag\/\"]}]},{\"@type\":\"ImageObject\",\"inLanguage\":\"ja\",\"@id\":\"https:\/\/youzum.net\/a-coding-tutorial-for-running-prismml-bonsai-1-bit-llm-on-cuda-with-gguf-benchmarking-chat-json-and-rag\/#primaryimage\",\"url\":\"https:\/\/s.w.org\/images\/core\/emoji\/17.0.2\/72x72\/2705.png\",\"contentUrl\":\"https:\/\/s.w.org\/images\/core\/emoji\/17.0.2\/72x72\/2705.png\"},{\"@type\":\"BreadcrumbList\",\"@id\":\"https:\/\/youzum.net\/a-coding-tutorial-for-running-prismml-bonsai-1-bit-llm-on-cuda-with-gguf-benchmarking-chat-json-and-rag\/#breadcrumb\",\"itemListElement\":[{\"@type\":\"ListItem\",\"position\":1,\"name\":\"Home\",\"item\":\"https:\/\/youzum.net\/\"},{\"@type\":\"ListItem\",\"position\":2,\"name\":\"A Coding Tutorial for Running PrismML Bonsai 1-Bit LLM on CUDA with GGUF, Benchmarking, Chat, JSON, and RAG\"}]},{\"@type\":\"WebSite\",\"@id\":\"https:\/\/yousum.gpucore.co\/#website\",\"url\":\"https:\/\/yousum.gpucore.co\/\",\"name\":\"YouSum\",\"description\":\"\",\"publisher\":{\"@id\":\"https:\/\/yousum.gpucore.co\/#organization\"},\"potentialAction\":[{\"@type\":\"SearchAction\",\"target\":{\"@type\":\"EntryPoint\",\"urlTemplate\":\"https:\/\/yousum.gpucore.co\/?s={search_term_string}\"},\"query-input\":{\"@type\":\"PropertyValueSpecification\",\"valueRequired\":true,\"valueName\":\"search_term_string\"}}],\"inLanguage\":\"ja\"},{\"@type\":\"Organization\",\"@id\":\"https:\/\/yousum.gpucore.co\/#organization\",\"name\":\"Drone Association Thailand\",\"url\":\"https:\/\/yousum.gpucore.co\/\",\"logo\":{\"@type\":\"ImageObject\",\"inLanguage\":\"ja\",\"@id\":\"https:\/\/yousum.gpucore.co\/#\/schema\/logo\/image\/\",\"url\":\"https:\/\/youzum.net\/wp-content\/uploads\/2024\/11\/tranparent-logo.png\",\"contentUrl\":\"https:\/\/youzum.net\/wp-content\/uploads\/2024\/11\/tranparent-logo.png\",\"width\":300,\"height\":300,\"caption\":\"Drone Association Thailand\"},\"image\":{\"@id\":\"https:\/\/yousum.gpucore.co\/#\/schema\/logo\/image\/\"},\"sameAs\":[\"https:\/\/www.facebook.com\/DroneAssociationTH\/\"]},{\"@type\":\"Person\",\"@id\":\"https:\/\/yousum.gpucore.co\/#\/schema\/person\/97fa48242daf3908e4d9a5f26f4a059c\",\"name\":\"admin NU\",\"image\":{\"@type\":\"ImageObject\",\"inLanguage\":\"ja\",\"@id\":\"https:\/\/yousum.gpucore.co\/#\/schema\/person\/image\/\",\"url\":\"https:\/\/youzum.net\/wp-content\/uploads\/avatars\/2\/1746849356-bpfull.png\",\"contentUrl\":\"https:\/\/youzum.net\/wp-content\/uploads\/avatars\/2\/1746849356-bpfull.png\",\"caption\":\"admin NU\"},\"url\":\"https:\/\/youzum.net\/ja\/members\/adminnu\/\"}]}<\/script>\n<!-- \/ Yoast SEO plugin. -->","yoast_head_json":{"title":"A Coding Tutorial for Running PrismML Bonsai 1-Bit LLM on CUDA with GGUF, Benchmarking, Chat, JSON, and RAG - YouZum","description":"\u0e01\u0e34\u0e08\u0e01\u0e23\u0e23\u0e21\u0e40\u0e01\u0e35\u0e48\u0e22\u0e27\u0e01\u0e31\u0e1a\u0e42\u0e14\u0e23\u0e19","robots":{"index":"index","follow":"follow","max-snippet":"max-snippet:-1","max-image-preview":"max-image-preview:large","max-video-preview":"max-video-preview:-1"},"canonical":"https:\/\/youzum.net\/ja\/a-coding-tutorial-for-running-prismml-bonsai-1-bit-llm-on-cuda-with-gguf-benchmarking-chat-json-and-rag\/","og_locale":"ja_JP","og_type":"article","og_title":"A Coding Tutorial for Running PrismML Bonsai 1-Bit LLM on CUDA with GGUF, Benchmarking, Chat, JSON, and RAG - YouZum","og_description":"\u0e01\u0e34\u0e08\u0e01\u0e23\u0e23\u0e21\u0e40\u0e01\u0e35\u0e48\u0e22\u0e27\u0e01\u0e31\u0e1a\u0e42\u0e14\u0e23\u0e19","og_url":"https:\/\/youzum.net\/ja\/a-coding-tutorial-for-running-prismml-bonsai-1-bit-llm-on-cuda-with-gguf-benchmarking-chat-json-and-rag\/","og_site_name":"YouZum","article_publisher":"https:\/\/www.facebook.com\/DroneAssociationTH\/","article_published_time":"2026-04-19T15:19:17+00:00","og_image":[{"url":"https:\/\/s.w.org\/images\/core\/emoji\/17.0.2\/72x72\/2705.png","type":"","width":"","height":""}],"author":"admin NU","twitter_card":"summary_large_image","twitter_misc":{"\u57f7\u7b46\u8005":"admin NU","\u63a8\u5b9a\u8aad\u307f\u53d6\u308a\u6642\u9593":"16\u5206"},"schema":{"@context":"https:\/\/schema.org","@graph":[{"@type":"Article","@id":"https:\/\/youzum.net\/a-coding-tutorial-for-running-prismml-bonsai-1-bit-llm-on-cuda-with-gguf-benchmarking-chat-json-and-rag\/#article","isPartOf":{"@id":"https:\/\/youzum.net\/a-coding-tutorial-for-running-prismml-bonsai-1-bit-llm-on-cuda-with-gguf-benchmarking-chat-json-and-rag\/"},"author":{"name":"admin NU","@id":"https:\/\/yousum.gpucore.co\/#\/schema\/person\/97fa48242daf3908e4d9a5f26f4a059c"},"headline":"A Coding Tutorial for Running PrismML Bonsai 1-Bit LLM on CUDA with GGUF, Benchmarking, Chat, JSON, and RAG","datePublished":"2026-04-19T15:19:17+00:00","mainEntityOfPage":{"@id":"https:\/\/youzum.net\/a-coding-tutorial-for-running-prismml-bonsai-1-bit-llm-on-cuda-with-gguf-benchmarking-chat-json-and-rag\/"},"wordCount":767,"commentCount":0,"publisher":{"@id":"https:\/\/yousum.gpucore.co\/#organization"},"image":{"@id":"https:\/\/youzum.net\/a-coding-tutorial-for-running-prismml-bonsai-1-bit-llm-on-cuda-with-gguf-benchmarking-chat-json-and-rag\/#primaryimage"},"thumbnailUrl":"https:\/\/s.w.org\/images\/core\/emoji\/17.0.2\/72x72\/2705.png","articleSection":["AI","Committee","News","Uncategorized"],"inLanguage":"ja","potentialAction":[{"@type":"CommentAction","name":"Comment","target":["https:\/\/youzum.net\/a-coding-tutorial-for-running-prismml-bonsai-1-bit-llm-on-cuda-with-gguf-benchmarking-chat-json-and-rag\/#respond"]}]},{"@type":"WebPage","@id":"https:\/\/youzum.net\/a-coding-tutorial-for-running-prismml-bonsai-1-bit-llm-on-cuda-with-gguf-benchmarking-chat-json-and-rag\/","url":"https:\/\/youzum.net\/a-coding-tutorial-for-running-prismml-bonsai-1-bit-llm-on-cuda-with-gguf-benchmarking-chat-json-and-rag\/","name":"A Coding Tutorial for Running PrismML Bonsai 1-Bit LLM on CUDA with GGUF, Benchmarking, Chat, JSON, and RAG - YouZum","isPartOf":{"@id":"https:\/\/yousum.gpucore.co\/#website"},"primaryImageOfPage":{"@id":"https:\/\/youzum.net\/a-coding-tutorial-for-running-prismml-bonsai-1-bit-llm-on-cuda-with-gguf-benchmarking-chat-json-and-rag\/#primaryimage"},"image":{"@id":"https:\/\/youzum.net\/a-coding-tutorial-for-running-prismml-bonsai-1-bit-llm-on-cuda-with-gguf-benchmarking-chat-json-and-rag\/#primaryimage"},"thumbnailUrl":"https:\/\/s.w.org\/images\/core\/emoji\/17.0.2\/72x72\/2705.png","datePublished":"2026-04-19T15:19:17+00:00","description":"\u0e01\u0e34\u0e08\u0e01\u0e23\u0e23\u0e21\u0e40\u0e01\u0e35\u0e48\u0e22\u0e27\u0e01\u0e31\u0e1a\u0e42\u0e14\u0e23\u0e19","breadcrumb":{"@id":"https:\/\/youzum.net\/a-coding-tutorial-for-running-prismml-bonsai-1-bit-llm-on-cuda-with-gguf-benchmarking-chat-json-and-rag\/#breadcrumb"},"inLanguage":"ja","potentialAction":[{"@type":"ReadAction","target":["https:\/\/youzum.net\/a-coding-tutorial-for-running-prismml-bonsai-1-bit-llm-on-cuda-with-gguf-benchmarking-chat-json-and-rag\/"]}]},{"@type":"ImageObject","inLanguage":"ja","@id":"https:\/\/youzum.net\/a-coding-tutorial-for-running-prismml-bonsai-1-bit-llm-on-cuda-with-gguf-benchmarking-chat-json-and-rag\/#primaryimage","url":"https:\/\/s.w.org\/images\/core\/emoji\/17.0.2\/72x72\/2705.png","contentUrl":"https:\/\/s.w.org\/images\/core\/emoji\/17.0.2\/72x72\/2705.png"},{"@type":"BreadcrumbList","@id":"https:\/\/youzum.net\/a-coding-tutorial-for-running-prismml-bonsai-1-bit-llm-on-cuda-with-gguf-benchmarking-chat-json-and-rag\/#breadcrumb","itemListElement":[{"@type":"ListItem","position":1,"name":"Home","item":"https:\/\/youzum.net\/"},{"@type":"ListItem","position":2,"name":"A Coding Tutorial for Running PrismML Bonsai 1-Bit LLM on CUDA with GGUF, Benchmarking, Chat, JSON, and RAG"}]},{"@type":"WebSite","@id":"https:\/\/yousum.gpucore.co\/#website","url":"https:\/\/yousum.gpucore.co\/","name":"YouSum","description":"","publisher":{"@id":"https:\/\/yousum.gpucore.co\/#organization"},"potentialAction":[{"@type":"SearchAction","target":{"@type":"EntryPoint","urlTemplate":"https:\/\/yousum.gpucore.co\/?s={search_term_string}"},"query-input":{"@type":"PropertyValueSpecification","valueRequired":true,"valueName":"search_term_string"}}],"inLanguage":"ja"},{"@type":"Organization","@id":"https:\/\/yousum.gpucore.co\/#organization","name":"Drone Association Thailand","url":"https:\/\/yousum.gpucore.co\/","logo":{"@type":"ImageObject","inLanguage":"ja","@id":"https:\/\/yousum.gpucore.co\/#\/schema\/logo\/image\/","url":"https:\/\/youzum.net\/wp-content\/uploads\/2024\/11\/tranparent-logo.png","contentUrl":"https:\/\/youzum.net\/wp-content\/uploads\/2024\/11\/tranparent-logo.png","width":300,"height":300,"caption":"Drone Association Thailand"},"image":{"@id":"https:\/\/yousum.gpucore.co\/#\/schema\/logo\/image\/"},"sameAs":["https:\/\/www.facebook.com\/DroneAssociationTH\/"]},{"@type":"Person","@id":"https:\/\/yousum.gpucore.co\/#\/schema\/person\/97fa48242daf3908e4d9a5f26f4a059c","name":"admin NU","image":{"@type":"ImageObject","inLanguage":"ja","@id":"https:\/\/yousum.gpucore.co\/#\/schema\/person\/image\/","url":"https:\/\/youzum.net\/wp-content\/uploads\/avatars\/2\/1746849356-bpfull.png","contentUrl":"https:\/\/youzum.net\/wp-content\/uploads\/avatars\/2\/1746849356-bpfull.png","caption":"admin NU"},"url":"https:\/\/youzum.net\/ja\/members\/adminnu\/"}]}},"rttpg_featured_image_url":null,"rttpg_author":{"display_name":"admin NU","author_link":"https:\/\/youzum.net\/ja\/members\/adminnu\/"},"rttpg_comment":0,"rttpg_category":"<a href=\"https:\/\/youzum.net\/ja\/category\/ai-club\/\" rel=\"category tag\">AI<\/a> <a href=\"https:\/\/youzum.net\/ja\/category\/committee\/\" rel=\"category tag\">Committee<\/a> <a href=\"https:\/\/youzum.net\/ja\/category\/news\/\" rel=\"category tag\">News<\/a> <a href=\"https:\/\/youzum.net\/ja\/category\/uncategorized\/\" rel=\"category tag\">Uncategorized<\/a>","rttpg_excerpt":"In this tutorial, we implement how to run the Bonsai 1-bit large language model efficiently using GPU acceleration and PrismML\u2019s optimized GGUF deployment stack. We set up the environment, install the required dependencies, and download the prebuilt llama.cpp binaries, and load the Bonsai-1.7B model for fast inference on CUDA. As we progress, we examine how&hellip;","_links":{"self":[{"href":"https:\/\/youzum.net\/ja\/wp-json\/wp\/v2\/posts\/84562","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/youzum.net\/ja\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/youzum.net\/ja\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/youzum.net\/ja\/wp-json\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"https:\/\/youzum.net\/ja\/wp-json\/wp\/v2\/comments?post=84562"}],"version-history":[{"count":0,"href":"https:\/\/youzum.net\/ja\/wp-json\/wp\/v2\/posts\/84562\/revisions"}],"wp:attachment":[{"href":"https:\/\/youzum.net\/ja\/wp-json\/wp\/v2\/media?parent=84562"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/youzum.net\/ja\/wp-json\/wp\/v2\/categories?post=84562"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/youzum.net\/ja\/wp-json\/wp\/v2\/tags?post=84562"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}