YouZum

A Coding Tutorial for Running PrismML Bonsai 1-Bit LLM on CUDA with GGUF, Benchmarking, Chat, JSON, and RAG

In this tutorial, we implement how to run the Bonsai 1-bit large language model efficiently using GPU acceleration and PrismML’s optimized GGUF deployment stack. We set up the environment, install the required dependencies, and download the prebuilt llama.cpp binaries, and load the Bonsai-1.7B model for fast inference on CUDA. As we progress, we examine how 1-bit quantization works under the hood, why the Q1_0_g128 format is so memory-efficient, and how this makes Bonsai practical for lightweight yet capable language model deployment. We also test core inference, benchmarking, multi-turn chat, structured JSON generation, code generation, OpenAI-compatible server mode, and a small retrieval-augmented generation workflow, giving us a complete, hands-on view of how Bonsai operates in real-world use.

import os, sys, subprocess, time, json, urllib.request, tarfile, textwrap


try:
   import google.colab
   IN_COLAB = True
except ImportError:
   IN_COLAB = False


def section(title):
   bar = "═" * 60
   print(f"n{bar}n  {title}n{bar}")


section("1 · Environment & GPU Check")


def run(cmd, capture=False, check=True, **kw):
   return subprocess.run(
       cmd, shell=True, capture_output=capture,
       text=True, check=check, **kw
   )


gpu_info = run("nvidia-smi --query-gpu=name,memory.total,driver_version --format=csv,noheader",
              capture=True, check=False)
if gpu_info.returncode == 0:
   print("✅ GPU detected:", gpu_info.stdout.strip())
else:
   print("⚠  No GPU found — inference will run on CPU (much slower).")


cuda_check = run("nvcc --version", capture=True, check=False)
if cuda_check.returncode == 0:
   for line in cuda_check.stdout.splitlines():
       if "release" in line:
           print("   CUDA:", line.strip())
           break


print(f"   Python {sys.version.split()[0]}  |  Platform: Linux (Colab)")


section("2 · Installing Python Dependencies")


run("pip install -q huggingface_hub requests tqdm openai")
print("✅ huggingface_hub, requests, tqdm, openai installed")


from huggingface_hub import hf_hub_download

We begin by importing the core Python modules that we need for system operations, downloads, timing, and JSON handling. We check whether we are running inside Google Colab, define a reusable section printer, and create a helper function to run shell commands cleanly from Python. We then verify the GPU and CUDA environment, print the Python runtime details, install the required Python dependencies, and prepare the Hugging Face download utility for the next stages.

section("3 · Downloading PrismML llama.cpp Prebuilt Binaries")


RELEASE_TAG = "prism-b8194-1179bfc"
BASE_URL    = f"https://github.com/PrismML-Eng/llama.cpp/releases/download/{RELEASE_TAG}"
BIN_DIR     = "/content/bonsai_bin"
os.makedirs(BIN_DIR, exist_ok=True)


def detect_cuda_build():
   r = run("nvcc --version", capture=True, check=False)
   for line in r.stdout.splitlines():
       if "release" in line:
           try:
               ver = float(line.split("release")[-1].strip().split(",")[0].strip())
               if ver >= 13.0: return "13.1"
               if ver >= 12.6: return "12.8"
               return "12.4"
           except ValueError:
               pass
   return "12.4"


cuda_build = detect_cuda_build()
print(f"   Detected CUDA build slot: {cuda_build}")


TAR_NAME = f"llama-{RELEASE_TAG}-bin-linux-cuda-{cuda_build}-x64.tar.gz"
TAR_URL  = f"{BASE_URL}/{TAR_NAME}"
tar_path = f"/tmp/{TAR_NAME}"


if not os.path.exists(f"{BIN_DIR}/llama-cli"):
   print(f"   Downloading: {TAR_URL}")
   urllib.request.urlretrieve(TAR_URL, tar_path)
   print("   Extracting …")
   with tarfile.open(tar_path, "r:gz") as t:
       t.extractall(BIN_DIR)
   for fname in os.listdir(BIN_DIR):
       fp = os.path.join(BIN_DIR, fname)
       if os.path.isfile(fp):
           os.chmod(fp, 0o755)
   print(f"✅ Binaries extracted to {BIN_DIR}")
   bins = sorted(f for f in os.listdir(BIN_DIR) if os.path.isfile(os.path.join(BIN_DIR, f)))
   print("   Available:", ", ".join(bins))
else:
   print(f"✅ Binaries already present at {BIN_DIR}")


LLAMA_CLI    = f"{BIN_DIR}/llama-cli"
LLAMA_SERVER = f"{BIN_DIR}/llama-server"


test = run(f"{LLAMA_CLI} --version", capture=True, check=False)
if test.returncode == 0:
   print(f"   llama-cli version: {test.stdout.strip()[:80]}")
else:
   print(f"⚠  llama-cli test failed: {test.stderr.strip()[:200]}")


section("4 · Downloading Bonsai-1.7B GGUF Model")


MODEL_REPO    = "prism-ml/Bonsai-1.7B-gguf"
MODEL_DIR     = "/content/bonsai_models"
GGUF_FILENAME = "Bonsai-1.7B.gguf"
os.makedirs(MODEL_DIR, exist_ok=True)
MODEL_PATH = os.path.join(MODEL_DIR, GGUF_FILENAME)


if not os.path.exists(MODEL_PATH):
   print(f"   Downloading {GGUF_FILENAME} (~248 MB) from HuggingFace …")
   MODEL_PATH = hf_hub_download(
       repo_id=MODEL_REPO,
       filename=GGUF_FILENAME,
       local_dir=MODEL_DIR,
   )
   print(f"✅ Model saved to: {MODEL_PATH}")
else:
   print(f"✅ Model already cached: {MODEL_PATH}")


size_mb = os.path.getsize(MODEL_PATH) / 1e6
print(f"   File size on disk: {size_mb:.1f} MB")


section("5 · Core Inference Helpers")


DEFAULT_GEN_ARGS = dict(
   temp=0.5,
   top_p=0.85,
   top_k=20,
   repeat_penalty=1.0,
   n_predict=256,
   n_gpu_layers=99,
   ctx_size=4096,
)


def build_llama_cmd(prompt, system_prompt="You are a helpful assistant.", **overrides):
   args = {**DEFAULT_GEN_ARGS, **overrides}
   formatted = (
       f"<|im_start|>systemn{system_prompt}<|im_end|>n"
       f"<|im_start|>usern{prompt}<|im_end|>n"
       f"<|im_start|>assistantn"
   )
   safe_prompt = formatted.replace('"', '\"')
   return (
       f'{LLAMA_CLI} -m "{MODEL_PATH}"'
       f' -p "{safe_prompt}"'
       f' -n {args["n_predict"]}'
       f' --temp {args["temp"]}'
       f' --top-p {args["top_p"]}'
       f' --top-k {args["top_k"]}'
       f' --repeat-penalty {args["repeat_penalty"]}'
       f' -ngl {args["n_gpu_layers"]}'
       f' -c {args["ctx_size"]}'
       f' --no-display-prompt'
       f' -e'
   )


def infer(prompt, system_prompt="You are a helpful assistant.", verbose=True, **overrides):
   cmd = build_llama_cmd(prompt, system_prompt, **overrides)
   t0 = time.time()
   result = run(cmd, capture=True, check=False)
   elapsed = time.time() - t0
   output = result.stdout.strip()
   if verbose:
       print(f"n{'─'*50}")
       print(f"Prompt : {prompt[:100]}{'…' if len(prompt) > 100 else ''}")
       print(f"{'─'*50}")
       print(output)
       print(f"{'─'*50}")
       print(f"⏱  {elapsed:.2f}s  |  ~{len(output.split())} words")
   return output, elapsed


print("✅ Inference helpers ready.")


section("6 · Basic Inference — Hello, Bonsai!")


infer("What makes 1-bit language models special compared to standard models?")

We download and prepare the PrismML prebuilt llama.cpp CUDA binaries that power local inference for the Bonsai model. We detect the available CUDA version, choose the matching binary build, extract the downloaded archive, make the files executable, and verify that the llama-cli binary works correctly. After that, we download the Bonsai-1.7B GGUF model from Hugging Face, set up the model path, define the default generation settings, and build the core helper functions that format prompts and run inference.

section("7 · Q1_0_g128 Quantization — What's Happening Under the Hood")


print(textwrap.dedent("""
╔══════════════════════════════════════════════════════════════╗
║           Bonsai Q1_0_g128 Weight Representation            ║
╠══════════════════════════════════════════════════════════════╣
║  Each weight = 1 bit:  0  →  −scale                         ║
║                        1  →  +scale                         ║
║  Every 128 weights share one FP16 scale factor.             ║
║                                                              ║
║  Effective bits per weight:                                  ║
║    1 bit (sign) + 16/128 bits (shared scale) = 1.125 bpw    ║
║                                                              ║
║  Memory comparison for Bonsai-1.7B:                         ║
║    FP16:            3.44 GB  (1.0×  baseline)               ║
║    Q1_0_g128:       0.24 GB  (14.2× smaller!)               ║
║    MLX 1-bit g128:  0.27 GB  (12.8× smaller)                ║
╚══════════════════════════════════════════════════════════════╝
"""))


print("📐 Python demo of Q1_0_g128 quantization logic:n")
import random
random.seed(42)
GROUP_SIZE   = 128
weights_fp16 = [random.gauss(0, 0.1) for _ in range(GROUP_SIZE)]
scale        = max(abs(w) for w in weights_fp16)
quantized    = [1 if w >= 0 else 0 for w in weights_fp16]
dequantized  = [scale if b == 1 else -scale for b in quantized]
mse          = sum((a - b) ** 2 for a, b in zip(weights_fp16, dequantized)) / GROUP_SIZE


print(f"  FP16 weights (first 8): {[f'{w:.4f}' for w in weights_fp16[:8]]}")
print(f"  1-bit repr  (first 8): {quantized[:8]}")
print(f"  Shared scale:          {scale:.4f}")
print(f"  Dequantized (first 8): {[f'{w:.4f}' for w in dequantized[:8]]}")
print(f"  MSE of reconstruction: {mse:.6f}")
memory_fp16 = GROUP_SIZE * 2
memory_1bit = GROUP_SIZE / 8 + 2
print(f"n  Memory: FP16={memory_fp16}B  vs  Q1_0_g128={memory_1bit:.1f}B  "
     f"({memory_fp16/memory_1bit:.1f}× reduction)")


section("8 · Performance Benchmark — Tokens per Second")


def benchmark(prompt, n_tokens=128, n_runs=3, **kw):
   timings = []
   for i in range(n_runs):
       print(f"   Run {i+1}/{n_runs} …", end=" ", flush=True)
       _, elapsed = infer(prompt, verbose=False, n_predict=n_tokens, **kw)
       tps = n_tokens / elapsed
       timings.append(tps)
       print(f"{tps:.1f} tok/s")
   avg = sum(timings) / len(timings)
   print(f"n  ✅ Average: {avg:.1f} tok/s  (over {n_runs} runs, {n_tokens} tokens each)")
   return avg


print("📊 Benchmarking Bonsai-1.7B on your GPU …")
tps = benchmark(
   "Explain the concept of neural network backpropagation step by step.",
   n_tokens=128, n_runs=3,
)


print("n  Published reference throughputs (from whitepaper):")
print("  ┌──────────────────────┬─────────┬──────────────┐")
print("  │ Platform             │ Backend │ TG128 tok/s  │")
print("  ├──────────────────────┼─────────┼──────────────┤")
print("  │ RTX 4090             │ CUDA    │     674      │")
print("  │ M4 Pro 48 GB         │ Metal   │     250      │")
print(f"  │ Your GPU (measured)  │ CUDA    │  {tps:>7.1f}    │")
print("  └──────────────────────┴─────────┴──────────────┘")


section("9 · Multi-Turn Chat with Context Accumulation")


def chat(user_msg, system="You are a helpful assistant.", history=None, **kw):
   if history is None:
       history = []
   history.append(("user", user_msg))
   full = f"<|im_start|>systemn{system}<|im_end|>n"
   for role, msg in history:
       full += f"<|im_start|>{role}n{msg}<|im_end|>n"
   full += "<|im_start|>assistantn"
   safe = full.replace('"', '\"').replace('n', '\n')
   cmd = (
       f'{LLAMA_CLI} -m "{MODEL_PATH}"'
       f' -p "{safe}" -e'
       f' -n 200 --temp 0.5 --top-p 0.85 --top-k 20'
       f' -ngl 99 -c 4096 --no-display-prompt'
   )
   result = run(cmd, capture=True, check=False)
   reply = result.stdout.strip()
   history.append(("assistant", reply))
   return reply, history


print("🗣  Starting a 3-turn conversation about 1-bit models …n")
history = []
turns = [
   "What is a 1-bit language model?",
   "What are the main trade-offs compared to 4-bit or 8-bit quantization?",
   "How does Bonsai specifically address those trade-offs?",
]
for i, msg in enumerate(turns, 1):
   print(f"👤 Turn {i}: {msg}")
   reply, history = chat(msg, history=history)
   print(f"🤖 Bonsai: {reply}n")
   time.sleep(0.5)


section("10 · Sampling Parameter Exploration")


creative_prompt = "Write a one-sentence description of a futuristic city powered entirely by 1-bit AI."
configs = [
   ("Precise / Focused",  dict(temp=0.1, top_k=10,  top_p=0.70)),
   ("Balanced (default)", dict(temp=0.5, top_k=20,  top_p=0.85)),
   ("Creative / Varied",  dict(temp=0.9, top_k=50,  top_p=0.95)),
   ("High entropy",       dict(temp=1.2, top_k=100, top_p=0.98)),
]


print(f'Prompt: "{creative_prompt}"n')
for label, params in configs:
   out, _ = infer(creative_prompt, verbose=False, n_predict=80, **params)
   print(f"  [{label}]")
   print(f"    temp={params['temp']}, top_k={params['top_k']}, top_p={params['top_p']}")
   print(f"    → {out[:200]}n")

We move from setup into experimentation by first running a basic inference call to confirm that the model is functioning properly. We then explain the Q1_0_g128 quantization format through a visual text block and a small Python demo that shows how 1-bit signs and shared scales reconstruct weights with strong memory savings. After that, we benchmark token generation speed, simulate a multi-turn conversation with accumulated history, and compare how different sampling settings affect the style and diversity of the model’s outputs.

section("11 · Context Window — Long-Document Summarisation")


long_doc = (
   "The transformer architecture, introduced in 'Attention is All You Need' (Vaswani et al., 2017), "
   "replaced recurrent and convolutional networks with self-attention mechanisms. The key insight was "
   "that attention weights could be computed in parallel across the entire sequence, unlike RNNs which "
   "stacked identical layers with multi-head self-attention and feed-forward sub-layers. Positional "
   "encodings inject sequence-order information since attention is permutation-invariant. Subsequent "
   "work removed the encoder (GPT family) or decoder (BERT family) to specialise for generation or "
   "understanding tasks respectively. Scaling laws (Kaplan et al., 2020) showed that loss decreases "
   "predictably with more compute, parameters, and data. This motivated the emergence of large language "
   "these models became prohibitive for edge and on-device deployment. Quantisation research sought to "
   "reduce the bit-width of weights from FP16/BF16 down to INT8, INT4, and eventually binary (1-bit). "
   "BitNet (Wang et al., 2023) was among the first to demonstrate that training with 1-bit weights from "
   "scratch could approach the quality of higher-precision models at scale. Bonsai (Prism ML, 2026) "
   "extended this to an end-to-end 1-bit deployment pipeline across CUDA, Metal, and mobile runtimes, "
   "achieving 14x memory reduction with the Q1_0_g128 GGUF format."
)


summarize_prompt = f"Summarize the following technical text in 3 bullet points:nn{long_doc}"
print(f"   Input length: ~{len(long_doc.split())} words")
out, elapsed = infer(summarize_prompt, n_predict=200, ctx_size=2048, verbose=False)
print("📝 Summary:")
for line in out.splitlines():
   print(f"   {line}")
print(f"n⏱  {elapsed:.2f}s")


section("12 · Structured Output — Forcing JSON Responses")


json_system = (
   "You are a JSON API. Respond ONLY with valid JSON, no markdown, no explanation. "
   "Never include ```json fences."
)
json_prompt = (
   "Return a JSON object with keys: model_name, parameter_count, "
   "bits_per_weight, memory_gb, top_use_cases (array of 3 strings). "
   "Fill in values for Bonsai-1.7B."
)


raw, _ = infer(json_prompt, system_prompt=json_system, temp=0.1, n_predict=300, verbose=False)
print("Raw model output:")
print(raw)
print()


try:
   clean = raw.strip().lstrip("```json").lstrip("```").rstrip("```").strip()
   data  = json.loads(clean)
   print("✅ Parsed JSON:")
   for k, v in data.items():
       print(f"   {k}: {v}")
except json.JSONDecodeError as e:
   print(f"⚠  JSON parse error: {e} — raw output shown above.")


section("13 · Code Generation")


code_prompt = (
   "Write a Python function called `quantize_weights` that takes a list of float "
   "weights and a group_size, applies 1-bit Q1_0_g128-style quantization (sign bit + "
   "per-group FP16 scale), and returns the quantized bits and scale list. "
   "Include a docstring and a short usage example."
)
code_system = "You are an expert Python programmer. Return clean, well-commented Python code only."


code_out, _ = infer(code_prompt, system_prompt=code_system,
                   temp=0.2, n_predict=400, verbose=False)
print(code_out)


exec_ns = {}
try:
   exec(code_out, exec_ns)
   if "quantize_weights" in exec_ns:
       import random as _r
       test_w = [_r.gauss(0, 0.1) for _ in range(256)]
       bits, scales = exec_ns["quantize_weights"](test_w, 128)
       print(f"n✅ Function executed successfully!")
       print(f"   Input  : {len(test_w)} weights")
       print(f"   Output : {len(bits)} bits, {len(scales)} scale values")
except Exception as e:
   print(f"n⚠  Exec note: {e} (model output may need minor tweaks)")

We test the model on longer-context and structured tasks to better understand its practical capabilities. We feed a technical passage into a summarization model, ask it to return strict JSON output, and then push it further by generating Python code that we immediately execute in the notebook. This helps us evaluate not only whether Bonsai can answer questions, but also whether it can follow formatting rules, generate usable structured responses, and produce code that works in real execution.

section("14 · OpenAI-Compatible Server Mode")


SERVER_PORT = 8088
SERVER_URL  = f"http://localhost:{SERVER_PORT}"
server_proc = None


def start_server():
   global server_proc
   if server_proc and server_proc.poll() is None:
       print("   Server already running.")
       return
   cmd = (
       f"{LLAMA_SERVER} -m {MODEL_PATH} "
       f"--host 0.0.0.0 --port {SERVER_PORT} "
       f"-ngl 99 -c 4096 --no-display-prompt --log-disable 2>/dev/null"
   )
   server_proc = subprocess.Popen(cmd, shell=True,
                                  stdout=subprocess.DEVNULL,
                                  stderr=subprocess.DEVNULL)
   for _ in range(30):
       try:
           urllib.request.urlopen(f"{SERVER_URL}/health", timeout=1)
           print(f"✅ llama-server running at {SERVER_URL}")
           return
       except Exception:
           time.sleep(1)
   print("⚠  Server may still be starting up …")


def stop_server():
   global server_proc
   if server_proc:
       server_proc.terminate()
       server_proc.wait()
       print("   Server stopped.")


print("🚀 Starting llama-server …")
start_server()
time.sleep(2)


try:
   from openai import OpenAI
   client   = OpenAI(base_url=f"{SERVER_URL}/v1", api_key="no-key-needed")
   print("n   Sending request via OpenAI client …")
   response = client.chat.completions.create(
       model="bonsai",
       messages=[
           {"role": "user",   "content": "What are three key advantages of 1-bit LLMs for mobile devices?"},
       ],
       max_tokens=200,
       temperature=0.5,
   )
   reply = response.choices[0].message.content
   print(f"n🤖 Server response:n{reply}")
   usage = response.usage
   print(f"n   Prompt tokens    : {usage.prompt_tokens}")
   print(f"   Completion tokens: {usage.completion_tokens}")
   print(f"   Total tokens     : {usage.total_tokens}")
except Exception as e:
   print(f"⚠  OpenAI client error: {e}")


section("15 · Mini-RAG — Grounded Q&A with Context Injection")


KB = {
   "bonsai_1.7b": (
       "Bonsai-1.7B uses Q1_0_g128 quantization. It has 1.7B parameters, "
       "deployed size 0.24 GB, context length 32,768 tokens, and is based on "
       "the Qwen3-1.7B dense architecture with GQA attention."
   ),
   "bonsai_8b": (
       "Bonsai-8B uses Q1_0_g128 quantization. It supports up to 65,536 tokens "
       "of context. It achieves 3.0x faster token generation than FP16 on RTX 4090."
   ),
   "quantization": (
       "Q1_0_g128 packs each weight as a single sign bit (0=-scale, 1=+scale). "
       "Each group of 128 weights shares one FP16 scale factor, giving 1.125 bpw."
   ),
}


def rag_query(question):
   q = question.lower()
   relevant = []
   if "1.7" in q or "small" in q:  relevant.append(KB["bonsai_1.7b"])
   if "8b" in q or "context" in q: relevant.append(KB["bonsai_8b"])
   if "quant" in q or "bit" in q:  relevant.append(KB["quantization"])
   if not relevant:                 relevant = list(KB.values())
   context    = "n".join(f"- {c}" for c in relevant)
   rag_prompt = (
       "If the answer is not in the context, say so.nn"
       f"Context:n{context}nnQuestion: {question}"
   )
   ans, _ = infer(rag_prompt, n_predict=150, temp=0.1, verbose=False)
   print(f"❓ {question}")
   print(f"💡 {ans}n")


print("Running RAG queries …n")
rag_query("What is the deployed file size of the 1.7B model?")
rag_query("How does Q1_0_g128 quantization work?")
rag_query("What context length does the 8B model support?")


section("16 · Model Family Comparison")


print("""
┌─────────────────┬──────────┬────────────┬────────────────┬──────────────┬──────────────┐
│ Model           │ Params   │ GGUF Size  │ Context Len    │ FP16 Size    │ Compression  │
├─────────────────┼──────────┼────────────┼────────────────┼──────────────┼──────────────┤
│ Bonsai-1.7B     │  1.7 B   │  0.25 GB   │ 32,768 tokens  │   3.44 GB    │    14.2×     │
│ Bonsai-4B       │  4.0 B   │  ~0.6 GB   │ 32,768 tokens  │   ~8.0  GB   │    ~13×      │
│ Bonsai-8B       │  8.0 B   │  ~0.9 GB   │ 65,536 tokens  │  ~16.0  GB   │    ~13.9×    │
└─────────────────┴──────────┴────────────┴────────────────┴──────────────┴──────────────┘


Throughput (from whitepaper):
 RTX 4090  — Bonsai-1.7B:  674 tok/s (TG128) vs FP16 224 tok/s  →  3.0× faster
 M4 Pro    — Bonsai-1.7B:  250 tok/s (TG128) vs FP16  65 tok/s  →  3.8× faster
""")


section("17 · Cleanup")


stop_server()
print("✅ Tutorial complete!n")
print("📚 Resources:")
print("   GitHub:      https://github.com/PrismML-Eng/Bonsai-demo")
print("   HuggingFace: https://huggingface.co/collections/prism-ml/bonsai")
print("   Whitepaper:  https://github.com/PrismML-Eng/Bonsai-demo/blob/main/1-bit-bonsai-8b-whitepaper.pdf")
print("   Discord:     https://discord.gg/prismml")

We launch the OpenAI-compatible llama-server to interact with Bonsai via the OpenAI Python client. We then build a lightweight Mini-RAG example by injecting relevant context into prompts, compare the broader Bonsai model family in terms of size, context length, and compression, and finally shut down the local server cleanly. This closing section shows how Bonsai can fit into API-style workflows, grounded question-answering setups, and broader deployment scenarios beyond simple single-prompt inference.

In conclusion, we built and ran a full Bonsai 1-bit LLM workflow in Google Colab and observed that extreme quantization can dramatically reduce model size while still supporting useful, fast, and flexible inference. We verified the runtime environment, launched the model locally, measured token throughput, and experimented with different prompting, sampling, context handling, and server-based integrations. Along the way, we also connected the practical execution to the underlying quantization logic, helping us understand not just how to use Bonsai, but why its design is important for efficient AI deployment. By the end, we have a compact but advanced setup that demonstrates how 1-bit language models can make high-performance inference more accessible across constrained and mainstream hardware environments.


Check out the Full Coding Notebook here. Also, feel free to follow us on Twitter and don’t forget to join our 130k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.

Need to partner with us for promoting your GitHub Repo OR Hugging Face Page OR Product Release OR Webinar etc.? Connect with us

The post A Coding Tutorial for Running PrismML Bonsai 1-Bit LLM on CUDA with GGUF, Benchmarking, Chat, JSON, and RAG appeared first on MarkTechPost.

We use cookies to improve your experience and performance on our website. You can learn more at Privacy Policy and manage your privacy settings by clicking Settings.

Privacy Preferences

You can choose your cookie settings by turning on/off each type of cookie as you wish, except for essential cookies.

Allow All
Manage Consent Preferences
  • Always Active

Save
en_US