In this tutorial, we set out to recreate the spirit of the Hierarchical Reasoning Model (HRM) using a free Hugging Face model that runs locally. We walk through the design of a lightweight yet structured reasoning agent, where we act as both architects and experimenters. By breaking problems into subgoals, solving them with Python, critiquing the outcomes, and synthesizing a final answer, we can experience how hierarchical planning and execution can enhance reasoning performance. This process enables us to see, in real-time, how a brain-inspired workflow can be implemented without requiring massive model sizes or expensive APIs. Check out the Paper and FULL CODES. Copy CodeCopiedUse a different Browser !pip -q install -U transformers accelerate bitsandbytes rich import os, re, json, textwrap, traceback from typing import Dict, Any, List from rich import print as rprint import torch from transformers import AutoTokenizer, AutoModelForCausalLM, pipeline MODEL_NAME = “Qwen/Qwen2.5-1.5B-Instruct” DTYPE = torch.bfloat16 if torch.cuda.is_available() else torch.float32 We begin by installing the required libraries and loading the Qwen2.5-1.5B-Instruct model from Hugging Face. We set the data type based on GPU availability to ensure efficient model execution in Colab. Copy CodeCopiedUse a different Browser tok = AutoTokenizer.from_pretrained(MODEL_NAME, use_fast=True) model = AutoModelForCausalLM.from_pretrained( MODEL_NAME, device_map=”auto”, torch_dtype=DTYPE, load_in_4bit=True ) gen = pipeline( “text-generation”, model=model, tokenizer=tok, return_full_text=False ) We load the tokenizer and model, configure it to run in 4-bit for efficiency, and wrap everything in a text-generation pipeline so we can interact with the model easily in Colab. Check out the Paper and FULL CODES. Copy CodeCopiedUse a different Browser def chat(prompt: str, system: str = “”, max_new_tokens: int = 512, temperature: float = 0.3) -> str: msgs = [] if system: msgs.append({“role”:”system”,”content”:system}) msgs.append({“role”:”user”,”content”:prompt}) inputs = tok.apply_chat_template(msgs, tokenize=False, add_generation_prompt=True) out = gen(inputs, max_new_tokens=max_new_tokens, do_sample=(temperature>0), temperature=temperature, top_p=0.9) return out[0][“generated_text”].strip() def extract_json(txt: str) -> Dict[str, Any]: m = re.search(r”{[sS]*}$”, txt.strip()) if not m: m = re.search(r”{[sS]*?}”, txt) try: return json.loads(m.group(0)) if m else {} except Exception: # fallback: strip code fences s = re.sub(r”^“`.*?n|n“`$”, “”, txt, flags=re.S) try: return json.loads(s) except Exception: return {} We define helper functions: the chat function allows us to send prompts to the model with optional system instructions and sampling controls, while extract_json helps us parse structured JSON outputs from the model reliably, even if the response includes code fences or additional text. Check out the Paper and FULL CODES. Copy CodeCopiedUse a different Browser def extract_code(txt: str) -> str: m = re.search(r”“`(?:python)?s*([sS]*?)“`”, txt, flags=re.I) return (m.group(1) if m else txt).strip() def run_python(code: str, env: Dict[str, Any] | None = None) -> Dict[str, Any]: import io, contextlib g = {“__name__”: “__main__”}; l = {} if env: g.update(env) buf = io.StringIO() try: with contextlib.redirect_stdout(buf): exec(code, g, l) out = l.get(“RESULT”, g.get(“RESULT”)) return {“ok”: True, “result”: out, “stdout”: buf.getvalue()} except Exception as e: return {“ok”: False, “error”: str(e), “trace”: traceback.format_exc(), “stdout”: buf.getvalue()} PLANNER_SYS = “””You are the HRM Planner. Decompose the TASK into 2–4 atomic, code-solvable subgoals. Return compact JSON only: {“subgoals”:[…], “final_format”:”<one-line answer format>”}.””” SOLVER_SYS = “””You are the HRM Solver. Given SUBGOAL and CONTEXT vars, output a single Python snippet. Rules: – Compute deterministically. – Set a variable RESULT to the answer. – Keep code short; stdlib only. Return only a Python code block.””” CRITIC_SYS = “””You are the HRM Critic. Given TASK and LOGS (subgoal results), decide if final answer is ready. Return JSON only: {“action”:”submit”|”revise”,”critique”:”…”, “fix_hint”:”<if revise>”}.””” SYNTH_SYS = “””You are the HRM Synthesizer. Given TASK, LOGS, and final_format, output only the final answer (no steps). Follow final_format exactly.””” We add two important pieces: utility functions and system prompts. The extract_code function pulls Python snippets from the model’s output, while run_python safely executes those snippets and captures their results. Alongside, we define four role prompts, Planner, Solver, Critic, and Synthesizer, which guide the model to break tasks into subgoals, solve them with code, verify correctness, and finally produce a clean answer. Check out the Paper and FULL CODES. Copy CodeCopiedUse a different Browser def plan(task: str) -> Dict[str, Any]: p = f”TASK:n{task}nReturn JSON only.” return extract_json(chat(p, PLANNER_SYS, temperature=0.2, max_new_tokens=300)) def solve_subgoal(subgoal: str, context: Dict[str, Any]) -> Dict[str, Any]: prompt = f”SUBGOAL:n{subgoal}nCONTEXT vars: {list(context.keys())}nReturn Python code only.” code = extract_code(chat(prompt, SOLVER_SYS, temperature=0.2, max_new_tokens=400)) res = run_python(code, env=context) return {“subgoal”: subgoal, “code”: code, “run”: res} def critic(task: str, logs: List[Dict[str, Any]]) -> Dict[str, Any]: pl = [{“subgoal”: L[“subgoal”], “result”: L[“run”].get(“result”), “ok”: L[“run”][“ok”]} for L in logs] out = chat(“TASK:n”+task+”nLOGS:n”+json.dumps(pl, ensure_ascii=False, indent=2)+”nReturn JSON only.”, CRITIC_SYS, temperature=0.1, max_new_tokens=250) return extract_json(out) def refine(task: str, logs: List[Dict[str, Any]]) -> Dict[str, Any]: sys = “Refine subgoals minimally to fix issues. Return same JSON schema as planner.” out = chat(“TASK:n”+task+”nLOGS:n”+json.dumps(logs, ensure_ascii=False)+”nReturn JSON only.”, sys, temperature=0.2, max_new_tokens=250) j = extract_json(out) return j if j.get(“subgoals”) else {} def synthesize(task: str, logs: List[Dict[str, Any]], final_format: str) -> str: packed = [{“subgoal”: L[“subgoal”], “result”: L[“run”].get(“result”)} for L in logs] return chat(“TASK:n”+task+”nLOGS:n”+json.dumps(packed, ensure_ascii=False)+ f”nfinal_format: {final_format}nOnly the final answer.”, SYNTH_SYS, temperature=0.0, max_new_tokens=120).strip() def hrm_agent(task: str, context: Dict[str, Any] | None = None, budget: int = 2) -> Dict[str, Any]: ctx = dict(context or {}) trace, plan_json = [], plan(task) for round_id in range(1, budget+1): logs = [solve_subgoal(sg, ctx) for sg in plan_json.get(“subgoals”, [])] for L in logs: ctx_key = f”g{len(trace)}_{abs(hash(L[‘subgoal’]))%9999}” ctx[ctx_key] = L[“run”].get(“result”) verdict = critic(task, logs) trace.append({“round”: round_id, “plan”: plan_json, “logs”: logs, “verdict”: verdict}) if verdict.get(“action”) == “submit”: break plan_json = refine(task, logs) or plan_json final = synthesize(task, trace[-1][“logs”], plan_json.get(“final_format”, “Answer: <value>”)) return {“final”: final, “trace”: trace} We implement the full HRM loop: we plan subgoals, solve each by generating and running Python (capturing RESULT), then we critique, optionally refine the plan, and synthesize a clean final answer. We orchestrate these rounds in hrm_agent, carrying forward intermediate results as context so we iteratively improve and stop once the critic says “submit.” Check out the Paper and FULL CODES. Copy CodeCopiedUse a different Browser ARC_TASK = textwrap.dedent(“”” Infer the transformation rule from train examples and apply to test. Return exactly: “Answer: <grid>”, where <grid> is a Python list of lists of ints. “””).strip() ARC_DATA = { “train”: [ {“inp”: [[0,0],[1,0]], “out”: [[1,1],[0,1]]}, {“inp”: [[0,1],[0,0]], “out”: [[1,0],[1,1]]} ], “test”: [[0,0],[0,1]] } res1 = hrm_agent(ARC_TASK, context={“TRAIN”: ARC_DATA[“train”], “TEST”: ARC_DATA[“test”]}, budget=2) rprint(“n[bold]Demo 1 —