In this tutorial, we set out to recreate the spirit of the Hierarchical Reasoning Model (HRM) using a free Hugging Face model that runs locally. We walk through the design of a lightweight yet structured reasoning agent, where we act as both architects and experimenters. By breaking problems into subgoals, solving them with Python, critiquing the outcomes, and synthesizing a final answer, we can experience how hierarchical planning and execution can enhance reasoning performance. This process enables us to see, in real-time, how a brain-inspired workflow can be implemented without requiring massive model sizes or expensive APIs. Check out the Paper and FULL CODES.
!pip -q install -U transformers accelerate bitsandbytes rich
import os, re, json, textwrap, traceback
from typing import Dict, Any, List
from rich import print as rprint
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, pipeline
MODEL_NAME = "Qwen/Qwen2.5-1.5B-Instruct"
DTYPE = torch.bfloat16 if torch.cuda.is_available() else torch.float32We begin by installing the required libraries and loading the Qwen2.5-1.5B-Instruct model from Hugging Face. We set the data type based on GPU availability to ensure efficient model execution in Colab.
tok = AutoTokenizer.from_pretrained(MODEL_NAME, use_fast=True)
model = AutoModelForCausalLM.from_pretrained(
   MODEL_NAME,
   device_map="auto",
   torch_dtype=DTYPE,
   load_in_4bit=True
)
gen = pipeline(
   "text-generation",
   model=model,
   tokenizer=tok,
   return_full_text=False
)
We load the tokenizer and model, configure it to run in 4-bit for efficiency, and wrap everything in a text-generation pipeline so we can interact with the model easily in Colab. Check out the Paper and FULL CODES.
def chat(prompt: str, system: str = "", max_new_tokens: int = 512, temperature: float = 0.3) -> str:
   msgs = []
   if system:
       msgs.append({"role":"system","content":system})
   msgs.append({"role":"user","content":prompt})
   inputs = tok.apply_chat_template(msgs, tokenize=False, add_generation_prompt=True)
   out = gen(inputs, max_new_tokens=max_new_tokens, do_sample=(temperature>0), temperature=temperature, top_p=0.9)
   return out[0]["generated_text"].strip()
def extract_json(txt: str) -> Dict[str, Any]:
   m = re.search(r"{[sS]*}$", txt.strip())
   if not m:
       m = re.search(r"{[sS]*?}", txt)
   try:
       return json.loads(m.group(0)) if m else {}
   except Exception:
       # fallback: strip code fences
       s = re.sub(r"^```.*?n|n```$", "", txt, flags=re.S)
       try:
           return json.loads(s)
       except Exception:
           return {}We define helper functions: the chat function allows us to send prompts to the model with optional system instructions and sampling controls, while extract_json helps us parse structured JSON outputs from the model reliably, even if the response includes code fences or additional text. Check out the Paper and FULL CODES.
def extract_code(txt: str) -> str:
   m = re.search(r"```(?:python)?s*([sS]*?)```", txt, flags=re.I)
   return (m.group(1) if m else txt).strip()
def run_python(code: str, env: Dict[str, Any] | None = None) -> Dict[str, Any]:
   import io, contextlib
   g = {"__name__": "__main__"}; l = {}
   if env: g.update(env)
   buf = io.StringIO()
   try:
       with contextlib.redirect_stdout(buf):
           exec(code, g, l)
       out = l.get("RESULT", g.get("RESULT"))
       return {"ok": True, "result": out, "stdout": buf.getvalue()}
   except Exception as e:
       return {"ok": False, "error": str(e), "trace": traceback.format_exc(), "stdout": buf.getvalue()}
PLANNER_SYS = """You are the HRM Planner.
Decompose the TASK into 2–4 atomic, code-solvable subgoals.
Return compact JSON only: {"subgoals":[...], "final_format":"<one-line answer format>"}."""
SOLVER_SYS = """You are the HRM Solver.
Given SUBGOAL and CONTEXT vars, output a single Python snippet.
Rules:
- Compute deterministically.
- Set a variable RESULT to the answer.
- Keep code short; stdlib only.
Return only a Python code block."""
CRITIC_SYS = """You are the HRM Critic.
Given TASK and LOGS (subgoal results), decide if final answer is ready.
Return JSON only: {"action":"submit"|"revise","critique":"...", "fix_hint":"<if revise>"}."""
SYNTH_SYS = """You are the HRM Synthesizer.
Given TASK, LOGS, and final_format, output only the final answer (no steps).
Follow final_format exactly."""
We add two important pieces: utility functions and system prompts. The extract_code function pulls Python snippets from the model’s output, while run_python safely executes those snippets and captures their results. Alongside, we define four role prompts, Planner, Solver, Critic, and Synthesizer, which guide the model to break tasks into subgoals, solve them with code, verify correctness, and finally produce a clean answer. Check out the Paper and FULL CODES.
def plan(task: str) -> Dict[str, Any]:
   p = f"TASK:n{task}nReturn JSON only."
   return extract_json(chat(p, PLANNER_SYS, temperature=0.2, max_new_tokens=300))
def solve_subgoal(subgoal: str, context: Dict[str, Any]) -> Dict[str, Any]:
   prompt = f"SUBGOAL:n{subgoal}nCONTEXT vars: {list(context.keys())}nReturn Python code only."
   code = extract_code(chat(prompt, SOLVER_SYS, temperature=0.2, max_new_tokens=400))
   res = run_python(code, env=context)
   return {"subgoal": subgoal, "code": code, "run": res}
def critic(task: str, logs: List[Dict[str, Any]]) -> Dict[str, Any]:
   pl = [{"subgoal": L["subgoal"], "result": L["run"].get("result"), "ok": L["run"]["ok"]} for L in logs]
   out = chat("TASK:n"+task+"nLOGS:n"+json.dumps(pl, ensure_ascii=False, indent=2)+"nReturn JSON only.",
              CRITIC_SYS, temperature=0.1, max_new_tokens=250)
   return extract_json(out)
def refine(task: str, logs: List[Dict[str, Any]]) -> Dict[str, Any]:
   sys = "Refine subgoals minimally to fix issues. Return same JSON schema as planner."
   out = chat("TASK:n"+task+"nLOGS:n"+json.dumps(logs, ensure_ascii=False)+"nReturn JSON only.",
              sys, temperature=0.2, max_new_tokens=250)
   j = extract_json(out)
   return j if j.get("subgoals") else {}
def synthesize(task: str, logs: List[Dict[str, Any]], final_format: str) -> str:
   packed = [{"subgoal": L["subgoal"], "result": L["run"].get("result")} for L in logs]
   return chat("TASK:n"+task+"nLOGS:n"+json.dumps(packed, ensure_ascii=False)+
               f"nfinal_format: {final_format}nOnly the final answer.",
               SYNTH_SYS, temperature=0.0, max_new_tokens=120).strip()
def hrm_agent(task: str, context: Dict[str, Any] | None = None, budget: int = 2) -> Dict[str, Any]:
   ctx = dict(context or {})
   trace, plan_json = [], plan(task)
   for round_id in range(1, budget+1):
       logs = [solve_subgoal(sg, ctx) for sg in plan_json.get("subgoals", [])]
       for L in logs:
           ctx_key = f"g{len(trace)}_{abs(hash(L['subgoal']))%9999}"
           ctx[ctx_key] = L["run"].get("result")
       verdict = critic(task, logs)
       trace.append({"round": round_id, "plan": plan_json, "logs": logs, "verdict": verdict})
       if verdict.get("action") == "submit": break
       plan_json = refine(task, logs) or plan_json
   final = synthesize(task, trace[-1]["logs"], plan_json.get("final_format", "Answer: <value>"))
   return {"final": final, "trace": trace}We implement the full HRM loop: we plan subgoals, solve each by generating and running Python (capturing RESULT), then we critique, optionally refine the plan, and synthesize a clean final answer. We orchestrate these rounds in hrm_agent, carrying forward intermediate results as context so we iteratively improve and stop once the critic says “submit.” Check out the Paper and FULL CODES.
ARC_TASK = textwrap.dedent("""
Infer the transformation rule from train examples and apply to test.
Return exactly: "Answer: <grid>", where <grid> is a Python list of lists of ints.
""").strip()
ARC_DATA = {
   "train": [
       {"inp": [[0,0],[1,0]], "out": [[1,1],[0,1]]},
       {"inp": [[0,1],[0,0]], "out": [[1,0],[1,1]]}
   ],
   "test": [[0,0],[0,1]]
}
res1 = hrm_agent(ARC_TASK, context={"TRAIN": ARC_DATA["train"], "TEST": ARC_DATA["test"]}, budget=2)
rprint("n[bold]Demo 1 — ARC-like Toy[/bold]")
rprint(res1["final"])
WM_TASK = "A tank holds 1200 L. It leaks 2% per hour for 3 hours, then is refilled by 150 L. Return exactly: 'Answer: <liters>'."
res2 = hrm_agent(WM_TASK, context={}, budget=2)
rprint("n[bold]Demo 2 — Word Math[/bold]")
rprint(res2["final"])
rprint("n[dim]Rounds executed (Demo 1):[/dim]", len(res1["trace"]))We run two demos to validate the agent: an ARC-style task where we infer a transformation from train pairs and apply it to a test grid, and a word-math problem that checks numeric reasoning. We call hrm_agent with each task, print the final answers, and also display the number of reasoning rounds the ARC run takes.
In conclusion, we recognize that what we have built is more than a simple demonstration; it is a window into how hierarchical reasoning can make smaller models punch above their weight. By layering planning, solving, and critiquing, we empower a free Hugging Face model to perform tasks with surprising robustness. We leave with a deeper appreciation of how brain-inspired structures, when paired with practical, open-source tools, enable us to explore reasoning benchmarks and experiment creatively without incurring high costs. This hands-on journey shows us that advanced cognitive-like workflows are accessible to anyone willing to tinker, iterate, and learn.
Check out the Paper and FULL CODES. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter.
The post A Coding Guide to Building a Brain-Inspired Hierarchical Reasoning AI Agent with Hugging Face Models appeared first on MarkTechPost.
 
				 
				 
				 
					           
					           
					           
					           
					           
					          