{"id":86671,"date":"2026-04-28T15:38:07","date_gmt":"2026-04-28T15:38:07","guid":{"rendered":"https:\/\/youzum.net\/how-to-build-a-lightweight-vision-language-action-inspired-embodied-agent-with-latent-world-modeling-and-model-predictive-control\/"},"modified":"2026-04-28T15:38:07","modified_gmt":"2026-04-28T15:38:07","slug":"how-to-build-a-lightweight-vision-language-action-inspired-embodied-agent-with-latent-world-modeling-and-model-predictive-control","status":"publish","type":"post","link":"https:\/\/youzum.net\/it\/how-to-build-a-lightweight-vision-language-action-inspired-embodied-agent-with-latent-world-modeling-and-model-predictive-control\/","title":{"rendered":"How to Build a Lightweight Vision-Language-Action-Inspired Embodied Agent with Latent World Modeling and Model Predictive Control"},"content":{"rendered":"<p>In this tutorial, we build an embodied simulation vision agent that learns to perceive, plan, predict, and replan directly from pixel observations. We create a fully NumPy-rendered grid world in which the agent observes RGB frames rather than symbolic state variables, enabling us to simulate a simplified Vision-Language-Action-style pipeline. We train a lightweight world model that encodes visual input into a latent representation, predicts future states conditioned on actions and goals, and reconstructs the next frame. Using model predictive control in latent space, we enable the agent to sample possible action sequences, evaluate predicted outcomes, and execute the best action in a closed loop.<\/p>\n<div class=\"dm-code-snippet dark dm-normal-version default no-background-mobile\">\n<div class=\"control-language\">\n<div class=\"dm-buttons\">\n<div class=\"dm-buttons-left\">\n<div class=\"dm-button-snippet red-button\"><\/div>\n<div class=\"dm-button-snippet orange-button\"><\/div>\n<div class=\"dm-button-snippet green-button\"><\/div>\n<\/div>\n<div class=\"dm-buttons-right\"><a><span class=\"dm-copy-text\">Copy Code<\/span><span class=\"dm-copy-confirmed\">Copied<\/span><span class=\"dm-error-message\">Use a different Browser<\/span><\/a><\/div>\n<\/div>\n<pre class=\"no-line-numbers\"><code class=\"no-wrap language-php\">import random, numpy as np, torch, torch.nn as nn, torch.nn.functional as F\nimport matplotlib.pyplot as plt\nfrom dataclasses import dataclass\nfrom typing import Tuple, Dict, List\nfrom torch.utils.data import Dataset, DataLoader\n\n\ntry:\n   from tqdm.auto import tqdm\nexcept Exception:\n   def tqdm(x, **kwargs): return x\n\n\nSEED = 7\nrandom.seed(SEED); np.random.seed(SEED); torch.manual_seed(SEED)\n\n\nif device.type == \"cuda\":\n   torch.backends.cudnn.benchmark = True\n\n\n@dataclass\nclass WorldConfig:\n   grid_size: int = 8\n   cell_px: int = 14\n   max_steps: int = 45\n   n_obstacles: int = 8\n   spawn_margin: int = 1\n\n\nclass GridWorldRGBNoPIL:\n   ACTIONS = {0:(0,-1),1:(0,1),2:(-1,0),3:(1,0),4:(0,0)}\n   ACTION_NAMES = {0:\"UP\",1:\"DOWN\",2:\"LEFT\",3:\"RIGHT\",4:\"STAY\"}\n\n\n   def __init__(self, cfg: WorldConfig):\n       self.cfg = cfg\n       self.reset()\n\n\n   def reset(self) -&gt; Dict:\n       g = self.cfg.grid_size\n       self.steps = 0\n       def sample_empty(exclude=set()):\n           while True:\n               x = random.randint(self.cfg.spawn_margin, g-1-self.cfg.spawn_margin)\n               y = random.randint(self.cfg.spawn_margin, g-1-self.cfg.spawn_margin)\n               if (x,y) not in exclude: return (x,y)\n       self.obstacles = set()\n       ax, ay = sample_empty()\n       gx, gy = sample_empty(exclude={(ax,ay)})\n       used = {(ax,ay),(gx,gy)}\n       for _ in range(self.cfg.n_obstacles):\n           ox, oy = sample_empty(exclude=used)\n           self.obstacles.add((ox,oy))\n           used.add((ox,oy))\n       self.agent = (ax,ay)\n       self.goal = (gx,gy)\n       return {\"image\": self._render_u8()}\n\n\n   def _in_bounds(self, x, y):\n       return 0 &lt;= x &lt; self.cfg.grid_size and 0 &lt;= y &lt; self.cfg.grid_size\n\n\n   def _dist_to_goal(self, pos: Tuple[int,int]) -&gt; float:\n       x,y = pos; gx,gy = self.goal\n       return abs(x-gx)+abs(y-gy)\n\n\n   def _state_vector(self) -&gt; np.ndarray:\n       g = self.cfg.grid_size - 1\n       ax,ay = self.agent; gx,gy = self.goal\n       return np.array([ax\/g, ay\/g, gx\/g, gy\/g], dtype=np.float32)\n\n\n   def step(self, action: int):\n       self.steps += 1\n       dx, dy = self.ACTIONS[int(action)]\n       x,y = self.agent\n       nx, ny = x+dx, y+dy\n       if self._in_bounds(nx,ny) and (nx,ny) not in self.obstacles:\n           self.agent = (nx,ny)\n       done = (self.agent == self.goal) or (self.steps &gt;= self.cfg.max_steps)\n       d_prev = self._dist_to_goal((x,y))\n       d_now = self._dist_to_goal(self.agent)\n       reward = 0.1*(d_prev - d_now) + (1.0 if self.agent == self.goal else 0.0)\n       obs = {\"image\": self._render_u8()}\n       info = {\"state\": self._state_vector()}\n       return obs, float(reward), bool(done), info\n\n\n   def _render_u8(self) -&gt; np.ndarray:\n       g, s = self.cfg.grid_size, self.cfg.cell_px\n       H = W = g*s\n       bg = np.array([245,245,245], np.uint8)\n       gridline = np.array([220,220,220], np.uint8)\n       obstacle_c = np.array([220,70,70], np.uint8)\n       goal_c = np.array([60,180,75], np.uint8)\n       agent_c = np.array([65,105,225], np.uint8)\n       img = np.empty((H,W,3), np.uint8); img[...] = bg\n       img[::s,:,:] = gridline\n       img[:,::s,:] = gridline\n       def paint_cell(x,y,color):\n           y0,y1 = y*s,(y+1)*s\n           x0,x1 = x*s,(x+1)*s\n           img[y0+1:y1-1, x0+1:x1-1] = color\n       for (ox,oy) in self.obstacles: paint_cell(ox,oy, obstacle_c)\n       gx,gy = self.goal; paint_cell(gx,gy, goal_c)\n       ax,ay = self.agent; paint_cell(ax,ay, agent_c)\n       return img\n\n\ncfg = WorldConfig()\nenv = GridWorldRGBNoPIL(cfg)\nplt.figure(figsize=(3,3))\nplt.imshow(env.reset()[\"image\"]); plt.axis(\"off\"); plt.title(\"No-Pillow observation\"); plt.show()\n\n\ndef to_tensor_img_u8(img_u8: np.ndarray) -&gt; torch.Tensor:\n   return torch.from_numpy(img_u8).permute(2,0,1).float() \/ 255.0<\/code><\/pre>\n<\/div>\n<\/div>\n<p>We initialize the environment, set deterministic seeds, and define the lightweight grid-world configuration. We implement a fully NumPy-based RGB renderer so that the agent perceives raw pixel observations without relying on external libraries. We also define the state transition dynamics and prepare image-to-tensor conversion for model training.<\/p>\n<div class=\"dm-code-snippet dark dm-normal-version default no-background-mobile\">\n<div class=\"control-language\">\n<div class=\"dm-buttons\">\n<div class=\"dm-buttons-left\">\n<div class=\"dm-button-snippet red-button\"><\/div>\n<div class=\"dm-button-snippet orange-button\"><\/div>\n<div class=\"dm-button-snippet green-button\"><\/div>\n<\/div>\n<div class=\"dm-buttons-right\"><a><span class=\"dm-copy-text\">Copy Code<\/span><span class=\"dm-copy-confirmed\">Copied<\/span><span class=\"dm-error-message\">Use a different Browser<\/span><\/a><\/div>\n<\/div>\n<pre class=\"no-line-numbers\"><code class=\"no-wrap language-php\">class TransitionDataset(Dataset):\n   def __init__(self, items): self.items = items\n   def __len__(self): return len(self.items)\n   def __getitem__(self, i): return self.items[i]\n\n\ndef collect_transitions(n_episodes=120):\n   items = []\n   e = GridWorldRGBNoPIL(cfg)\n   for _ in tqdm(range(n_episodes), desc=\"Collect\"):\n       obs = e.reset()\n       img_t = to_tensor_img_u8(obs[\"image\"])\n       for _ in range(cfg.max_steps):\n           a = random.randint(0,4)\n           obs2, r, done, info = e.step(a)\n           img_tp1 = to_tensor_img_u8(obs2[\"image\"])\n           st = torch.from_numpy(info[\"state\"]).float()\n           goal = st[2:4].clone()\n           items.append({\n               \"img_t\": img_t,\n               \"action\": torch.tensor(a, dtype=torch.long),\n               \"img_tp1\": img_tp1,\n               \"state_tp1\": st,\n               \"goal\": goal\n           })\n           img_t = img_tp1\n           if done: break\n   return items\n\n\nitems = collect_transitions(n_episodes=120)\nprint(\"Transitions:\", len(items))\nH, W = items[0][\"img_t\"].shape[1], items[0][\"img_t\"].shape[2]\ndl = DataLoader(TransitionDataset(items), batch_size=64, shuffle=True, num_workers=0, drop_last=True)<\/code><\/pre>\n<\/div>\n<\/div>\n<p>We collect rollout data by allowing the agent to interact randomly with the environment. We construct transitions that map the current image and action to the next image and state representation. We then wrap this data into a PyTorch Dataset and DataLoader to enable efficient mini-batch training.<\/p>\n<div class=\"dm-code-snippet dark dm-normal-version default no-background-mobile\">\n<div class=\"control-language\">\n<div class=\"dm-buttons\">\n<div class=\"dm-buttons-left\">\n<div class=\"dm-button-snippet red-button\"><\/div>\n<div class=\"dm-button-snippet orange-button\"><\/div>\n<div class=\"dm-button-snippet green-button\"><\/div>\n<\/div>\n<div class=\"dm-buttons-right\"><a><span class=\"dm-copy-text\">Copy Code<\/span><span class=\"dm-copy-confirmed\">Copied<\/span><span class=\"dm-error-message\">Use a different Browser<\/span><\/a><\/div>\n<\/div>\n<pre class=\"no-line-numbers\"><code class=\"no-wrap language-php\">class Encoder(nn.Module):\n   def __init__(self, H, W, zdim=64):\n       super().__init__()\n       self.net = nn.Sequential(\n           nn.Conv2d(3, 24, 5, stride=2, padding=2), nn.ReLU(),\n           nn.Conv2d(24, 48, 5, stride=2, padding=2), nn.ReLU(),\n           nn.Conv2d(48, 64, 3, stride=2, padding=1), nn.ReLU(),\n       )\n       with torch.no_grad():\n           f = self.net(torch.zeros(1,3,H,W))\n       self.feat_shape = f.shape[1:]\n       self.fc = nn.Linear(int(np.prod(self.feat_shape)), zdim)\n   def forward(self, x):\n       return self.fc(self.net(x).flatten(1))\n\n\nclass Decoder(nn.Module):\n   def __init__(self, feat_shape, zdim=64):\n       super().__init__()\n       C,h,w = feat_shape\n       self.C,self.h,self.w = C,h,w\n       self.fc = nn.Linear(zdim, C*h*w)\n       self.net = nn.Sequential(\n           nn.ConvTranspose2d(C, 48, 4, stride=2, padding=1), nn.ReLU(),\n           nn.ConvTranspose2d(48, 24, 4, stride=2, padding=1), nn.ReLU(),\n           nn.ConvTranspose2d(24, 16, 4, stride=2, padding=1), nn.ReLU(),\n           nn.Conv2d(16, 3, 3, padding=1),\n           nn.Sigmoid()\n       )\n   def forward(self, z):\n       x = self.fc(z).view(z.size(0), self.C, self.h, self.w)\n       return self.net(x)\n\n\nclass VLASimLite(nn.Module):\n   def __init__(self, H, W, zdim=64, adim=5):\n       super().__init__()\n       self.enc = Encoder(H,W,zdim)\n       self.dec = Decoder(self.enc.feat_shape, zdim)\n       self.aemb = nn.Embedding(adim, 16)\n       self.gnet = nn.Sequential(nn.Linear(2,16), nn.ReLU(), nn.Linear(16,16))\n       self.dyn = nn.Sequential(\n           nn.Linear(zdim+16+16, 128), nn.ReLU(),\n           nn.Linear(128, zdim)\n       )\n       self.state = nn.Sequential(\n           nn.Linear(zdim, 64), nn.ReLU(),\n           nn.Linear(64, 4),\n           nn.Sigmoid()\n       )\n   def encode(self, img): return self.enc(img)\n   def predict_next_latent(self, z, a, goal):\n       return self.dyn(torch.cat([z, self.aemb(a), self.gnet(goal)], dim=-1))\n   def decode(self, z): return self.dec(z)\n   def forward(self, img_t, a, goal):\n       z = self.encode(img_t)\n       z_next = self.predict_next_latent(z, a, goal)\n       return z_next, self.decode(z_next), self.state(z_next)\n\n\nmodel = VLASimLite(H,W,zdim=64,adim=5).to(device)\nopt = torch.optim.Adam(model.parameters(), lr=2e-3)<\/code><\/pre>\n<\/div>\n<\/div>\n<p>We define the compact Vision-Language-Action-inspired world model. We build a CNN encoder to compress visual input into a latent space and condition latent dynamics on actions and goals. We also add a decoder and a state-prediction head so the model can reconstruct future frames and predict structured state variables.<\/p>\n<div class=\"dm-code-snippet dark dm-normal-version default no-background-mobile\">\n<div class=\"control-language\">\n<div class=\"dm-buttons\">\n<div class=\"dm-buttons-left\">\n<div class=\"dm-button-snippet red-button\"><\/div>\n<div class=\"dm-button-snippet orange-button\"><\/div>\n<div class=\"dm-button-snippet green-button\"><\/div>\n<\/div>\n<div class=\"dm-buttons-right\"><a><span class=\"dm-copy-text\">Copy Code<\/span><span class=\"dm-copy-confirmed\">Copied<\/span><span class=\"dm-error-message\">Use a different Browser<\/span><\/a><\/div>\n<\/div>\n<pre class=\"no-line-numbers\"><code class=\"no-wrap language-php\">def train(epochs=4):\n   model.train()\n   for ep in range(1, epochs+1):\n       losses = []\n       for b in tqdm(dl, desc=f\"Train {ep}\/{epochs}\"):\n           img_t = b[\"img_t\"].to(device)\n           a = b[\"action\"].to(device)\n           img_tp1 = b[\"img_tp1\"].to(device)\n           st_tp1 = b[\"state_tp1\"].to(device)\n           goal = b[\"goal\"].to(device)\n           z_next, img_pred, st_pred = model(img_t, a, goal)\n           loss = F.l1_loss(img_pred, img_tp1) + 3.0*F.mse_loss(st_pred, st_tp1) + 1e-4*z_next.pow(2).mean()\n           opt.zero_grad(set_to_none=True)\n           loss.backward()\n           nn.utils.clip_grad_norm_(model.parameters(), 2.0)\n           opt.step()\n           losses.append(loss.item())\n       print(\"Epoch\", ep, \"loss\", float(np.mean(losses)))\n\n\ntrain(epochs=4)<\/code><\/pre>\n<\/div>\n<\/div>\n<p>We train the world model using a combination of image reconstruction loss and state prediction loss. We optimize the latent dynamics so that the model learns consistent forward prediction from pixels. We keep the architecture lightweight and training stable to ensure smooth execution in constrained runtimes.<\/p>\n<div class=\"dm-code-snippet dark dm-normal-version default no-background-mobile\">\n<div class=\"control-language\">\n<div class=\"dm-buttons\">\n<div class=\"dm-buttons-left\">\n<div class=\"dm-button-snippet red-button\"><\/div>\n<div class=\"dm-button-snippet orange-button\"><\/div>\n<div class=\"dm-button-snippet green-button\"><\/div>\n<\/div>\n<div class=\"dm-buttons-right\"><a><span class=\"dm-copy-text\">Copy Code<\/span><span class=\"dm-copy-confirmed\">Copied<\/span><span class=\"dm-error-message\">Use a different Browser<\/span><\/a><\/div>\n<\/div>\n<pre class=\"no-line-numbers\"><code class=\"no-wrap language-php\">@torch.no_grad()\ndef mpc_action(img_t, horizon=6, n_candidates=120, action_space=5):\n   model.eval()\n   z = model.encode(img_t)\n   st_now = model.state(z)\n   goal = st_now[:,2:4].clamp(0,1)\n   cand = torch.randint(0, action_space, (n_candidates, horizon), device=device)\n   z_roll = z.repeat(n_candidates, 1)\n   goal_k = goal.repeat(n_candidates, 1)\n   for t in range(horizon):\n       z_roll = model.predict_next_latent(z_roll, cand[:,t], goal_k)\n   stT = model.state(z_roll)\n   dist = torch.abs(stT[:,0:2] - stT[:,2:4]).sum(dim=-1)\n   changes = (cand[:,1:] != cand[:,:-1]).float().mean(dim=1)\n   score = dist + 0.12*changes\n   best = torch.argmin(score)\n   return int(cand[best,0].item())\n\n\n@torch.no_grad()\ndef predict_next_frame(img_u8, action):\n   model.eval()\n   img_t = to_tensor_img_u8(img_u8).unsqueeze(0).to(device)\n   z = model.encode(img_t)\n   goal = model.state(z)[:,2:4].clamp(0,1)\n   a = torch.tensor([action], dtype=torch.long, device=device)\n   z_next = model.predict_next_latent(z, a, goal)\n   pred = model.decode(z_next)[0].detach().cpu().permute(1,2,0).numpy()\n   return (pred*255.0).clip(0,255).astype(np.uint8)\n\n\ndef run_episode(max_steps=45):\n   e = GridWorldRGBNoPIL(cfg)\n   obs = e.reset()\n   real, pred, acts, rews = [], [], [], []\n   for _ in range(max_steps):\n       img = obs[\"image\"]\n       real.append(img)\n       a = mpc_action(to_tensor_img_u8(img).unsqueeze(0).to(device), horizon=6, n_candidates=120)\n       pred.append(predict_next_frame(img, a))\n       obs, r, done, info = e.step(a)\n       acts.append(a); rews.append(r)\n       if done:\n           real.append(obs[\"image\"])\n           pred.append(pred[-1])\n           break\n   return real, pred, acts, rews\n\n\nreal, pred, acts, rews = run_episode()\nprint(\"Steps:\", len(acts), \"Return:\", round(sum(rews), 3))\n\n\ndef show(real, pred, acts, every=2, panels=8):\n   idxs = list(range(0, min(len(acts), every*panels), every))\n   n = len(idxs)\n   plt.figure(figsize=(2.4*n, 4.8))\n   for j,i in enumerate(idxs):\n       plt.subplot(2,n,j+1); plt.imshow(real[i]); plt.axis(\"off\"); plt.title(f\"Real t={i}\")\n       plt.subplot(2,n,n+j+1); plt.imshow(pred[i]); plt.axis(\"off\"); plt.title(f\"Pred | {GridWorldRGBNoPIL.ACTION_NAMES[acts[i]]}\")\n   plt.tight_layout(); plt.show()\n\n\nshow(real, pred, acts, every=2, panels=8)\nprint(\"Pipeline OK\")<\/code><\/pre>\n<\/div>\n<\/div>\n<p>We implement model predictive control directly in latent space. We sample multiple action sequences, roll them forward through the learned dynamics, and select the sequence that minimizes predicted distance to the goal. We then run the full perception\u2013plan\u2013predict\u2013replan loop and visualize how the agent\u2019s predicted future aligns with the actual environment dynamics.<\/p>\n<p>In conclusion, we implemented a complete perception\u2013planning\u2013prediction loop without relying on external rendering libraries. We train a compact vision-based world model, use latent dynamics for forward simulation, and perform real-time replanning using MPC. By keeping the architecture lightweight and stable for constrained runtimes, we demonstrated how embodied agents can reason about future outcomes directly from visual inputs. This approach captures the core idea behind modern Vision-Language-Action systems, where perception and decision-making are tightly integrated within a predictive model of the environment.<\/p>\n<hr class=\"wp-block-separator aligncenter has-alpha-channel-opacity is-style-wide\" \/>\n<p>Check out\u00a0the\u00a0<strong><a href=\"https:\/\/github.com\/Marktechpost\/AI-Agents-Projects-Tutorials\/blob\/main\/Computer%20Vision\/embodied_vla_latent_mpc_agent_marktechpost.py\" target=\"_blank\" rel=\"noreferrer noopener\">FULL CODES here<\/a><\/strong>.<strong>\u00a0<\/strong>Also,\u00a0feel free to follow us on\u00a0<strong><a href=\"https:\/\/x.com\/intent\/follow?screen_name=marktechpost\" target=\"_blank\" rel=\"noreferrer noopener\"><mark>Twitter<\/mark><\/a><\/strong>\u00a0and don\u2019t forget to join our\u00a0<strong><a href=\"https:\/\/www.reddit.com\/r\/machinelearningnews\/\" target=\"_blank\" rel=\"noreferrer noopener\">130k+ ML SubReddit<\/a><\/strong>\u00a0and Subscribe to\u00a0<strong><a href=\"https:\/\/www.aidevsignals.com\/\" target=\"_blank\" rel=\"noreferrer noopener\">our Newsletter<\/a><\/strong>. Wait! are you on telegram?\u00a0<strong><a href=\"https:\/\/t.me\/machinelearningresearchnews\" target=\"_blank\" rel=\"noreferrer noopener\">now you can join us on telegram as well.<\/a><\/strong><\/p>\n<p>Need to partner with us for promoting your GitHub Repo OR Hugging Face Page OR Product Release OR Webinar etc.?\u00a0<strong><a href=\"https:\/\/forms.gle\/MTNLpmJtsFA3VRVd9\" target=\"_blank\" rel=\"noreferrer noopener\"><mark>Connect with us<\/mark><\/a><\/strong><\/p>\n<p>The post <a href=\"https:\/\/www.marktechpost.com\/2026\/04\/27\/how-to-build-a-lightweight-vision-language-action-inspired-embodied-agent-with-latent-world-modeling-and-model-predictive-control\/\">How to Build a Lightweight Vision-Language-Action-Inspired Embodied Agent with Latent World Modeling and Model Predictive Control<\/a> appeared first on <a href=\"https:\/\/www.marktechpost.com\/\">MarkTechPost<\/a>.<\/p>","protected":false},"excerpt":{"rendered":"<p>In this tutorial, we build an embodied simulation vision agent that learns to perceive, plan, predict, and replan directly from pixel observations. We create a fully NumPy-rendered grid world in which the agent observes RGB frames rather than symbolic state variables, enabling us to simulate a simplified Vision-Language-Action-style pipeline. We train a lightweight world model that encodes visual input into a latent representation, predicts future states conditioned on actions and goals, and reconstructs the next frame. Using model predictive control in latent space, we enable the agent to sample possible action sequences, evaluate predicted outcomes, and execute the best action in a closed loop. Copy CodeCopiedUse a different Browser import random, numpy as np, torch, torch.nn as nn, torch.nn.functional as F import matplotlib.pyplot as plt from dataclasses import dataclass from typing import Tuple, Dict, List from torch.utils.data import Dataset, DataLoader try: from tqdm.auto import tqdm except Exception: def tqdm(x, **kwargs): return x SEED = 7 random.seed(SEED); np.random.seed(SEED); torch.manual_seed(SEED) if device.type == &#8220;cuda&#8221;: torch.backends.cudnn.benchmark = True @dataclass class WorldConfig: grid_size: int = 8 cell_px: int = 14 max_steps: int = 45 n_obstacles: int = 8 spawn_margin: int = 1 class GridWorldRGBNoPIL: ACTIONS = {0:(0,-1),1:(0,1),2:(-1,0),3:(1,0),4:(0,0)} ACTION_NAMES = {0:&#8221;UP&#8221;,1:&#8221;DOWN&#8221;,2:&#8221;LEFT&#8221;,3:&#8221;RIGHT&#8221;,4:&#8221;STAY&#8221;} def __init__(self, cfg: WorldConfig): self.cfg = cfg self.reset() def reset(self) -&gt; Dict: g = self.cfg.grid_size self.steps = 0 def sample_empty(exclude=set()): while True: x = random.randint(self.cfg.spawn_margin, g-1-self.cfg.spawn_margin) y = random.randint(self.cfg.spawn_margin, g-1-self.cfg.spawn_margin) if (x,y) not in exclude: return (x,y) self.obstacles = set() ax, ay = sample_empty() gx, gy = sample_empty(exclude={(ax,ay)}) used = {(ax,ay),(gx,gy)} for _ in range(self.cfg.n_obstacles): ox, oy = sample_empty(exclude=used) self.obstacles.add((ox,oy)) used.add((ox,oy)) self.agent = (ax,ay) self.goal = (gx,gy) return {&#8220;image&#8221;: self._render_u8()} def _in_bounds(self, x, y): return 0 &lt;= x &lt; self.cfg.grid_size and 0 &lt;= y &lt; self.cfg.grid_size def _dist_to_goal(self, pos: Tuple[int,int]) -&gt; float: x,y = pos; gx,gy = self.goal return abs(x-gx)+abs(y-gy) def _state_vector(self) -&gt; np.ndarray: g = self.cfg.grid_size &#8211; 1 ax,ay = self.agent; gx,gy = self.goal return np.array([ax\/g, ay\/g, gx\/g, gy\/g], dtype=np.float32) def step(self, action: int): self.steps += 1 dx, dy = self.ACTIONS[int(action)] x,y = self.agent nx, ny = x+dx, y+dy if self._in_bounds(nx,ny) and (nx,ny) not in self.obstacles: self.agent = (nx,ny) done = (self.agent == self.goal) or (self.steps &gt;= self.cfg.max_steps) d_prev = self._dist_to_goal((x,y)) d_now = self._dist_to_goal(self.agent) reward = 0.1*(d_prev &#8211; d_now) + (1.0 if self.agent == self.goal else 0.0) obs = {&#8220;image&#8221;: self._render_u8()} info = {&#8220;state&#8221;: self._state_vector()} return obs, float(reward), bool(done), info def _render_u8(self) -&gt; np.ndarray: g, s = self.cfg.grid_size, self.cfg.cell_px H = W = g*s bg = np.array([245,245,245], np.uint8) gridline = np.array([220,220,220], np.uint8) obstacle_c = np.array([220,70,70], np.uint8) goal_c = np.array([60,180,75], np.uint8) agent_c = np.array([65,105,225], np.uint8) img = np.empty((H,W,3), np.uint8); img[&#8230;] = bg img[::s,:,:] = gridline img[:,::s,:] = gridline def paint_cell(x,y,color): y0,y1 = y*s,(y+1)*s x0,x1 = x*s,(x+1)*s img[y0+1:y1-1, x0+1:x1-1] = color for (ox,oy) in self.obstacles: paint_cell(ox,oy, obstacle_c) gx,gy = self.goal; paint_cell(gx,gy, goal_c) ax,ay = self.agent; paint_cell(ax,ay, agent_c) return img cfg = WorldConfig() env = GridWorldRGBNoPIL(cfg) plt.figure(figsize=(3,3)) plt.imshow(env.reset()[&#8220;image&#8221;]); plt.axis(&#8220;off&#8221;); plt.title(&#8220;No-Pillow observation&#8221;); plt.show() def to_tensor_img_u8(img_u8: np.ndarray) -&gt; torch.Tensor: return torch.from_numpy(img_u8).permute(2,0,1).float() \/ 255.0 We initialize the environment, set deterministic seeds, and define the lightweight grid-world configuration. We implement a fully NumPy-based RGB renderer so that the agent perceives raw pixel observations without relying on external libraries. We also define the state transition dynamics and prepare image-to-tensor conversion for model training. Copy CodeCopiedUse a different Browser class TransitionDataset(Dataset): def __init__(self, items): self.items = items def __len__(self): return len(self.items) def __getitem__(self, i): return self.items[i] def collect_transitions(n_episodes=120): items = [] e = GridWorldRGBNoPIL(cfg) for _ in tqdm(range(n_episodes), desc=&#8221;Collect&#8221;): obs = e.reset() img_t = to_tensor_img_u8(obs[&#8220;image&#8221;]) for _ in range(cfg.max_steps): a = random.randint(0,4) obs2, r, done, info = e.step(a) img_tp1 = to_tensor_img_u8(obs2[&#8220;image&#8221;]) st = torch.from_numpy(info[&#8220;state&#8221;]).float() goal = st[2:4].clone() items.append({ &#8220;img_t&#8221;: img_t, &#8220;action&#8221;: torch.tensor(a, dtype=torch.long), &#8220;img_tp1&#8221;: img_tp1, &#8220;state_tp1&#8221;: st, &#8220;goal&#8221;: goal }) img_t = img_tp1 if done: break return items items = collect_transitions(n_episodes=120) print(&#8220;Transitions:&#8221;, len(items)) H, W = items[0][&#8220;img_t&#8221;].shape[1], items[0][&#8220;img_t&#8221;].shape[2] dl = DataLoader(TransitionDataset(items), batch_size=64, shuffle=True, num_workers=0, drop_last=True) We collect rollout data by allowing the agent to interact randomly with the environment. We construct transitions that map the current image and action to the next image and state representation. We then wrap this data into a PyTorch Dataset and DataLoader to enable efficient mini-batch training. Copy CodeCopiedUse a different Browser class Encoder(nn.Module): def __init__(self, H, W, zdim=64): super().__init__() self.net = nn.Sequential( nn.Conv2d(3, 24, 5, stride=2, padding=2), nn.ReLU(), nn.Conv2d(24, 48, 5, stride=2, padding=2), nn.ReLU(), nn.Conv2d(48, 64, 3, stride=2, padding=1), nn.ReLU(), ) with torch.no_grad(): f = self.net(torch.zeros(1,3,H,W)) self.feat_shape = f.shape[1:] self.fc = nn.Linear(int(np.prod(self.feat_shape)), zdim) def forward(self, x): return self.fc(self.net(x).flatten(1)) class Decoder(nn.Module): def __init__(self, feat_shape, zdim=64): super().__init__() C,h,w = feat_shape self.C,self.h,self.w = C,h,w self.fc = nn.Linear(zdim, C*h*w) self.net = nn.Sequential( nn.ConvTranspose2d(C, 48, 4, stride=2, padding=1), nn.ReLU(), nn.ConvTranspose2d(48, 24, 4, stride=2, padding=1), nn.ReLU(), nn.ConvTranspose2d(24, 16, 4, stride=2, padding=1), nn.ReLU(), nn.Conv2d(16, 3, 3, padding=1), nn.Sigmoid() ) def forward(self, z): x = self.fc(z).view(z.size(0), self.C, self.h, self.w) return self.net(x) class VLASimLite(nn.Module): def __init__(self, H, W, zdim=64, adim=5): super().__init__() self.enc = Encoder(H,W,zdim) self.dec = Decoder(self.enc.feat_shape, zdim) self.aemb = nn.Embedding(adim, 16) self.gnet = nn.Sequential(nn.Linear(2,16), nn.ReLU(), nn.Linear(16,16)) self.dyn = nn.Sequential( nn.Linear(zdim+16+16, 128), nn.ReLU(), nn.Linear(128, zdim) ) self.state = nn.Sequential( nn.Linear(zdim, 64), nn.ReLU(), nn.Linear(64, 4), nn.Sigmoid() ) def encode(self, img): return self.enc(img) def predict_next_latent(self, z, a, goal): return self.dyn(torch.cat([z, self.aemb(a), self.gnet(goal)], dim=-1)) def decode(self, z): return self.dec(z) def forward(self, img_t, a, goal): z = self.encode(img_t) z_next = self.predict_next_latent(z, a, goal) return z_next, self.decode(z_next), self.state(z_next) model = VLASimLite(H,W,zdim=64,adim=5).to(device) opt = torch.optim.Adam(model.parameters(), lr=2e-3) We define the compact Vision-Language-Action-inspired world model. We build a CNN encoder to compress visual input into a latent space and condition latent dynamics on actions and goals. We also add a decoder and a state-prediction head so the model can reconstruct future frames and predict structured state variables. Copy CodeCopiedUse a different Browser def train(epochs=4): model.train() for ep in range(1, epochs+1): losses = [] for b in tqdm(dl, desc=f&#8221;Train {ep}\/{epochs}&#8221;): img_t = b[&#8220;img_t&#8221;].to(device) a = b[&#8220;action&#8221;].to(device) img_tp1 = b[&#8220;img_tp1&#8221;].to(device) st_tp1 = b[&#8220;state_tp1&#8221;].to(device) goal = b[&#8220;goal&#8221;].to(device) z_next, img_pred, st_pred = model(img_t, a, goal) loss = F.l1_loss(img_pred, img_tp1) + 3.0*F.mse_loss(st_pred, st_tp1) + 1e-4*z_next.pow(2).mean() opt.zero_grad(set_to_none=True) loss.backward() nn.utils.clip_grad_norm_(model.parameters(), 2.0) opt.step() losses.append(loss.item()) print(&#8220;Epoch&#8221;, ep, &#8220;loss&#8221;, float(np.mean(losses))) train(epochs=4) We train the world model using a combination of image reconstruction<\/p>","protected":false},"author":2,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"_acf_changed":false,"pmpro_default_level":"","site-sidebar-layout":"default","site-content-layout":"","ast-site-content-layout":"","site-content-style":"default","site-sidebar-style":"default","ast-global-header-display":"","ast-banner-title-visibility":"","ast-main-header-display":"","ast-hfb-above-header-display":"","ast-hfb-below-header-display":"","ast-hfb-mobile-header-display":"","site-post-title":"","ast-breadcrumbs-content":"","ast-featured-img":"","footer-sml-layout":"","theme-transparent-header-meta":"","adv-header-id-meta":"","stick-header-meta":"","header-above-stick-meta":"","header-main-stick-meta":"","header-below-stick-meta":"","astra-migrate-meta-layouts":"default","ast-page-background-enabled":"default","ast-page-background-meta":{"desktop":{"background-color":"var(--ast-global-color-4)","background-image":"","background-repeat":"repeat","background-position":"center center","background-size":"auto","background-attachment":"scroll","background-type":"","background-media":"","overlay-type":"","overlay-color":"","overlay-opacity":"","overlay-gradient":""},"tablet":{"background-color":"","background-image":"","background-repeat":"repeat","background-position":"center center","background-size":"auto","background-attachment":"scroll","background-type":"","background-media":"","overlay-type":"","overlay-color":"","overlay-opacity":"","overlay-gradient":""},"mobile":{"background-color":"","background-image":"","background-repeat":"repeat","background-position":"center center","background-size":"auto","background-attachment":"scroll","background-type":"","background-media":"","overlay-type":"","overlay-color":"","overlay-opacity":"","overlay-gradient":""}},"ast-content-background-meta":{"desktop":{"background-color":"var(--ast-global-color-5)","background-image":"","background-repeat":"repeat","background-position":"center center","background-size":"auto","background-attachment":"scroll","background-type":"","background-media":"","overlay-type":"","overlay-color":"","overlay-opacity":"","overlay-gradient":""},"tablet":{"background-color":"var(--ast-global-color-5)","background-image":"","background-repeat":"repeat","background-position":"center center","background-size":"auto","background-attachment":"scroll","background-type":"","background-media":"","overlay-type":"","overlay-color":"","overlay-opacity":"","overlay-gradient":""},"mobile":{"background-color":"var(--ast-global-color-5)","background-image":"","background-repeat":"repeat","background-position":"center center","background-size":"auto","background-attachment":"scroll","background-type":"","background-media":"","overlay-type":"","overlay-color":"","overlay-opacity":"","overlay-gradient":""}},"_pvb_checkbox_block_on_post":false,"footnotes":""},"categories":[52,5,7,1],"tags":[],"class_list":["post-86671","post","type-post","status-publish","format-standard","hentry","category-ai-club","category-committee","category-news","category-uncategorized","pmpro-has-access"],"acf":[],"yoast_head":"<!-- This site is optimized with the Yoast SEO plugin v25.3 - https:\/\/yoast.com\/wordpress\/plugins\/seo\/ -->\n<title>How to Build a Lightweight Vision-Language-Action-Inspired Embodied Agent with Latent World Modeling and Model Predictive Control - YouZum<\/title>\n<meta name=\"description\" content=\"\u0e01\u0e34\u0e08\u0e01\u0e23\u0e23\u0e21\u0e40\u0e01\u0e35\u0e48\u0e22\u0e27\u0e01\u0e31\u0e1a\u0e42\u0e14\u0e23\u0e19\" \/>\n<meta name=\"robots\" content=\"index, follow, max-snippet:-1, max-image-preview:large, max-video-preview:-1\" \/>\n<link rel=\"canonical\" href=\"https:\/\/youzum.net\/it\/how-to-build-a-lightweight-vision-language-action-inspired-embodied-agent-with-latent-world-modeling-and-model-predictive-control\/\" \/>\n<meta property=\"og:locale\" content=\"it_IT\" \/>\n<meta property=\"og:type\" content=\"article\" \/>\n<meta property=\"og:title\" content=\"How to Build a Lightweight Vision-Language-Action-Inspired Embodied Agent with Latent World Modeling and Model Predictive Control - YouZum\" \/>\n<meta property=\"og:description\" content=\"\u0e01\u0e34\u0e08\u0e01\u0e23\u0e23\u0e21\u0e40\u0e01\u0e35\u0e48\u0e22\u0e27\u0e01\u0e31\u0e1a\u0e42\u0e14\u0e23\u0e19\" \/>\n<meta property=\"og:url\" content=\"https:\/\/youzum.net\/it\/how-to-build-a-lightweight-vision-language-action-inspired-embodied-agent-with-latent-world-modeling-and-model-predictive-control\/\" \/>\n<meta property=\"og:site_name\" content=\"YouZum\" \/>\n<meta property=\"article:publisher\" content=\"https:\/\/www.facebook.com\/DroneAssociationTH\/\" \/>\n<meta property=\"article:published_time\" content=\"2026-04-28T15:38:07+00:00\" \/>\n<meta name=\"author\" content=\"admin NU\" \/>\n<meta name=\"twitter:card\" content=\"summary_large_image\" \/>\n<meta name=\"twitter:label1\" content=\"Scritto da\" \/>\n\t<meta name=\"twitter:data1\" content=\"admin NU\" \/>\n\t<meta name=\"twitter:label2\" content=\"Tempo di lettura stimato\" \/>\n\t<meta name=\"twitter:data2\" content=\"10 minuti\" \/>\n<script type=\"application\/ld+json\" class=\"yoast-schema-graph\">{\"@context\":\"https:\/\/schema.org\",\"@graph\":[{\"@type\":\"Article\",\"@id\":\"https:\/\/youzum.net\/how-to-build-a-lightweight-vision-language-action-inspired-embodied-agent-with-latent-world-modeling-and-model-predictive-control\/#article\",\"isPartOf\":{\"@id\":\"https:\/\/youzum.net\/how-to-build-a-lightweight-vision-language-action-inspired-embodied-agent-with-latent-world-modeling-and-model-predictive-control\/\"},\"author\":{\"name\":\"admin NU\",\"@id\":\"https:\/\/yousum.gpucore.co\/#\/schema\/person\/97fa48242daf3908e4d9a5f26f4a059c\"},\"headline\":\"How to Build a Lightweight Vision-Language-Action-Inspired Embodied Agent with Latent World Modeling and Model Predictive Control\",\"datePublished\":\"2026-04-28T15:38:07+00:00\",\"mainEntityOfPage\":{\"@id\":\"https:\/\/youzum.net\/how-to-build-a-lightweight-vision-language-action-inspired-embodied-agent-with-latent-world-modeling-and-model-predictive-control\/\"},\"wordCount\":571,\"commentCount\":0,\"publisher\":{\"@id\":\"https:\/\/yousum.gpucore.co\/#organization\"},\"articleSection\":[\"AI\",\"Committee\",\"News\",\"Uncategorized\"],\"inLanguage\":\"it-IT\",\"potentialAction\":[{\"@type\":\"CommentAction\",\"name\":\"Comment\",\"target\":[\"https:\/\/youzum.net\/how-to-build-a-lightweight-vision-language-action-inspired-embodied-agent-with-latent-world-modeling-and-model-predictive-control\/#respond\"]}]},{\"@type\":\"WebPage\",\"@id\":\"https:\/\/youzum.net\/how-to-build-a-lightweight-vision-language-action-inspired-embodied-agent-with-latent-world-modeling-and-model-predictive-control\/\",\"url\":\"https:\/\/youzum.net\/how-to-build-a-lightweight-vision-language-action-inspired-embodied-agent-with-latent-world-modeling-and-model-predictive-control\/\",\"name\":\"How to Build a Lightweight Vision-Language-Action-Inspired Embodied Agent with Latent World Modeling and Model Predictive Control - YouZum\",\"isPartOf\":{\"@id\":\"https:\/\/yousum.gpucore.co\/#website\"},\"datePublished\":\"2026-04-28T15:38:07+00:00\",\"description\":\"\u0e01\u0e34\u0e08\u0e01\u0e23\u0e23\u0e21\u0e40\u0e01\u0e35\u0e48\u0e22\u0e27\u0e01\u0e31\u0e1a\u0e42\u0e14\u0e23\u0e19\",\"breadcrumb\":{\"@id\":\"https:\/\/youzum.net\/how-to-build-a-lightweight-vision-language-action-inspired-embodied-agent-with-latent-world-modeling-and-model-predictive-control\/#breadcrumb\"},\"inLanguage\":\"it-IT\",\"potentialAction\":[{\"@type\":\"ReadAction\",\"target\":[\"https:\/\/youzum.net\/how-to-build-a-lightweight-vision-language-action-inspired-embodied-agent-with-latent-world-modeling-and-model-predictive-control\/\"]}]},{\"@type\":\"BreadcrumbList\",\"@id\":\"https:\/\/youzum.net\/how-to-build-a-lightweight-vision-language-action-inspired-embodied-agent-with-latent-world-modeling-and-model-predictive-control\/#breadcrumb\",\"itemListElement\":[{\"@type\":\"ListItem\",\"position\":1,\"name\":\"Home\",\"item\":\"https:\/\/youzum.net\/\"},{\"@type\":\"ListItem\",\"position\":2,\"name\":\"How to Build a Lightweight Vision-Language-Action-Inspired Embodied Agent with Latent World Modeling and Model Predictive Control\"}]},{\"@type\":\"WebSite\",\"@id\":\"https:\/\/yousum.gpucore.co\/#website\",\"url\":\"https:\/\/yousum.gpucore.co\/\",\"name\":\"YouSum\",\"description\":\"\",\"publisher\":{\"@id\":\"https:\/\/yousum.gpucore.co\/#organization\"},\"potentialAction\":[{\"@type\":\"SearchAction\",\"target\":{\"@type\":\"EntryPoint\",\"urlTemplate\":\"https:\/\/yousum.gpucore.co\/?s={search_term_string}\"},\"query-input\":{\"@type\":\"PropertyValueSpecification\",\"valueRequired\":true,\"valueName\":\"search_term_string\"}}],\"inLanguage\":\"it-IT\"},{\"@type\":\"Organization\",\"@id\":\"https:\/\/yousum.gpucore.co\/#organization\",\"name\":\"Drone Association Thailand\",\"url\":\"https:\/\/yousum.gpucore.co\/\",\"logo\":{\"@type\":\"ImageObject\",\"inLanguage\":\"it-IT\",\"@id\":\"https:\/\/yousum.gpucore.co\/#\/schema\/logo\/image\/\",\"url\":\"https:\/\/youzum.net\/wp-content\/uploads\/2024\/11\/tranparent-logo.png\",\"contentUrl\":\"https:\/\/youzum.net\/wp-content\/uploads\/2024\/11\/tranparent-logo.png\",\"width\":300,\"height\":300,\"caption\":\"Drone Association Thailand\"},\"image\":{\"@id\":\"https:\/\/yousum.gpucore.co\/#\/schema\/logo\/image\/\"},\"sameAs\":[\"https:\/\/www.facebook.com\/DroneAssociationTH\/\"]},{\"@type\":\"Person\",\"@id\":\"https:\/\/yousum.gpucore.co\/#\/schema\/person\/97fa48242daf3908e4d9a5f26f4a059c\",\"name\":\"admin NU\",\"image\":{\"@type\":\"ImageObject\",\"inLanguage\":\"it-IT\",\"@id\":\"https:\/\/yousum.gpucore.co\/#\/schema\/person\/image\/\",\"url\":\"https:\/\/youzum.net\/wp-content\/uploads\/avatars\/2\/1746849356-bpfull.png\",\"contentUrl\":\"https:\/\/youzum.net\/wp-content\/uploads\/avatars\/2\/1746849356-bpfull.png\",\"caption\":\"admin NU\"},\"url\":\"https:\/\/youzum.net\/it\/members\/adminnu\/\"}]}<\/script>\n<!-- \/ Yoast SEO plugin. -->","yoast_head_json":{"title":"How to Build a Lightweight Vision-Language-Action-Inspired Embodied Agent with Latent World Modeling and Model Predictive Control - YouZum","description":"\u0e01\u0e34\u0e08\u0e01\u0e23\u0e23\u0e21\u0e40\u0e01\u0e35\u0e48\u0e22\u0e27\u0e01\u0e31\u0e1a\u0e42\u0e14\u0e23\u0e19","robots":{"index":"index","follow":"follow","max-snippet":"max-snippet:-1","max-image-preview":"max-image-preview:large","max-video-preview":"max-video-preview:-1"},"canonical":"https:\/\/youzum.net\/it\/how-to-build-a-lightweight-vision-language-action-inspired-embodied-agent-with-latent-world-modeling-and-model-predictive-control\/","og_locale":"it_IT","og_type":"article","og_title":"How to Build a Lightweight Vision-Language-Action-Inspired Embodied Agent with Latent World Modeling and Model Predictive Control - YouZum","og_description":"\u0e01\u0e34\u0e08\u0e01\u0e23\u0e23\u0e21\u0e40\u0e01\u0e35\u0e48\u0e22\u0e27\u0e01\u0e31\u0e1a\u0e42\u0e14\u0e23\u0e19","og_url":"https:\/\/youzum.net\/it\/how-to-build-a-lightweight-vision-language-action-inspired-embodied-agent-with-latent-world-modeling-and-model-predictive-control\/","og_site_name":"YouZum","article_publisher":"https:\/\/www.facebook.com\/DroneAssociationTH\/","article_published_time":"2026-04-28T15:38:07+00:00","author":"admin NU","twitter_card":"summary_large_image","twitter_misc":{"Scritto da":"admin NU","Tempo di lettura stimato":"10 minuti"},"schema":{"@context":"https:\/\/schema.org","@graph":[{"@type":"Article","@id":"https:\/\/youzum.net\/how-to-build-a-lightweight-vision-language-action-inspired-embodied-agent-with-latent-world-modeling-and-model-predictive-control\/#article","isPartOf":{"@id":"https:\/\/youzum.net\/how-to-build-a-lightweight-vision-language-action-inspired-embodied-agent-with-latent-world-modeling-and-model-predictive-control\/"},"author":{"name":"admin NU","@id":"https:\/\/yousum.gpucore.co\/#\/schema\/person\/97fa48242daf3908e4d9a5f26f4a059c"},"headline":"How to Build a Lightweight Vision-Language-Action-Inspired Embodied Agent with Latent World Modeling and Model Predictive Control","datePublished":"2026-04-28T15:38:07+00:00","mainEntityOfPage":{"@id":"https:\/\/youzum.net\/how-to-build-a-lightweight-vision-language-action-inspired-embodied-agent-with-latent-world-modeling-and-model-predictive-control\/"},"wordCount":571,"commentCount":0,"publisher":{"@id":"https:\/\/yousum.gpucore.co\/#organization"},"articleSection":["AI","Committee","News","Uncategorized"],"inLanguage":"it-IT","potentialAction":[{"@type":"CommentAction","name":"Comment","target":["https:\/\/youzum.net\/how-to-build-a-lightweight-vision-language-action-inspired-embodied-agent-with-latent-world-modeling-and-model-predictive-control\/#respond"]}]},{"@type":"WebPage","@id":"https:\/\/youzum.net\/how-to-build-a-lightweight-vision-language-action-inspired-embodied-agent-with-latent-world-modeling-and-model-predictive-control\/","url":"https:\/\/youzum.net\/how-to-build-a-lightweight-vision-language-action-inspired-embodied-agent-with-latent-world-modeling-and-model-predictive-control\/","name":"How to Build a Lightweight Vision-Language-Action-Inspired Embodied Agent with Latent World Modeling and Model Predictive Control - YouZum","isPartOf":{"@id":"https:\/\/yousum.gpucore.co\/#website"},"datePublished":"2026-04-28T15:38:07+00:00","description":"\u0e01\u0e34\u0e08\u0e01\u0e23\u0e23\u0e21\u0e40\u0e01\u0e35\u0e48\u0e22\u0e27\u0e01\u0e31\u0e1a\u0e42\u0e14\u0e23\u0e19","breadcrumb":{"@id":"https:\/\/youzum.net\/how-to-build-a-lightweight-vision-language-action-inspired-embodied-agent-with-latent-world-modeling-and-model-predictive-control\/#breadcrumb"},"inLanguage":"it-IT","potentialAction":[{"@type":"ReadAction","target":["https:\/\/youzum.net\/how-to-build-a-lightweight-vision-language-action-inspired-embodied-agent-with-latent-world-modeling-and-model-predictive-control\/"]}]},{"@type":"BreadcrumbList","@id":"https:\/\/youzum.net\/how-to-build-a-lightweight-vision-language-action-inspired-embodied-agent-with-latent-world-modeling-and-model-predictive-control\/#breadcrumb","itemListElement":[{"@type":"ListItem","position":1,"name":"Home","item":"https:\/\/youzum.net\/"},{"@type":"ListItem","position":2,"name":"How to Build a Lightweight Vision-Language-Action-Inspired Embodied Agent with Latent World Modeling and Model Predictive Control"}]},{"@type":"WebSite","@id":"https:\/\/yousum.gpucore.co\/#website","url":"https:\/\/yousum.gpucore.co\/","name":"YouSum","description":"","publisher":{"@id":"https:\/\/yousum.gpucore.co\/#organization"},"potentialAction":[{"@type":"SearchAction","target":{"@type":"EntryPoint","urlTemplate":"https:\/\/yousum.gpucore.co\/?s={search_term_string}"},"query-input":{"@type":"PropertyValueSpecification","valueRequired":true,"valueName":"search_term_string"}}],"inLanguage":"it-IT"},{"@type":"Organization","@id":"https:\/\/yousum.gpucore.co\/#organization","name":"Drone Association Thailand","url":"https:\/\/yousum.gpucore.co\/","logo":{"@type":"ImageObject","inLanguage":"it-IT","@id":"https:\/\/yousum.gpucore.co\/#\/schema\/logo\/image\/","url":"https:\/\/youzum.net\/wp-content\/uploads\/2024\/11\/tranparent-logo.png","contentUrl":"https:\/\/youzum.net\/wp-content\/uploads\/2024\/11\/tranparent-logo.png","width":300,"height":300,"caption":"Drone Association Thailand"},"image":{"@id":"https:\/\/yousum.gpucore.co\/#\/schema\/logo\/image\/"},"sameAs":["https:\/\/www.facebook.com\/DroneAssociationTH\/"]},{"@type":"Person","@id":"https:\/\/yousum.gpucore.co\/#\/schema\/person\/97fa48242daf3908e4d9a5f26f4a059c","name":"admin NU","image":{"@type":"ImageObject","inLanguage":"it-IT","@id":"https:\/\/yousum.gpucore.co\/#\/schema\/person\/image\/","url":"https:\/\/youzum.net\/wp-content\/uploads\/avatars\/2\/1746849356-bpfull.png","contentUrl":"https:\/\/youzum.net\/wp-content\/uploads\/avatars\/2\/1746849356-bpfull.png","caption":"admin NU"},"url":"https:\/\/youzum.net\/it\/members\/adminnu\/"}]}},"rttpg_featured_image_url":null,"rttpg_author":{"display_name":"admin NU","author_link":"https:\/\/youzum.net\/it\/members\/adminnu\/"},"rttpg_comment":0,"rttpg_category":"<a href=\"https:\/\/youzum.net\/it\/category\/ai-club\/\" rel=\"category tag\">AI<\/a> <a href=\"https:\/\/youzum.net\/it\/category\/committee\/\" rel=\"category tag\">Committee<\/a> <a href=\"https:\/\/youzum.net\/it\/category\/news\/\" rel=\"category tag\">News<\/a> <a href=\"https:\/\/youzum.net\/it\/category\/uncategorized\/\" rel=\"category tag\">Uncategorized<\/a>","rttpg_excerpt":"In this tutorial, we build an embodied simulation vision agent that learns to perceive, plan, predict, and replan directly from pixel observations. We create a fully NumPy-rendered grid world in which the agent observes RGB frames rather than symbolic state variables, enabling us to simulate a simplified Vision-Language-Action-style pipeline. We train a lightweight world model&hellip;","_links":{"self":[{"href":"https:\/\/youzum.net\/it\/wp-json\/wp\/v2\/posts\/86671","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/youzum.net\/it\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/youzum.net\/it\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/youzum.net\/it\/wp-json\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"https:\/\/youzum.net\/it\/wp-json\/wp\/v2\/comments?post=86671"}],"version-history":[{"count":0,"href":"https:\/\/youzum.net\/it\/wp-json\/wp\/v2\/posts\/86671\/revisions"}],"wp:attachment":[{"href":"https:\/\/youzum.net\/it\/wp-json\/wp\/v2\/media?parent=86671"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/youzum.net\/it\/wp-json\/wp\/v2\/categories?post=86671"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/youzum.net\/it\/wp-json\/wp\/v2\/tags?post=86671"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}