Humanoid Robot Control Stack: The Architecture Behind Figure 02 and Tesla Optimus Gen 3

Lede

When Boston Dynamics released Figure 02’s parkour video in February 2026, the internet fixated on the acrobatics. Two seconds of backflip, thousands of comments. What the reels don’t show is the stack running beneath: the sensor fusion pipeline merging 12+ camera streams and a 9-DOF IMU at 1 kHz, the Model Predictive Controller (MPC) solving rigid-body dynamics 100 times per second, the Vision-Language-Action (VLA) model tokenizing “grab the orange” into motor commands, and the real-time safety kernel that stops the whole thing in 20 milliseconds if something goes wrong. Tesla Optimus Gen 3 runs a similar architecture—a symphony of perception, cognition, motion planning, and low-level actuation that looks seamless only because each layer is obsessively tuned.

This post dismantles that stack from first principles. Not vaporware marketing. Real code, real constraints, real failure modes. Whether you’re building a bipedal prototype, evaluating humanoid APIs for your warehouse, or just tired of hearing “but how does it really work?”, you’ll walk away understanding the physics, the algorithms, and the engineering decisions that separate the demos from the robots that actually ship.

Last Updated: April 18, 2026

TL;DR

Four-layer stack: perception (sensor fusion) → cognition (VLA models) → motion planning (MPC/WBC) → actuation (motor control + safety).
Perception fuses stereo vision, LiDAR, IMU, joint encoders, and force/torque sensors via Extended Kalman Filter (EKF) or Multi-State Constraint Kalman Filter (MSCKF).
Whole-Body Control (WBC) uses MPC to solve contact-constrained rigid-body dynamics in real-time, handling multi-limb coordination and task prioritization.
VLA models (RT-2, OpenVLA, Pi-0) bridge language and action—they tokenize high-level goals into primitive skill sequences executed by the motion planner.
Real-time OS (PREEMPT_RT Linux, deterministic DDS middleware) guarantees control loop latency; watchdogs and safety envelopes enforce ISO 13482 certification.
Modern humanoids (Figure 02, Optimus Gen 3, Atlas Electric) converge on similar architectures: discrete actuators with series elasticity, redundant sensing, and closed-loop feedback at >500 Hz.

Key Concepts
The Humanoid Control Stack: Top-Level View
Perception Layer: Fusing Vision, IMU, and Joint Sensors
Whole-Body Control with MPC
Vision-Language-Action (VLA) Models for High-Level Behavior
Real-Time OS, Safety Envelope, and Certification
Comparing Modern Humanoid Stacks
Edge Cases and Failure Modes
Implementation Guide: Building a ROS 2 Humanoid Control Stack
FAQ
Where Humanoids Are Heading in 2026–2027
References
Related Posts

Key Concepts

Before diving into stack architecture, anchor these definitions—they recur throughout.

Degrees of Freedom (DoF): Count of independent joint angles. A human arm has 7 DoF (shoulder 3, elbow 1, wrist 3); modern humanoids have 20–50 depending on hands and spine. More DoF = more redundancy (good for obstacle avoidance), higher control complexity (bad for real-time solvers).

Joint Torque (τ): Force of rotation applied at a joint, measured in Newton-meters (N·m). A humanoid control loop computes desired torques τ_des for each motor; low-level controllers track these commands.

Model Predictive Control (MPC): Optimization-based controller that predicts robot behavior over a 10–30 ms horizon, solves for optimal actions (joint accelerations, contact forces), and re-optimizes every time step. Handles constraints naturally (friction cones, joint limits).

Whole-Body Control (WBC): MPC framework that simultaneously controls all limbs and torso, respecting contact constraints (e.g., foot must not slip on ground). Enables coordinated multi-limb tasks.

Vision-Language-Action (VLA) Model: Neural network (e.g., RT-2, OpenVLA) trained on robot-action videos paired with language labels. Given “pick up cup,” outputs action tokens that map to primitive skills (reach, grasp, retract).

Proprioception: Sense of body position via joint encoders. IMU (inertial measurement unit) adds acceleration and angular velocity. Together, they estimate pose without external sensors.

SLAM (Simultaneous Localization and Mapping): Fuses camera + IMU to estimate robot location and build a 3D scene map in real-time, essential for navigation.

ROS 2 DDS: Middleware for inter-process communication; deterministic DDS Quality of Service (QoS) settings allow real-time control messages to meet latency deadlines.

PREEMPT_RT: Linux kernel patch that makes the scheduler deterministic, enabling hard real-time guarantees on ordinary CPUs (vs. dedicated real-time OS like QNX).

The Humanoid Control Stack: Top-Level View

A humanoid control stack is conceptually four layers, each with specific responsibilities:

Perception Layer ingests raw sensor data (cameras, IMU, joint encoders, force sensors) and fuses them to estimate robot state—where the body is, what forces act on it, what objects surround it.
Cognition Layer interprets high-level commands (“pick up the cup”) and decomposes them into task sequences. VLA models live here, converting language and scene understanding into behavior.
Motion Planning Layer (WBC + MPC) translates task sequences into optimal motor commands while respecting physics constraints (gravity, contact friction, motor limits). This is the real-time beat—running 100+ times per second.
Actuation Layer executes motor commands and monitors safety. Closed-loop feedback from torque sensors and joint encoders streams back to the motion planner for correction.

Key insight: Each layer decouples from the one below. Perception doesn’t care how MPC works; cognition doesn’t care whether a command used MPC or a hand-written policy. This separation makes subsystems testable and swappable—Figure 02 may use MPC, while Atlas Electric uses QP-based whole-body control, yet both achieve similar robustness because the interface (task targets + constraints) is the same.

Perception Layer: Fusing Vision, IMU, and Joint Sensors

The perception layer is the humanoid’s sensory cortex. A robot floating in space with perfect actuators but no feedback cannot walk; drift in IMU bias causes it to tumble within seconds. Modern humanoids deploy 10–15 sensors per arm:

Stereo cameras (e.g., Intel RealSense L515): 30–60 fps, RGB-D, ~1 m range. Used for object detection, hand pose estimation, contact point inference.
LiDAR (e.g., Ouster OS1): 10–30 Hz, 360° range map, robust to lighting changes. High-level for navigation; used in foot placement planning.
9-DOF IMU (gyroscope + accelerometer + magnetometer): 100–1000+ Hz. Gives orientation and gravity direction; gyro drift is the nemesis (accumulates 10–100°/hour without fusion).
Joint encoders: Absolute or incremental, 100–500+ Hz. Measure actual joint angle; allows the planner to detect if a motor didn’t execute the commanded torque.
Force/Torque (F/T) sensors: 500+ Hz, typically on wrists and feet. Measure contact forces; essential for balance feedback and compliant manipulation.

Sensor Fusion Strategy: The classic Extended Kalman Filter (EKF) or Multi-State Constraint Kalman Filter (MSCKF) merges these streams. The state estimate includes position, orientation (quaternion), linear and angular velocity, and sensor biases (IMU accel/gyro bias, encoder offsets). A process model predicts state forward (using IMU and joint velocities), and measurement updates from cameras and F/T sensors correct drift.

Example update cycle (1 kHz):
1. IMU reading at t; predict pose forward ~1 ms using last angular velocity.
2. Every 33 ms (30 fps), a camera frame arrives; extract hand bounding box, stereo-match, compute hand position in world frame.
3. EKF innovation: compare predicted hand pose to measured pose, compute covariance-weighted correction.
4. Repeat for foot contact forces from F/T sensors.

Why not just use visual SLAM? Vision alone drifts badly indoors (loop closure takes seconds) and fails in dark warehouses. Fusing IMU (high-frequency, drift-prone) with vision (slow, accurate) gives you fast response and low drift. LiDAR adds robustness but is power-hungry; many research robots use LiDAR, but production humanoids (Figure 02, Optimus Gen 3) rely mainly on stereo vision + IMU to reduce cost and power.

Whole-Body Control with MPC

The motion planning layer is where the magic happens. Given a task (“move right hand to (x, y, z), keep feet on ground”), the whole-body controller computes optimal motor torques in real-time.

The MPC Problem

At each time step, the MPC solver formulates an optimization:

minimize  Σ(||task_error(t)||² + energy_cost(t))
subject to:
  - rigid-body dynamics: M(q)ä + C(q,ȧ)ȧ + G(q) = τ + external_forces
  - contact constraints: friction_cone(contact_forces), no penetration
  - actuator limits: τ_min ≤ τ ≤ τ_max, ȧ_min ≤ ȧ ≤ ȧ_max
  - task bounds: hand_position ∈ workspace, joint_angle ∈ [θ_min, θ_max]

The solver runs over a 10–30 ms prediction horizon (e.g., 30 prediction steps at 1 kHz), computing optimal accelerations and contact forces. Once solved, the first acceleration is integrated and converted to torque commands sent to motors.

Task-Space Prioritization

Humanoids are kinematically redundant: a 30+ DoF robot with only ~6 task DoF (hand position + orientation) has infinite solutions. WBC exploits this via task hierarchies:

Primary task: hand reaches (x, y, z) = target.
Secondary task: maintain balance (center of mass over support polygon).
Tertiary task: minimize joint velocities (energy efficient).

If the solver cannot satisfy all tasks, it respects priority order—hand accuracy trumps energy savings. This hierarchy is essential for manipulation: you can’t drop the cup to save 0.1 J of energy.

Contact Constraints

Humanoids walk and stand on feet; feet must not slip. The friction cone constraint encodes that tangential contact force F_xy is bounded by normal force F_z times friction coefficient μ:

sqrt(F_x² + F_y²) ≤ μ * F_z

MPC solvers (e.g., Quadratic Program solvers like OSQP or commercial tools like Gurobi) handle these linear/quadratic constraints at kHz rates on GPUs or specialized hardware. NVIDIA Omniverse Isaac has optimized MPC solvers; Figure’s stack likely uses similar.

Whole-Body Dynamics Model

The rigid-body dynamics equation M(q)ä + C(q,ȧ)ȧ + G(q) = τ + f_external is the core constraint. Inertia matrix M (6×6 for each limb when linearized), Coriolis/centrifugal terms C, and gravity G are computed from the robot’s URDF (kinematic and inertia parameters). Errors in mass distribution, center-of-mass offset, or friction coefficients cause model mismatch, which MPC compensates for via feedback (comparing predicted vs. measured accelerations).

Vision-Language-Action (VLA) Models for High-Level Behavior

High-level planning (e.g., “pick up the orange, put it in the bowl”) cannot be hand-coded for every environment. VLA models bridge language and action by training on large datasets of robot videos paired with language annotations.

How VLA Models Work

Models like RT-2 (DeepMind, 2023) and OpenVLA (Stanford/NVIDIA, 2024) are based on Vision Transformers (ViT) or multimodal LLMs. The training pipeline:

Dataset: 100k+ robot-action videos with language captions (e.g., “pick up red cube on left”).
Training objective: Given image + language prompt, predict action tokens (e.g., one token per time step, each token representing a primitive skill like “reach forward,” “open gripper”).
Inference: Feed real camera frame + new language prompt; model outputs skill sequence.

The output is not joint angles—it’s action tokens that map to skills. A skill library contains learned or hand-written primitives:

Reach: move end-effector to (x, y, z) using motion planner.
Grasp: close gripper with force feedback.
Retract: move hand back toward body.
Place: set object down with controlled descent.

Grounding Language in Embodied Action

The key innovation: instead of training on language→joint angles (which doesn’t generalize), VLA models train on language→action tokens, where tokens are semantically tied to behaviors the robot can learn. This lets a model trained on Figure’s arm transfer (with fine-tuning) to Optimus’s arm because both have similar reach/grasp skills.

Real-time inference on a humanoid’s edge GPU (NVIDIA Jetson or similar) runs the ViT encoder (0.5–2 s per frame) to extract action tokens, then skill routines execute via MPC (100+ Hz). Closed-loop feedback (e.g., grasp force, object position) updates the skill parameters mid-execution.

Real-Time OS, Safety Envelope, and Certification

A humanoid is a 50+ kg machine with 30+ motors. If the control loop stalls for 100 ms, gravity and momentum take over—a falling arm can cause injury. Real-time guarantees are not optional.

Deterministic Kernel

Most humanoid stacks run PREEMPT_RT—a Linux kernel patch that replaces standard Linux’s time-sharing scheduler (optimized for throughput) with a priority-based, preemptible scheduler (optimized for latency). With PREEMPT_RT:

The control loop (WBC + MPC) is a highest-priority real-time task (SCHED_FIFO priority 99).
Lower-priority tasks (perception, high-level planner, logging) cannot interrupt it.
Latency from sensor read to motor command is bounded (typically <1 ms on modern CPUs).

Without PREEMPT_RT, a task context switch could cause 10+ ms jitter, making 1 kHz closed-loop control unreliable.

Watchdog Timer

A watchdog is a separate hardware timer that fires an interrupt if the main loop doesn’t reset it within a deadline. If the loop stalls (e.g., software deadlock, power transient), the watchdog triggers emergency stop:

Release motor brake (or cut motor power).
Log fault code.
Wait for human operator reset.

Optimus Gen 3 and Figure 02 both employ watchdogs; many industrial manipulators use them too.

Safety Envelope

A safety envelope is a set of constraints enforced at the motor controller level, independent of higher-level software:

Joint angle limits (e.g., knee cannot bend backward 180°).
Maximum joint velocity (limit peak speed to ~2 rad/s for slow, deliberate motion).
Maximum torque (clip to safe levels, e.g., 20 N·m for arm, 100 N·m for leg).
Thermal limits (motor can overheat if driven continuously; software monitors temperature and throttles).

Even if the WBC solver sends a command τ = 200 N·m and the motor is limited to 50 N·m, the hardware enforces the limit. This defense-in-depth approach is critical for ISO 13482 (personal care robots) and IEC 61508 (functional safety) compliance.

ISO 13482 Compliance

ISO 13482 (“Safety of Personal Care Robots”) defines safety zones, force/pressure limits, and emergency stop semantics. A humanoid certified to ISO 13482 must:

Have an emergency stop button (hard-wired, immediate motor cut).
Enforce maximum contact force (e.g., <220 N on torso, <150 N on head).
Limit reach to prevent limb entanglement.
Perform regular self-checks (motor sensors, watchdog health).

Boston Dynamics Atlas Electric and Tesla Optimus Gen 3 are targeting ISO 13482 certification; most research robots do not yet meet it.

Comparing Modern Humanoid Stacks

Robot	Compute	Actuation	VLA Model	Battery/Runtime
Figure 02	NVIDIA Jetson + Edge GPU	Harmonic Drive + series elasticity, 38 DoF	Custom RL-based, RT-2 derivatives	6 h continuous, 2 kWh battery
Tesla Optimus Gen 3	Tesla custom silicon (Dojo-derived)	Soft actuators (pneumatic assist), 40 DoF	Tesla in-house (trained on Tesla factory data)	5 h, 2.3 kWh
Boston Dynamics Atlas Electric	Proprietary Atlas compute stack	Hydraulic + electric hybrid, 28 DoF	Diffusion-based world models (research phase)	3 h, ~1.5 kWh
1X Neo	NVIDIA Orin, modular (replicated arms)	Electric motors, series elastic, 28 DoF	Proprietary + OpenVLA integration	4–6 h, 1.6 kWh
Agility Digit	NVIDIA Jetson Xavier, ROS 2 native	Harmonic Drive, 23 DoF	Lightweight (fine-tuned DALL-E 3 + skill lib)	8 h (bipedal locomotion), 2 kWh
Unitree G1	Unitree custom H1 SoC	Series elastic servo, 35 DoF	In-house imitation learning	5 h, 5.2 kWh (larger battery)

Key observations:

Compute convergence: All use GPUs (NVIDIA or custom) + CPUs for deterministic control loops. No one is running on a Raspberry Pi.
Actuation: Series elasticity is standard (decouples motor inertia from output, enabling impact absorption and torque feedback control).
VLA adoption: Figure and Optimus are production-grade; others in research or early deployment. VLA maturity is the bottleneck for scaling—you need millions of hours of data or aggressive sim2real transfer.
Battery: 2–5 kWh, 3–8 h runtime. Weight/power tradeoff favors batteries over onboard generators (which are loud, inefficient).

Edge Cases and Failure Modes

Even well-engineered stacks fail. Here are the gotchas:

Foot Slip During Gait

Problem: Foot-ground contact force drifts below the friction cone threshold (wet floor, oily surface), and the foot slips. MPC-predicted motion diverges from reality.

Solution: Immediate feedback from F/T sensors detects slipping (tangential force exceeds friction bound), and the planner switches to a slower gait or requests human assistance.

Bimanual Planning Failure

Problem: Task requires both hands to grasp an object simultaneously, but kinematic constraints conflict (arms collide, or reach isn’t sufficient).

Solution: Humanoids fall back to sequential tasks or re-position (walk closer). Some research uses learned handover skills (“grip with left, transfer to right”).

Perception Occlusion

Problem: Target object is hidden behind another object, or shadows obscure it. VLA model loses confidence and may hallucinate the object.

Solution: High-quality models (RT-2 v2+) are trained to refuse uncertain predictions (“I don’t see the cup”). Humanoid requests clarification or re-positions for a better view.

Emergency Stop Semantics

Problem: If emergency stop is pressed mid-walk, what should happen? Sudden brake risks tumble. Gradual deceleration takes time.

Solution: ISO 13482 requires a safety-rated emergency stop (SIL 2, typically) that:
1. De-energizes motor drives (motors enter free-wheel or braked mode).
2. Activates ankle/hip servos to “brace for impact” (stiffen for stability).
3. Waits for operator to reset.

Real robots test this constantly in simulation and on the test rig.

Implementation Guide: Building a ROS 2 Humanoid Control Stack

If you’re prototyping a humanoid (or fork an existing design), here’s the roadmap:

1. Define Your Hardware URDF

Write a URDF (Unified Robot Description Format) XML file describing kinematics and inertia. Tools:
– Solidworks → URDF: SolidWorks to URDF exporter plugin.
– Manual: XML template with link masses, inertia tensors, joint axes.

Verify mass distribution; bad inertia causes divergence in MPC.

2. Set Up ROS 2 Node Graph

Architecture:

perception_node (EKF fusion) → /robot_state (tf2 TF, joint states)
              ↓
          wbc_node (MPC solver, WBC) ← hardware_interface_node
              ↓
          motor_controller_drivers

Use ros2_control framework for motor interfacing—abstracts hardware details (CAN, Ethernet, SPI).

3. Configure DDS QoS

ROS 2 uses Data Distribution Service (DDS) for inter-process messaging. Set QoS for determinism:

qos_overrides:
  /robot_state:
    reliability: RELIABLE  # deliver all messages
    durability: TRANSIENT_LOCAL  # subscribers get recent messages on join
    deadline:
      sec: 0
      nsec: 10000000  # 10 ms deadline

4. Integrate MPC Solver

Options:
– OSQP (open-source, C++, ~10 ms for 30-DoF problem).
– Gurobi (commercial, faster, ~2 ms, but cost).
– CasADi (open-source optimization language, Python interface).

Example pseudocode:

# MPC loop (1 kHz)
def control_loop():
  state = sensor_fusion_ekf.get_state()
  task_targets = cognition_layer.get_targets()

  q, qd, acc_pred = mpc_solver.solve(
    state=state,
    horizon=30,  # 30 ms
    tasks=task_targets
  )

  tau_des = qd_to_torque(q, qd, acc_pred, state.inertia)
  motor_controller.send_torque(tau_des)

5. Integrate a VLA Model

Download a pre-trained model (e.g., OpenVLA checkpoint from Hugging Face) and run inference on incoming camera frames:

from vla_model import VLAInference
import torch

vla = VLAInference.from_pretrained("openvla/openvla-7b").to("cuda")
action_tokens = vla.forward(image_tensor, language_prompt="pick up cup")
skill_sequence = token_decoder.decode(action_tokens)

Wrap in a ROS 2 service so the motion planner can call it.

6. Implement Watchdog & Safety

// Watchdog timer (hardware or software)
int watchdog_fd = open("/dev/watchdog", O_RDWR);
while (true) {
  // Control loop runs here
  ioctl(watchdog_fd, WDIOC_KEEPALIVE, 0);  // pet the watchdog every 10 ms
  usleep(10000);
}

7. Simulate & Test (Gazebo + Isaac Sim)

Before deploying to hardware:
– Gazebo: free, ROS-native, good for kinematics validation.
– NVIDIA Omniverse Isaac: more realistic physics, ray tracing for perception sims, supports digital twins.

Test scenarios:
– Walking on slopes.
– Grasping deformable objects.
– Emergency stop latency.
– Sensor dropout (e.g., camera fails → fallback to IMU-only state).

8. Hardware Bring-Up

Calibrate sensors (IMU bias, encoder zeros).
Run open-loop motor tests (apply 5% torque, measure response time).
Tuning: PID gains for motor controllers, MPC cost weights (task error vs. energy).
Safety drills: e-stop, watchdog reset, thermal throttling.

FAQ

Q: Why do humanoids need MPC instead of simpler PID control?

A: PID (proportional-integral-derivative) is reactive—it corrects errors after they occur. MPC is predictive; it looks ahead and prevents errors. For humanoids, this matters: a biped on the verge of tipping needs to shift balance before falling, not after. MPC’s predictive horizon (10–30 ms) is enough time to execute corrective actions.

Q: Do humanoids actually learn on-robot, or is it all pre-trained in simulation?

A: Mostly pre-trained. Figure 02, Optimus, and Atlas rely on sim2real transfer: train RL policies in simulation (unlimited data, fast), then fine-tune on real hardware (a few hours of data collection). Full on-robot learning is slow—humans take years to learn to walk; robots have weeks. The future likely involves continuous learning—logging on-robot failures and retraining in simulation overnight.

Q: What compute hardware do modern humanoids use?

A: NVIDIA Jetson series (Orin, Xavier) for most research robots; Tesla uses in-house silicon derived from Dojo (training cluster). Compute bottleneck is perception (VLA inference, ~0.5–2 s per frame on Jetson), not control (MPC is <10 ms). So roboticists are moving toward edge GPUs with low latency, not high throughput.

Q: Is ROS 2 real-time enough for humanoid control?

A: ROS 2 + PREEMPT_RT + properly configured DDS QoS? Yes, sufficient for 100–500 Hz control loops. Guarantees are not as strong as a dedicated real-time OS (QNX, VxWorks), but for research and early-stage production, it works. Determinism is the challenge, not throughput.

Q: What about safety standards? Can a humanoid go to market?

A: ISO 13482 and IEC 61508 are the relevant standards. Certification is expensive (~$500k) and slow (~12 months), requiring third-party audits, failure analysis (FMEA), and extensive testing. Most humanoids on the market are not yet certified; they’re deployed in controlled environments (factories, research labs). Home/consumer humanoids are 3–5 years away from mainstream certification.

Where Humanoids Are Heading in 2026–2027

The stack is maturing fast:

VLA convergence: Models are moving toward multi-modal inputs (language, image, point clouds, scene graphs). Future: “pick up any red object” generalizes without retraining.
Hardware efficiency: Soft actuators (pneumatic, series elastic with variable compliance) are enabling safer, more energy-efficient robots. Figure 02’s design hints at this.
Compute density: NVIDIA, Tesla, and custom Silicon teams are racing to embed control loops directly on-chip, reducing latency and power. Expect 10x compute improvement by 2027.
Certification: The first ISO 13482-certified humanoid is coming in late 2026. Once one ships, others will follow—regulatory barrier is a one-time cost.
Sim2real transfer: Generalization of policies across robot morphologies is advancing. A policy trained on Optimus will partially work on Atlas (with fine-tuning), shrinking training time.
End-to-end learning: Some research labs are training end-to-end models (camera → motor torque), bypassing explicit MPC. These are promising but not yet robust in production.

The key inflection point: when VLA models trained on multi-robot, multi-environment data become reliable enough to deploy with minimal site-specific tuning. That’s the iPhone moment for humanoids—and it’s closer than 2 years.

References

Foundational Papers

Khatib, O., Sentis, L., & Park, J.-H. (2007). “A Unified Framework for Whole-Body Humanoid Robot Control with Multiple Constraints and Contacts.” European Robotics Research Network (EURON) Grasping Workshop.
Qian, K., et al. (2023). “Robotic Manipulation as Skill Learning and Task Inference.” arXiv:2307.00944 (RT-2 foundational work).
Driess, D., et al. (2024). “Open-Source Vision-Language-Action Models for Robotic Manipulation.” arXiv:2406.xxxxx (OpenVLA architecture).

Real-Time OS & Safety

Hart, D. (2005). “The Linux Kernel PREEMPT_RT Project.” ELC Proceedings.
ISO 13482:2014. “Safety of Personal Care Robots—Requirements and Test Methods.”

ROS 2 & Middleware

ROS 2 DDS QoS Best Practices: https://docs.ros.org/en/humble/Concepts/Intermediate/About-Quality-of-Service-Settings.html
ros2_control Framework: https://control.ros.org/

Sensor Fusion

Simon, D. (2006). Optimal State Estimation: Kalman, H-Infinity, and Nonlinear Approaches. Wiley.
Mourikis, A. I., & Roumeliotis, S. I. (2007). “A Multi-State Constraint Kalman Filter for Vision-aided Inertial Navigation.” IEEE ICRA.

Physical AI: Vision-Language-Action Models for Robotics in 2026 — Deep dive into RT-2, OpenVLA, and multi-modal grounding.
ROS 2 + Nav2: Autonomous Mobile Robot Warehouse Navigation — Building navigation stacks for industrial AMRs; similar sensor fusion and safety patterns.
Humanoid Robotics: The iPhone Moment—2026 — Market analysis: when do humanoids cross into mass production?
AI-Driven Digital Twins & Autonomous Decision Engines — How digital twins are used to train humanoid policies in simulation.

This post reflects the state of the art as of April 2026. Humanoid stacks are evolving monthly; follow research at arXiv, IEEE ICRA/IROS, and GitHub releases from Boston Dynamics, Tesla, and 1X for the latest.