This project explores reinforcement learning for continuous control within a foosball simulation built in MuJoCo. We trained agents using Truncated Quantile Critics (TQC) and benchmarked against Soft Actor-Critic (SAC) to study policy robustness. Our goal was to build a MuJoCo foosball simulation grounded in real video, then train TQC agents that outperform a SAC baseline in interception rate and control stability.
Algorithm integration
Implemented TQC and revived SAC from the baseline; trained and compared both on the foosball environment.
Vision & trajectory pipeline
OpenCV + ArUco calibration and ball tracking for (x,y,t) trajectories from real gameplay as ground truth for the sim.
Custom MuJoCo environment
Table from CAD, rods with sliding/rotation, ball in plane; tuned mass, friction, and damping to match real trajectories.
Reward design & training
Shaped rewards and termination rules; compared SAC vs TQC (stability, goal rate, episode length); full metrics in the paper.
Why TQC?
Both SAC and TQC are continuous-control reinforcement learning algorithms.
However, TQC replaces scalar Q-value estimates with quantile distributions and discards the highest quantiles when computing targets,
statistically filtering out overestimated samples to produce smoother and more conservative value predictions.
This is important because in foosball, the ball's movements are stochastic and the environment is noisy.
Data Pipeline Details
Calibration: Collected checkerboard frames; estimated camera intrinsics/extrinsics (K, rvec/tvec) with OpenCV.
Undistortion: Applied lens undistortion to gameplay video using estimated parameters.
Ball state extraction: Computed per‑frame ball position and velocity from recordings.
Ball Trajectory from OpenCV
We extracted ball trajectories from overhead gameplay video to obtain ground-truth (x,y,t) data for validating and tuning the simulator.
Monocular calibration: Implemented calibration using OpenCV with ArUco to detect the table corners and establish the playing-plane coordinate frame.
Ball tracking: Tracked the red ball frame-by-frame using color + contour detection to get (x,y,t) trajectories.
Ground truth for sim: Used this trajectory dataset as ground truth for tuning the simulator’s dynamics.
Simulation Environment
We built a custom MuJoCo environment to match our physical table and support training TQC and SAC agents. The simulation is calibrated so that ball dynamics and rod control transfer meaningfully from sim to real.
Custom MuJoCo environment: Built with a CAD model for the table; calibrated simulation to match table dimensions and coordinate frame.
Rod and ball modeling: Modeled each rod with sliding + rotational joints; ball moves freely in the plane with contact against players and walls.
Physics tuning: Tuned ball mass, friction, and damping so simulated passes and rebounds have similar speed and travel distance to real trajectories.
Reward & Episode Termination
We found that without shaping, the sparse “goal only” signal wasn’t enough to learn reasonable play in our time budget.
Penalty (per step): Large control inputs (0.001 × squared action magnitude per step for violent rod rotation); stalling (time penalty −0.1).
Terminal events: Goal scored (+10000, or −10000 for self-goal); ball stuck (slower than 0.15 in y-axis per frame for 40 steps); max episode time (3000 steps).
Results
SAC: Converges to a stable return plateau and a much higher goal rate; episode length collapses to very short episodes (often tens of steps), so episodes end quickly.
TQC: Reaches higher return peaks at times but is less stable; maintains much longer episodes (often thousands of steps), keeping the ball in play longer but converting to goals less often than SAC.
Takeaway: SAC gives stable, consistent, goal-heavy play; TQC can achieve higher returns but is more sensitive to reward/termination design and tends toward longer rallies or timeouts rather than quick goals. Full learning curves and metrics are in the paper.
Course
This project was completed for Computational Aspects of Robotics at Columbia University. The paper below was our final paper, completed as a group.