3D Pose Estimation with FMPose3D#

Overview#

FMPose3D: monocular 3D pose estimation via flow matching by Ti Wang, Xiaohang Yu, and Mackenzie Weygandt Mathis.

| Paper | Project Page | GitHub | PyPI |

FMPose3D lifts 2D keypoints from a single image into 3D poses using flow matching β€” a generative technique based on ODE sampling. It generates multiple plausible 3D pose hypotheses in just a few steps, then aggregates them using a reprojection-based Bayesian module (RPEA) for accurate predictions, achieving state-of-the-art results on human and animal 3D pose benchmarks.

This recipe shows how to use FMPose3D in DeepLabCut for monocular 3D pose estimation. Two pipelines are available:

Pipeline

2D Estimator

Skeleton

Joints

Human

HRNet + YOLO

H36M

17

Animal

DeepLabCut SuperAnimal

Animal3D

26

Model weights are hosted on HuggingFace Hub and downloaded automatically on first use.

Prerequisites

Install the fmpose3d package before running this notebook:

pip install fmpose3d

A GPU is recommended but not required β€” CPU inference works out of the box.

Setup#

Import the DeepLabCut convenience wrapper and a few helpers.

import numpy as np
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D

from deeplabcut.modelzoo.fmpose_3d.fmpose3d import get_fmpose3d_inference_api

Human Pose Estimation (end-to-end)#

The simplest way to get 3D human poses is the end-to-end pipeline. get_fmpose3d_inference_api creates an inference object that handles 2D detection and 3D lifting in a single predict call. Weights are downloaded automatically from HuggingFace on first use.

# Create the human pose API (downloads weights on first call)
human_api = get_fmpose3d_inference_api(
    model_type="fmpose3d_humans",
    device="cuda:0",  # use "cpu" if no GPU is available
)
# Run end-to-end inference on an image
image_path = "path/to/your/image.jpg"  # replace with your image path
result = human_api.predict(source=image_path)

print("3D poses (root-relative):", result.poses_3d.shape)        # (num_frames, 17, 3)
print("3D poses (world coords):", result.poses_3d_world.shape)   # (num_frames, 17, 3)

Accepted input sources#

predict (and prepare_2d) accept a variety of input types:

  • A file path (str or Path) to a single image

  • A directory of images

  • A numpy array β€” either a single frame (H, W, C) or a batch (N, H, W, C)

  • A list of any of the above

Animal Pose Estimation (end-to-end)#

Switching to the animal pipeline only requires changing model_type. This pipeline uses DeepLabCut SuperAnimal for 2D detection and outputs 26-joint Animal3D skeletons.

# Create the animal pose API
animal_api = get_fmpose3d_inference_api(
    model_type="fmpose3d_animals",
    device="cuda:0",
)

# Run inference
animal_image_path = "path/to/your/animal_image.jpg"
animal_result = animal_api.predict(source=animal_image_path)

print("3D poses:", animal_result.poses_3d.shape)              # (num_frames, 26, 3)
print("3D poses (regularized):", animal_result.poses_3d_world.shape)

Note

For animals, poses_3d_world contains limb-regularized poses (the skeleton is rotated so that the average limb direction is vertical) rather than a camera-to-world transform.

Two-Step Inference (2D then 3D)#

For more control, you can run the 2D and 3D stages separately. This is useful when you want to inspect or modify 2D keypoints before lifting.

api = get_fmpose3d_inference_api(model_type="fmpose3d_animals", device="cuda:0")

# Step 1: detect 2D keypoints
result_2d = api.prepare_2d(source=animal_image_path)

print("2D keypoints:", result_2d.keypoints.shape)   # (num_persons, num_frames, J, 2)
print("Confidence scores:", result_2d.scores.shape)  # (num_persons, num_frames, J)
print("Image size (H, W):", result_2d.image_size)
# Step 2: lift 2D keypoints to 3D
result_3d = api.pose_3d(
    keypoints_2d=result_2d.keypoints,
    image_size=result_2d.image_size,
)

print("Lifted 3D poses:", result_3d.poses_3d.shape)  # (num_frames, J, 3)

Lifting DeepLabCut 2D Predictions to 3D#

A common workflow is to use a DeepLabCut model you have already trained for 2D pose estimation, then lift those predictions to 3D with FMPose3D. The example below runs DLC inference with deeplabcut.analyze_images and feeds the resulting keypoints straight into the 3D lifter.

Keypoint compatibility

The FMPose3D lifter was trained on specific skeleton layouts (17 H36M joints for humans, 26 Animal3D joints for animals). Your DLC model’s bodyparts must match one of these layouts for the lifted poses to be meaningful. If your skeleton differs, you will need to select or re-order the relevant subset of keypoints before calling pose_3d.

import deeplabcut

# ── 1. Run DLC 2D inference ───────────────────────────────────────────────
# analyze_images returns a dict mapping each image path to its predictions.
# Each prediction contains a "bodyparts" array of shape
# (num_individuals, num_bodyparts, 3) where 3 = (x, y, likelihood).

config_path = "path/to/my_dlc_project/config.yaml"
image_paths = ["frame_001.png", "frame_002.png", "frame_003.png"]

predictions = deeplabcut.analyze_images(
    config=config_path,
    images=image_paths,
    shuffle=1,
    device="cuda:0",
)

# ── 2. Extract (x, y) keypoints from each frame ──────────────────────────
# Stack all frames into a single array and take only the first individual.
all_bodyparts = np.stack([
    predictions[img]["bodyparts"][0]  # first individual per frame
    for img in image_paths
])  # shape: (num_frames, num_bodyparts, 3)

keypoints_2d = all_bodyparts[:, :, :2]  # drop likelihood β†’ (num_frames, J, 2)
print("keypoints_2d shape:", keypoints_2d.shape)
# ── 3. Lift DLC 2D keypoints to 3D ────────────────────────────────────────
# image_size = (height, width) of the frames the DLC model was run on.
import cv2

sample_img = cv2.imread(image_paths[0])
image_size = sample_img.shape[:2]  # (height, width)

api = get_fmpose3d_inference_api(model_type="fmpose3d_animals", device="cuda:0")
result_3d = api.pose_3d(
    keypoints_2d=keypoints_2d,
    image_size=image_size,
    seed=42,  # for reproducible sampling
)

print("3D poses (root-relative):", result_3d.poses_3d.shape)       # (num_frames, J, 3)
print("3D poses (post-processed):", result_3d.poses_3d_world.shape)

Tip

If you are working with video frames from deeplabcut.analyze_videos instead of individual images, you can read image_size from the video:

import cv2
cap = cv2.VideoCapture("path/to/video.mp4")
image_size = (int(cap.get(cv2.CAP_PROP_FRAME_HEIGHT)),
              int(cap.get(cv2.CAP_PROP_FRAME_WIDTH)))
cap.release()

You will also need to load the keypoints from the .h5 file that analyze_videos produces:

import pandas as pd
df = pd.read_hdf("path/to/videoDLC_scorer.h5")
scorer = df.columns.get_level_values("scorer").unique()[0]
bodyparts = df[scorer].columns.get_level_values("bodyparts").unique()
coords = df[scorer].values.reshape(len(df), len(bodyparts), 3)
keypoints_2d = coords[:, :, :2]  # (num_frames, num_bodyparts, 2)

Further Reading#