Semi-Automatic 3D data annotation using Structure from Motion (SfM) to create large scale datasets

Share This Post

More To Explore

Semi-Automatic 3D data annotation using Structure from Motion (SfM) to create large scale datasets.

Summary

This article offers a deep dive into using Structure from Motion (SfM) to supercharge 3D Machine Learning dataset annotation. Embracing these cutting-edge technologies can dramatically enhance the quality and efficiency of data annotation for researchers and practitioners alike.

The methods and concepts explored here are built upon the advanced techniques Tesla employs for annotating data in their autonomous vehicles. Dive into more details in the video below! (start at 1:26:33)

Introduction

For training machine learning (ML) models, the quality and accuracy of datasets play a crucial role in model performance. Annotating datasets, especially in 3D, presents a unique set of challenges. To cite a few:

  • Annotating the same objects in all camera views is time-consuming and expensive, especially when dealing with large datasets.
  • Depending on the dataset and the objects of interest, a 2D object tracker may not work effectively.
  • Ensuring consistency for all annotations of the same object is difficult.

As stated by Andrej Karpathy, it is much more efficient to annotate the objects in the scene just once using a 3D representation of the real data (or 4D, if you consider the time domain) and propagate the labels to all images at once. The annotation can be extended from 3D bounding boxes to 3D polylines, polygons, and keypoints, to generate 2D segmentation masks, oriented boxes, and much more.

3D reconstructed scene from camera images

Camera images and reconstructed 3D semantic point cloud

To provide you with insights into how this can be done, this article explores how Structure from Motion (SfM) technologies can be leveraged to streamline the 2D and 3D dataset annotation process for static objects, enhancing both efficiency and precision.

Understanding Structure from Motion

Structure from Motion (SfM) is a photogrammetric range imaging technique for estimating 3D structures from 2D image sequences. It involves:

  • Feature Extraction/Matching: Identifying and matching key points in multiple images.
  • Camera Pose Estimation: Determining the position and orientation of the camera for each image.
  • Global Graph Optimization (Bundle adjustment): Optimizes all the poses at once using the features detected by all the camera images as constraints.
  • 3D Reconstruction: Using the matched points and optimized camera poses to create a 3D scene model.

The architecture of a global SfM pipeline. Source:

Given that SfM relies solely on visual data as input, it cannot solve for the real-world scale, so the output 3D reconstruction is generated in an arbitrary scale. This deficiency can be mitigated by using prior scene knowledge, adjusting the scale based on known objects or marker sizes, or fusing information provided by other sensors/algorithms such as Visual Inertial Odometry (ViO).

If you want to understand how to generate a full-scale 3D reconstruction using ARKit, Visual SLAM, and SfM, please refer to this previous article. There, we explain how to ensure the generated model keeps the real-world scale.

Using SfM for Dataset Annotation

Two experiments were performed using two different datasets and SfM softwares.

For the first, we used images from the Kitti dataset and the Meshroom software for 3D reconstruction of a street scene.

In the second experiment, we recorded our own dataset and ran the Colmap software to generate the dense point cloud and the camera poses.

The entire process for going from the images to the annotated labels is described below:

Workflow Overview

  1. Data Collection: Capture 2D images and video sequences using a mobile phone or camera.
  2. Run SfM: Use SfM algorithms to detect and match features across multiple images. Generate a 3D model from the matched features and camera poses.
  3. Data Annotation: Annotate the features in the 3D model using AR interfaces to add labels and metadata.
  4. 3d -> 2d Projection: Project the 3D annotated features on each image to generate the 2D labels files

Detailed Steps

Step 1: Data Collection

We began by using a widely recognized benchmark dataset for autonomous vehicles. The images are from the Kitti Stereo dataset, and will be used for our initial experiment. We have selected a small sequence of 40 images in total.

Image sample. Kitti Stereo dataset

Images selected for 3D reconstruction

For the second experiment, we captured images and videos of the target scene using a cellphone camera in a street environment, similar to Tesla’s approach.

When recording the dataset, It is better to lock the camera focus when taking photos and videos. This usually can be done in the Pro mode of the camera app or by clicking and holding a position on the screen.

The goal is to acquire as many variations of perspective, light, and angles as possible from multiple static objects in the scene. The same object needs to be viewed by different images.

We will not use all the images recorded by the video, but only frames that have enough motion from the previous one.

Some examples of collected images are presented below. A total of 430 images were captured. Carefully acquiring more images ensuring good quality and sharpness can improve the results.

Collected dataset using a mobile phone camera

Step 2: Run SfM

During the experiments, it was explored two well-known and open-source SFM softwares:

Both have advantages and disadvantages and you may want to also test other implementations if you were not getting the desired results with your dataset.

For the first experiment, the Meshroom software was used. It has an intuitive user interface and is easy to install.

Drag and drop the images in the left panel. Then, click on the Start button at the top to run the pipeline. It might take a while depending on the number of images and the computer specs. You can follow the process status and steps in the ‘Task Manager’ tab.

Meshroom user interface and Kitti dataset 3D reconstruction

The output will contain an .obj mesh file and the cameras.sfm file that can be extracted by right-clicking in the Texturing and StructFromMotion modules of the pipeline, respectively, as shown in the image below.

Meshroom SfM pipeline

The image below shows the mesh file obtained in the output of the texturing module. The mesh quality can be improved by adding more images from different perspectives and by improving the captured image resolution and quality.

3D reconstructed mesh and camera poses

Most labeling tools do not support mesh files, so you will also need to generate the dense point cloud (.ply). On Meshroom 2020 you can just click on the option “Save Raw Dense Point Cloud”.

For the second experiment, we now used the Colmap software to perform the same process. The image below shows all the camera views from the dataset (Red) and the sparse point cloud. The next one shows the dense point cloud result.

Colmap sparse point cloud and camera poses

Colmap dense point cloud visualization

Step 3: Annotation

In the point cloud data, we can see poles, traffic signs, and other static objects. The 3D visualization allows us to annotate these objects only once in 3D. These labels can be propagated by projecting them in all the images that view the object in the dataset.

The chosen annotation tool was CVAT given it is well maintained,  open-source and it is possible to easily test the tool online. https://www.cvat.ai/

Follow the steps below to setup the annotation process:

  • Convert the dense point cloud to .pcd format
  • Create a new Task and set up the label class names

CVAT labeling tool. Task setup

  • Create the folder structure below and copy the .pcd file to the point cloud folder. You can leave the ‘related_images’ folder empty.
  • Compress the two folders to a .zip file. This .zip file will be uploaded to CVAT
  • After finalizing the task creation click on Job #1051147 to open the annotation window

The images below show an example of the annotated labels for the first and second experiments.

CVAT 3D annotation. Kitti dataset (first experiment)

CVAT 3D annotation. Second experiment

Export the annotation in the Task panel once all the objects are labeled.

Step 4: 3d->2d Projection

The next step is to project these 3D labels onto the 2D images used in the reconstruction. This process involves using the camera poses estimated during the SfM pipeline. Projecting 3D labels onto 2D images ensures that annotations can be utilized in various Machine Learning applications, including object detection and segmentation.

Extract Camera Poses

Meshroom and Colmap generate a set of camera poses during the SfM process. These poses include the rotation matrix and translation vector for each image frame, which describe the camera’s orientation and position in the 3D space.

  1. Open the Meshroom project and navigate to the “StructureFromMotion” node.
  2. Export the camera poses, typically found in a file named cameras.sfm or similar. (In Colmap, save the cameras.txt and images.txt files)

Project to 2D Image Plane

For each 3D label, transform its coordinates using the camera poses. The transformation involves applying the rotation and translation (camera extrinsic parameters) to convert the 3D coordinates from the model’s coordinate system to the camera’s coordinate system.

Once the 3D points are transformed into the camera’s coordinate system, project them onto the 2D image plane using the camera’s intrinsic parameters, which include the focal length and principal point.

Mathematical Formulation

where:

  • [X,Y,Z] are the coordinates of the 3D bounding box vertices in the model’s coordinate system.
  • [u,v] are the coordinates of the projection point in image pixels
  • [R|t] are the extrinsic parameters.
  • fx, fy are the focal lengths expressed in pixel units.
  • (cy, cy) are the principal point (usually the image center)

PseudoCode:

				
					intrinsics = [np.array([[fx,  0,  cx],
                        [0  , fy, cy],
                        [0  , 0,  1]]), ...]

extrinsics = [np.array([[1.0, 0.0, 0.0, tx],
                        [0.0, 1.0, 0.0, ty],
                        [0.0, 0.0, 1.0, tz]]), ...]

name_imgs = ['img1.jpg', 'img2.jpg', ...]

annotation_centroids = [box1_3d_vertices, box2_3d_vertices, ...]

# Loop for all camera poses in sfm json file 
for pos in range(len(name_imgs)):

    # visualization image
    rgb = cv2.imread(name_imgs[pos], 1)

    intrinsic_matrix = intrinsics[pos] # [3,3]
    extrinsics_matrix = extrinsics[pos] # [3,4]

    # Generate Projection matrix
    P = np.matmul(intrinsic_matrix, extrinsics_matrix) # [3,4]

    for j in range(len(annotation_bbox_vertices)):
        # Add 1 in last position in array to transform in format [4,1] (homogeneous coordinates)
        vertices_hom = np.append(annotation_bbox_vertices[j],1).reshape(-1,1) # [8,4]

        # Transform points in world to image
        pts_2d = np.matmul(P, vertices_hom)
        pts_2d[:,0] /= pts_2d[:,2]
        pts_2d[:,1] /= pts_2d[:,2]
        # dtype int to match pixel coordinates
        pts_2d = np.round(pts_2d).astype(int)

        # Draw 2D bounding box and handle occlusion
        draw_projected_box3d(rgb, pts_2d, color, 2)
				
			

The images below show the 3D labels projected to the image.

3D labels projected to the image views. A single 3D label can be projected to all camera views

3D labels projected to the image views

Challenges and Considerations

Data Quality

The accuracy of SfM and AR annotations heavily depends on the quality of the captured data. Ensure high-resolution images and adequate lighting conditions during data collection. It is also a good practice to ensure homogeneous lighting conditions throughout the dataset, or at least smooth transitions between sequential images.

Label Occlusion

When reconstructing a scene and determining the camera position, it’s important to keep in mind that certain 3D labels may be obstructed by other objects in certain camera views. This must be taken into account and managed when projecting the labels. One approach to addressing this is to split a large dataset into smaller subsets.

Dynamic objects

Moving objects in the scene, such as other cars, will not be reconstructed because the points on those objects will not match the motion of the static points during the optimization process. Therefore, these points will be removed.

However, having knowledge of the camera poses makes it easier to label the dynamic objects in a few images and to extend these labels to other views by estimating a motion model.

Computational Resources

3D reconstruction and annotation can be computationally intensive. Access to powerful hardware and optimization techniques is essential.

Conclusion

Integrating Structure from Motion in a 3D data labeling pipeline offers a promising solution to the challenges faced in machine learning data preparation. By substantially improving the labeling time in large datasets and enhancing accuracy, these technologies pave the way for more robust and effective ML models.

This article just touches the surface of how it is possible to optimize tasks surrounding AI model training and validation. Other computer vision techniques can also be applied to build a robust solution.

If you’re looking for an engineered solution, please contact us. We have extensive experience in training, validating, and deploying complex AI models for companies worldwide and would be happy to provide a free consultation.

About us

At dtLabs, we believe that great research comes from a combination of technical excellence and a commitment to solving real-world problems. Our private research lab is staffed by world-class experts who are dedicated to pushing the boundaries of science and technology. We are passionate about what we do, and we bring that passion to every project we work on. Our goal is to create solutions that not only meet our clients’ needs but exceed their expectations. When you work with dtLabs, you can trust that you’re partnering with a team that is committed to excellence in every aspect of our work.

Artigo

Multi-Camera Multi-Object Tracking

Exploring multi-camera multi-object tracking: Techniques, challenges, and real-World applications Object tracking is a fundamental task in computer vision that involves identifying and following the movement