Implement Visual SLAM in MATLAB
Visual simultaneous localization and mapping (vSLAM) refers to the process of calculating the position and orientation of a camera, with respect to its surroundings, while simultaneously mapping the environment. The process uses only visual inputs from the camera. Applications for visual SLAM include augmented reality, robotics, and autonomous driving. For a general description on why SLAM matters and how it works for different applications, see What is SLAM?
Visual SLAM algorithms are broadly classified into two categories, depending on how they estimate the camera motion. The indirect, feature-based method uses feature points of images to minimize the reprojection error. The direct method uses the overall brightness of images to minimize the photometric error. The Computer Vision Toolbox™ algorithms provide functions for performing feature-based visual SLAM. The workflow consists of map initialization, tracking, local mapping, loop detection, and drift correction.
The workflow described in this overview applies to images taken by a pinhole camera.
To use the visual SLAM workflow with images taken by a fisheye camera, convert the
fisheye camera into a virtual pinhole camera using the
Terms Used in Visual SLAM
Visual SLAM literature uses these common terms:
Key Frames — A subset of video frames that contain cues for localization and tracking. Two consecutive key frames usually indicate a large visual change caused by a camera movement.
Map Points — A list of 3-D world points that represent the map of the environment reconstructed from the key frames.
Covisibility Graph — A graph with key frames as nodes. Two key frames are connected by an edge if they share common map points. The weight of an edge is the number of shared map points.
Recognition Database — A database that stores the visual word-to-image mapping based on the input bag of features. Determine whether a place has been visited in the past by searching the database for an image that is visually similar to the query image.
Typical Feature-based Visual SLAM Workflow
To construct a feature-based visual SLAM pipeline on a sequence of images, follow these steps:
Initialize Map — Initialize the map of 3-D points from two image frames. Compute the 3-D points and relative camera pose by using triangulation based on 2-D feature correspondences.
Track Features — For each new frame, estimate the camera pose by matching features in the current frame to features in the last key frame.
Create Local Map — If you identify the current frame as a key frame, create a new 3-D map of points. Use bundle adjustment to refine the camera pose and 3-D points.
Detect Loops — Detect loops for each key frame by comparing the current frame to all previous key frames using the bag-of-features approach.
Correct Drift — Optimize the pose graph to correct the drift in the camera poses of all the key frames.
The figure illustrates a typical feature-based visual SLAM workflow. It also shows the points at which data is stored or retrieved from objects that manage the data.
Key Frame and Map Data Management
Use the view set, point set, and transformation objects to manage key frames and map data.
imageviewsetobject to manage data associated with the odometry and mapping process. The object contains data as a set of views and pairwise connections between views. The object can also be used to build and update a pose graph.
Each view consists of the absolute camera pose and the feature points extracted from the image. Each view, with its unique identifier (view ID), within the view set forms a node of the pose graph.
Each connection stores information that links one view to another view. The connection includes the indices of matched features between the views, the relative transformation between the connected views, and the uncertainty in computing the measurement. Each connection forms an edge in the pose graph.
rigidtform3dobject input with
imageviewsetto store the absolute camera poses and relative camera poses of odometry edges. Use a
simtform3dobject input with
imageviewsetto store the relative camera poses of loop-closure edges.
worldpointsetobject to store correspondences between 3-D map points and 2-D image points across camera views.
worldpointsetstores the 3-D locations of map points.
worldpointsetstores the view IDs of the key frames that observe the map points.
To initialize mapping, you must match features between two images, estimate the relative camera pose, and triangulate initial 3-D world points. This workflow commonly uses the Speeded-Up Robust Features (SURF) and Oriented FAST and Rotated BRIEF (ORB) features point features. The map initialization workflow consists of a detecting, extracting, and matching features, and then finding a relative camera pose estimate, finding the 3-D locations of matched features, and refining the initial map. Finally, store the resulting key frames and mapped points in an image view set and a world point set, respectively.
|1. Detect||Detect SURF features and return a |
|Detect ORB features and return an |
|Detect SIFT features and return a |
|2. Extract||Extract feature vectors and their corresponding locations in a binary or intensity image.|
|3. Match||Obtain the indices of the matching features between two feature sets.|
|4. Estimate relative camera pose from matched feature points||Compute a homography from matching point pairs.|
|Estimate the fundamental matrix from matching point pairs.|
|Compute the relative camera poses, represented as a |
|5. Find 3-D locations of the matched feature points||Find the 3-D locations of matching pairs of undistorted image points.|
|6. Refine initial map||Refine 3-D map points and camera poses that minimize reprojection errors.|
|7. Manage data for initial map and key frames||Add the two views formed by the feature points and their absolute
poses to the |
|Add the odometry edge defined by the connection between
successive key views, formed by the relative pose transformation
between the cameras, to the |
|Add the initial map points to the |
|Add the 3-D to 2-D projection correspondences between the key
frames and the map points to the |
The tracking workflow uses every frame to determine when to insert a new key frame. Use these steps and functions for the tracking workflow.
|Match extracted features||Match extracted features from the current frame with features in the last key frame that have known 3-D locations.|
|Estimate camera pose||Estimate the current camera pose.|
|Project map points||Project the map points observed by the last key frame into the current frame.|
|Search for feature correspondences||Search for feature correspondences within spatial constraints.|
|Refine camera pose||Refine the camera pose with 3-D to 2-D correspondence by performing a motion-only bundle adjustment.|
|Identify local map points||Identify points in the view and points that correspond to point tracks.|
|Search for more feature correspondences||Search for more feature correspondences in the current frame, which contains projected local map points.|
|Refine camera pose||Refine the camera pose with 3-D to 2-D correspondence by performing a motion-only bundle adjustment.|
|Store new key frame|| If you determine that the current frame is a new key frame, add
it and its connections to covisible key frames to the |
Feature matching is critical in the tracking workflow. Use the
matchFeaturesInRadius function to return more putative matches when an
estimation of the positions of matched feature points is available. The two match
feature functions used in the workflow are:
matchFeatures— Returns the indices of the matching features in the two input feature sets.
matchFeaturesInRadius— Returns the indices of the matching features, which satisfy spatial constraints, in the two input feature sets.
To get a greater number of matched feature pairs, increase the values for the
arguments of the
matchFeaturesInRadius functions. The outliers pairs can be discarded
after performing bundle adjustment in the local mapping step.
Perform local mapping for every key frame. Follow these steps to create new map points.
|Connect key frames||Find the covisible key frames of the current key frame.|
|Search for matches in connected key frames||For each unmatched feature point in the current key frame, use
|Compute location for new matches||Compute the 3-D locations of the matched feature points.|
|Store new map points||Add the new map points to the |
|Store 3-D to 2-D correspondences||Add new 3-D to 2-D correspondences to the |
|Update odometry connection||Update the connection between the current key frame and its covisible frames with more feature matches.|
|Store representative view of 3-D points||Update representative view ID and corresponding feature index.|
|Store distance limits and viewing direction of 3-D points||Update distance limits and mean viewing direction.|
Refine the pose of the current key frame, the poses of covisible key frames, and all the map points observed in these key frames. For improved performance, only include strongly connected, covisible key frames in the refinement process.
|Remove outliers||Remove outlier map points with large reprojection errors from the
This table compares the camera poses, map points, and number of cameras for each of the bundle adjustment functions used in 3-D reconstruction.
|Function||Camera Poses||Map Points||Number of Cameras|
Due to an accumulation of errors, using visual odometry alone can lead to drift. These errors can result in severe inaccuracies over long distances. Using graph-based SLAM helps to correct the drift. To do this, detect loop closures by finding a previously visited location. A common approach is to use this bag-of-features workflow:
|Construct bag of visual words||Construct a bag of visual words for place recognition.|
|Create recognition database||Create a recognition database, |
|Identify loop closure candidates||Search for images that are similar to the current key frame. Identify consecutive images as loop closure candidates if they are similar to the current frame. Otherwise, add the current key frame to the recognition database.|
|Compute relative camera pose for loop closure candidates||Compute the relative camera pose between the candidate key frame and the current key frame, for each loop closure candidate|
|Close loop||Close the loop by adding a loop closure edge with the relative
camera pose to the |
imageviewset object internally updates the pose graph as views and
connections are added. To minimize drift, perform pose graph optimization by using the
optimizePoses function, once sufficient loop closures are added. The
optimizePoses function returns an
imageviewset object with the optimized absolute pose transformations for
You can use the
createPoseGraph function to return the pose graph as a MATLAB®
digraph object. You can use graph algorithms in MATLAB to inspect, view, or modify the pose graph. Use the
optimizePoseGraph (Navigation Toolbox) function from Navigation Toolbox™ to optimize the modified pose graph, and then use the
updateView function to update the camera poses in the view set.
To develop the visual SLAM system, you can use the following visualization functions.
|Display an image|
|Display matched feature points in two images|
|Plot image view set views and connections|
|Plot a camera in 3-D coordinates|
|Plot 3-D point cloud|
|Visualize streaming 3-D point cloud data|
 Hartley, Richard, and Andrew Zisserman. Multiple View Geometry in Computer Vision. 2nd ed. Cambridge: Cambridge University Press, 2003.
 Fraundorfer, Friedrich, and Davide Scaramuzza. “Visual Odometry: Part II: Matching, Robustness, Optimization, and Applications.” IEEE Robotics & Automation Magazine 19, no. 2 (June 2012): 78–90. https://doi.org/10.1109/MRA.2012.2182810.
 Mur-Artal, Raul, J. M. M. Montiel, and Juan D. Tardos. “ORB-SLAM: A Versatile and Accurate Monocular SLAM System.” IEEE Transactions on Robotics 31, no. 5 (October 2015): 1147–63. https://doi.org/10.1109/TRO.2015.2463671.
 Kümmerle, Rainer, Giorgio Grisetti, Hauke Strasdat, Kurt Konolige, and Wolfram Burgard. "G2o: A General Framework for Graph Optimization." In 2011 IEEE International Conference on Robotics and Automation (ICRA 2011), Shanghai, 9–13 May 2011, 3607–13. New York: Institute of Electrical and Electronics Engineers. https://doi.org//10.1109/ICRA.2011.5979949.
- What is SLAM?
- Structure from Motion Overview
- Visual Localization in a Parking Lot
- Stereo Visual SLAM for UAV Navigation in 3D Simulation
- Monocular Visual Simultaneous Localization and Mapping
- Stereo Visual Simultaneous Localization and Mapping
- Develop Visual SLAM Algorithm Using Unreal Engine Simulation (Automated Driving Toolbox)