3D Point Cloud processing tutorial by F. Poux
The ultimate guide to subsample 3D point clouds from scratch, with Python. Two efficient methods are shown to import, process, structure as a voxel grid, and visualise LiDAR data.
In this article, I will give you my two favourite 3D processes for quickly structuring and sub-sampling point cloud data with python. You will also be able to automate, export, visualize and integrate results into your favourite 3D software, without any coding experience. I will focus on code optimization while using a minimum number of libraries (mainly NumPy) so that you can extend what you learnt with very high flexibility! Ready 😁?
Point cloud datasets are marvellous! You can get a geometric description of world entities by discretizing them through a bunch of points, which, aggregated together, resemble the shape — the environment — of interest.
But a major problem with 3D point clouds is that the data density may be more than necessary for a given application. This often leads to higher computational cost in subsequent data processing or visualisation. To make the dense point clouds more manageable, their data density can be reduced. This article provides you with the knowledge and actual scripts to implement sub-sampling methods for reducing point cloud data density.
Let us dive in 🤿!
Ha, I tricked you 🙃. Before directly diving to the implementation of sampling strategies, let us first review the typical sub-sampling methods for point cloud data thinning. These include the random, the minimal distance and the grid (often tagged as uniform) methods. The random method is the simplest for reducing data density, in which a specified number of data points is selected randomly.
In the minimal distance method, the data point selection is constrained by a minimum distance so that no data point in the selected subset is closer to another data point than the minimum distance specified.
In the grid method (which can be uniform), a grid structure — the handier being a voxel grid structure — is created and a representative data point is selected.
The latter two methods can achieve a more homogeneous spatial distribution of data points in the reduced point cloud. In such cases, the average data spacing is determined by the minimal distance or the voxel edge length specified.
Okay for the theory, let us put it into action 🤠!
In the previous article below, we saw how to set-up an environment easily with Anaconda and how to use the IDE Spyder for managing your code. I recommend continuing in this fashion if you set yourself up to becoming a fully-fledge python app developer 😆.
But hey, if you prefer to do everything from scratch in the next 5 minutes, I also give you access to a Google Colab notebook that you will find at the end of the article. There is nothing to install; you can just save it to your google drive and start working with it, also using the free datasets from Step 2 👇.
In this tutorial, we will extend the scope, and test on a point cloud obtained through an aerial LiDAR survey. This is an excellent opportunity to introduce you to the great Open Data platform: Open Topography. It is a collaborative data repository for LiDAR users. Through a web map, you can select a region of interest, and download the related point cloud dataset with its metadata in different file formats (.laz, .las or as an ASCII file).
At this phase, what is important to know is that you can easily process both the ASCII file and the .las file with python (the .laz is more tricky). The .las file is far more compressed than the ASCII file (355 Mo vs 1026 Mo for the example in this guide), but it will necessitate that you use a library called LasPy. So now, if you need 3D point cloud datasets over a large region, you know where you can find great datasets easily 🗺️.
🤓 Note: For this how-to guide, you can use the point cloud in this repository, that I already filtered, colourized and translated so that you are in the optimal conditions. If you want to visualize and play with it beforehand without installing anything, you can check out the webGL version.
Okay, now that we are set-up, let us write some code 💻. First, we install the library package that is missing to read .las files. If you are with anaconda, I suggest you run the following command by looking up the conda-forge channel:
conda install -c conda-forge
Else, in general, you can use the pip package installer for Python by typing:
pip install laspy
Then, let us import necessary libraries within the script (NumPy and LasPy), and load the .las file in a variable called
import numpy as np
import laspy as lpinput_path="gdrive/My Drive/10-MEDIUM/DATA/Point Cloud Sample/"
Nice, we are almost ready! What is great, is that the LasPy library also give a structure to the
point_cloud variable, and we can use straightforward methods to get, for example, X, Y, Z, Red, Blue and Green fields. Let us do this to separate coordinates from colours, and put them in NumPy arrays:
points = np.vstack((point_cloud.x, point_cloud.y, point_cloud.z)).transpose()colors = np.vstack((point_cloud.red, point_cloud.green, point_cloud.blue)).transpose()
🤓 Note: We use a vertical stack method from NumPy, and we have to transpose it to get from (n x 3) to a (3 x n) matrix of the point cloud.
And we are set up! Moving on to step 3 👇.
We will focus on decimation and voxel grid sampling. Now is the time to pick a side 🙂
💡 Hint: I will give you code scripts that actually maximize the use of NumPy, but know that you can achieve similar results with widely different implementations (or through importing other packages). The main difference is often the execution time. The goal is to have the best execution runtime while having a readable script.
Strategy 1: Point Cloud Random subsampling
If we define a point cloud as a matrix (m x n), then the decimated cloud is obtained by keeping one row out of n of this matrix :
Slicing a list in python is pretty simple with the command
l[start:end:step]. To shorten and parametrize the expression, you can just write the lines:
decimated_points_random = points[::factor]
🤓 Note: Running this will keep 1 row every 160 rows, thus diving the size of the original point cloud by 160. It goes from 13 993 118 points to 87 457 points.
Strategie 2: Point Cloud Grid Sampling
The grid subsampling strategy will be based on the division of the 3D space in regular cubic cells called voxels. For each cell of this grid, we will only keep one representative point. This point, the representant of the cell, can be chosen in different ways. For example, it can be the barycenter of the points in that cell, or the closest point to it.
We will work in two sub-steps.
1. First, we create a grid structure over the points. For this, we actually want to initially compute the bounding box of the point cloud (i.e. the box dimensions that englobe all the points). Then, we can discretize the bounding box into small cubic grids: the voxels. These are obtained by setting the length, width and height of the voxel (which is equal), but it could also be set by giving the number of desired voxels in the three directions of the bounding box.
nb_vox=np.ceil((np.max(points, axis=0) - np.min(points, axis=0))/voxel_size)
🤓 Note: You can see the little
axis=0 that is actually fundamental if you want to be sure you apply the max method “per column”. The ceil then will make sure to keep the ceiling of the difference (element-wise), and thus, when divided by the
voxel_size, it returns the number of empty voxels in each direction. With a cubic size of 6 m, we get 254 voxels along X, 154 voxels along Y and 51 along Z: 1 994 916 empty voxels.
2. For each small voxel, we test if it contains one or more points. If it does, we keep it, and we take note of the points indexes that we will have to link to each voxel.
non_empty_voxel_keys, inverse, nb_pts_per_voxel = np.unique(((points - np.min(points, axis=0)) // voxel_size).astype(int), axis=0, return_inverse=True, return_counts=True)
🤓 Note: We want to work with indices rather than coordinates for simplicity and efficiency. The little script above is a super-compact way to return the “designation” of each non-empty voxel. On top, we want to access the points that are linked to each
idx_pts_vox_sorted, and how many there are (
nb_pts_per_voxel). This is done by first looking out unique values based on the integer “indices” gathered for each point. The
argsort method is actually returning the index of the points that we can later link to the voxel index.
3. Finally, we compute the representant of the voxel. I will illustrate this for both the barycenter (
grid_barycenter) and the closest point to the barycenter (
💡 Hint: The use of python dictionaries to keep the points in each voxel is my recommendation. This sparse structure is more adapted than full arrays which will use all your memory on bigger point clouds. A dictionary cannot take a [i, j, k] vector of coordinates as key if it is a list, but converting it to a tuple (i, j, k) will make it work.
- We initialise self-explanatory variables of which a counter
- We create a loop that will iterate over each non-empty voxel, while allowing to work with both the index
idxof the array, and the value
vox, which is actually the [i, j, k] of the voxel.
for idx,vox in enumerate(non_empty_voxel_keys):
- Then (don’t forget to indent) we feed the loop with a way to complete the
voxel_griddictionary with contained points.
- Still in the loop, you can now pick/compute the representative of the voxel. It can be the barycenter that you append to the list of all barycenters:
- Or it can be the closest point to the barycenter (uses Euclidean distances):
- Finally, don’t forget to update your counter, to make sure the selection in the array of points is correct:
🤓 Note: Most of my M.Sc. students will accomplish the task with a bunch of imbricated “for” or “while” loop. It does work, but it is not the most efficient. You have to know Python is not very optimized with loops. Thus, when processing point clouds (which are often massive), you should aim at a minimal amount of loops, and a maximum amount of “vectorization”. With NumPy, this is by “broadcasting”, a mean of vectorizing array operations so that looping occurs in C instead of Python (more efficient). Take the time to digest what I do in this third step (especially the details of playing with indexes and voxels), or check out the Google Colab script for more in-depth information.
This voxel sampling strategy is usually very efficient, relatively uniform, and useful for downward processes (but this extend the scope of the current tutorial). However, you should know that while the point spacing can be controlled by the size of the grid, we cannot “accurately” control the number of sampling points.
To simply visualize in-line your results (or within Python), you can use the
matplotlib library, with its 3D toolkit (see the previous article for understanding what happens under the hood). Run the following command, illustrated over the decimated point cloud :
import matplotlib.pyplot as plt
from mpl_toolkits import mplot3ddecimated_colors = colors[::factor]ax = plt.axes(projection='3d')ax.scatter(decimated_points[:,0], decimated_points[:,1], decimated_points[:,2], c = decimated_colors/65535, s=0.01)plt.show()
🤓 Note: Looking at the number of possible points, I would not recommend in-line visualisation with classical libraries such as
matplotlib if your subsampled results exceed the million mark.
In the very likely event your point cloud is too heavy for visualizing this way, you can export the data in an eatable file format for your software of choice. To get an ASCII file, you can use the command:
np.savetxt(output_path+dataname+"_voxel-best_point_%s.poux" % (voxel_size), grid_candidate_center, delimiter=";", fmt="%s")
🤓 Note: A “;” delimited ASCII file is created, ending with .poux 🤭. The “fmt” command is to make sure the writing is most standard, for example as a string.
💡 Hint: If you also want to make operations to retrieve the colour of voxels representatives, be careful with the NumPy dtype of the sum of colours. The colour type “uint16” can take values from 0 to 65535. Change the type when summing and go to uint16 (or uint8) after the final division.
Now that you addressed steps 1 to 4, it is time to create functions and put them together in an automated fashion 🤖. Basically, we want (a) to load the data and libraries, (b) set parameters value, (c) declare functions, (d) call them when needed, (e) return some kind of results. This can be to show in-line the results and/or to export a sampled point cloud file to be used in your 3D software, outside of Python.
You already know how to do a, b and e, so let us focus on b and c 🎯. To create a function, you can just follow the provided template below:
def cloud_decimation(points, factor): # YOUR CODE TO EXECUTE
🤓 Note: The function created is called
cloud_decimation, and eats two arguments which are
factor. It will execute the desired code written inside and return the variable
decimated_points when it is called. and to call a function, nothing more straightforward: simply write
cloud_decimation(point_cloud, 6) (the same way you would use the function
print(), but here you have two arguments to fill by the values/variables that you want to pass to the function).
By creating a suite of functions, you can then use your script directly, just changing the set parameters at the beginning.
The full code is accessible here: Google Colab notebook.
Additionally, you can check out the follow-up article if you want to extend your capabilities using the library Open3D, and learn specific commands related to 3D point clouds and 3D mesh processing.