diff --git a/thesis/Main.tex b/thesis/Main.tex index 41d594b..9b76ce3 100755 --- a/thesis/Main.tex +++ b/thesis/Main.tex @@ -70,7 +70,8 @@ %\usepackage[disable]{todonotes} \DeclareRobustCommand{\threadtodo}[4]{% - \todo[disable, + \todo[inline, + %\todo[disable, backgroundcolor=red!20, bordercolor=red!50, textcolor=black!80, @@ -406,11 +407,11 @@ Machine learning based anomaly detection methods can utilize techniques from all {explain basic idea, unfixed architecture, infomax, mention usecases, dimension reduction} {dimensionality reduction, useful for high dim data $\rightarrow$ pointclouds from lidar} -Autoencoders are a type of neural network architecture, whose main goal is learning to encode input data into a representative state, from which the same input can be reconstructed, hence the name. They typically consist of two parts, an encoder and a decoder. The encoder learns to extract the most significant features from the input and convert them into another representation. The reconstruction goal ensures that the most prominent features of the input get retained during the encoding phase, due to the inability of reconstruction without the relevant information. The decoder learns to reconstruct the original input from its encoded representation, by minimizing the error between the original input data and the autoencoder's output. This optimization goal creates uncertainty when categorizing autoencoders as an unsupervised method, although literature commonly defines them as such. While they do not require any labeling of the input data, their optimization target can still calculate the error between the output and the optimal target, which is typically not available for unsupervised methods. For this reason, they are sometimes proposed to be a case of self-supervised learning, a type of machine learning where the data itself can be used to generate a supervisory signal without the need for a domain expert to provide one. +Autoencoders are a type of neural network architecture, whose main goal is learning to encode input data into a representative state, from which the same input can be reconstructed, hence the name. They typically consist of two functions, an encoder and a decoder with a latent space inbetween them as depicted in the toy example in figure~\ref{fig:autoencoder_general}. The encoder learns to extract the most significant features from the input and to convert them into the input's latent space representation. The reconstruction goal ensures that the most prominent features of the input get retained during the encoding phase, due to the inherent inability to reconstruct the input if too much relevant information is missing. The decoder simultaneously learns to reconstruct the original input from its encoded latent space representation, by minimizing the error between the input sample and the autoencoder's output. This optimization goal creates uncertainty when categorizing autoencoders as an unsupervised method, although literature commonly defines them as such. While they do not require any labeling of the input data, their optimization target can still calculate the error between the output and the optimal target, which is typically not available for unsupervised methods. For this reason, they are sometimes proposed to be a case of self-supervised learning, a type of machine learning where the data itself can be used to generate a supervisory signal without the need for a domain expert to provide one. \fig{autoencoder_general}{figures/autoencoder_principle_placeholder.png}{PLACEHOLDER - An illustration of autoencoders' general architecture and reconstruction task.} -\todo[inline, color=green!40]{explain figure} +%\todo[inline, color=green!40]{explain figure} \todo[inline, color=green!40]{Paragraph about Variational Autoencoders? generative models vs discriminative models, enables other common use cases such as generating new data by changing parameterized generative distribution in latent space - VAES are not really relevant, maybe leave them out and just mention them shortly, with the hint that they are important but too much to explain since they are not key knowledge for this thesis} One key use case of autoencoders is to employ them as a dimensionality reduction technique. In that case, the latent space inbetween the encoder and decoder is of a lower dimensionality than the input data itself. Due to the aforementioned reconstruction goal, the shared information between the input data and its latent space representation is maximized, which is known as following the infomax principle. After training such an autoencoder, it may be used to generate lower-dimensional representations of the given datatype, enabling more performant computations which may have been infeasible to achieve on the original data. DeepSAD uses an autoencoder in a pre-training step to achieve this goal among others. @@ -635,7 +636,7 @@ To ensure our chosen dataset meets the needs of reliable degradation quantificat \begin{enumerate} \item \textbf{Data Modalities:}\\ - The dataset must include LiDAR sensor data, since we decided to train and evaluate our method on what should be the most universally used sensor type in the given domain. To keep our method as generalized as possible, we chose to only require range-based point cloud data and forego sensor-specific data such as intensity or reflectivity, though it may be of interest for future work. It is also desirable to have complementary visual data such as camera images, for better context and manual verification and understanding of the data. + The dataset must include LiDAR sensor data, since we decided to train and evaluate our method on what should be the most universally used sensor type in the given domain. To keep our method as generalized as possible, we chose to only require range-based point cloud data and forego sensor-specific data such as intensity or reflectivity, though it may be of interest for future work. It is also desirable to have complementary visual data such as camera images, for better context, manual verification and understanding of the data. \item \textbf{Context \& Collection Method:}\\ To mirror the real-world conditions of autonomous rescue robots, the data should originate from locations such as subterranean environments (tunnels, caves, collapsed structures), which closely reflect what would be encountered during rescue missions. Ideally, it should be captured from a ground-based, self-driving robot platform in motion instead of aerial, handheld, or stationary collection, to ensure similar circumstances to the target domain. @@ -820,12 +821,12 @@ Taken together, the percentage of missing points and the proportion of near-sens \newsection{preprocessing}{Preprocessing Steps and Labeling} \threadtodo -{explain preprocessing and rationale behind labeling techniques} +{explain preprocessing and rationale} {raw dataset has been explained, how did we change the data before use} -{projection because easier autoencoder, manual and experiment-based labeling explained} -{method and preprared data known $\rightarrow$ next explain experimental setup} +{projection because easier autoencoder, how does projection work, index-based} +{preprocessing known $\rightarrow$ how did we label the data for use} -\newsubsubsectionNoTOC{Preprocessing} +%\newsubsubsectionNoTOC{Preprocessing} %\todo{describe how 3d lidar data was preprocessed (2d projection), labeling} %\todo[inline]{screenshots of 2d projections?} @@ -833,13 +834,13 @@ Taken together, the percentage of missing points and the proportion of near-sens %As described in section~\ref{sec:algorithm_description} the method we want to evaluate is datatype agnostic and can be adjusted to work with any kind of data. The data from~\cite{subter} that we will train on is a point cloud per scan created by the lidar sensor which contains up to 65536 points with \emph{X}, \emph{Y}, and \emph{Z} coordinates (in meters) per point. To adjust the architecture of Deep SAD to work with a specific datatype, we have to define an autoencoder architecture that works for the given datatype. While autoencoders can be created for any datatype, as~\cite{autoencoder_survey} points out over 60\% of research papers pertaining autoencoders in recent years look at image classification and reconstruction, so we have a better understanding of their architectures for two dimensional images than for three dimensional point clouds. -As described in Section~\ref{sec:algorithm_description}, the method under evaluation is data type agnostic and can be adapted to work with any kind of data. In our case, we train on point clouds from~\cite{subter}, where each scan produced by the LiDAR sensor contains up to 65,536 points, with each point represented by its \emph{X}, \emph{Y}, and \emph{Z} coordinates. To tailor the Deep SAD architecture to this specific data type, we would need to design an autoencoder suitable for processing three-dimensional point clouds. Although autoencoders can be developed for various data types, as noted in~\cite{autoencoder_survey}, over 60\% of recent research on autoencoders focuses on two-dimensional image classification and reconstruction. Consequently, there is a more established understanding of architectures for images compared to those for three-dimensional point clouds. +As described in Section~\ref{sec:algorithm_description}, the method under evaluation is data type agnostic and can be adapted to work with any kind of data by choosing a suitable autoencoder architecture. In our case, the input data are point clouds produced by a lidar sensor. Each point cloud contains up to 65,536 points, with each point represented by its \emph{X}, \emph{Y}, and \emph{Z} coordinates. To tailor the Deep SAD architecture to this specific data type, we would need to design an autoencoder suitable for processing three-dimensional point clouds. Although autoencoders can be developed for various data types, \citetitle{autoencoder_survey}~\cite{autoencoder_survey} noted that over 60\% of recent research on autoencoders focuses on two-dimensional image classification and reconstruction. Consequently, there is a more established understanding of autoencoder architectures for images compared to those for three-dimensional point clouds. %\todo[inline, color=green!40]{to achieve this transformation we used the helpful measurement index and channel present in each measurement point of the dataset which allowed a perfect reconstruction of the 2d projection without calculating the pixel position in the projection of each measurement via angles which in our experience typically leads to some ambiguity in the projection (multiple measurements mapping to the same pixel due to precision loss/other errors) the measurement index increases even for unavailable measurements (no ray return) so we can simply create the 2d projection by mapping the normalized range (FIXME really normalized) value to the pixel position y = channel, x = measurement index. by initalizing the array to NaN values originally we have a 2d data structure with the range values and NaN on pixel positions where originally no measurement took place (missing measurements in scans due to no ray return)} %For this reason we decided to preprocess the point clouds by converting them to two dimensional grayscale images using spherical projection. Additionally, \cite{degradation_quantification_rain}-which we modeled our approach after-successfully chose this approach. In the projected image each measurement is encoded to a single pixel, whose grayscale value $v$ is the normalized range of the measurement $v = \sqrt{\emph{X}^2 + \emph{Y}^2 + \emph{Z}^2}$. Due to the settings of the datasets' lidar, this results in images with the resolution of 2048 pixels wide by 32 pixels tall. Missing measurements of the point cloud are mapped to pixels with a brightness of 0. To create the mapping we used the measurements indices and channels which are available since the dataset contains dense point clouds and which can be used since the point indices are ordered from 0 to 65535 horizontally ascending channel by channel. For point clouds without indices which can be directly mapped, as is often the case for sparse ones, it would be necessary to use the pitch and yaw angles to the sensor origin to map each point to a pixel on the projection. -For this reason and to simplify the architecture, we converted the point clouds into two-dimensional grayscale images using a spherical projection. This approach—also employed successfully in \cite{degradation_quantification_rain}—encodes each LiDAR measurement as a single pixel, where the pixel’s grayscale value is determined by the normalized range, calculated as $v = \sqrt{\emph{X}^2 + \emph{Y}^2 + \emph{Z}^2}$. Given the LiDAR sensor's configuration, the resulting images have a resolution of 2048 pixels in width and 32 pixels in height. Missing measurements in the point cloud are mapped to pixels with a brightness value of 0. +For this reason and to simplify the architecture, we converted the point clouds into two-dimensional grayscale images using a spherical projection. This approach—proven sucessful in related work~\cite{degradation_quantification_rain}—encodes each LiDAR measurement as a single pixel, where the pixel’s grayscale value is determined by the normalized range, calculated as $v = \frac{1}{\sqrt{\emph{X}^2 + \emph{Y}^2 + \emph{Z}^2}}$. Given the LiDAR sensor's configuration, the resulting images have a resolution of 2048 pixels in width and 32 pixels in height. Missing measurements in the point cloud are mapped to pixels with a brightness value of $v = 0$. To create this mapping, we leveraged the available measurement indices and channel information inherent in the dense point clouds, which are ordered from 0 to 65,535 in a horizontally ascending, channel-by-channel manner. For sparser point clouds without such indices, one would need to rely on the pitch and yaw angles relative to the sensor's origin to correctly map each point to its corresponding pixel. @@ -847,28 +848,38 @@ To create this mapping, we leveraged the available measurement indices and chann Figure~\ref{fig:data_projections} displays two examples of LiDAR point cloud projections to aid in the reader’s understanding. Although the original point clouds were converted into grayscale images with a resolution of 2048×32 pixels, these raw images can be challenging to interpret. To enhance human readability, we applied the viridis colormap and vertically stretched the images so that each measurement occupies multiple pixels in height. The top projection is derived from a scan without artificial smoke—and therefore minimal degradation—while the lower projection comes from an experiment where artificial smoke introduced significant degradation. -\todo[inline, color=green!40]{add same projections as they are used in training} +\todo[inline, color=green!40]{add same projections as they are used in training? grayscale without vertical scaling} -\fig{data_projections}{figures/data_2d_projections.png}{Two-dimensional projections of two pointclouds, one from an experiment without degradation and one from an experiment with artifical smoke as degradation} +\fig{data_projections}{figures/data_2d_projections.png}{Two-dimensional projections of two pointclouds, one from an experiment without degradation and one from an experiment with artifical smoke as degradation. To aid the readers perception, the images are vertically stretched and a colormap has been applied to the pixels' normalized range values, while the actual training data is grayscale.} -\newsubsubsectionNoTOC{Labeling} + + +%\newsubsubsectionNoTOC{Labeling} + +\threadtodo +{explain labeling techniques and rationales} +{raw dataset has been explained, how did we label the data before use} +{experiment-based labeling, problems, manual labeling for traing, both for evaluation} +{method and labeled preprocessed data known $\rightarrow$ explain experimental setup} %\todo[inline, color=green!40]{another important preprocessing step is labeling of the lidar frames as normal/anormal. this is one hand used during training (experiments with zero labeled up to most of the data being labeled) and on the other hand is important for evaluation of the method performance. originally we do not have any labels on the data regarding degradation and no analog values from another sensor which measures current smoke particles in the air. our simple approach was to label all frames from experiments which included artifical degradation by fog machine smoke as anomalous and all frames from experiments without artifical degradation as normal.} %We discussed the requirements to data labels in section~\ref{sec:data}, where we mentioned the challenges but also importance of correctly labeled data, especially for evaluation. Since to our knowledege no public dataset with objective labels regarding dataset degradation of lidar data in subterranean environments is available and the dataset chosen for evaluation in this thesis \cite{subter} does not contain any explicit data or measurements about the dedata degradation, we had to choose a method of how we would label the data ourselves for evaluation. After considering multiple avenues, we decided to simply label all point clouds created during experiments with artifical smoke present as anomalies and all point clouds from other experiments as normal data. -We discussed the challenges and importance of obtaining correctly labeled data in Section~\ref{sec:data_req}, particularly for evaluation purposes. Since, to our knowledge, no public dataset provides objective labels for LiDAR data degradation in subterranean environments—and the dataset selected for this thesis \cite{subter} lacks explicit measurements of degradation—we had to develop our own labeling approach. After considering several options, we decided to label all point clouds from experiments with artificial smoke as anomalies, while point clouds from experiments without smoke were labeled as normal data. +The remaining challenge, was labeling a large enough portion of the dataset in a reasonably accurate manner, whose difficulties and general approach we described in section~\ref{sec:data_req}. Since, to our knowledge, neither our chosen dataset nor any other publicly available dataset provide objective labels for LiDAR data degradation in the SAR domain, we had to define our own labeling approach. With objective measures of degradation unavailable, we explored alternative labeling methods—such as using statistical properties like the number of missing measurements per point cloud or the higher incidence of erroneous measurements near the sensor we described in section~\ref{sec:data_req}. Ultimately, we were concerned that these statistical approaches might lead the method to simply mimic the statistical evaluation rather than to quantify degradation in a generalized and robust manner. After considering these options, we decided to label all point clouds from experiments with artificial smoke as anomalies, while point clouds from experiments without smoke were labeled as normal data. This labeling strategy—based on the presence or absence of smoke—is fundamentally an environmental indicator, independent of the intrinsic data properties recorded during the experiments. %\todo[inline, color=green!40]{this simple labeling method is quite flawed since we do not label based on the actual degradation of the scan (not by some kind of threshold of analog measurement threshold, statistical info about scan) since (TODO FIXME) this would result in training which only learns this given metric (example missing measurement points) which would make this methodology useless since we could simply use that same measurement as an more simple way to quantify the scan's degradation. } %This simplistic approach has both Advantages and disadvantages. The approach is simple to implement and provides a clear and straightforward distinction between normal and anomalous data. As a negative, there are clearly point clouds without subjective degradation present in the experiments with added degradation, which-using this method-get labeled as anomalies even though for actual trainging and evaluation purposes they should not be labeleld as such. Since we do not have an objective measure available, we looked into other ways to label the data such as statistical data about missing measurements per point cloud or the aforementioned phenomenon of more erroneous measurements up close to the sensor in degraded environments, but we feared that any statistical property of the data or any combination of them would only result in the method learning to replicate those statistical evaluations rather than to actually quantify the degradation in a generalized way. The classification of wether smoke was present during an experiment or not is different here in that it is not dependent on the data but is rather an expression of the environment itself, during the recording of the data. -This simplistic labeling approach has both advantages and disadvantages. On the positive side, it is easy to implement and creates a clear distinction between normal and anomalous data. However, its simplicity is also its drawback: some point clouds from experiments with artificial smoke do not exhibit perceptible degradation, yet they are still labeled as anomalies. In an ideal scenario, these point clouds should not be classified as degraded for training and evaluation purposes. +This simplistic labeling approach has both advantages and disadvantages. On the positive side, it is easy to implement and creates a clear distinction between normal and anomalous data. However, its simplicity is also its drawback: some point clouds from experiments with artificial smoke do not exhibit perceptible degradation, yet they are still labeled as anomalies. The reason for this, is that during the three non-static anomalous experiments the sensor platform starts recording in a tunnel roughly 20 meters from the smoke machine location. It then approaches the smoke machine, navigates the room for some time and then leaves the room, distancing itself from the smoke machine once again. Since the artificical smoke's density is far larger near the machine it originates from, the beginning and end of the anomalous experiments capture point clouds which do not exhibit perceptible degradation. -Since an objective measure of degradation is unavailable, we explored alternative labeling methods—such as using statistical properties like the number of missing measurements per point cloud or the higher incidence of erroneous measurements near the sensor in degraded environments. Ultimately, we were concerned that these statistical approaches might lead the method to simply mimic the statistical evaluation rather than to quantify degradation in a generalized and robust manner. Notably, our labeling strategy—based on the presence or absence of smoke—is fundamentally an environmental indicator, independent of the intrinsic data properties recorded during the experiments. +Afraid that the incorrectly labeled data may negatively impact the training of DeepSAD, we chose to manually remove the anomalous labels from the beginning and end of each anomalous experiment, resulting in \dots. +\todo[inline]{only strongly degraded samples, how many vs earlier, all normal experiments were kept labeled since no smoke was ever present, even though general dust in subt environment may be present in a few samples} +\todo[inline]{this allowed us to safely use the manually labeld anomalies for semi-supervised training, but we found that for evaluation the manually labeled samples were so degraded that multiple methods were able to create a perfect classifier - making evaluation on this not really meaningful. so we also kept experiment based labeling, with the knowledge that there may be a cap of how well any method may perform on this data, since it contains x percent more samples for which it is unclear if they should be anomalies - some definitely should not } %\todo[inline]{TODO maybe evaluate based on different thresholds? missing datapoints, number of detected outliers, number of particles in phantom circle around sensor?} -\todo[inline]{maybe also mention that we considered labeling using output of down-the-pipeline algorithm (e.g., SLAM) and how it performs/how confident it is and retrospectively label the quality of the data based on that} +%\todo[inline]{maybe also mention that we considered labeling using output of down-the-pipeline algorithm (e.g., SLAM) and how it performs/how confident it is and retrospectively label the quality of the data based on that} \newchapter{experimental_setup}{Experimental Setup} \newsection{autoencoder_architecture}{Deep SAD Autoencoder Architecture}