wip rework setup chpt
This commit is contained in:
@@ -1534,6 +1534,63 @@
|
|||||||
\verb https://odds.cs.stonybrook.edu
|
\verb https://odds.cs.stonybrook.edu
|
||||||
\endverb
|
\endverb
|
||||||
\endentry
|
\endentry
|
||||||
|
\entry{lenet}{article}{}{}
|
||||||
|
\name{author}{4}{}{%
|
||||||
|
{{hash=c7a5b066431b979612b716532b228554}{%
|
||||||
|
family={Lecun},
|
||||||
|
familyi={L\bibinitperiod},
|
||||||
|
given={Y.},
|
||||||
|
giveni={Y\bibinitperiod}}}%
|
||||||
|
{{hash=bbfb0f3936c83b7b099561e6f0e32ef3}{%
|
||||||
|
family={Bottou},
|
||||||
|
familyi={B\bibinitperiod},
|
||||||
|
given={L.},
|
||||||
|
giveni={L\bibinitperiod}}}%
|
||||||
|
{{hash=419350ebbeb4eba5351469f378dee007}{%
|
||||||
|
family={Bengio},
|
||||||
|
familyi={B\bibinitperiod},
|
||||||
|
given={Y.},
|
||||||
|
giveni={Y\bibinitperiod}}}%
|
||||||
|
{{hash=00f962380d25c4d7f23fa6c7e926c3ed}{%
|
||||||
|
family={Haffner},
|
||||||
|
familyi={H\bibinitperiod},
|
||||||
|
given={P.},
|
||||||
|
giveni={P\bibinitperiod}}}%
|
||||||
|
}
|
||||||
|
\list{publisher}{2}{%
|
||||||
|
{Institute of Electrical}%
|
||||||
|
{Electronics Engineers (IEEE)}%
|
||||||
|
}
|
||||||
|
\strng{namehash}{dd2ddc978fe083bcff1aa1379cd19643}
|
||||||
|
\strng{fullhash}{4dd3ca3cdc8023700c28169734d6ad61}
|
||||||
|
\strng{fullhashraw}{4dd3ca3cdc8023700c28169734d6ad61}
|
||||||
|
\strng{bibnamehash}{4dd3ca3cdc8023700c28169734d6ad61}
|
||||||
|
\strng{authorbibnamehash}{4dd3ca3cdc8023700c28169734d6ad61}
|
||||||
|
\strng{authornamehash}{dd2ddc978fe083bcff1aa1379cd19643}
|
||||||
|
\strng{authorfullhash}{4dd3ca3cdc8023700c28169734d6ad61}
|
||||||
|
\strng{authorfullhashraw}{4dd3ca3cdc8023700c28169734d6ad61}
|
||||||
|
\field{sortinit}{6}
|
||||||
|
\field{sortinithash}{b33bc299efb3c36abec520a4c896a66d}
|
||||||
|
\field{labelnamesource}{author}
|
||||||
|
\field{labeltitlesource}{title}
|
||||||
|
\field{issn}{0018-9219}
|
||||||
|
\field{journaltitle}{Proceedings of the IEEE}
|
||||||
|
\field{number}{11}
|
||||||
|
\field{title}{Gradient-based learning applied to document recognition}
|
||||||
|
\field{volume}{86}
|
||||||
|
\field{year}{1998}
|
||||||
|
\field{pages}{2278\bibrangedash 2324}
|
||||||
|
\range{pages}{47}
|
||||||
|
\verb{doi}
|
||||||
|
\verb 10.1109/5.726791
|
||||||
|
\endverb
|
||||||
|
\verb{urlraw}
|
||||||
|
\verb http://dx.doi.org/10.1109/5.726791
|
||||||
|
\endverb
|
||||||
|
\verb{url}
|
||||||
|
\verb http://dx.doi.org/10.1109/5.726791
|
||||||
|
\endverb
|
||||||
|
\endentry
|
||||||
\enddatalist
|
\enddatalist
|
||||||
\endrefsection
|
\endrefsection
|
||||||
\endinput
|
\endinput
|
||||||
|
|||||||
BIN
thesis/Main.pdf
BIN
thesis/Main.pdf
Binary file not shown.
113
thesis/Main.tex
113
thesis/Main.tex
@@ -65,6 +65,8 @@
|
|||||||
% \draftcopyName{ENTWURF}{160}
|
% \draftcopyName{ENTWURF}{160}
|
||||||
|
|
||||||
\usepackage{xcolor}
|
\usepackage{xcolor}
|
||||||
|
\usepackage{booktabs}
|
||||||
|
\usepackage{multirow}
|
||||||
\usepackage[colorinlistoftodos]{todonotes}
|
\usepackage[colorinlistoftodos]{todonotes}
|
||||||
%\usepackage[disable]{todonotes}
|
%\usepackage[disable]{todonotes}
|
||||||
\usepackage{makecell}
|
\usepackage{makecell}
|
||||||
@@ -907,11 +909,12 @@ In the following sections, we detail our adaptations to this framework:
|
|||||||
|
|
||||||
\begin{itemize}
|
\begin{itemize}
|
||||||
\item Data integration: preprocessing and loading the dataset from \citetitle{subter}.
|
\item Data integration: preprocessing and loading the dataset from \citetitle{subter}.
|
||||||
\item Model architecture: configuring DeepSAD’s encoder to match our pointcloud input format.
|
\item Model architecture: configuring DeepSAD’s encoder to match our pointcloud input format, contrasting two distinct neural network architectures to investigate their impact on the method's output.
|
||||||
\item Training \& evaluation: training DeepSAD alongside two classical baselines—Isolation Forest and one-class SVM—and comparing their degradation-quantification performance.
|
\item Training \& evaluation: training DeepSAD alongside two classical baselines—Isolation Forest and one-class SVM—and comparing their degradation-quantification performance.
|
||||||
\item Experimental environment: the hardware and software stack used, with typical training and inference runtimes.
|
\item Experimental environment: the hardware and software stack used, with typical training and inference runtimes.
|
||||||
\end{itemize}
|
\end{itemize}
|
||||||
|
|
||||||
|
Together, these components define the full experimental pipeline, from data preprocessing to the evaluation metrics we use to compare methods.
|
||||||
|
|
||||||
\section{Framework \& Data Preparation}
|
\section{Framework \& Data Preparation}
|
||||||
% Combines: Framework Initialization + Data Integration
|
% Combines: Framework Initialization + Data Integration
|
||||||
@@ -924,9 +927,9 @@ In the following sections, we detail our adaptations to this framework:
|
|||||||
{codebase, github, dataloading, training, testing, baselines}
|
{codebase, github, dataloading, training, testing, baselines}
|
||||||
{codebase understood $\rightarrow$ how was it adapted}
|
{codebase understood $\rightarrow$ how was it adapted}
|
||||||
|
|
||||||
The PyTorch implementation of the DeepSAD framework includes the MNIST, Fashion-MNIST, and CIFAR-10 datasets and arrhythmia, cardio, satellite, satimage-2, shuttle, and thyroid datasets from \citetitle{odds}~\cite{odds}, as well as suitable autoencoder and DeepSAD network architectures for the corresponding datatypes. The framework can train and test DeepSAD as well as a number of baseline algorithms, namely SSAD, OCSVM, Isolation Forest, KDE and SemiDGM with the loaded data and evaluate their performance by calculating the ROC area under curve for all given algorithms. We adapted this implementation which was originally developed for Python 3.7 to work with Python 3.12 and changed or added functionality for dataloading our chosen dataset, added DeepSAD models that work with the lidar projections datatype, more evaluation methods and an inference module.
|
DeepSAD's PyTorch implementation includes standardized datasets such as MNIST, CIFAR-10 and datasets from \citetitle{odds}~\cite{odds}, as well as suitable network architectures for the corresponding datatypes. The framework can train and test DeepSAD as well as a number of baseline algorithms, namely SSAD, OCSVM, Isolation Forest, KDE and SemiDGM with the loaded data and evaluate their performance by calculating the ROC area under curve for all given algorithms. We adapted this implementation which was originally developed for Python 3.7 to work with Python 3.12 and changed or added functionality for dataloading our chosen dataset, added DeepSAD models that work with the lidar projections datatype, added more evaluation methods and an inference module.
|
||||||
|
|
||||||
\newsubsubsectionNoTOC{SubTERR dataset preprocessing, train/test splits, and label strategy}
|
\newsubsubsectionNoTOC{SubTER dataset preprocessing, train/test splits, and label strategy}
|
||||||
|
|
||||||
\threadtodo
|
\threadtodo
|
||||||
{explain how dataloading was adapted}
|
{explain how dataloading was adapted}
|
||||||
@@ -937,18 +940,18 @@ The PyTorch implementation of the DeepSAD framework includes the MNIST, Fashion-
|
|||||||
%dataset in rosbag format (one bag file per experiment) was preprocessed as mentioned in chapter X by projecting the 3d lidar data (xzy pointcloud) using a spherical projection in a python script and saved as a npy araray of dimensions frames by height by width with value normalized distance (1 over sqrt(distance)) using numpy save method for simplicity while loading and to avoid having to do this preprocessing during each experiment. the projection was done using the meta information in the bag which includes the channel (height/row) and the index which is available since the data is non-sparse/dense, which means that for each possible measurement a data is available in the original rosbag even if the sensor did not record a return ray for this measurement, which means there is no data and it could be left out in a sparse array saving file size. this is very helpful since it allows the direct mapping of all measurements to the spherical projection using channel as the height index and measurement index modulo (measurements / channel) as the width index for each measurement. the reason that this is useful is that otherwise the projection would have to be calculated, meaning the angles between the origin and each point from the point cloud would have to be used to reconstruct the mapping between each measurement and a pixel in the projection. we also tried this method originally which lead to many ambiguities in the mappings were sometimes multiple measurements were erroneously mapped to the same pixel with no clear way to differentiate between which of them was mapped incorrectly. this is most likely due to quantification errors, systematic and sporadic measurement errors and other unforseen problems. for these reasons the index based mapping is a boon to us in this dataset. it should also be mentioned that lidar sensors originally calculate the distance to an object by measuring the time it takes for an emitted ray to return (bg chapter lidar ref) and the point cloud point is only calculated using this data and the known measurement angles. for this reason it is typically possible to configure lidar sensors to provide this original data which is basically the same as the 2d projection directly, without having to calculate it from the pointcloud.
|
%dataset in rosbag format (one bag file per experiment) was preprocessed as mentioned in chapter X by projecting the 3d lidar data (xzy pointcloud) using a spherical projection in a python script and saved as a npy araray of dimensions frames by height by width with value normalized distance (1 over sqrt(distance)) using numpy save method for simplicity while loading and to avoid having to do this preprocessing during each experiment. the projection was done using the meta information in the bag which includes the channel (height/row) and the index which is available since the data is non-sparse/dense, which means that for each possible measurement a data is available in the original rosbag even if the sensor did not record a return ray for this measurement, which means there is no data and it could be left out in a sparse array saving file size. this is very helpful since it allows the direct mapping of all measurements to the spherical projection using channel as the height index and measurement index modulo (measurements / channel) as the width index for each measurement. the reason that this is useful is that otherwise the projection would have to be calculated, meaning the angles between the origin and each point from the point cloud would have to be used to reconstruct the mapping between each measurement and a pixel in the projection. we also tried this method originally which lead to many ambiguities in the mappings were sometimes multiple measurements were erroneously mapped to the same pixel with no clear way to differentiate between which of them was mapped incorrectly. this is most likely due to quantification errors, systematic and sporadic measurement errors and other unforseen problems. for these reasons the index based mapping is a boon to us in this dataset. it should also be mentioned that lidar sensors originally calculate the distance to an object by measuring the time it takes for an emitted ray to return (bg chapter lidar ref) and the point cloud point is only calculated using this data and the known measurement angles. for this reason it is typically possible to configure lidar sensors to provide this original data which is basically the same as the 2d projection directly, without having to calculate it from the pointcloud.
|
||||||
%\todo[inline]{why normalize range?}
|
%\todo[inline]{why normalize range?}
|
||||||
|
|
||||||
The raw SubTERR dataset is provided as ROS bag files—one per experiment—each containing a dense 3D point cloud from the Ouster OS1-32 LiDAR. To streamline training and avoid repeated heavy computation, we project these point clouds offline into 2D “range images” and save them as NumPy arrays. We apply a spherical projection that maps each LiDAR measurement to a pixel in a 2D image of size Height × Width, where Height = number of vertical channels (32) and Width = measurements per rotation (2048). Instead of computing per-point azimuth and elevation angles at runtime, we exploit the sensor’s metadata:
|
The raw SubTER dataset is provided as one ROS bag file per experiment, each containing a dense 3D point cloud from the Ouster OS1-32 LiDAR. To streamline training and avoid repeated heavy computation, we project these point clouds offline into 2D “range images” and save them as NumPy arrays. We apply a spherical projection that maps each LiDAR measurement to a pixel in a 2D image of size Height × Width, where Height = number of vertical channels (32) and Width = measurements per rotation (2048). Instead of computing per-point azimuth and elevation angles at runtime, we exploit the sensor’s metadata:
|
||||||
|
|
||||||
\begin{itemize}
|
\begin{itemize}
|
||||||
\item \textbf{Channel index:} directly gives the row (vertical position) of each measurement.
|
\item \textbf{Channel index:} directly gives the row (vertical position) of each measurement.
|
||||||
\item \textbf{Measurement index:} by taking the measurement index modulo Width, we obtain the column (horizontal position) in the 360° sweep.
|
\item \textbf{Measurement index:} by taking the measurement index modulo Width, we obtain the column (horizontal position) in the 360° sweep.
|
||||||
\end{itemize}
|
\end{itemize}
|
||||||
|
|
||||||
Because the SubTERR data is dense—every possible channel × measurement pair appears in the bag, even if the LiDAR did not record a return-we can perform a direct 1:1 mapping without collision or missing entries. This avoids the ambiguities we previously encountered when reconstructing the projection via angle computations alone, which sometimes mapped multiple points to the same pixel due to numerical errors in angle estimation.
|
The measurement index is only available because the SubTER data is dense—every possible channel × measurement pair appears in the bag, even if the LiDAR did not record a return. We can perform a direct 1:1 mapping without collision or missing entries. This avoids the ambiguities we previously encountered when reconstructing the projection via angle computations alone, which sometimes mapped multiple points to the same pixel due to numerical errors in angle estimation.
|
||||||
|
|
||||||
For each projected pixel, we compute $\mathbf{v}_i = \sqrt{{x_i}^2 + {y_i}^2 + {z_i}^2}$ where $\mathbf{v}_i$ is the reciprocal range value assigned to each pixel in the projection and $x_i, y_i$ and $z_i$ are the corresponding measurement's 3d coordinates. This transformation both compresses the dynamic range and emphasizes close-range returns—critical for detecting near-sensor degradation. We then save the resulting tensor of shape (Number of Frames, Height, Width) using NumPy’s save function. Storing precomputed projections allows rapid data loading during training and evaluation.
|
For each projected pixel, we compute $\mathbf{v}_i = \sqrt{{x_i}^2 + {y_i}^2 + {z_i}^2}$ where $\mathbf{v}_i$ is the reciprocal range value assigned to each pixel in the projection and $x_i, y_i$ and $z_i$ are the corresponding measurement's 3d coordinates. This transformation both compresses the dynamic range and emphasizes close-range returns—critical for detecting near-sensor degradation. We then save the resulting tensor of shape (Number of Frames, Height, Width) using NumPy’s save function. Storing precomputed projections allows rapid data loading during training and evaluation.
|
||||||
|
|
||||||
Many modern LiDARs can be configured to output range images directly—bypassing the need for post-hoc projection—since they already compute per-beam azimuth and elevation internally. When available, such native range-image streams can further simplify preprocessing or even allow skipping this step completely.
|
Many modern LiDARs can be configured to output range images directly which would bypass the need for post-hoc projection. When available, such native range-image streams can further simplify preprocessing or even allow skipping this step completely.
|
||||||
|
|
||||||
\newsubsubsectionNoTOC{Any implementation challenges or custom data loaders}
|
\newsubsubsectionNoTOC{Any implementation challenges or custom data loaders}
|
||||||
|
|
||||||
@@ -968,7 +971,7 @@ The first labeling scheme, called \emph{experiment-based labels}, assigns
|
|||||||
\]
|
\]
|
||||||
At load time, any file with “smoke” in its name is treated as anomalous (label \(-1\)), and all others (normal experiments) are labeled \(+1\).
|
At load time, any file with “smoke” in its name is treated as anomalous (label \(-1\)), and all others (normal experiments) are labeled \(+1\).
|
||||||
|
|
||||||
To obtain a second source of ground truth, we also support \emph{manually-defined labels}. A companion JSON file specifies a start and end frame index for each of the four smoke experiments—defining the interval of unequivocal degradation. During loading:
|
To obtain a second source of ground truth, we also support \emph{manually-defined labels}. A companion JSON file specifies a start and end frame index for each of the four smoke experiments—defining the interval of unequivocal degradation. During loading the second label $y_{man}$ is assigned as follows:
|
||||||
|
|
||||||
\[
|
\[
|
||||||
y_{\mathrm{man}} =
|
y_{\mathrm{man}} =
|
||||||
@@ -995,28 +998,30 @@ When using semi-supervised mode, we begin with the manually-defined evaluation l
|
|||||||
|
|
||||||
To obtain robust performance estimates on our relatively small dataset, we implement $k$-fold cross-validation. A single integer parameter, \texttt{num\_folds}, controls the number of splits. We use scikit-learn’s \texttt{KFold} (from \texttt{sklearn.model\_selection}) with \texttt{shuffle=True} and a fixed random seed to partition each experiment’s frames into \texttt{num\_folds} disjoint folds. Training then proceeds across $k$ rounds, each time training on $(k-1)/k$ of the data and evaluating on the remaining $1/k$. In our experiments, we set \texttt{num\_folds=5}, yielding an 80/20 train/evaluation split per fold.
|
To obtain robust performance estimates on our relatively small dataset, we implement $k$-fold cross-validation. A single integer parameter, \texttt{num\_folds}, controls the number of splits. We use scikit-learn’s \texttt{KFold} (from \texttt{sklearn.model\_selection}) with \texttt{shuffle=True} and a fixed random seed to partition each experiment’s frames into \texttt{num\_folds} disjoint folds. Training then proceeds across $k$ rounds, each time training on $(k-1)/k$ of the data and evaluating on the remaining $1/k$. In our experiments, we set \texttt{num\_folds=5}, yielding an 80/20 train/evaluation split per fold.
|
||||||
|
|
||||||
For inference (i.e.\ model validation on held-out experiments), we provide a second \texttt{Dataset} class that loads the full NumPy file for a single experiment (no k-fold splitting), does not assign any labels to the frames nor does it shuffle frames, preserving temporal order. This setup enables seamless, frame-by-frame scoring of complete runs—crucial for analyzing degradation dynamics over an entire traversal.
|
For inference (i.e.\ model validation on held-out experiments), we provide a second \texttt{Dataset} class that loads a single experiment's NumPy file (no k-fold splitting), does not assign any labels to the frames nor does it shuffle frames, preserving temporal order. This setup enables seamless, frame-by-frame scoring of complete runs—crucial for analyzing degradation dynamics over an entire traversal.
|
||||||
|
|
||||||
\section{Model Configuration \& Evaluation Protocol}
|
\section{Model Configuration \& Evaluation Protocol}
|
||||||
|
|
||||||
Since the neural network architecture trained in the deepsad method is not fixed as described in section~\ref{sec:algorithm_details} but rather chosen based on the input data, we also had to choose an autoencoder architecture befitting our preprocessed lidar data projections. Since \citetitle{degradation_quantification_rain}~\cite{degradation_quantification_rain} reported success in training DeepSAD on similar data we firstly adapted the network architecture utilized by them for our usecase, which is based on the simple and well understood LeNet architecture. Additionally we were interested in evaluating the importance and impact of a well-suited network architecture for DeepSAD's performance and therefore designed a second network architecture henceforth called "efficient architecture" to incorporate a few modern techniques, befitting our usecase.
|
Since the neural network architecture trained in the deepsad method is not fixed as described in section~\ref{sec:algorithm_details} but rather chosen based on the input data, we also had to choose an autoencoder architecture befitting our preprocessed lidar data projections. Since \citetitle{degradation_quantification_rain}~\cite{degradation_quantification_rain} reported success in training DeepSAD on similar data we firstly adapted the network architecture utilized by them for our usecase, which is based on the simple and well understood LeNet architecture~\cite{lenet}. Additionally we were interested in evaluating the importance and impact of a well-suited network architecture for DeepSAD's performance and therefore designed a second network architecture henceforth reffered to as "efficient architecture" to incorporate a few modern techniques, befitting our usecase.
|
||||||
|
|
||||||
\newsubsubsectionNoTOC{Network architectures (LeNet variant, custom encoder) and how they suit the point‑cloud input}
|
\newsubsubsectionNoTOC{Network architectures (LeNet variant, custom encoder) and how they suit the point‑cloud input}
|
||||||
|
|
||||||
|
\todo[inline]{STANDARDIZE ALL DIMENSIONS TO (CHANNEL, WIDTH, HEIGHT)}
|
||||||
|
|
||||||
The LeNet-inspired autoencoder can be split into an encoder network (figure~\ref{fig:setup_arch_lenet_encoder}) and a decoder network (figure~\ref{fig:setup_arch_lenet_decoder}) with a latent space inbetween the two parts. Such an arrangement is typical for autoencoder architectures as we discussed in section~\ref{sec:autoencoder}. The encoder network is simultaneously DeepSAD's main training architecture which is used to infer the degradation quantification in our use-case, once trained.
|
The LeNet-inspired autoencoder can be split into an encoder network (figure~\ref{fig:setup_arch_lenet_encoder}) and a decoder network (figure~\ref{fig:setup_arch_lenet_decoder}) with a latent space inbetween the two parts. Such an arrangement is typical for autoencoder architectures as we discussed in section~\ref{sec:autoencoder}. The encoder network is simultaneously DeepSAD's main training architecture which is used to infer the degradation quantification in our use-case, once trained.
|
||||||
|
|
||||||
%The LeNet-inspired encoder network (see figure~\ref{fig:setup_arch_lenet_encoder}) consists of two convolution steps with pooling layers, and finally a dense layer which populates the latent space. \todo[inline]{lenet explanation from chatgpt?} The first convolutional layer uses a 3x3 kernel and outputs 8 channels, which depicts the number of features/structures/patterns the network can learn to extract from the input and results in an output dimensionality of 2048x32x8 which is reduced to 1024x16x8 by a 2x2 pooling layer. \todo[inline]{batch normalization, and something else like softmax or relu blah?} The second convolution reduces the 8 channels to 4 with another 3x3 kernel \todo[inline]{why? explain rationale} and is followed by another 2x2 pooling layer resulting in a 512x8x4 dimensionality, which is then flattened and input into a dense layer. The dense layer's output dimension is the chosen latent space dimensionality, which is as previously mentioned another tuneable hyperparameter.
|
%The LeNet-inspired encoder network (see figure~\ref{fig:setup_arch_lenet_encoder}) consists of two convolution steps with pooling layers, and finally a dense layer which populates the latent space. \todo[inline]{lenet explanation from chatgpt?} The first convolutional layer uses a 3x3 kernel and outputs 8 channels, which depicts the number of features/structures/patterns the network can learn to extract from the input and results in an output dimensionality of 2048x32x8 which is reduced to 1024x16x8 by a 2x2 pooling layer. \todo[inline]{batch normalization, and something else like softmax or relu blah?} The second convolution reduces the 8 channels to 4 with another 3x3 kernel \todo[inline]{why? explain rationale} and is followed by another 2x2 pooling layer resulting in a 512x8x4 dimensionality, which is then flattened and input into a dense layer. The dense layer's output dimension is the chosen latent space dimensionality, which is as previously mentioned another tuneable hyperparameter.
|
||||||
|
|
||||||
\fig{setup_arch_lenet_encoder}{diagrams/arch_lenet_encoder}{UNFINISHED - Visualization of the original LeNet-inspired encoder architecture.}
|
\fig{setup_arch_lenet_encoder}{diagrams/arch_lenet_encoder}{UNFINISHED - Visualization of the original LeNet-inspired encoder architecture.}
|
||||||
|
|
||||||
The LeNet-inspired encoder network (see figure~\ref{fig:setup_arch_lenet_encoder}) is a compact convolutional neural network that reduces image data into a lower-dimensional latent space. It consists of two stages of convolution, normalization, non-linear activation, and pooling, followed by a dense layer that defines the latent representation. At a high level, the convolutional layers learn small filters that detect visual patterns in the input (such as edges or textures). Batch normalization ensures that these learned signals remain numerically stable during training, while the LeakyReLU activation introduces non-linearity, allowing the network to capture more complex relationships. Pooling operations then downsample the feature maps, which reduces the spatial size of the data and emphasizes the most important features. Finally, a dense layer transforms the extracted feature maps into the latent space, which serves as the datas' representation in the reduced dimensionality latent space.
|
The LeNet-inspired encoder network (see figure~\ref{fig:setup_arch_lenet_encoder}) is a compact convolutional neural network that reduces image data into a lower-dimensional latent space. It consists of two stages of convolution, normalization, non-linear activation, and pooling, followed by a dense layer that defines the latent representation. Conceptually, the convolutional layers learn small filters that detect visual patterns in the input (such as edges or textures). Batch normalization ensures that these learned signals remain numerically stable during training, while a LeakyReLU activation introduces non-linearity, allowing the network to capture more complex relationships. Pooling operations then downsample the feature maps, which reduces the spatial size of the data and emphasizes the most important features. Finally, a dense layer transforms the extracted feature maps into the latent space, which serves as the datas' representation in the reduced dimensionality latent space.
|
||||||
|
|
||||||
\fig{setup_arch_lenet_decoder}{diagrams/arch_lenet_decoder}{UNFINISHED - Visualization of the original LeNet-inspired decoder architecture.}
|
|
||||||
|
|
||||||
Concretely, the first convolutional layer uses a $3\times 3$ kernel with 8 output channels, corresponding to 8 learnable filters. For input images of size $2048\times 32\times 1$, this produces an intermediate representation of shape $2048\times 32\times 8$, which is reduced to $1024\times 16\times 8$ by a $2\times 2$ pooling layer. The second convolutional layer again applies a $3\times 3$ kernel but outputs 4 channels, followed by another pooling step, resulting in a feature map of shape $512\times 8\times 4$. This feature map is flattened and passed into a fully connected layer. The dimensionality of the output of this layer corresponds to the latent space, whose size is a tunable hyperparameter chosen according to the needs of the application.
|
Concretely, the first convolutional layer uses a $3\times 3$ kernel with 8 output channels, corresponding to 8 learnable filters. For input images of size $2048\times 32\times 1$, this produces an intermediate representation of shape $2048\times 32\times 8$, which is reduced to $1024\times 16\times 8$ by a $2\times 2$ pooling layer. The second convolutional layer again applies a $3\times 3$ kernel but outputs 4 channels, followed by another pooling step, resulting in a feature map of shape $512\times 8\times 4$. This feature map is flattened and passed into a fully connected layer. The dimensionality of the output of this layer corresponds to the latent space, whose size is a tunable hyperparameter chosen according to the needs of the application.
|
||||||
|
|
||||||
% Its decoder network (see figure~\ref{fig:setup_arch_lenet_decoder}) is a mirrored version of the encoder, with a dense layer after the latent space and two pairs of 2x2 upsampling and transpose convolution layers which use 4 and 8 input channels respectively with the second one reducing its output to one channel resulting in the 2048x32x1 output dimensionality, equal to the input's, which is required for the autoencoding objective to be possible.
|
% Its decoder network (see figure~\ref{fig:setup_arch_lenet_decoder}) is a mirrored version of the encoder, with a dense layer after the latent space and two pairs of 2x2 upsampling and transpose convolution layers which use 4 and 8 input channels respectively with the second one reducing its output to one channel resulting in the 2048x32x1 output dimensionality, equal to the input's, which is required for the autoencoding objective to be possible.
|
||||||
|
|
||||||
|
\fig{setup_arch_lenet_decoder}{diagrams/arch_lenet_decoder}{UNFINISHED - Visualization of the original LeNet-inspired decoder architecture.}
|
||||||
|
|
||||||
The decoder network (see figure~\ref{fig:setup_arch_lenet_decoder}) mirrors the encoder and reconstructs the input from its latent representation. A dense layer first expands the latent vector into a feature map of shape $512\times 8\times 4$, which is then upsampled and refined in two successive stages. Each stage consists of an interpolation step that doubles the spatial resolution, followed by a transpose convolution that learns how to add structural detail. The first stage operates on 4 channels, and the second on 8 channels, with the final transpose convolution reducing the output to a single channel. The result is a reconstructed output of size $2048\times 32 \times 1$, matching the original input dimensionality required for the autoencoding objective.
|
The decoder network (see figure~\ref{fig:setup_arch_lenet_decoder}) mirrors the encoder and reconstructs the input from its latent representation. A dense layer first expands the latent vector into a feature map of shape $512\times 8\times 4$, which is then upsampled and refined in two successive stages. Each stage consists of an interpolation step that doubles the spatial resolution, followed by a transpose convolution that learns how to add structural detail. The first stage operates on 4 channels, and the second on 8 channels, with the final transpose convolution reducing the output to a single channel. The result is a reconstructed output of size $2048\times 32 \times 1$, matching the original input dimensionality required for the autoencoding objective.
|
||||||
|
|
||||||
%\todo[inline]{what problems and possible improvements did we find when investigating this architecture}
|
%\todo[inline]{what problems and possible improvements did we find when investigating this architecture}
|
||||||
@@ -1038,73 +1043,65 @@ The decoder network (see figure~\ref{fig:setup_arch_lenet_decoder}) mirrors the
|
|||||||
%Since our understanding of the kinds of degradation present in lidar data is limited we want to make sure the network is capable of capturing many types of degradation patterns. To increase the network's chance of learning to identify such patterns we explored a new network architecture possessing a more square RF when viewed in terms of degrees and also included some other improvements.
|
%Since our understanding of the kinds of degradation present in lidar data is limited we want to make sure the network is capable of capturing many types of degradation patterns. To increase the network's chance of learning to identify such patterns we explored a new network architecture possessing a more square RF when viewed in terms of degrees and also included some other improvements.
|
||||||
|
|
||||||
|
|
||||||
Even though the LeNet-inspired encoder proved capable of achieving our degradation quantification objective in initial experiments, we identified several shortcomings that motivated the design of a second, more efficient architecture. The most important issue concerns the shape of the receptive field (RF) in relation to the anisotropic resolution of our LiDAR input data.
|
Even though the LeNet-inspired encoder proved capable of achieving our degradation quantification objective in initial experiments, we identified several shortcomings that motivated the design of a second, more efficient architecture. The most important issue concerns the shape of the CNN's receptive field (RF) which describes the region of the input that influences a single output activation. Its size and aspect ratio determine which structures the network can effectively capture: if the RF is too small, larger patterns cannot be detected, while an excessively large RF may hinder the network from learning to recognize fine details. For standard image data, the RF is often expressed as a symmetric $n \times n$ region, but in principle it can be computed independently per axis.
|
||||||
|
|
||||||
The receptive field of a convolutional neural network describes the region of the input that influences a single output activation. Its size and aspect ratio determine which structures the network can effectively capture: if the RF is too small, larger patterns cannot be detected, while an excessively large RF may blur fine details. For standard image data, the RF is often expressed as a symmetric $n \times n$ region \todo[inline]{add schematic of square RF}, but in principle it can be computed independently per axis.
|
|
||||||
|
|
||||||
\todo[inline]{RF concept figur}
|
\todo[inline]{RF concept figur}
|
||||||
|
|
||||||
In the case of spherical LiDAR projections, the input has a highly unbalanced resolution due to the sensor geometry. A fixed number of vertical channels (typically 32--128) sweeps across the horizontal axis, producing thousands of measurements per channel. This results in a pixel-per-degree resolution of approximately $0.99^{\circ}$/pixel vertically and $0.18^{\circ}$/pixel horizontally \todo[inline]{double-check with calculation graphic/table}. Consequently, the LeNet-inspired encoder’s calculated receptive field of $16 \times 16$ pixels translates to an angular size of $15.88^{\circ} \times 2.81^{\circ}$, which is highly rectangular in angular space. Such a mismatch risks limiting the network’s ability to capture degradation patterns that extend differently across the two axes. \todo[inline]{add schematic showing rectangular angular RF overlaid on LiDAR projection}
|
The RF shape's issue arises from the fact that spinning multi-beam LiDAR oftentimes produce point clouds posessing dense horizontal but limited vertical resolution. In our case this, this results in a pixel-per-degree resolution of approximately $0.99^{\circ}$/pixel vertically and $0.18^{\circ}$/pixel horizontally \todo[inline]{double-check with calculation graphic/table}. Consequently, the LeNet-inspired encoder’s calculated receptive field of $16 \times 16$ pixels translates to an angular size of $15.88^{\circ} \times 2.81^{\circ}$, which is highly rectangular in angular space. Such a mismatch risks limiting the network’s ability to capture degradation patterns that extend differently across the two axes.
|
||||||
|
|
||||||
To address this, we developed an efficient network architecture with asymmetric convolution kernels, resulting in a receptive field of $10 \times 52$ pixels. In angular terms, this corresponds to $9.93^{\circ} \times 9.14^{\circ}$, which is far more balanced between vertical and horizontal directions. This adjustment increases the likelihood of capturing a broader variety of degradation patterns. Additional design improvements were incorporated as well, which will be described in the following section.
|
|
||||||
|
|
||||||
|
|
||||||
|
\todo[inline]{add schematic showing rectangular angular RF overlaid on LiDAR projection}
|
||||||
|
|
||||||
%\todo[inline]{start by explaining lenet architecture, encoder and decoder split, encoder network is the one being trained during the main training step, together as autoencoder during pretraining, decoder of lenet pretty much mirrored architecture of encoder, after preprocessing left with image data (2d projections, grayscale = 1 channel) so input is 2048x32x1. convolutional layers with pooling afterwards (2 convolution + pooling) convolutions to multiple channels (8, 4?) each channel capable of capturing a different pattern/structure of input. fully connected layer before latent space, latent space size not fixed since its also a hyperparameter and depended on how well the normal vs anomalous data can be captured and differentiated in the dimensionality of the latent space}
|
%\todo[inline]{start by explaining lenet architecture, encoder and decoder split, encoder network is the one being trained during the main training step, together as autoencoder during pretraining, decoder of lenet pretty much mirrored architecture of encoder, after preprocessing left with image data (2d projections, grayscale = 1 channel) so input is 2048x32x1. convolutional layers with pooling afterwards (2 convolution + pooling) convolutions to multiple channels (8, 4?) each channel capable of capturing a different pattern/structure of input. fully connected layer before latent space, latent space size not fixed since its also a hyperparameter and depended on how well the normal vs anomalous data can be captured and differentiated in the dimensionality of the latent space}
|
||||||
%\todo[inline]{batch normalization, relu? something....}
|
%\todo[inline]{batch normalization, relu? something....}
|
||||||
|
|
||||||
|
To adjust for this, we decided to modify the network architecture and included further modificatons to improve the method's performance. The encoder (see figure~\ref{fig:setup_arch_ef_encoder}) follows the same general idea as the LeNet-inspired encoder, but incorporates the following modificatons:
|
||||||
The efficient network architecture was designed to address the shortcomings of the LeNet-inspired model and to better align with the characteristics of spherical LiDAR projections. At a high level, the main design principles were (i) balancing the receptive field in angular space, (ii) improving efficiency for deployment on embedded hardware, and (iii) avoiding reconstruction artifacts in the decoder. The following paragraphs describe the encoder and decoder components in more detail.
|
\begin{itemize}
|
||||||
|
\item \textbf{Non-square convolution kernels.} Depthwise-separable convolutions with kernel size $3 \times 17$ are used instead of square kernels, resulting in an RF of $10 \times 52$ pixels, corresponding to $9.93^{\circ} \times 9.14^{\circ}$, substantially more balanced than the LeNet-inspired network's RF.
|
||||||
|
\item \textbf{Circular padding along azimuth.} The horizontal axis is circularly padded to respect the wrap-around of $360^{\circ}$ LiDAR data, preventing artificial seams at the image boundaries.
|
||||||
|
\item \textbf{Aggressive horizontal pooling.} A $1 \times 4$ pooling operation is applied early in the network, which reduces the over-sampled horizontal resolution (2048~px to 512~px) while keeping vertical detail intact.
|
||||||
|
\item \textbf{Depthwise-separable convolutions with channel shuffle.} Inspired by MobileNet and ShuffleNet, this reduces the number of parameters and computations while retaining representational capacity, making the network more suitable for embedded platforms, while simultaneously allowing more learnable channels without increasing computational demand.
|
||||||
|
\item \textbf{Max pooling.} Standard max pooling is used instead of average pooling, since it preserves sharp activations that are often indicative of localized degradation.
|
||||||
|
\item \textbf{Channel compression before latent mapping.} After feature extraction, a $1 \times 1$ convolution reduces the number of channels before flattening, which lowers the parameter count of the final fully connected layer without sacrificing feature richness.
|
||||||
|
\end{itemize}
|
||||||
|
|
||||||
\fig{setup_arch_ef_encoder}{diagrams/arch_ef_encoder}{UNFINISHED - Visualization of the efficient encoder architecture.}
|
\fig{setup_arch_ef_encoder}{diagrams/arch_ef_encoder}{UNFINISHED - Visualization of the efficient encoder architecture.}
|
||||||
|
|
||||||
\paragraph{Encoder.}
|
|
||||||
The efficient encoder (see figure~\ref{fig:setup_arch_ef_encoder}) follows the same general idea as the LeNet-inspired encoder, but incorporates several modifications to improve both performance and efficiency:
|
|
||||||
\begin{itemize}
|
|
||||||
\item \textbf{Non-square convolution kernels.} Depthwise-separable convolutions with kernel size $3 \times 17$ are used instead of square kernels, resulting in a more balanced receptive field in angular coordinates and better alignment with the anisotropic LiDAR input.
|
|
||||||
\item \textbf{Circular padding along azimuth.} Only the horizontal axis is circularly padded to respect the wrap-around of $360^{\circ}$ LiDAR data, preventing artificial seams at the image boundaries.
|
|
||||||
\item \textbf{Depthwise-separable convolutions with channel shuffle.} Inspired by MobileNet and ShuffleNet, this reduces the number of parameters and computations while retaining representational capacity, making the network more suitable for embedded platforms.
|
|
||||||
\item \textbf{Aggressive horizontal pooling.} A $1 \times 4$ pooling operation is applied early in the network, which reduces the over-sampled horizontal resolution (2048~px to 512~px) while keeping vertical detail intact.
|
|
||||||
\item \textbf{Max pooling.} Standard max pooling is used instead of average pooling, since it preserves sharp activations that are often indicative of localized degradation.
|
|
||||||
\item \textbf{Channel compression before latent mapping.} After feature extraction, a $1 \times 1$ convolution reduces the number of channels before flattening, which lowers the parameter count of the final fully connected layer without sacrificing feature richness.
|
|
||||||
\end{itemize}
|
|
||||||
These design choices together result in a latent representation of dimension $d=512$ (tunable), with a receptive field of approximately $10 \times 52$ pixels, corresponding to $9.93^{\circ} \times 9.14^{\circ}$ in angular space \todo[inline]{insert RF figure}. This is substantially more balanced than the $15.88^{\circ} \times 2.81^{\circ}$ receptive field of the LeNet-inspired encoder.
|
|
||||||
|
|
||||||
\fig{setup_arch_ef_decoder}{diagrams/arch_ef_decoder}{UNFINISHED - Visualization of the efficient decoder architecture.}
|
|
||||||
|
|
||||||
\paragraph{Decoder.}
|
\paragraph{Decoder.}
|
||||||
The efficient decoder (see figure~\ref{fig:setup_arch_ef_decoder}) mirrors the encoder’s structure but introduces changes to improve reconstruction stability:
|
The decoder (see figure~\ref{fig:setup_arch_ef_decoder}) mirrors the encoder’s structure but introduces changes to improve reconstruction stability:
|
||||||
\begin{itemize}
|
\begin{itemize}
|
||||||
\item \textbf{Nearest-neighbor upsampling followed by convolution.} Instead of relying solely on transposed convolutions, each upsampling stage first enlarges the feature map using parameter-free nearest-neighbor interpolation, followed by a depthwise-separable convolution. This strategy reduces the risk of checkerboard artifacts while still allowing the network to learn fine detail.
|
\item \textbf{Nearest-neighbor upsampling followed by convolution.} Instead of relying solely on transposed convolutions, each upsampling stage first enlarges the feature map using parameter-free nearest-neighbor interpolation, followed by a depthwise-separable convolution. This strategy reduces the risk of checkerboard artifacts while still allowing the network to learn fine detail.
|
||||||
\item \textbf{Asymmetric upsampling schedule.} Horizontal resolution is restored more aggressively (e.g., scale factor $1 \times 4$) to reflect the anisotropic downsampling performed in the encoder.
|
\item \textbf{Asymmetric upsampling schedule.} Horizontal resolution is restored more aggressively (e.g., scale factor $1 \times 4$) to reflect the anisotropic downsampling performed in the encoder.
|
||||||
\item \textbf{Final convolution with circular padding.} The output is generated using a $(3 \times 17)$ convolution with circular padding along the azimuth, ensuring consistent treatment of the 360° LiDAR input.
|
\item \textbf{Final convolution with circular padding.} The output is generated using a $(3 \times 17)$ convolution with circular padding along the azimuth similar to the new encoder, ensuring consistent treatment of the 360° LiDAR input.
|
||||||
\end{itemize}
|
\end{itemize}
|
||||||
The resulting output has the same dimensionality as the input ($32 \times 2048 \times 1$), enabling the autoencoding objective.
|
|
||||||
|
\fig{setup_arch_ef_decoder}{diagrams/arch_ef_decoder}{UNFINISHED - Visualization of the efficient decoder architecture.}
|
||||||
|
|
||||||
Even though both encoders were designed for the same input dimensionality of $2048 \times 32$, their computational requirements differ significantly. To quantify this, we compared the number of trainable parameters and the number of multiply–accumulate operations (MACs) for different latent space sizes used in our experiments.
|
Even though both encoders were designed for the same input dimensionality of $2048 \times 32$, their computational requirements differ significantly. To quantify this, we compared the number of trainable parameters and the number of multiply–accumulate operations (MACs) for different latent space sizes used in our experiments.
|
||||||
|
|
||||||
\begin{table}[h]
|
\begin{table}[h]
|
||||||
\centering
|
\centering
|
||||||
\caption{Comparison of parameter count and MACs for SubTer\_LeNet and SubTer\_Efficient encoders across different latent space sizes.}
|
\caption{Comparison of parameter count and MACs for SubTer\_LeNet and SubTer\_Efficient encoders across different latent space sizes.}
|
||||||
\begin{tabular}{c|cc|cc}
|
\begin{tabular}{c|cc|cc}
|
||||||
\toprule
|
\toprule
|
||||||
\multirow{2}{*}{Latent dim} & \multicolumn{2}{c|}{SubTer\_LeNet} & \multicolumn{2}{c}{SubTer\_Efficient} \\
|
\multirow{2}{*}{Latent dim} & \multicolumn{2}{c|}{SubTer\_LeNet} & \multicolumn{2}{c}{SubTer\_Efficient} \\
|
||||||
& Params & MACs & Params & MACs \\
|
& Params & MACs & Params & MACs \\
|
||||||
\midrule
|
\midrule
|
||||||
32 & 8.40M & 17.41G & 1.17M & 2.54G \\
|
32 & 8.40M & 17.41G & 1.17M & 2.54G \\
|
||||||
64 & 16.38M & 17.41G & 1.22M & 2.54G \\
|
64 & 16.38M & 17.41G & 1.22M & 2.54G \\
|
||||||
128 & 32.35M & 17.41G & 1.33M & 2.54G \\
|
128 & 32.35M & 17.41G & 1.33M & 2.54G \\
|
||||||
256 & 64.30M & 17.41G & 1.55M & 2.54G \\
|
256 & 64.30M & 17.41G & 1.55M & 2.54G \\
|
||||||
512 & 128.19M & 17.41G & 1.99M & 2.54G \\
|
512 & 128.19M & 17.41G & 1.99M & 2.54G \\
|
||||||
768 & 192.07M & 17.41G & 2.43M & 2.54G \\
|
768 & 192.07M & 17.41G & 2.43M & 2.54G \\
|
||||||
1024 & 255.96M & 17.41G & 2.87M & 2.54G \\
|
1024 & 255.96M & 17.41G & 2.87M & 2.54G \\
|
||||||
\bottomrule
|
\bottomrule
|
||||||
\end{tabular}
|
\end{tabular}
|
||||||
\label{tab:lenet_vs_efficient}
|
\label{tab:lenet_vs_efficient}
|
||||||
\end{table}
|
\end{table}
|
||||||
|
|
||||||
\todo[inline]{rework table and calculate with actual scripts and network archs in deepsad codebase}
|
\todo[inline]{rework table and calculate with actual scripts and network archs in deepsad codebase}
|
||||||
As can be seen, the Efficient encoder requires an order of magnitude fewer parameters and significantly fewer operations while maintaining a comparable representational capacity. The key reason is the use of depth–wise separable convolutions, aggressive pooling along the densely sampled horizontal axis, and a channel squeezing strategy before the fully connected layer. Interestingly, the Efficient network also processes more intermediate channels (up to 32 compared to only 8 in the LeNet variant), which increases its ability to capture a richer set of patterns despite the reduced computational cost. This combination of efficiency and representational power makes the Efficient encoder a more suitable backbone for our anomaly detection task.
|
|
||||||
|
As can be seen, the efficient encoder requires an order of magnitude fewer parameters and significantly fewer operations while maintaining a comparable representational capacity. The key reason is the use of depth–wise separable convolutions, aggressive pooling along the densely sampled horizontal axis, and a channel squeezing strategy before the fully connected layer. Interestingly, the Efficient network also processes more intermediate channels (up to 32 compared to only 8 in the LeNet variant), which increases its ability to capture a richer set of patterns despite the reduced computational cost. This combination of efficiency and representational power makes the Efficient encoder a more suitable backbone for our anomaly detection task.
|
||||||
|
|
||||||
\todo[inline]{mention that as we see in AE results the efficient arch is capable of reproducing inputs better and especially so in lower dimensional latent spaces}
|
\todo[inline]{mention that as we see in AE results the efficient arch is capable of reproducing inputs better and especially so in lower dimensional latent spaces}
|
||||||
|
|
||||||
|
|||||||
File diff suppressed because it is too large
Load Diff
Reference in New Issue
Block a user