reworked lenet arch

This commit is contained in:
Jan Kowalczyk
2025-08-18 11:30:47 +02:00
parent 0cd9d4ba1b
commit e1c13be697

View File

@@ -1002,11 +1002,17 @@ Since the neural network architecture trained in the deepsad method is not fixed
\newsubsubsectionNoTOC{Network architectures (LeNet variant, custom encoder) and how they suit the pointcloud input}
The LeNet-inspired autoencoder can be split into an encoder network (figure~\ref{fig:setup_arch_lenet_encoder}) and a decoder network (figure~\ref{fig:setup_arch_lenet_decoder}) with a latent space inbetween the two parts. Such an arrangement is typical for autoencoder architectures as we discussed in section~{sec:autoencoder}. The encoder network is simultaneously DeepSAD's main training architecture which is used to infer the degradation quantification in our use-case, once trained.
The LeNet-inspired autoencoder can be split into an encoder network (figure~\ref{fig:setup_arch_lenet_encoder}) and a decoder network (figure~\ref{fig:setup_arch_lenet_decoder}) with a latent space inbetween the two parts. Such an arrangement is typical for autoencoder architectures as we discussed in section~\ref{sec:autoencoder}. The encoder network is simultaneously DeepSAD's main training architecture which is used to infer the degradation quantification in our use-case, once trained.
The LeNet-inspired encoder network (see figure~\ref{fig:setup_arch_lenet_encoder}) consists of two convolution steps with pooling layers, and finally a dense layer which populates the latent space. \todo[inline]{lenet explanation from chatgpt?} The first convolutional layer uses a 3x3 kernel and outputs 8 channels, which depicts the number of features/structures/patterns the network can learn to extract from the input and results in an output dimensionality of 2048x32x8 which is reduced to 1024x16x8 by a 2x2 pooling layer. \todo[inline]{batch normalization, and something else like softmax or relu blah?} The second convolution reduces the 8 channels to 4 with another 3x3 kernel \todo[inline]{why? explain rationale} and is followed by another 2x2 pooling layer resulting in a 512x8x4 dimensionality, which is then flattened and input into a dense layer. The dense layer's output dimension is the chosen latent space dimensionality, which is as previously mentioned another tuneable hyperparameter.
%The LeNet-inspired encoder network (see figure~\ref{fig:setup_arch_lenet_encoder}) consists of two convolution steps with pooling layers, and finally a dense layer which populates the latent space. \todo[inline]{lenet explanation from chatgpt?} The first convolutional layer uses a 3x3 kernel and outputs 8 channels, which depicts the number of features/structures/patterns the network can learn to extract from the input and results in an output dimensionality of 2048x32x8 which is reduced to 1024x16x8 by a 2x2 pooling layer. \todo[inline]{batch normalization, and something else like softmax or relu blah?} The second convolution reduces the 8 channels to 4 with another 3x3 kernel \todo[inline]{why? explain rationale} and is followed by another 2x2 pooling layer resulting in a 512x8x4 dimensionality, which is then flattened and input into a dense layer. The dense layer's output dimension is the chosen latent space dimensionality, which is as previously mentioned another tuneable hyperparameter.
Its decoder network (see figure~\ref{fig:setup_arch_lenet_decoder}) is a mirrored version of the encoder, with a dense layer after the latent space and two pairs of \todo[inline]{upscale, then deconv?} \todo[inline]{deconv = convtranspose why??} 2x2 upsampling and transpose convolution layers which use 4 and 8 input channels respectively with the second one reducing its output to one channel resulting in the 2048x32x1 output dimensionality, equal to the input's, which is required for the autoencoding objective to be possible.
The LeNet-inspired encoder network (see figure~\ref{fig:setup_arch_lenet_encoder}) is a compact convolutional neural network that reduces image data into a lower-dimensional latent space. It consists of two stages of convolution, normalization, non-linear activation, and pooling, followed by a dense layer that defines the latent representation. At a high level, the convolutional layers learn small filters that detect visual patterns in the input (such as edges or textures). Batch normalization ensures that these learned signals remain numerically stable during training, while the LeakyReLU activation introduces non-linearity, allowing the network to capture more complex relationships. Pooling operations then downsample the feature maps, which reduces the spatial size of the data and emphasizes the most important features. Finally, a dense layer transforms the extracted feature maps into the latent space, which serves as the datas' representation in the reduced dimensionality latent space.
Concretely, the first convolutional layer uses a $3\times 3$ kernel with 8 output channels, corresponding to 8 learnable filters. For input images of size $2048\times 32\times 1$, this produces an intermediate representation of shape $2048\times 32\times 8$, which is reduced to $1024\times 16\times 8$ by a $2\times 2$ pooling layer. The second convolutional layer again applies a $3\times 3$ kernel but outputs 4 channels, followed by another pooling step, resulting in a feature map of shape $512\times 8\times 4$. This feature map is flattened and passed into a fully connected layer. The dimensionality of the output of this layer corresponds to the latent space, whose size is a tunable hyperparameter chosen according to the needs of the application.
% Its decoder network (see figure~\ref{fig:setup_arch_lenet_decoder}) is a mirrored version of the encoder, with a dense layer after the latent space and two pairs of 2x2 upsampling and transpose convolution layers which use 4 and 8 input channels respectively with the second one reducing its output to one channel resulting in the 2048x32x1 output dimensionality, equal to the input's, which is required for the autoencoding objective to be possible.
The decoder network (see figure~\ref{fig:setup_arch_lenet_decoder}) mirrors the encoder and reconstructs the input from its latent representation. A dense layer first expands the latent vector into a feature map of shape $512\times 8\times 4$, which is then upsampled and refined in two successive stages. Each stage consists of an interpolation step that doubles the spatial resolution, followed by a transpose convolution that learns how to add structural detail. The first stage operates on 4 channels, and the second on 8 channels, with the final transpose convolution reducing the output to a single channel. The result is a reconstructed output of size $2048\times 32 \times 1$, matching the original input dimensionality required for the autoencoding objective.
\todo[inline]{what problems and possible improvements did we find when investigating this architecture}
\todo[inline]{starting point - receptive field, possible loss of information due to narrow RF during convolutions, which motivated us to investigate the impact of an improved arch}
@@ -1021,9 +1027,15 @@ The receptive field of a neural network is a measure of how many input data affe
The receptieve field of convolutional neural networks is often given as a single integer $n$, implying a symmetric receptive field of size $n \times n$ affects each output neuron \todo[inline]{refer to graphic 2}. Formally the RF can be calculated independently for each input dimension and does not have to be square.
For the LeNet-inspired encoder we calculated the RF size to be \todo[inline]{result and calculation here}. The problem we identified stems from the vast difference in resolution per axis in the input data. Due to the working mechanics of spinning lidar sensors, it is typical for the point cloud's produced by them to have a denser resolution in their x axis than in their y axis. This is the result of a fixed number of vertical channels (typically 32 to 128) rotating on the sensor's x axis and producing oftentimes over 1000 measurements per channel over the 360 degree field of view. We can therefore calculate the measurements per degree for both axis, which is tantamount to pixels per degree for our spherical projection images. The vertical pixel per degree are with A ppd nearly 10 times larger than the B ppd of the horizontal axis \todo[inline]{show calculation and correct values of ppd}. This discrapency in resolution per axis means that a square RF results in a unproportional RF when viewed in terms of degrees, namely N by M degrees of RF.
For the LeNet-inspired encoder we calculated the RF size to be \todo[inline]{result and calculation here}. The problem we identified stems from the vast difference in resolution per axis in the input data. Due to the working mechanics of spinning lidar sensors, it is typical for the point cloud's produced by them to have a denser resolution in their x axis than in their y axis. This is the result of a fixed number of vertical channels (typically 32 to 128) rotating on the sensor's x axis and producing oftentimes over 1000 measurements per channel over the 360 degree field of view. We can therefore calculate the measurements per degree for both axis, which is tantamount to pixels per degree for our spherical projection images. The vertical pixel per degree are with A ppd nearly 10 times larger than the B ppd of the horizontal axis \todo[inline]{show calculation and correct values of ppd}. This discrapency in resolution per axis means that a square pixel RF results in a unproportional RF when viewed in terms of degrees, namely N by M degrees of RF.
Since our understanding of the kinds of degradation present in lidar data is limited we want to make sure the network is capable of capturing many types of degradation patterns. To increase the network's chance of learning to identify such patterns we
Since our understanding of the kinds of degradation present in lidar data is limited we want to make sure the network is capable of capturing many types of patterns
\todo[inline]{start by explaining lenet architecture, encoder and decoder split, encoder network is the one being trained during the main training step, together as autoencoder during pretraining, decoder of lenet pretty much mirrored architecture of encoder, after preprocessing left with image data (2d projections, grayscale = 1 channel) so input is 2048x32x1. convolutional layers with pooling afterwards (2 convolution + pooling) convolutions to multiple channels (8, 4?) each channel capable of capturing a different pattern/structure of input. fully connected layer before latent space, latent space size not fixed since its also a hyperparameter and depended on how well the normal vs anomalous data can be captured and differentiated in the dimensionality of the latent space}
\todo[inline]{batch normalization, relu? something....}