network arch comparison table - wip

2025-08-18 14:02:38 +02:00
parent d170b4f9b7
commit e56b8b47c5
1 changed files with 26 additions and 1 deletions
--- a/thesis/Main.tex
+++ b/thesis/Main.tex
@@ -1041,7 +1041,7 @@ Even though the LeNet-inspired encoder proved capable of achieving our degradati
 The receptive field of a convolutional neural network describes the region of the input that influences a single output activation. Its size and aspect ratio determine which structures the network can effectively capture: if the RF is too small, larger patterns cannot be detected, while an excessively large RF may blur fine details. For standard image data, the RF is often expressed as a symmetric $n \times n$ region \todo[inline]{add schematic of square RF}, but in principle it can be computed independently per axis.
-\fig{setup_rf_concept}{diagrams/rf_figure}{UNFINISHED - rf concept}
+\todo[inline]{RF concept figur}
 In the case of spherical LiDAR projections, the input has a highly unbalanced resolution due to the sensor geometry. A fixed number of vertical channels (typically 32--128) sweeps across the horizontal axis, producing thousands of measurements per channel. This results in a pixel-per-degree resolution of approximately $0.99^{\circ}$/pixel vertically and $0.18^{\circ}$/pixel horizontally \todo[inline]{double-check with calculation graphic/table}. Consequently, the LeNet-inspired encoder’s calculated receptive field of $16 \times 16$ pixels translates to an angular size of $15.88^{\circ} \times 2.81^{\circ}$, which is highly rectangular in angular space. Such a mismatch risks limiting the network’s ability to capture degradation patterns that extend differently across the two axes. \todo[inline]{add schematic showing rectangular angular RF overlaid on LiDAR projection}
@@ -1080,7 +1080,32 @@ The efficient decoder (see figure~\ref{fig:setup_arch_ef_decoder}) mirrors the e
 \end{itemize}
 The resulting output has the same dimensionality as the input ($32 \times 2048 \times 1$), enabling the autoencoding objective.
 Even though both encoders were designed for the same input dimensionality of $2048 \times 32$, their computational requirements differ significantly. To quantify this, we compared the number of trainable parameters and the number of multiply–accumulate operations (MACs) for different latent space sizes used in our experiments.
 \begin{table}[h]
    \centering
    \caption{Comparison of parameter count and MACs for SubTer\_LeNet and SubTer\_Efficient encoders across different latent space sizes.}
    \begin{tabular}{c|cc|cc}
        \toprule
        \multirow{2}{*}{Latent dim} & \multicolumn{2}{c|}{SubTer\_LeNet} & \multicolumn{2}{c}{SubTer\_Efficient} \\
        & Params & MACs & Params & MACs \\
        \midrule
        32   & 8.40M & 17.41G & 1.17M & 2.54G \\
        64   & 16.38M & 17.41G & 1.22M & 2.54G \\
        128  & 32.35M & 17.41G & 1.33M & 2.54G \\
        256  & 64.30M & 17.41G & 1.55M & 2.54G \\
        512  & 128.19M & 17.41G & 1.99M & 2.54G \\
        768  & 192.07M & 17.41G & 2.43M & 2.54G \\
        1024 & 255.96M & 17.41G & 2.87M & 2.54G \\
        \bottomrule
    \end{tabular}
    \label{tab:lenet_vs_efficient}
 \end{table}
 \todo[inline]{rework table and calculate with actual scripts and network archs in deepsad codebase}
 As can be seen, the Efficient encoder requires an order of magnitude fewer parameters and significantly fewer operations while maintaining a comparable representational capacity. The key reason is the use of depth–wise separable convolutions, aggressive pooling along the densely sampled horizontal axis, and a channel squeezing strategy before the fully connected layer. Interestingly, the Efficient network also processes more intermediate channels (up to 32 compared to only 8 in the LeNet variant), which increases its ability to capture a richer set of patterns despite the reduced computational cost. This combination of efficiency and representational power makes the Efficient encoder a more suitable backbone for our anomaly detection task.
 \todo[inline]{mention that as we see in AE results the efficient arch is capable of reproducing inputs better and especially so in lower dimensional latent spaces}
 \threadtodo
 {how was training/testing adapted (networks overview), inference, ae tuning}