hyperparam section & setup rework

2025-09-28 18:58:03 +02:00
parent 1e71600102
commit fe45de00ca
6 changed files with 86 additions and 28 deletions
--- a/thesis/Main.bbl
+++ b/thesis/Main.bbl
@@ -1177,6 +1177,43 @@
      \verb https://proceedings.mlr.press/v80/ruff18a.html
      \endverb
    \endentry
+    \entry{deep_learning_book}{book}{}{}
+      \name{author}{3}{}{%
+        {{hash=5d2585c11210cf1d4512e6e0a03ec315}{%
+           family={Goodfellow},
+           familyi={G\bibinitperiod},
+           given={Ian},
+           giveni={I\bibinitperiod}}}%
+        {{hash=40a8e4774982146adc2688546f54efb2}{%
+           family={Bengio},
+           familyi={B\bibinitperiod},
+           given={Yoshua},
+           giveni={Y\bibinitperiod}}}%
+        {{hash=ccec1ccd2e1aa86960eb2e872c6b7020}{%
+           family={Courville},
+           familyi={C\bibinitperiod},
+           given={Aaron},
+           giveni={A\bibinitperiod}}}%
+      }
+      \list{publisher}{1}{%
+        {MIT Press}%
+      }
+      \strng{namehash}{d17e6557c5836d2d978179999ea1037f}
+      \strng{fullhash}{3ae53fe582e8a815b118d26947eaa326}
+      \strng{fullhashraw}{3ae53fe582e8a815b118d26947eaa326}
+      \strng{bibnamehash}{3ae53fe582e8a815b118d26947eaa326}
+      \strng{authorbibnamehash}{3ae53fe582e8a815b118d26947eaa326}
+      \strng{authornamehash}{d17e6557c5836d2d978179999ea1037f}
+      \strng{authorfullhash}{3ae53fe582e8a815b118d26947eaa326}
+      \strng{authorfullhashraw}{3ae53fe582e8a815b118d26947eaa326}
+      \field{sortinit}{5}
+      \field{sortinithash}{20e9b4b0b173788c5dace24730f47d8c}
+      \field{labelnamesource}{author}
+      \field{labeltitlesource}{title}
+      \field{note}{\url{http://www.deeplearningbook.org}}
+      \field{title}{Deep Learning}
+      \field{year}{2016}
+    \endentry
    \entry{lidar_errormodel_particles}{inproceedings}{}{}
      \name{author}{4}{}{%
        {{hash=5e35bb636f146553847f059f1d9b0112}{%
--- a/thesis/Main.pdf
+++ b/thesis/Main.pdf
--- a/thesis/Main.tex
+++ b/thesis/Main.tex
@@ -357,7 +357,7 @@ The pre-training results are used in two more key ways. First, the encoder weigh

 In the main training step, DeepSAD's network is trained using SGD backpropagation. The unlabeled training data is used with the goal to minimize an data-encompassing hypersphere. Since one of the pre-conditions of training was the significant prevelance of normal data over anomalies in the training set, normal samples collectively cluster more tightly around the centroid, while the rarer anomalous samples do not contribute as significantly to the optimization, resulting in them staying further from the hypersphere center. The labeled data includes binary class labels signifying their status as either normal or anomalous samples. Labeled anomalies are pushed away from the center by defining their optimization target as maximizing the distance between them and $\mathbf{c}$. Labeled normal samples are treated similar to unlabeled samples with the difference that DeepSAD includes a hyperparameter capable of controling the proportion with which labeled and unlabeled data contribute to the overall optimization. The resulting network has learned to map normal data samples closer to $\mathbf{c}$ in the latent space and anomalies further away.

-\fig{deepsad_procedure}{diagrams/deepsad_procedure/deepsad_procedure}{(WORK IN PROGRESS) Depiction of DeepSAD's training procedure, including data flows and tweakable hyperparameters.}
+\fig{deepsad_procedure}{diagrams/deepsad_procedure/deepsad_procedure}{Overview of the DeepSAD workflow. Training starts with unlabeled data and optional labeled samples, which are used to pre-train an autoencoder, compute the hypersphere center, and then perform main training with adjustable weighting of labeled versus unlabeled data. During inference, new samples are encoded and their distance to the hypersphere center is used as an anomaly score, with larger distances indicating stronger anomalies.}

 To infer if a previously unknown data sample is normal or anomalous, the sample is fed in a forward-pass through the fully trained network. During inference, the centroid $\mathbf{c}$ needs to be known, to calculate the geometric distance of the samples latent representation to $\mathbf{c}$. This distance is tantamount to an anomaly score, which correlates with the likelihood of the sample being anomalous. Due to differences in input data type, training success and latent space dimensionality, the anomaly score's magnitude has to be judged on an individual basis for each trained network. This means, scores produced by one network that signify normal data, may very well clearly indicate an anomaly for another network. The geometric distance between two points in space is a scalar analog value, therefore post-processing of the score is necessary to achieve a binary classification of normal and anomalous if desired.

@@ -394,15 +394,36 @@ From \ref{eq:deepsvdd_optimization_objective} it is easy to understand Deep SAD'
 The first term of \ref{eq:deepsad_optimization_objective} stays mostly the same, differing only in its consideration of the introduced $m$ labeled datasamples for its proportionality. The second term is newly introduced to incorporate the labeled data samples with hyperparameter $\eta$'s strength, by either minimizing or maximizing the distance between the samples latent represenation and $\mathbf{c}$ depending on each data samples label $\tilde{y}$. The third term, is kept identical compared to Deep SVDD as standard L2 regularization. It can also be observed that in case of $m = 0$ labeled samples, Deep SAD falls back to the same optimization objective of Deep SVDD and can therefore be used in a completely unsupervised fashion as well.


+% \newsubsubsectionNoTOC{Hyperparameters}
+%
+% The neural network architecture of DeepSAD is not fixed but rather dependent on the data type the algorithm is supposed to operate on. This is due to the way it employs an autoencoder for pre-training and the encoder part of the network for its main training step. This makes the adaption of an autoencoder architecture suitable to the specific application necessary but also allows for flexibility in choosing a fitting architecture depending on the application's requirements. For this reason the specific architecture employed, may be considered a hyperparameter of the Deep SAD algorithm.
+%
+% The network architecture also includes the latent space dimensionality, which can be freely chosen depending on the input data's dimensionality and complexity as well as the encoder network's capability of learning the input data's patterns. While the latent space dimensionality arises from the architecture, in most cases it is trivial to adjust its size without having to adapt the architecture too strongly. In addition to that it is also quite an important parameter since choosing an incorrect latent space dimensionality may worsen performance in multiple ways. Too large latent spaces increase the likeliness of overfitting during training and oftentimes harshly increase the number of operations (MACs/parameters) required during training as well as during inference. Latent spces which are chosen too small may result in the inability of the network to encode enough of the input data's patterns into the bottleneck resulting in poor performance of the method. For autoencoders a latent space size is oftentimes chosen using a hyperparameter search to find an elbow in the resulting network's loss although it is unclear if this common wisdom from autoencoders translates to the deepsad training which only re-uses the autoencoder's encoder and its weights to start from a point that is relatively good at encoding the input data into the bottleneck but will then go on to train the network to actually separate the latent represenations of normal and anomalous samples.
+%
+% The hyperparameter $\eta$ which is typically 1 or close to it, defines the weight that is given to the labeled samples in the semi-supervised training of DeepSAD in comparison to the unlabeled samples. Setting $\eta > 1$ puts more
+% emphasis on the labeled data whereas $\eta < 1$ emphasizes the unlabeled data. Its exact influence can be seen in DeepSAD's optimization objective in equation~\ref{eq:deepsad_optimization_objective}.
+%
+% The learning rates for the autoencoder $L_A$ and the main training $L_M$ are also hyperparameters to be chosen and if one chooses so they can also use an adaptive learning rate. These are handled like in most neural network trainings.
+%
+% The number of epochs to train the autoencoder $E_A$ and the main training $E_M$ can also be freely chosen and depend on compute resources or the implementation of early stopping criteria.
+
+
 \newsubsubsectionNoTOC{Hyperparameters}

-The neural network architecture of DeepSAD is not fixed but rather dependent on the data type the algorithm is supposed to operate on. This is due to the way it employs an autoencoder for pre-training and the encoder part of the network for its main training step. This makes the adaption of an autoencoder architecture suitable to the specific application necessary but also allows for flexibility in choosing a fitting architecture depending on the application's requirements. For this reason the specific architecture employed, may be considered a hyperparameter of the Deep SAD algorithm.
+DeepSAD relies on several hyperparameters that influence different stages of the algorithm. The most relevant ones are summarized and discussed below.

-\todo[inline]{latent space size, talk about auto encoder performance, trying out sensible dimensionalities and find reconstruction elbow, choose smallest possible, but as large as necessary}
-\todo[inline]{latent space size for AE shows that most likely all of the important data may be captured inside this dim (since recons;truction is possible) but we may only require some of the encoded patterns to differentiate normal from anomaly so smaller may still be possible? should this be discussed here or not? maybe only discuss; AE considerations and then move this discussion to discussion / results}
-
-\todo[inline]{eta, think of possible important scenarios, learning rate, epochs}
-\todo[inline]{todo new structure}
+\begin{itemize}
+	\item \textbf{Network architecture $\mathcal{\phi}$} \\
+	      The encoder architecture determines the representational capacity of the model. Because DeepSAD builds on a pre-training autoencoder, the architecture must be expressive enough to reconstruct input data during pretraining, but also compact enough to support separation of normal and anomalous samples in the latent space. The choice of architecture is therefore data-dependent: convolutional encoders are often used for images, while fully connected encoders or other architectures may be more suitable for various data modalities. The architecture directly constrains which patterns the network can learn and thus strongly shapes the latent space structure.
+	\item \textbf{Latent space dimensionality $\mathbb{R}^d$} \\
+	      The size of the latent bottleneck is a critical parameter. If $\mathbb{R}^d$ is too small, the network cannot encode all relevant information, leading to information loss and weak representations. If $\mathbb{R}^d$ is too large, the network risks overfitting by encoding irrelevant detail, while also increasing computational cost. These insights stem from autoencoder literature \cite{deep_learning_book}, but it is unclear whether they apply directly to DeepSAD: here the autoencoder serves only for pretraining, and the encoder is subsequently fine-tuned with a different objective. Thus, the optimal choice of $\mathbb{R}^d$ may not coincide with the value that would be ideal for autoencoder reconstruction alone.
+	\item \textbf{Label weighting $\eta$} \\
+	      The parameter $\eta$ controls the relative contribution of labeled versus unlabeled data in the DeepSAD objective. With $\eta = 1$, both groups contribute equally (normalized by their sample counts). Larger values of $\eta$ emphasize the labeled data, pulling labeled normals closer to the center and pushing labeled anomalies further away. Smaller values emphasize the unlabeled data, effectively reducing the influence of labels. This hyperparameter is unique to DeepSAD and mediates the transition between unsupervised and semi-supervised training. Its impact depends not only on its numerical value but also on the quantity and quality of available labels.
+	\item \textbf{Learning rates $L_A$ and $L_M$} \\
+	      Two learning rates are defined: $L_A$ for the autoencoder pretraining and $L_M$ for the main DeepSAD training. The learning rate sets the step size used during gradient descent updates and thereby controls the stability and speed of training. If it is too high, the optimization may diverge or oscillate; if too low, convergence becomes excessively slow and may get stuck in poor local minima. Schemes with adaptive learning rates such as ADAM may be applied to prevent poor choices.
+	\item \textbf{Number of epochs $E_A$ and $E_M$} \\
+	      The number of training epochs specifies how many full passes over the dataset are made in pretraining ($E_A$) and in the main DeepSAD training ($E_M$). More epochs allow the model to fit more closely to the training data, but also increase the risk of overfitting to noise or mislabeled samples. In practice, the effective number of epochs depends on dataset size, network architecture, and whether early stopping is applied.
+\end{itemize}

 \newchapter{data_preprocessing}{Data and Preprocessing}

@@ -761,21 +782,19 @@ We adapted the baseline implementations to our data loader and input format, and

 \newsection{setup_experiments_environment}{Experiment Overview \& Computational Environment}

+Across all experiments we vary three factors: (i) latent space dimensionality, (ii) encoder architecture (LeNet-inspired vs. Efficient), and (iii) the amount of semi-supervision (labeling regime). To keep results comparable, we fix the remaining training hyperparameters: all autoencoders are pretrained for 50~epochs at a learning rate of $1\cdot 10^{-5}$; all DeepSAD models are then trained for 150~epochs at the same learning rate. The DeepSAD label-weighting parameter is kept at $\eta = 1$ for all runs. Every configuration is evaluated with 5-fold cross-validation, and we report fold means.

+We first search over the latent bottleneck size by pretraining autoencoders only. For both encoder backbones, we evaluate latent sizes $32, 64, 128, 256, 512, 768,$ and $1024$. The goal is to identify compact yet expressive representations before moving to anomaly detection.

-Our experimental setup consisted of two stages. First, we conducted a hyperparameter search over the latent space dimensionality by pretraining the autoencoders alone. For both the LeNet-inspired and the Efficient network, we evaluated latent space sizes of $32, 64, 128, 256, 384, 512, 768,$ and $1024$. Each autoencoder was trained for 50~epochs with a learning rate of $1\cdot 10^{-5}$, and results were averaged across 5-fold cross-validation. The goal of this stage was to identify the ``elbow point'' in reconstruction loss curves, which serves as a practical indicator of a sufficiently expressive, yet compact, representation.
+Using the same latent sizes and backbones, we train full DeepSAD models initialized from the pretrained encoders. This stage tests how representation size and architecture transfer to anomaly detection performance under different levels of supervision.

-Second, we trained the full DeepSAD models on the same latent space sizes in order to investigate how autoencoder performance transfers to anomaly detection performance. Specifically, we aimed to answer whether poor autoencoder reconstructions necessarily imply degraded DeepSAD results, or whether the two stages behave differently. To disentangle these effects, both network architectures (LeNet-inspired and Efficient) were trained with identical configurations, allowing for a direct architectural comparison.
-
-Furthermore, we examined the effect of semi-supervised labeling on DeepSAD’s performance. As summarized in Table~\ref{tab:labeling_regimes}, three labeling regimes were tested, ranging from fully unsupervised training to progressively larger amounts of supervision:
+We study three regimes, from unsupervised to strongly supervised (see Table~\ref{tab:labeling_regimes} for proportions within the training folds):
 \begin{itemize}
-	\item \textbf{Unsupervised:} $(0,0)$ labeled samples of (normal, anomalous) data.
+	\item \textbf{Unsupervised:} $(0,0)$ labeled (normal, anomalous) samples.
 	\item \textbf{Low supervision:} $(50,10)$ labeled samples.
 	\item \textbf{High supervision:} $(500,100)$ labeled samples.
 \end{itemize}
-
-The percentages reported in Table~\ref{tab:labeling_regimes} are relative to the training folds after 5-fold cross-validation. Here, the classes “normal,” “anomalous,” and “unknown” follow the same definition as in the experiment-based labeling scheme. In particular, the “unknown” category arises because for semi-supervised anomaly labels we only used the manually selected, unambiguous degradation intervals from smoke experiments. Frames outside of these intervals were treated as “unknown” rather than anomalous, so as to prevent mislabeled data from being used during training. This design choice ensured that the inclusion of labeled samples could not inadvertently reduce performance by introducing additional label noise.
-
+Percentages in Table~\ref{tab:labeling_regimes} are computed relative to the training split of each fold (80\% of the data). The classes “normal,” “anomalous,” and “unknown” follow the experiment-based scheme. Importantly, for semi-supervised labels we \emph{only} use hand-selected, unambiguous smoke intervals; frames outside these intervals in smoke runs are treated as “unknown” (not anomalous) to avoid injecting mislabeled data into training.

 \begin{table}[h]
 	\centering
@@ -795,8 +814,7 @@ The percentages reported in Table~\ref{tab:labeling_regimes} are relative to the
 	\label{tab:labeling_regimes}
 \end{table}

-All models were pre-trained for 50~epochs and then trained for 150~epochs with the same learning rate of $1\cdot 10^{-5}$ and evaluated with 5-fold cross-validation.
-Table~\ref{tab:exp_grid} summarizes the full experiment matrix.
+Combining $7$ latent sizes, $2$ architectures, and $3$ labeling regimes yields $7 \times 2 \times 3 = 42$ DeepSAD configurations per evaluation protocol. Table~\ref{tab:exp_grid} summarizes the grid.

 \begin{table}[h]
 	\centering
@@ -819,10 +837,6 @@ Table~\ref{tab:exp_grid} summarizes the full experiment matrix.
 	\label{tab:exp_grid}
 \end{table}

-
-
-
-
 These experiments were run on a computational environment for which we summarize the hardware and software stack in table~\ref{tab:system_setup}.

 \begin{table}[p]
--- a/thesis/bib/bibliography.bib
+++ b/thesis/bib/bibliography.bib
@@ -593,6 +593,13 @@
 	month = sep,
 	day = "16",
 	language = "English",
+},
+@book{deep_learning_book,
+	title = {Deep Learning},
+	author = {Ian Goodfellow and Yoshua Bengio and Aaron Courville},
+	publisher = {MIT Press},
+	note = {\url{http://www.deeplearningbook.org}},
+	year = {2016},
 }


--- a/thesis/diagrams/deepsad_procedure/deepsad_procedure.pdf
+++ b/thesis/diagrams/deepsad_procedure/deepsad_procedure.pdf
--- a/thesis/diagrams/deepsad_procedure/deepsad_procedure.tex
+++ b/thesis/diagrams/deepsad_procedure/deepsad_procedure.tex
@@ -40,15 +40,15 @@
 	%\draw[arrow] (latent.east) -- node{} (autoenc.west);

 	\begin{pgfonlayer}{foreground}
-		\node[stepsbox, below=of process] (pretrainproc) {Train Autoencoder $\mathcal{W}_{E}$ \\ optimize Autoencoding Objective \\ for $E_A$ Epochs \\ with $L_A$ Learning Rate \\ No Labels Used / Required};
-		\node[outputbox, below=.1 of pretrainproc] (pretrainout) {\boxtitle{Outputs} $\mathcal{W}$: Encoder / DeepSAD Network \\ $\mathbf{w_{E}}$: Encoder Network Weights};
+		\node[stepsbox, below=of process] (pretrainproc) {Train Autoencoder $\mathcal{\phi}_{AE}$ \\ optimize Autoencoding Objective \\ for $E_A$ Epochs \\ with $L_A$ Learning Rate \\ No Labels Used / Required};
+		\node[outputbox, below=.1 of pretrainproc] (pretrainout) {\boxtitle{Outputs} $\mathcal{\phi}$: Encoder / DeepSAD Network \\ $\mathcal{W}_E$: Encoder Network Weights};
 	\end{pgfonlayer}
 	\begin{pgfonlayer}{background}
 		\node[procbox, fit=(pretrainproc) (pretrainout), label={[label distance = 1, name=pretrainlab]above:{\textbf{Pre-Training of Autoencoder}}}] (pretrain) {};
 	\end{pgfonlayer}

 	\begin{pgfonlayer}{foreground}
-		\node[hlabelbox, below=1.26 of hyper] (autoencarch) {\boxtitle{Autoencoder Architecture} $\mathcal{W}_{E}$: Autoencoder Network \\ $\mathbb{R}^d$: Latent Space Size };
+		\node[hlabelbox, below=1.26 of hyper] (autoencarch) {\boxtitle{Autoencoder Architecture} $\mathcal{\phi}_{AE}$: Autoencoder Network \\ $\mathbb{R}^d$: Latent Space Size };
 		\node[hlabelbox, below=.1 of autoencarch] (pretrainhyper) {\boxtitle{Hyperparameters} $E_A$: Number of Epochs  \\ $L_A$: Learning Rate AE};
 	\end{pgfonlayer}
 	\begin{pgfonlayer}{background}
@@ -62,7 +62,7 @@
 	%	\draw[arrow] (node cs:name=autoenc,angle=196) |- (node cs:name=pretrain,angle=5);

 	\begin{pgfonlayer}{foreground}
-		\node[stepsbox, below=1.4 of pretrain] (calccproc) {Init Network $\mathcal{W}$ with $\mathbf{w_{E}}$ \\ Forward Pass on all data \\  Hypersphere center $\mathbf{c}$ is mean \\ of all Latent Representation};
+		\node[stepsbox, below=1.4 of pretrain] (calccproc) {Init Network $\mathcal{\phi}$ with $\mathcal{W}_E$ \\ Forward Pass on all data \\  Hypersphere center $\mathbf{c}$ is mean \\ of all Latent Representation};
 		\node[outputbox, below=.1 of calccproc] (calccout) {\boxtitle{Outputs} $\mathbf{c}$: Hypersphere Center};
 	\end{pgfonlayer}
 	\begin{pgfonlayer}{background}
@@ -77,8 +77,8 @@
 	%\draw[arrow] (node cs:name=traindata,angle=-45) |- node[arrowlabel]{all training data, labels removed} (node cs:name=calcc,angle=200);

 	\begin{pgfonlayer}{foreground}
-		\node[stepsbox, below=1.4 of calcc] (maintrainproc) {Init Network $\mathcal{W}$ with $\mathbf{w_{E}}$ \\ Train Network $\mathcal{W}$  \\ optimize DeepSAD Objective\\ for $E_M$ Epochs \\ with $L_M$ Learning Rate \\ Considers Labels with $\eta$ strength};
-		\node[outputbox, below=.1 of maintrainproc] (maintrainout) {\boxtitle{Outputs} $\mathcal{W}$: DeepSAD Network \\ $\mathbf{w}$: DeepSAD Network Weights  \\ $\mathbf{c}$: Hypersphere Center};
+		\node[stepsbox, below=1.4 of calcc] (maintrainproc) {Init Network $\mathcal{\phi}$ with $\mathcal{W}_E$ \\ Train Network $\mathcal{\phi}$  \\ optimize DeepSAD Objective\\ for $E_M$ Epochs \\ with $L_M$ Learning Rate \\ Considers Labels with $\eta$ strength};
+		\node[outputbox, below=.1 of maintrainproc] (maintrainout) {\boxtitle{Outputs} $\mathcal{\phi}$: DeepSAD Network \\ $\mathcal{W}$: DeepSAD Network Weights  \\ $\mathbf{c}$: Hypersphere Center};
 	\end{pgfonlayer}
 	\begin{pgfonlayer}{background}
 		\node[procbox, fit=(maintrainproc) (maintrainout), label={[label distance = 1, name=maintrainlab]above:{\textbf{Main Training}}}] (maintrain) {};
@@ -102,7 +102,7 @@


 	\begin{pgfonlayer}{foreground}
-		\node[stepsbox, below=1.4 of maintrain] (inferenceproc) {Init Network $\mathcal{W}$ with $\mathbf{w}$ \\Forward Pass on sample = $\mathbf{p}$ \\ Calculate Distance $\mathbf{p} \rightarrow \mathbf{c}$ \\ Distance = Anomaly Score};
+		\node[stepsbox, below=1.4 of maintrain] (inferenceproc) {Init Network $\mathcal{\phi}$ with $\mathcal{W}$ \\Forward Pass on sample = $\mathbf{p}$ \\ Calculate Distance $\mathbf{p} \rightarrow \mathbf{c}$ \\ Distance = Anomaly Score};
 		\node[outputbox, below=.1 of inferenceproc] (inferenceout) {\boxtitle{Outputs} Anomaly Score (Analog Value) \\ Higher for Anomalies};
 	\end{pgfonlayer}
 	\begin{pgfonlayer}{background}