reworked results chpt

2025-09-27 19:01:59 +02:00
parent c270783225
commit e00d1a33e3
4 changed files with 36 additions and 52 deletions
--- a/thesis/Main.tex
+++ b/thesis/Main.tex
@@ -1561,15 +1561,13 @@ Since only per-sample reconstruction losses were retained during pretraining, we
 % --- Section: DeepSAD Training Results ---
 \newsection{results_deepsad}{DeepSAD Detection Performance}

-Due to the challenges of ground truth quality, evaluation results must be interpreted with care. Two complementary evaluation schemes were introduced earlier:
+Due to the challenges of ground truth quality, evaluation results must be interpreted with care. As introduced earlier, we consider two complementary evaluation schemes:
 \begin{itemize}
-	\item \textbf{Experiment-based labels}, which provide an objective way to assign anomaly labels to entire degraded runs. However, this inevitably marks many near-normal frames at the start and end of these runs as anomalous. These knowingly “false” labels reduce the maximum achievable average precision, since even a perfect classifier cannot separate ambiguous from normal samples under this scheme.
-	\item \textbf{Hand-labeled labels}, which include only clearly degraded frames. These remove the mislabeled intervals and allow nearly perfect classification. While this evaluation is useful to show that performance losses in the experiment-based scheme stem from label noise, it would be uninformative in isolation because the targets become too easily distinguishable.
+	\item \textbf{Experiment-based labels:} An objective way to assign anomaly labels to all frames from degraded runs. However, this also marks many near-normal frames at the start and end of runs as anomalous. These knowingly false labels lower the maximum achievable AP, since even a perfect model cannot separate these mislabeled normals from true anomalies.
+	\item \textbf{Hand-labeled labels:} A cleaner ground truth, containing only clearly degraded frames. This removes mislabeled intervals and allows nearly perfect separation. However, it also simplifies the task too much, because borderline cases are excluded.
 \end{itemize}

-Table~\ref{tab:results_ap} gives an overview of average precision (AP) across all latent dimensions, labeling regimes, and methods. Under experiment-based labels, both DeepSAD variants consistently outperform the baselines, achieving AP values around 0.60–0.66 compared to 0.21 for IsoForest and 0.31–0.49 for OC-SVM. This demonstrates that even with noisy evaluation data, DeepSAD provides substantially stronger discriminative ability. Under hand-labeled evaluation, DeepSAD reaches nearly perfect AP across all settings, while the baselines remain much lower (IsoForest around 0.23–0.24, OC-SVM between 0.4 and 0.7).
-
-The contrast between the two evaluation schemes indicates, on the one hand, that the reduced AP in the experiment-based evaluation is largely due to mislabeled or ambiguous samples at the start and end of degraded runs. On the other hand, the perfect classification performance in the hand-labeled evaluation also reflects that only clearly degraded samples remain, meaning that borderline cases were removed entirely. This makes it impossible to assess how DeepSAD handles frames with weak or gradual degradation: the results show that it excels at separating clearly normal from clearly degraded samples, but they do not tell us whether it can reliably classify in-between cases where subjective judgment would otherwise be required. Consequently, both evaluation schemes are informative in complementary ways: experiment-based labels allow relative comparison under noisy, realistic conditions, while hand-labeled labels demonstrate the upper bound of performance when ambiguous samples are excluded.
+Table~\ref{tab:results_ap} summarizes average precision (AP) across latent dimensions, labeling regimes, and methods. Under experiment-based evaluation, both DeepSAD variants consistently outperform the baselines, reaching AP values around 0.60–0.66 compared to 0.21 for Isolation Forest and 0.31–0.49 for OC-SVM. Under hand-labeled evaluation, DeepSAD achieves nearly perfect AP in all settings, while the baselines remain much lower. This contrast shows that the lower AP under experiment-based evaluation is not a weakness of DeepSAD itself, but a direct result of mislabeled samples in the evaluation data. The hand-labeled scheme therefore confirms that DeepSAD separates clearly normal from clearly degraded frames very well, while also highlighting that label noise must be kept in mind when interpreting the experiment-based results.

 \begin{table}[t]
 	\centering
@@ -1614,29 +1612,32 @@ The contrast between the two evaluation schemes indicates, on the one hand, that
 	\end{tabularx}
 \end{table}

-Representative precision–recall curves illustrate how methods differ in their operating regimes (Figure~\ref{fig:prc_representative}). DeepSAD shows a stable high-precision region up to about 0.5 recall, followed by a sharp drop once it is forced to classify borderline cases. OC-SVM declines gradually without ever reaching a strong plateau, while Isolation Forest detects only a few extreme anomalies before collapsing to near-random performance. These qualitative differences are masked in single-number metrics but are critical for interpreting how the methods would behave in deployment.
+
+The precision--recall curves (Figure~\ref{fig:prc_representative}) illustrate these effects more clearly. For DeepSAD, precision stays close to 1 until about 0.5 recall, after which it drops off sharply. This plateau corresponds to the fraction of truly degraded frames in the anomalous set. Once recall moves beyond this point, the evaluation demands that the model also “find” the mislabeled anomalies near the run boundaries. To do so, the decision threshold must be lowered so far that many true normal frames are also flagged, which causes precision to collapse. The baselines behave differently: OC-SVM shows a smooth but weaker decline without a strong high-precision plateau, while Isolation Forest detects only a few extreme anomalies before collapsing to near-random performance. These operational differences are hidden in a single AP number but are important for judging how the methods would behave in deployment.
+
+Taken together, the two evaluation schemes provide complementary insights. The experiment-based labels offer a noisy but realistic setting that shows how methods cope with ambiguous data, while the hand-labeled labels confirm that DeepSAD can achieve nearly perfect separation when the ground truth is clean. The combination of both evaluations makes clear that (i) DeepSAD is stronger than the baselines under both conditions, (ii) the apparent performance limits under experiment-based labels are mainly due to label noise, and (iii) interpreting results requires care, since performance drops in the curves often reflect mislabeled samples rather than model failures. At the same time, both schemes remain binary classifications and therefore cannot directly evaluate the central question of whether anomaly scores can serve as a continuous measure of degradation. For this reason, we extend the analysis in Section~\ref{sec:results_inference}, where inference on entire unseen experiments is used to provide a more intuitive demonstration of the methods’ potential for quantifying LiDAR degradation in practice.

 \fig{prc_representative}{figures/results_prc.png}{Representative precision–recall curves over all latent dimensionalities for semi-labeling regime 0/0 from experiment-based evaluation labels. DeepSAD maintains a large high-precision operating region before collapsing; OC-SVM declines smoother but exhibits high standard deviation between folds; IsoForest collapses quickly and remains flat. DeepSAD's fall-off is at least partly due to known mislabeled evaluation targets.}

-Figure~\ref{fig:latent_dim_ap} plots AP versus latent dimension under the experiment-based evaluation. DeepSAD benefits most from compact latent spaces (e.g., 32–128), with diminishing or even negative returns at larger code sizes. We argue that two interacting effects likely explain this trend. First, higher-dimensional latent spaces increase model capacity and reduce the implicit regularization provided by smaller bottlenecks, leading to overfitting. Second, as illustrated by the representative PRC curves in Figure~\ref{fig:prc_representative}, DeepSAD exhibits a steep decline in precision once recall exceeds roughly 0.5. We attribute this effect primarily to mislabeled or ambiguous samples in the experiment-based evaluation: once the model is forced to classify these borderline cases, precision inevitably drops. Importantly, while such a drop is visible across all latent dimensions, its sharpness increases with latent size. At small dimensions (e.g., 32), the decline is noticeable but somewhat gradual, whereas at 1024 it becomes nearly vertical. This suggests that larger latent spaces exacerbate the difficulty of distinguishing borderline anomalies from normal data, leading to more abrupt collapses in precision once the high-confidence region is exhausted.
+\paragraph{Effect of latent space dimensionality.}
+Figure~\ref{fig:latent_dim_ap} shows how average precision changes with latent dimension under the experiment-based evaluation. The best performance is reached with compact latent spaces (32–128), while performance drops as the latent dimension grows. This can be explained by how the latent space controls the separation between normal and anomalous samples. Small bottlenecks act as a form of regularization, keeping the representation compact and making it easier to distinguish clear anomalies from normal frames. Larger latent spaces increase model capacity, but this extra flexibility also allows more overlap between normal frames and the mislabeled anomalies from the evaluation data. As a result, the model struggles more to keep the two groups apart.
+
+This effect is clearly visible in the precision--recall curves. For DeepSAD at all dimensionalities we observe high the high initial precision and steep drop once the evaluation demands that mislabeled anomalies be included. However, the sharpness of this drop depends on the latent size: at 32 dimensions the fall is comparably more gradual, while at 1024 it is almost vertical. In practice, this means that higher-dimensional latent spaces amplify the label-noise problem and lead to sudden precision collapses once the clear anomalies have been detected. Compact latent spaces are therefore more robust under noisy evaluation conditions and appear to be the safer choice for real-world deployment.

 \fig{latent_dim_ap}{figures/results_ap_over_latent.png}{AP as a function of latent dimension (experiment-based evaluation). DeepSAD shows inverse correlation between AP and latent space size.}

-%\newsection{results_semi}{Effect of semi-supervised labeling regime}
-% Refering back to the results in table~\ref{tab:results_ap} compares AP across labeling regimes (0/0, 50/10, 500/100). Surprisingly, the unsupervised regime (0/0) often performs best; adding labels does not consistently help, likely due to label noise and the scarcity/ambiguity of anomalous labels. Baselines (which do not use labels) are stable across regimes.
-%
-% \todo[inline]{rework this discussion of semi-supervised labeling and how it affected our results}
 \paragraph{Effect of semi-supervised labeling.}
-As shown in Table~\ref{tab:results_ap}, the \emph{unsupervised} models reach the best AP values, while the lightly labeled regime \((50,10)\) performs the worst. With many labels \((500,100)\), performance improves again but usually stays a little below the unsupervised case. This behavior also appears in the \emph{hand-labeled} evaluation, where only clearly degraded frames are used. Therefore, the drop with light labeling cannot be explained by mislabeled evaluation data.
+Table~\ref{tab:results_ap} shows that the unsupervised regime \((0/0)\) achieves the best AP, while the lightly supervised regime \((50/10)\) performs worst. With many labels \((500/100)\), performance improves again but remains slightly below the unsupervised case. This pattern also appears under the hand-labeled evaluation, which excludes mislabeled frames. The drop with light supervision therefore cannot be explained by noisy evaluation targets, but must stem from the training process itself.

-The PRC curves (Figure~\ref{fig:prc_over_semi}) help explain this effect. With only \((50,10)\) labels, \textbf{DeepSAD–LeNet} shows a slow and continuous loss of precision before the usual sharp decline, and the variation across folds is very large. In contrast, \textbf{DeepSAD–Efficient} keeps a flat precision region until the sudden drop, and its results are more stable across folds. This suggests that using very few labels makes training unstable: depending on which samples are selected, the model may fit them too strongly and generalize poorly. The Efficient encoder is more robust to this effect, while LeNet is more sensitive.
+The precision--recall curves in Figure~\ref{fig:prc_over_semi} show that the overall curve shapes are similar across regimes, but shifted relative to one another in line with the AP ordering \((0/0) > (500/100) > (50/10)\). We attribute these shifts to overfitting: when only a few anomalies are labeled, the model fits them too strongly, and if those examples differ too much from other anomalies, generalization suffers. This explains why lightly supervised training performs even worse than unsupervised training, which avoids this bias.

-%\fig{prc_over_semi}{figures/results_prc_over_semi.png}{Precision--recall curves at latent dimension~32 for all three labeling regimes (unsupervised $(0,0)$, lightly supervised $(50,10)$, heavily supervised $(500,100)$), shown separately for the LeNet-inspired (left) and Efficient (right) encoders. Each subplot also includes the baseline methods (Isolation Forest, OC-SVM) for reference. The curves highlight how semi-supervised labels influence DeepSAD: in the lightly labeled regime, LeNet exhibits a gradual precision decay and high variance across folds,  whereas Efficient retains a flat high-precision region until the usual sharp drop. With many labels, both architectures return to behavior close to the unsupervised case,  although performance remains slightly lower.  These plots illustrate that a small amount of supervision can destabilize training,  while larger labeled sets reduce this effect without clearly surpassing the unsupervised baseline.}
 \fig{prc_over_semi}{figures/results_prc_over_semi.png}{Precision--recall curves at latent dimension~32 for all three labeling regimes (unsupervised, lightly supervised, heavily supervised), shown separately for the LeNet-inspired (left) and Efficient (right) encoders. Baseline methods are included for comparison. Latent dimension~32 is shown as it achieved the best overall AP and is representative of the typical PRC shapes across dimensions.}

+The LeNet variant illustrates this effect most clearly, showing unusually high variance across folds in the lightly supervised case. In several folds, precision drops untypically early, which supports the idea that the model has overfit to a poorly chosen subset of labeled anomalies. The Efficient variant is less affected, maintaining more stable precision plateaus, which suggests it is more robust to such overfitting, which we observe consistently for nearly all latent dimensionalities.

+With many labels \((500/100)\), the results become more stable again and the PRC curves closely resemble the unsupervised case, only shifted slightly left. A larger and more diverse set of labeled anomalies reduces the risk of unlucky sampling and improves generalization, but it still cannot fully match the unsupervised regime, where no overfitting to a specific labeled subset occurs. The only exception is an outlier at latent dimension 512 for LeNet, where the curve again resembles the lightly supervised case, likely due to label sampling effects amplified by higher latent capacity.

-With more labels \((500,100)\), the results become more stable again and the PRC curves look very similar to the unsupervised case, only slightly worse. One exception is a single outlier at latent dimension 512 for LeNet, where the curve again looks like the lightly labeled case. This is likely due to unlucky label sampling in that fold, combined with higher latent capacity amplifying the problem. Overall, we conclude that (i) a very small amount of labeled data can hurt performance instead of helping, (ii) many labels reduce this problem but still do not improve over unsupervised training, and (iii) the choice of encoder architecture strongly affects how robust the model is to these effects.
+In summary, three consistent patterns emerge: (i) a very small number of labels can hurt performance by causing overfitting to specific examples, (ii) many labels reduce this problem but still do not surpass unsupervised generalization, and (iii) encoder architecture strongly affects robustness, with LeNet being more sensitive to unstable behavior than Efficient.

 % --- Section: Autoencoder Pretraining Results ---
 \newsection{results_inference}{Inference on Held-Out Experiments}
@@ -1654,36 +1655,13 @@ Among the four approaches, the strongest separation is achieved by DeepSAD (Effi
 For Isolation Forest, the anomaly scores are already elevated in the clean experiment, which prevents reliable differentiation between normal and degraded runs and makes the method unsuitable in this context.
 It is important to note that the score axes are scaled individually per method, so comparisons should focus on relative separation rather than absolute values.

-Because the raw anomaly scores produced by the different methods are on incomparable scales (depending, for example, on network architecture or latent space dimensionality), we first applied a $z$-score normalization.
-The $z$-score is a standardized measure that rescales values in terms of their deviation from the mean relative to the standard deviation, making outputs from different models directly comparable in terms of how many standard deviations they deviate from normal behavior.
-To allow comparison between the clean and degraded experiments, the mean and standard deviation were estimated exclusively from the clean experiment and then used to normalize the degraded scores as well.
-This ensures that increases in the degraded runs are interpreted relative to the distribution of the clean baseline, whereas computing separate $z$-scores per experiment would only reveal deviations within each run individually and not enable a meaningful cross-experiment comparison.
-It should be noted that the $z$-scores remain method-specific, meaning that while relative separation between clean and degraded runs can be compared within a method, the absolute scales across different methods are not directly comparable; readers should therefore take note of the differing axis ranges for each subplot.
-After normalization, the resulting time series were still highly noisy, which motivated the application of exponential moving average (EMA) smoothing.
-EMA was chosen because it is causal (does not rely on future data) and thus suitable for real-time inference.
-Although it introduces a small time delay, this delay is shorter than for other smoothing techniques such as running averages.
-
-%Since the raw anomaly scores were highly noisy across all methods, \textbf{exponential moving average (EMA) smoothing} was applied.
-%EMA was chosen because it is causal (does not rely on future data), which makes it usable in real-world online detection scenarios.
-%Although it introduces a small time delay, this delay is shorter than for other techniques such as running averages.
+Because anomaly scores are on incomparable scales, we apply $z$-score normalization based on the clean experiment. This allows deviations in degraded runs to be measured relative to the clean baseline, enabling direct comparison across methods. To allow comparison between the clean and degraded experiments, the mean and standard deviation were estimated exclusively from the clean experiment and then used to normalize the degraded scores as well. This ensures that increases in the degraded runs are interpreted relative to the distribution of the clean baseline, whereas computing separate $z$-scores per experiment would only reveal deviations within each run individually and not enable a meaningful cross-experiment comparison. It should be noted that the $z$-scores remain method-specific, meaning that while relative separation between clean and degraded runs can be compared within a method, the absolute scales across different methods are not directly comparable; readers should therefore take note of the differing axis ranges for each subplot. After normalization, the resulting time series were still highly noisy, which motivated the application of exponential moving average (EMA) smoothing. EMA was chosen because it is causal (does not rely on future data) and thus suitable for real-time inference. Although it introduces a small time delay, this delay is shorter than for other smoothing techniques such as running averages.

 The red method curves can also be compared with the blue and green statistical indicators (missing points and near-sensor particle hits).
 While some similarities in shape may suggest that the methods partly capture these statistics, such interpretations should be made with caution.
 The anomaly detection models are expected to have learned additional patterns that are not directly observable from simple statistics, and these may also contribute to their ability to separate degraded from clean data.


-% \newchapter{conclusion_future_work}{Conclusion and Future Work}
-% \todo[inline]{overall we were successful in showing that AD can be used for degradation quantification but there is quite some future work before usable and especially obstacles in data availability and ground truth came up during thesis which show that more work in that direction is required for progress in this field}
-% \newsection{conclusion_data}{Missing Ground-Truth as an Obstacle}
-% \todo[inline]{biggest obstacle missing ground-truth, discuss what ground-truth is? we have a missing comprehensive understanding of what we mean when we say degradation so future work would be a better understanding but not only as theoretical which errors can occur but rather also how they can affect users of the data. complex models which not only assume disperesed smoke but also single origin of smoke which creates dense objects that may confuse algorithms. }
-% \todo[inline]{we also discussed that objective ground truth would be hard to collect with typical smoke sensors not only due to the aforementioned question what degradation is and if the amount/density/particle size of smoke would even be enough to get a full picture of degradation for such use-cases but also due to the different nature of smoke-sensors which collect data only at the local point in space where the sensor is located and on the other hand lidar sensors which use lasers to collect data about the environment from a distnace, resulting in differing information which may not be all that useful as groundt truth}
-% \todo[inline]{most likely user is SLAM whose mapping will most like also be used by other decision algorithms, instead of directly using lidar data. so maybe furure work could assign degradation based on difference between previously mapped ground truth of 3d model of world and of output from SLAM algorithms, maybe subjective labeling will be necessary especially where single points of origin result in smoke clouds which are seen as solid but morphing objects by slam algorithms.}
-% \todo[inline]{since quantification is goal the binary kind of ground-truth is also a bit lacking in that it allows us to show classification performance of methods and subjective judgement of inference shows that it very well might work not only as classifier but as regression/quantification for inbetween cases, but nonetheless analog evaluation targets would strongly improve confidence in quantification performance instead of only classification performance.}
-% \newsection{conclusion_ad}{Insights into DeepSAD and AD for Degradation Quantification}
-% \todo[inline]{we've shown that deepsad principle works and is usable for this use-case even when looking at feasibility due to runtimes. we've also shown that generally AD may be used although more complex algorithms such as DeepSAD outperform simple methods like isoforest. interestingly in our case the semi-supervised nature of deepsad could not show improved performance, although due to our noisy evaluation data its hard to interpret the results in a meaningful way for this. we have shown that choosing the correct architecture for the encoder can make quite a difference not only in pre-training preformance but also in training of the AD method, although once again the noisy evaluation targets make interpretation of performance hard to say definitively how strongly this affects the outcome. an interersting future work in this case could be to evaluate different architectures for known datasets with good evaluation targets to show how important the choice of encoder architecture really is for DeepSAD and maybe related methods. another interesting avenue of research could be the introduction of per sample weighted semi-supervised targets, if analog ground-truth is available this could allow DeepSAD to better learn to quantify the anomalous nature of a sample instead of simply training it to classify samples.}
-% \newsection{conclusion_open_questions}{Open Questions for Degradation Quantification}
-% \todo[inline]{possible avenues we know of but did not look into are: difference between two temporally related frames as input to how much degradation there is, sensor fusion with other sensors (ultrasonic sensor may look through dense clouds which lidar cannot penetrate), tested architecture of DeepSAD use full 360 degree pointcloud as input which could be problematic if degradation occurs only in part of pointcloud (specific direction). possible solution is smaller input window (segment point cloud into sections depending on horizontal and or vertical angles) and calculate anomaly score/degradation quantification per seciton and therefore per direction of the sensor. this was tested in a small capacity and proved quite promising but we left it as future work in this thesis. this is also related to the problem that current solution only works for data from pointclouds with exact same resolution and could be used to enable the technique to work with multiple types of lidar, although unclear if different resolution per angle will work/affect performance of DeepSAD since this was not tested}
-
 \newchapter{conclusion_future_work}{Conclusion and Future Work}

 This thesis set out to answer the following research question, formulated in Chapter~\ref{chp:introduction}:
@@ -1704,14 +1682,6 @@ The main contributions of this thesis can be summarized as follows:
 	\item \textbf{Feasibility study:} An exploration of runtime, temporal inference plots, and downstream applicability, indicating that anomaly scores correlate with degradation trends and could provide a foundation for future quantification methods.
 \end{itemize}

-
-% This thesis investigated the feasibility of using anomaly detection (AD) methods, and in particular DeepSAD, for quantifying lidar degradation in subterranean environments for use-cases such as autonomous rescue robots. The main findings can be summarized as follows:
-% \begin{itemize}
-% 	\item AD can in principle be applied to the problem of degradation quantification, with DeepSAD clearly outperforming simpler baselines such as Isolation Forest and OC-SVM (cf.~\ref{sec:results_deepsad}).
-% 	\item Evaluation is severely constrained by the lack of reliable ground truth, which limits interpretation of quantitative results and hampers the assessment of degradation quantification even just for binary classification but especially beyond that.
-% 	\item Despite these challenges, our results provide encouraging evidence that anomaly detection scores correlate with degradation trends (cf.~\ref{sec:results_inference}), motivating future work toward more reliable evaluation protocols and more expressive ground truth.
-% \end{itemize}
-
 \newsection{conclusion_data}{Missing Ground Truth as an Obstacle}

 %The most significant obstacle identified in this work is the absence of robust and comprehensive ground truth for lidar degradation. As discussed in Chapter~\ref{chp:data_preprocessing}, it is not trivial to define what “degradation” precisely means in practice. While intuitive descriptions exist (e.g., dispersed smoke, dense clouds, localized plumes), these translate poorly into objective evaluation targets. Future work should therefore not only refine the conceptual understanding of degradation but also connect it to downstream users of the data (e.g., SLAM or mapping algorithms), where errors manifest differently.