wip conclusion

This commit is contained in:
Jan Kowalczyk
2025-09-22 14:13:03 +02:00
parent 9ec73c5992
commit f93bbaeec1
2 changed files with 76 additions and 11 deletions

View File

@@ -88,8 +88,8 @@
}
\DeclareRobustCommand{\threadtodo}[4]{%
\todo[inline,
% \todo[disable,
% \todo[inline,
\todo[disable,
backgroundcolor=red!20,
bordercolor=red!50,
textcolor=black!80,
@@ -601,6 +601,7 @@ The neural network architecture of DeepSAD is not fixed but rather dependent on
\todo[inline]{latent space size for AE shows that most likely all of the important data may be captured inside this dim (since recons;truction is possible) but we may only require some of the encoded patterns to differentiate normal from anomaly so smaller may still be possible? should this be discussed here or not? maybe only discuss; AE considerations and then move this discussion to discussion / results}
\todo[inline]{eta, think of possible important scenarios, learning rate, epochs}
\todo[inline]{todo new structure}
%\todo[inline, color=green!40]{Core idea of the algorithm is to learn a transformation to map input data into a latent space where normal data clusters close together and anomalous data gets mapped further away. to achieve this the methods first includes a pretraining step of an auto-encoder to extract the most relevant information, second it fixes a hypersphere center in the auto-encoders latent space as a target point for normal data and third it traings the network to map normal data closer to that hypersphere center. Fourth The resulting network can map new data into this latent space and interpret its distance from the hypersphere center as an anomaly score which is larger the more anomalous the datapoint is}
%\todo[inline, color=green!40]{explanation pre-training step: architecture of the autoencoder is dependent on the input data shape, but any data shape is generally permissible. for the autoencoder we do not need any labels since the optimization target is always the input itself. the latent space dimensionality can be chosen based on the input datas complexity (search citations). generally a higher dimensional latent space has more learning capacity but tends to overfit more easily (find cite). the pre-training step is used to find weights for the encoder which genereally extract robust and critical data from the input because TODO read deepsad paper (cite deepsad). as training data typically all data (normal and anomalous) is used during this step.}
@@ -1170,7 +1171,7 @@ To compare the computational efficiency of the two architectures we show the num
1024 & 16.78M & 44.17M & 8.39M & 37.95M & 33.56M & 87.46M & 16.79M & 184.75M \\
\hline
\end{tabularx}
\caption{Comparison of parameter count and MACs for SubTer\_LeNet and SubTer\_Efficient encoders across different latent space sizes.}
\caption{Comparison of parameter count and MACs for DeepSAD LeNet-inspired and DeepSAD Efficient encoders across different latent space sizes.}
\label{tab:params_lenet_vs_efficient}
\end{table}
@@ -1575,7 +1576,6 @@ Due to the challenges of ground truth quality, evaluation results must be interp
\item \textbf{Hand-labeled labels}, which include only clearly degraded frames. These remove the mislabeled intervals and allow nearly perfect classification. While this evaluation is useful to show that performance losses in the experiment-based scheme stem from label noise, it would be uninformative in isolation because the targets become too easily distinguishable.
\end{itemize}
\subsection{Overall results}
Table~\ref{tab:results_ap} gives an overview of average precision (AP) across all latent dimensions, labeling regimes, and methods. Under experiment-based labels, both DeepSAD variants consistently outperform the baselines, achieving AP values around 0.600.66 compared to 0.21 for IsoForest and 0.310.49 for OC-SVM. This demonstrates that even with noisy evaluation data, DeepSAD provides substantially stronger discriminative ability. Under hand-labeled evaluation, DeepSAD reaches nearly perfect AP across all settings, while the baselines remain much lower (IsoForest around 0.230.24, OC-SVM between 0.4 and 0.7).
The contrast between the two evaluation schemes indicates, on the one hand, that the reduced AP in the experiment-based evaluation is largely due to mislabeled or ambiguous samples at the start and end of degraded runs. On the other hand, the perfect classification performance in the hand-labeled evaluation also reflects that only clearly degraded samples remain, meaning that borderline cases were removed entirely. This makes it impossible to assess how DeepSAD handles frames with weak or gradual degradation: the results show that it excels at separating clearly normal from clearly degraded samples, but they do not tell us whether it can reliably classify in-between cases where subjective judgment would otherwise be required. Consequently, both evaluation schemes are informative in complementary ways: experiment-based labels allow relative comparison under noisy, realistic conditions, while hand-labeled labels demonstrate the upper bound of performance when ambiguous samples are excluded.
@@ -1669,12 +1669,12 @@ Representative precisionrecall curves illustrate how methods differ in their
\fig{prc_representative}{figures/results_prc.png}{Representative precisionrecall curves over all latent dimensionalities for semi-labeling regime 0/0 from experiment-based evaluation labels. DeepSAD maintains a large high-precision operating region before collapsing; OC-SVM declines smoother but exhibits high standard deviation between folds; IsoForest collapses quickly and remains flat. DeepSAD's fall-off is at least partly due to known mislabeled evaluation targets.}
\subsection{Effect of latent space dimensionality}
\newsection{results_latent}{Effect of latent space dimensionality}
Figure~\ref{fig:latent_dim_ap} plots AP versus latent dimension under the experiment-based evaluation. DeepSAD benefits from compact latent spaces (e.g., 32128), with diminishing or negative returns at larger codes. Baseline methods are largely flat across dimensions, reflecting their reliance on fixed embeddings. (Hand-labeled results saturate and are shown in the appendix.)
\fig{latent_dim_ap}{figures/latent_dim_ap.png}{AP as a function of latent dimension (experiment-based evaluation). DeepSAD benefits from smaller codes; baselines remain flat.}
\subsection{Effect of semi-supervised labeling regime}
\newsection{results_semi_labeling}{Effect of semi-supervised labeling regime}
Figure~\ref{fig:labeling_regime_ap} compares AP across labeling regimes (0/0, 50/10, 500/100). Surprisingly, the unsupervised regime (0/0) often performs best; adding labels does not consistently help, likely due to label noise and the scarcity/ambiguity of anomalous labels. Baselines (which do not use labels) are stable across regimes.
\fig{labeling_regime_ap}{figures/labeling_regime_ap.png}{AP across semi-supervised labeling regimes. Unsupervised training often performs best; added labels do not yield consistent gains under noisy conditions.}
@@ -1713,12 +1713,77 @@ While some similarities in shape may suggest that the methods partly capture the
The anomaly detection models are expected to have learned additional patterns that are not directly observable from simple statistics, and these may also contribute to their ability to separate degraded from clean data.
\newchapter{conclusion_future_work}{Conclusion and Future Work}
\newsection{conclusion}{Conclusion}
\todo[inline]{summarize what has been achieved}
% \newchapter{conclusion_future_work}{Conclusion and Future Work}
% \todo[inline]{overall we were successful in showing that AD can be used for degradation quantification but there is quite some future work before usable and especially obstacles in data availability and ground truth came up during thesis which show that more work in that direction is required for progress in this field}
% \newsection{conclusion_data}{Missing Ground-Truth as an Obstacle}
% \todo[inline]{biggest obstacle missing ground-truth, discuss what ground-truth is? we have a missing comprehensive understanding of what we mean when we say degradation so future work would be a better understanding but not only as theoretical which errors can occur but rather also how they can affect users of the data. complex models which not only assume disperesed smoke but also single origin of smoke which creates dense objects that may confuse algorithms. }
% \todo[inline]{we also discussed that objective ground truth would be hard to collect with typical smoke sensors not only due to the aforementioned question what degradation is and if the amount/density/particle size of smoke would even be enough to get a full picture of degradation for such use-cases but also due to the different nature of smoke-sensors which collect data only at the local point in space where the sensor is located and on the other hand lidar sensors which use lasers to collect data about the environment from a distnace, resulting in differing information which may not be all that useful as groundt truth}
% \todo[inline]{most likely user is SLAM whose mapping will most like also be used by other decision algorithms, instead of directly using lidar data. so maybe furure work could assign degradation based on difference between previously mapped ground truth of 3d model of world and of output from SLAM algorithms, maybe subjective labeling will be necessary especially where single points of origin result in smoke clouds which are seen as solid but morphing objects by slam algorithms.}
% \todo[inline]{since quantification is goal the binary kind of ground-truth is also a bit lacking in that it allows us to show classification performance of methods and subjective judgement of inference shows that it very well might work not only as classifier but as regression/quantification for inbetween cases, but nonetheless analog evaluation targets would strongly improve confidence in quantification performance instead of only classification performance.}
% \newsection{conclusion_ad}{Insights into DeepSAD and AD for Degradation Quantification}
% \todo[inline]{we've shown that deepsad principle works and is usable for this use-case even when looking at feasibility due to runtimes. we've also shown that generally AD may be used although more complex algorithms such as DeepSAD outperform simple methods like isoforest. interestingly in our case the semi-supervised nature of deepsad could not show improved performance, although due to our noisy evaluation data its hard to interpret the results in a meaningful way for this. we have shown that choosing the correct architecture for the encoder can make quite a difference not only in pre-training preformance but also in training of the AD method, although once again the noisy evaluation targets make interpretation of performance hard to say definitively how strongly this affects the outcome. an interersting future work in this case could be to evaluate different architectures for known datasets with good evaluation targets to show how important the choice of encoder architecture really is for DeepSAD and maybe related methods. another interesting avenue of research could be the introduction of per sample weighted semi-supervised targets, if analog ground-truth is available this could allow DeepSAD to better learn to quantify the anomalous nature of a sample instead of simply training it to classify samples.}
% \newsection{conclusion_open_questions}{Open Questions for Degradation Quantification}
% \todo[inline]{possible avenues we know of but did not look into are: difference between two temporally related frames as input to how much degradation there is, sensor fusion with other sensors (ultrasonic sensor may look through dense clouds which lidar cannot penetrate), tested architecture of DeepSAD use full 360 degree pointcloud as input which could be problematic if degradation occurs only in part of pointcloud (specific direction). possible solution is smaller input window (segment point cloud into sections depending on horizontal and or vertical angles) and calculate anomaly score/degradation quantification per seciton and therefore per direction of the sensor. this was tested in a small capacity and proved quite promising but we left it as future work in this thesis. this is also related to the problem that current solution only works for data from pointclouds with exact same resolution and could be used to enable the technique to work with multiple types of lidar, although unclear if different resolution per angle will work/affect performance of DeepSAD since this was not tested}
\newsection{future_work}{Future Work}
\todo[inline]{confirm results with real smoke data}
\newchapter{conclusion_future_work}{Conclusion and Future Work}
This thesis set out to answer the following research question, formulated in Chapter~\ref{chp:introduction}:
\begin{quote}
Can autonomous robots quantify the reliability of lidar sensor data in hazardous environments to make more informed decisions?
\end{quote}
Based on the experiments presented in Chapter~\ref{sec:results_deepsad} and Chapter~\ref{sec:results_inference}, we conclude that anomaly detection (AD) methods—and in particular DeepSAD—can indeed be applied to this problem in principle. DeepSAD outperforms simpler baselines such as Isolation Forest and OC-SVM, and its anomaly scores correlate with degradation trends over time. However, the absence of robust ground truth and the reliance on noisy or ambiguous labels significantly limit the reliability of quantitative evaluation. As a result, while AD shows clear potential for enabling autonomous robots to assess the reliability of their lidar data, further work is required before this capability can be translated into practical decision-making support.
\paragraph{Contributions.}
The main contributions of this thesis can be summarized as follows:
\begin{itemize}
\item \textbf{Empirical evaluation:} A systematic comparison of DeepSAD against Isolation Forest and OC-SVM for lidar degradation detection, demonstrating that DeepSAD consistently outperforms simpler baselines.
\item \textbf{Analysis of latent dimensionality:} An investigation of how representation size influences performance and stability under noisy labels, revealing that smaller latent spaces are more robust in this setting.
\item \textbf{Analysis of semi-supervised training labels:} An investigation of how representation size influences performance and stability under noisy labels, revealing that smaller latent spaces are more robust in this setting.
\item \textbf{Analysis of semi-supervised training labels:} An evaluation of different semi-supervised labeling regimes, showing that in our case purely unsupervised training yielded the best performance. Adding a small number of labels reduced performance, while a higher ratio of labels led to partial recovery. This pattern may indicate overfitting effects, although interpretation is complicated by the presence of mislabeled evaluation targets.
\item \textbf{Analysis of encoder architecture:} A comparison between a LeNet-inspired and an Efficient encoder showed that the choice of architecture has a decisive influence on DeepSADs performance. The Efficient encoder outperformed the LeNet-inspired baseline not only during autoencoder pretraining but also in anomaly detection. While the exact magnitude of this improvement is difficult to quantify due to noisy evaluation targets, the results underline the importance of encoder design for representation quality in DeepSAD.
\item \textbf{Feasibility study:} An exploration of runtime, temporal inference plots, and downstream applicability, indicating that anomaly scores correlate with degradation trends and could provide a foundation for future quantification methods.
\end{itemize}
% This thesis investigated the feasibility of using anomaly detection (AD) methods, and in particular DeepSAD, for quantifying lidar degradation in subterranean environments for use-cases such as autonomous rescue robots. The main findings can be summarized as follows:
% \begin{itemize}
% \item AD can in principle be applied to the problem of degradation quantification, with DeepSAD clearly outperforming simpler baselines such as Isolation Forest and OC-SVM (cf.~\ref{sec:results_deepsad}).
% \item Evaluation is severely constrained by the lack of reliable ground truth, which limits interpretation of quantitative results and hampers the assessment of degradation quantification even just for binary classification but especially beyond that.
% \item Despite these challenges, our results provide encouraging evidence that anomaly detection scores correlate with degradation trends (cf.~\ref{sec:results_inference}), motivating future work toward more reliable evaluation protocols and more expressive ground truth.
% \end{itemize}
\newsection{conclusion_data}{Missing Ground Truth as an Obstacle}
%The most significant obstacle identified in this work is the absence of robust and comprehensive ground truth for lidar degradation. As discussed in Chapter~\ref{chp:data_preprocessing}, it is not trivial to define what “degradation” precisely means in practice. While intuitive descriptions exist (e.g., dispersed smoke, dense clouds, localized plumes), these translate poorly into objective evaluation targets. Future work should therefore not only refine the conceptual understanding of degradation but also connect it to downstream users of the data (e.g., SLAM or mapping algorithms), where errors manifest differently.
The most significant obstacle identified in this work is the absence of robust and comprehensive ground truth for lidar degradation. As discussed in Chapter~\ref{chp:data_preprocessing}, it is not trivial to define what “degradation” precisely means in practice. Although error models for lidar and theoretical descriptions of how airborne particles affect laser returns exist, these models typically quantify errors at the level of individual points (e.g., missing returns, spurious near-range hits). Such metrics, however, may not be sufficient to assess the impact of degraded data on downstream users. A point cloud with relatively few, but highly localized errors—such as those caused by a dense smoke cloud—may lead a SLAM algorithm to misinterpret the region as a solid obstacle. In contrast, a point cloud with a greater number of dispersed errors might be easier to filter and thus cause little to no disruption in mapping. Consequently, the notion of “degradation” must extend beyond point-level error statistics to encompass how different error patterns propagate to downstream perception and decision-making modules.
Collecting objective ground truth is complicated by the mismatch between available sensors. Smoke sensors provide only local measurements of density and particle size, while lidar captures environmental structure from a distance. Neither modality alone can serve as a perfect proxy for the other, and their differing measurement principles mean that ground truth derived from one may not fully reflect the challenges faced by the other.
One promising direction is to evaluate degradation not directly on raw lidar frames but via its downstream impact. For example, future work could assess degradation based on discrepancies between a previously mapped 3D environment model and the output of a SLAM algorithm operating under degraded conditions. In such a setup, subjective labeling may still be required in special cases (e.g., dense smoke clouds treated as solid objects by SLAM), but it would anchor evaluation closer to the ultimate users of the data.
Finally, the binary ground truth employed here is insufficient for the quantification goal. As shown in Section~\ref{sec:results_inference}, DeepSADs anomaly scores appear suitable not only for classification but also for expressing intermediate levels of degradation. Analog evaluation targets would therefore be highly valuable, as they would allow assessing whether anomaly scores correlate linearly or monotonically with degradation severity rather than only separating “normal” from “degraded.”
\newsection{conclusion_ad}{Insights into DeepSAD and AD for Degradation Quantification}
This work has shown that the DeepSAD principle is applicable to lidar degradation data and yields promising performance both in terms of accuracy and runtime feasibility (cf.~\ref{sec:setup_experiments_environment}). Compared to simple baselines such as Isolation Forest and OC-SVM, DeepSAD achieves significantly better discrimination of degraded frames. However, in our experiments the semi-supervised component of DeepSAD did not lead to measurable improvements, which may be attributable to the noisy evaluation targets (cf.~\ref{sec:results_semi}).
We also observed that the choice of encoder architecture is critical. As discussed in Section~\ref{sec:results_deepsad}, the Efficient architecture consistently outperformed the LeNet-inspired baseline in pretraining and contributed to stronger downstream performance. The influence of encoder design on DeepSAD training merits further study under cleaner evaluation conditions. In particular, benchmarking different encoder architectures on datasets with high-quality ground truth could clarify how much of DeepSADs performance gain stems from representation quality versus optimization.
Future work could also explore per-sample weighting of semi-supervised targets. If analog ground truth becomes available, this would allow DeepSAD to better capture varying degrees of degradation by treating supervision as a graded signal rather than a binary label.
\newsection{conclusion_open_questions}{Open Questions and Future Directions}
Several promising avenues remain open for future exploration:
\begin{itemize}
\item \textbf{Temporal modeling:} Instead of treating frames independently, future methods could directly model the difference between temporally consecutive frames to capture dynamic aspects of degradation.
\item \textbf{Sensor fusion:} Combining lidar with complementary sensors (e.g., ultrasonic sensors that penetrate dense clouds) could mitigate blind spots inherent to single-sensor evaluation.
\item \textbf{Input segmentation:} The DeepSAD architecture tested here processed full 360° lidar scans. This may obscure localized degradations. Segmenting point clouds into angular sectors and computing anomaly scores per sector could provide more fine-grained quantification. Preliminary tests in this direction were promising, but were not pursued further in this thesis.
\item \textbf{Cross-sensor generalization:} Current experiments assume identical sensor resolution. Extending the method to work across different lidar types, including those with varying angular resolutions, remains an open question and would enhance applicability in heterogeneous robotic fleets.
\end{itemize}
In summary, while this thesis demonstrates the feasibility of using anomaly detection for lidar degradation quantification, significant challenges remain. Chief among them are the definition and collection of ground truth, the development of analog evaluation targets, and architectural adaptations for more complex real-world scenarios. Addressing these challenges will be critical for moving from proof-of-concept to practical deployment in rescue robotics and beyond.