fixed plots

2025-10-21 19:04:19 +02:00
parent 8f983b890f
commit 7b5accb6c5
25 changed files with 1917 additions and 165 deletions
--- a/thesis/Main.pdf
+++ b/thesis/Main.pdf
--- a/thesis/Main.tex
+++ b/thesis/Main.tex
@@ -427,7 +427,7 @@ To ensure our chosen dataset meets the needs of reliable degradation quantificat
 \begin{enumerate}
 	\item \textbf{Data Modalities:}\\
-	      The dataset must include \rev{LiDAR} sensor data, since we decided to train and evaluate our method on what should be the most universally used sensor type in the given domain. To keep our method as generalized as possible, we chose to only require range-based point cloud data and \rev{opt out of} sensor-specific data such as intensity or reflectivity, though it may be of interest for future work. It is also desirable to have complementary visual data, such as camera images, for better context, manual verification, and understanding of the data.
+	      The dataset must include \rev{LiDAR} sensor data, since we decided to train and evaluate our method on what should be the most universally used sensor type in the given domain. To keep our method as generalized as possible, we chose to only require range-based point cloud data and neglect sensor-specific data such as intensity or reflectivity, though it may be of interest for future work. It is also desirable to have complementary visual data, such as camera images, for better context, manual verification, and understanding of the data.
 	\item \textbf{Context \& Collection Method:}\\
 	      To mirror the real-world conditions of autonomous rescue robots, the data should originate from locations such as subterranean environments (tunnels, caves, collapsed structures), which closely reflect what would be encountered during rescue missions. Ideally, it should be captured from a ground-based, self-driving robot platform in motion instead of aerial, handheld, or stationary collection, to ensure similar circumstances to the target domain.
@@ -514,7 +514,7 @@ In the anomalous experiments, the artificial smoke machine appears to have been
 Regarding the dataset volume, the 10 normal experiments ranged from 88.7 to 363.1 seconds, with an average duration of 157.65 seconds. At a capture rate of 10 frames per second, these experiments yield 15,765 non-degraded point clouds. In contrast, the 4 anomalous experiments, including one stationary experiment lasting 11.7 seconds and another extending to 62.1 seconds, averaged 47.33 seconds, resulting in 1,893 degraded point clouds. In total, the dataset comprises 17,658 point clouds, with approximately 89.28\% classified as non-degraded (normal) and 10.72\% as degraded (anomalous). The distribution of experimental data is visualized in \rev{Figure}~\ref{fig:data_points_pie}.
-\fig{data_points_pie}{figures/data_points_pie.png}{Pie chart visualizing the amount and distribution of normal and anomalous point clouds in \cite{subter}\rev{.}}
+\fig{data_points_pie}{figures/data_points_pie.png}{Pie chart visualizing the amount and distribution of normal and anomalous LiDAR frames (i.e., point clouds) in \cite{subter}\rev{.}}
 The artificial smoke introduces measurable changes that clearly separate the \textit{anomalous} runs from the \textit{normal} baseline.  One change is a larger share of missing points per scan: smoke particles scatter or absorb the laser beam before it reaches a solid target, so the sensor reports an error instead of a distance.  Figure~\ref{fig:data_missing_points} shows the resulting right–shift of the missing-point histogram, a known effect for \rev{LiDAR} sensors in aerosol-filled environments. Another demonstrative effect is the appearance of many spurious returns very close to the sensor; these near-field points arise when back-scatter from the aerosol itself is mistaken for a surface echo. The box plot in \rev{Figure}~\ref{fig:particles_near_sensor} confirms a pronounced increase in sub-50 cm hits under smoke, a range at which we do not expect any non-erroneous measurements. Both effects are consistent with the behaviour reported in \rev{\cite{when_the_dust_settles}}.
@@ -533,9 +533,9 @@ For this reason and to simplify the architecture, we converted the point clouds
 To create this mapping, we leveraged the available measurement indices and channel information inherent in the dense point clouds, which are ordered from 0 to 65,535 in a horizontally ascending, channel-by-channel manner. For sparse point clouds without such indices, one would need to rely on the pitch and yaw angles relative to the sensor's origin to correctly map each point to its corresponding pixel, although this often leads to ambiguous mappings due to numerical errors in angle estimation.
-Figure~\ref{fig:data_projections} displays two examples of \rev{LiDAR} point cloud projections to aid in the reader’s understanding. Although the original point clouds were converted into grayscale images with a resolution of 2048×32 pixels, these raw images can be challenging to interpret. To enhance human readability, we applied the viridis colormap and vertically stretched the images so that each measurement occupies multiple pixels in height. The top projection is derived from a scan without artificial smoke—and therefore minimal degradation—while the lower projection comes from an experiment where artificial smoke introduced significant degradation.
+Figure~\ref{fig:data_projections} displays two examples of \rev{LiDAR} point cloud projections to aid in the reader’s understanding. Although the original point clouds were converted into grayscale images with a resolution of 2048×32 pixels, these raw images can be challenging to interpret. To enhance human readability, we applied the viridis colormap and vertically stretched the images so that each measurement occupies multiple pixels in height. The projection in (a) is derived from a scan without artificial smoke—and therefore minimal degradation—while the projection in (b) comes from an experiment where artificial smoke introduced significant degradation.
-\fig{data_projections}{figures/data_2d_projections.png}{Two-dimensional projections of two point clouds, one from an experiment without degradation and one from an experiment with artificial smoke as degradation. To aid the reader's perception, the images are vertically stretched, and a colormap has been applied to the pixels' reciprocal range values, while the actual training data is grayscale.}
+\fig{data_projections}{figures/data_2d_projections.png}{Two-dimensional projections of two point clouds, (a) from an experiment without degradation and (b) from an experiment with artificial smoke as degradation. To aid the reader's perception, the images are vertically stretched, and a colormap has been applied to the pixels' reciprocal range values, while the actual training data is grayscale.}
 The remaining challenge was labeling a large enough portion of the dataset in a reasonably accurate manner, whose difficulties and general approach we described in \rev{Section}~\ref{sec:data_req}. Since, to our knowledge, neither our chosen dataset nor any other publicly available one provides objective labels for \rev{LiDAR} data degradation in the SAR domain, we had to define our own labeling approach. With objective measures of degradation unavailable, we explored alternative labeling methods—such as using \rev{the statistical} properties like the number of missing measurements per point cloud or the higher incidence of erroneous measurements near the sensor we described in \rev{Section~\ref{sec:data_dataset}}. Ultimately, we were concerned that these statistical approaches might lead the method to simply mimic the statistical evaluation rather than to quantify degradation in a generalized and robust manner. After considering these options, we decided to label all point clouds from experiments with artificial smoke as anomalies, while point clouds from experiments without smoke were labeled as normal data. This labeling strategy—based on the presence or absence of smoke—is fundamentally an environmental indicator, independent of the intrinsic data properties recorded during the experiments.
@@ -553,7 +553,7 @@ Afraid that the incorrectly labeled data may negatively impact DeepSAD's semi-su
 Under both evaluation schemes, all frames from normal experiments were marked as normal, since they appear to have produced high-quality data throughout. A visualization of how the two evaluation schemes measure up in terms of the number of samples per class can be seen in \rev{Figure}~\ref{fig:data_eval_labels}.
-\fig{data_eval_labels}{figures/data_eval_labels.png}{Pie charts visualizing the number of normal and anomalous labels applied to the dataset per labeling scheme. A large part of the experiment-based anomalous labels had to be removed for the manually-defined scheme, since, subjectively, they were either clearly or possibly not degraded.}
+\fig{data_eval_labels}{figures/data_eval_labels.png}{Pie charts visualizing the number of normal and anomalous labels applied to the dataset for (a) experiment-based labeling scheme and (b) manually-defined labeling scheme. A large part of the experiment-based anomalous labels had to be removed for the manually-defined scheme, since, subjectively, they were either clearly or possibly not degraded.}
 By evaluating and comparing both approaches, we hope to demonstrate a more thorough performance investigation than with only one of the two \rev{labeling schemes}.
@@ -775,9 +775,9 @@ To address this, we instead rely on Precision–Recall Curves (PRC)~\cite{prc},
 	\text{Recall} = \frac{\text{TP}}{\text{TP} + \text{FN}}.
 \]
-In our evaluation, this distinction proved practically significant. As illustrated in Figure~\ref{fig:roc_vs_prc}, ROC AUC values for Isolation Forest and DeepSAD appear similarly strong (0.693 vs. 0.782), suggesting comparable performance. However, the PRC reveals a clear divergence: while DeepSAD maintains high precision across recall levels, Isolation Forest suffers a steep decline in precision as recall increases, due to a high number of false positives. The resulting Average Precision (AP)—the area under the PRC—is much lower for Isolation Forest (0.207 vs. 0.633), offering a more realistic account of its performance under imbalance.
+In our evaluation, this distinction proved practically significant. As illustrated in Figure~\ref{fig:roc_vs_prc}, ROC AUC values in (a) appear similarly strong for both, Isolation Forest and DeepSAD (0.693 vs. 0.782), suggesting comparable performance. However, the PRC in (b) reveals a clear divergence: while DeepSAD maintains high precision across recall levels, Isolation Forest suffers a steep decline in precision as recall increases, due to a high number of false positives. The resulting Average Precision (AP)—the area under the PRC—is much lower for Isolation Forest (0.207 vs. 0.633), offering a more realistic account of its performance under imbalance.
-\figc{roc_vs_prc}{figures/setup_roc_vs_prc.png}{Comparison of ROC and PRC for the same evaluation run. ROC fails to reflect the poor performance of Isolation Forest, which misclassifies many normal samples as anomalous at lower thresholds. The PRC exposes this effect, resulting in a substantially lower AP for Isolation Forest than for DeepSAD.}{width=.9\textwidth}
+\figc{roc_vs_prc}{figures/setup_roc_vs_prc.png}{Comparison of ROC (a) and PRC (b) for the same evaluation run. ROC fails to reflect the poor performance of Isolation Forest, which misclassifies many normal samples as anomalous at lower thresholds. The PRC exposes this effect, resulting in a substantially lower AP for Isolation Forest than for DeepSAD.}{width=.9\textwidth}
 In addition to cross-validated performance comparisons, we also apply the trained models to previously unseen, temporally ordered experiments to simulate inference in realistic conditions. Since each method produces scores on a different scale—with different signs and ranges—raw scores are not directly comparable. To enable comparison, we compute a $z$-score~\cite{zscore} per frame, defined as the number of standard deviations a score deviates from the mean of the normal data. To perform the normalization, we compute the mean and standard deviation of anomaly scores on a clean reference experiment. These values are then used to normalize scores from degraded experiments, making it easy to see how much each method's output deviates from its own baseline under degradation. It also facilitates a unified view across methods, even though their outputs are otherwise heterogeneous. In this way, $z$-score normalization supports threshold-free interpretation and enables consistent model comparison during inference.
@@ -953,7 +953,7 @@ Together, these results provide a comprehensive overview of the computational re
 \newchapter{results_discussion}{Results and Discussion}
-In this chapter, we present the \rev{evaluation experiments, outlined in Chapter~\ref{chp:experimental_setup}}. We begin in Section~\ref{sec:results_pretraining} with the pretraining stage, where the two autoencoder architectures were trained across multiple latent space dimensionalities. These results provide insight into the representational capacity of each architecture. In Section~\ref{sec:results_deepsad}, we turn to the main experiments: training DeepSAD models and benchmarking them against baseline algorithms (Isolation Forest and OCSVM). Finally, in Section~\ref{sec:results_inference}, we present inference results on \rev{data} that were held out during training. These plots illustrate how the algorithms behave when applied sequentially to unseen \rev{data}, offering a more practical perspective on their potential for real-world rescue robotics applications.
+In this chapter, we present the \rev{evaluation experiments, based on the experimental setup described in Chapter~\ref{chp:experimental_setup}}. We begin in Section~\ref{sec:results_pretraining} with the pretraining stage, where the two autoencoder architectures were trained across multiple latent space dimensionalities. These results provide insight into the representational capacity of each architecture. In Section~\ref{sec:results_deepsad}, we turn to the main experiments: training DeepSAD models and benchmarking them against baseline algorithms (Isolation Forest and OCSVM). Finally, in Section~\ref{sec:results_inference}, we present inference results on \rev{data} that were held out during training. These plots illustrate how the algorithms behave when applied sequentially to unseen \rev{data}, offering a more practical perspective on their potential for real-world rescue robotics applications.
 % --- Section: Autoencoder Pretraining Results ---
 \newsection{results_pretraining}{Autoencoder Pretraining Results}
@@ -1046,7 +1046,7 @@ The precision--recall curves \rev{for experiment-based evaluation} (Figure~\ref{
 Taken together, the two evaluation schemes provide complementary insights. The experiment-based labels offer a noisy but realistic setting that shows how methods cope with ambiguous data, while the manually-defined labels confirm that DeepSAD can achieve nearly perfect separation when the ground truth is clean. The combination of both evaluations makes clear that (i) DeepSAD is stronger than the baselines under both conditions, (ii) the apparent performance limits under experiment-based labels are mainly due to label noise, and (iii) interpreting results requires care, since performance drops in the curves often reflect mislabeled samples rather than model failures. At the same time, both schemes remain binary classifications and therefore cannot directly evaluate the central question of whether anomaly scores can serve as a continuous measure of degradation. For this reason, we extend the analysis in Section~\ref{sec:results_inference}, where inference on entire unseen experiments is used to provide a more intuitive demonstration of the methods’ potential for quantifying \rev{LiDAR} degradation in practice.
-\fig{prc_representative}{figures/results_prc.png}{Representative precision–recall curves over all latent dimensionalities for semi-labeling regime 0/0 from experiment-based evaluation labels. DeepSAD maintains a large high-precision operating region before collapsing; OCSVM declines smoother but exhibits high standard deviation between folds; IsoForest collapses quickly and remains flat. DeepSAD's fall-off is at least partly due to known mislabeled evaluation targets.}
+\fig{prc_representative}{figures/results_prc.png}{Representative precision–recall curves (a) - (g) over all latent dimensionalities 32 - 1024 for semi-labeling regime 0/0 from experiment-based evaluation labels. DeepSAD maintains a large high-precision operating region before collapsing; OCSVM declines smoother but exhibits high standard deviation between folds; IsoForest collapses quickly and remains flat. DeepSAD's fall-off is at least partly due to known mislabeled evaluation targets.}
 \FloatBarrier
@@ -1066,7 +1066,7 @@ Table~\ref{tab:results_ap} shows that the unsupervised regime \((0/0)\) achieves
 The precision--recall curves in Figure~\ref{fig:prc_over_semi} show that the overall curve shapes are similar across regimes, but shifted relative to one another in line with the AP ordering \((0/0) > (500/100) > (50/10)\). We attribute these shifts to overfitting: when only a few anomalies are labeled, the model fits them too strongly, and if those examples differ too much from other anomalies, generalization suffers. This explains why lightly supervised training performs even worse than unsupervised training, which avoids this bias.
-\figc{prc_over_semi}{figures/results_prc_over_semi.png}{\rev{PRCs} at latent dimension~32 for all three labeling regimes (unsupervised, lightly supervised, heavily supervised), shown separately for the LeNet-inspired (\rev{top}) and Efficient (\rev{bottom}) encoders. Baseline methods are included for comparison. Latent dimension~32 is shown as it achieved the best overall AP and is representative of the typical PRC shapes across dimensions.}{width=.7\textwidth}
+\figc{prc_over_semi}{figures/results_prc_over_semi.png}{\rev{PRCs} from experiment-based evaluation for all three labeling regimes (unsupervised, lightly supervised, heavily supervised), shown separately for the LeNet-inspired (\rev{a}) and Efficient (\rev{b}) encoders. Baseline methods are included for comparison. Latent dimension~32 is shown as it achieved the best overall AP and is representative of the typical PRC shapes across dimensions.}{width=.7\textwidth}
 The LeNet variant illustrates this effect most clearly, showing unusually high variance across folds in the lightly supervised case. In several folds, precision drops untypically early, which supports the idea that the model has overfit to a poorly chosen subset of labeled anomalies. The Efficient variant is less affected, maintaining more stable precision plateaus, which suggests it is more robust to such overfitting, which we observe consistently for nearly all latent dimensionalities.
@@ -1084,13 +1084,13 @@ These frame-by-frame time-axis plots simulate online inference and illustrate ho
 %\fig{results_inference_normal_vs_degraded}{figures/results_inference_normal_vs_degraded.png}{Comparison of anomaly detection methods with statistical indicators across clean (dashed) and degraded (solid) experiments. Each subplot shows one method (DeepSAD--LeNet, DeepSAD--Efficient, OCSVM, Isolation Forest). Red curves denote how strongly the anomaly score deviates from clean-experiment baseline; blue and green curves denote the percentage of missing \rev{LiDAR} points and near-sensor particle hits, respectively. Latent Space Dimensionality was 32 and semi-supervised labeling regime was 0 normal and 0 anomalous samples during training.}
-\fig{results_inference_normal_vs_degraded}{figures/results_inference_normal_vs_degraded.png}{Comparison of inference on unseen experiment for clean (dashed) vs. degraded (solid) experiments. Every subplot compares one method to statistical indicators. Red curves show each method's anomaly score deviation from its clean baseline; blue and green curves indicate the percentage of missing \rev{LiDAR} points and near-sensor particle hits, respectively. Latent dimension: 32; training regime: 0 normal, 0 anomalous samples.}
+\fig{results_inference_normal_vs_degraded}{figures/results_inference_normal_vs_degraded.png}{Comparison of inference on unseen data for clean (dashed) vs. degraded (solid) experiments. Each subplot, (a) - (d), compares one method's anomaly score deviation from its clean baseline in red to statistical indicators in blue and green, which indicate the percentage of missing \rev{LiDAR} points and near-sensor particle hits, respectively. Latent dimension: 32; training regime: 0 normal, 0 anomalous samples. Smoothed with EMA $\alpha=0.1$.}
 As discussed in Section~\ref{sec:setup_baselines_evaluation}, we apply $z$-score normalization to enable comparison of the different methods during inference.  After normalization, the resulting time series were still highly noisy, which motivated the application of exponential moving average (EMA) smoothing. EMA was chosen because it is causal (does not rely on future data) and thus suitable for real-time inference. Although it introduces a small time delay, this delay is shorter than for other smoothing techniques, such as running averages.
 The plots in Figure~\ref{fig:results_inference_normal_vs_degraded} highlight important differences in how well the tested methods distinguish between normal and degraded sensor conditions. The plots show how strongly the method's scores deviate from their clean-data baseline and include statistical indicators (missing points and near-sensor particle hits) in blue and green.
-Among the four approaches, the strongest separation is achieved by DeepSAD (Efficient), followed by DeepSAD (LeNet), then OCSVM. For Isolation Forest, the anomaly scores are already elevated in the clean experiment, which prevents reliable differentiation between normal and degraded runs and makes the method unsuitable in this context.
+Among the four approaches, the strongest separation is achieved by DeepSAD Efficient (b), followed by DeepSAD LeNet (a), then OCSVM (c). For Isolation Forest (d), the anomaly scores are already elevated in the clean experiment, which prevents reliable differentiation between normal and degraded runs and makes the method unsuitable in this context.
 When comparing the methods to the statistical indicators, some similarities in shape may suggest that the methods partly capture these statistics, although such interpretations should be made with caution.
 The anomaly detection models are expected to have learned additional patterns that are not directly observable from simple statistics, and these may also contribute to their ability to separate degraded from clean data.
--- a/thesis/figures/ae_elbow_test_loss_anomaly.png
+++ b/thesis/figures/ae_elbow_test_loss_anomaly.png
--- a/thesis/figures/ae_elbow_test_loss_overall.png
+++ b/thesis/figures/ae_elbow_test_loss_overall.png
--- a/thesis/figures/data_2d_projections.png
+++ b/thesis/figures/data_2d_projections.png
--- a/thesis/figures/data_combined_anomalies_timeline.png
+++ b/thesis/figures/data_combined_anomalies_timeline.png
--- a/thesis/figures/data_missing_points.png
+++ b/thesis/figures/data_missing_points.png
--- a/thesis/figures/data_points_pie.png
+++ b/thesis/figures/data_points_pie.png
--- a/thesis/figures/particles_near_sensor_boxplot_zoomed_500.png
+++ b/thesis/figures/particles_near_sensor_boxplot_zoomed_500.png
--- a/thesis/figures/results_inference_normal_vs_degraded.png
+++ b/thesis/figures/results_inference_normal_vs_degraded.png
--- a/thesis/figures/results_prc.png
+++ b/thesis/figures/results_prc.png
--- a/thesis/figures/results_prc_over_semi.png
+++ b/thesis/figures/results_prc_over_semi.png
--- a/thesis/thesis_preamble/abstract.tex
+++ b/thesis/thesis_preamble/abstract.tex
@@ -1,9 +1,9 @@
 \addcontentsline{toc}{chapter}{Abstract}
 \begin{center}\Large\bfseries Abstract\end{center}\vspace*{1cm}\noindent
-Autonomous robots are increasingly used in search and rescue (SAR) missions. In these missions, LiDAR sensors are often the most important source of environmental data. However, LiDAR data can degrade under hazardous conditions, especially when airborne particles such as smoke or dust are present. This degradation can lead to errors in mapping and navigation and may endanger both the robot and humans. Therefore, robots need a way to estimate the reliability of their LiDAR data, so \rev{that} they can make better-informed decisions.
+Autonomous robots are increasingly used in search and rescue (SAR) missions. In these missions, LiDAR sensors are often the most important source of environmental data. However, LiDAR data can degrade under hazardous conditions, especially when airborne particles such as smoke or dust are present. This degradation can lead to errors in mapping and navigation and may endanger both the robot and humans. Therefore, robots need a way to estimate the reliability of their LiDAR data, so that they can make better-informed decisions.
 \bigskip
-This thesis investigates whether anomaly detection methods can be used to quantify LiDAR data degradation \rev{caused by airborne particles such as smoke and dust}. We apply a semi-supervised deep learning approach called DeepSAD, which produces an anomaly score for each LiDAR scan, serving as a measure of data reliability.
+This thesis investigates whether anomaly detection methods can be used to quantify LiDAR data degradation caused by airborne particles such as smoke and dust. We apply a semi-supervised deep learning approach called DeepSAD, which produces an anomaly score for each LiDAR scan, serving as a measure of data reliability.
 \bigskip
 We evaluate this method against baseline methods on a subterranean dataset that includes LiDAR scans degraded by artificial smoke. Our results show that DeepSAD consistently outperforms the baselines and can clearly distinguish degraded from normal scans. At the same time, we find that the limited availability of labeled data and the lack of robust ground truth remain major challenges. Despite these limitations, our work demonstrates that anomaly detection methods are a promising tool for LiDAR degradation quantification in SAR scenarios.
--- a/tools/devenv.nix
+++ b/tools/devenv.nix
@@ -1,6 +1,6 @@
 { pkgs, ... }:
 let
-  native_dependencies = with pkgs.python312Packages; [
+  native_dependencies = with pkgs.python311Packages; [
    torch-bin
    torchvision-bin
    aggdraw # for visualtorch
@@ -16,7 +16,7 @@ in
  packages = native_dependencies ++ tools;
  languages.python = {
    enable = true;
-    package = pkgs.python312;
+    package = pkgs.python311;
    uv = {
      enable = true;
      sync.enable = true;
--- a/tools/plot_scripts/ae_elbow_lenet.py
+++ b/tools/plot_scripts/ae_elbow_lenet.py
@@ -12,7 +12,7 @@ import numpy as np
 import polars as pl
 # CHANGE THIS IMPORT IF YOUR LOADER MODULE IS NAMED DIFFERENTLY
-from plot_scripts.load_results import load_pretraining_results_dataframe
+from load_results import load_pretraining_results_dataframe
 # ----------------------------
 # Config
@@ -78,8 +78,8 @@ def build_arch_curves_from_df(
            "overall": (dims, means, stds),
        } }
    """
-    if "split" not in df.columns:
+    # if "split" not in df.columns:
-        raise ValueError("Expected 'split' column in AE dataframe.")
+    #     raise ValueError("Expected 'split' column in AE dataframe.")
    if "scores" not in df.columns:
        raise ValueError("Expected 'scores' column in AE dataframe.")
    if "network" not in df.columns or "latent_dim" not in df.columns:
@@ -88,7 +88,7 @@ def build_arch_curves_from_df(
        raise ValueError(f"Expected '{label_field}' column in AE dataframe.")
    # Keep only test split
-    df = df.filter(pl.col("split") == "test")
+    # df = df.filter(pl.col("split") == "test")
    groups: dict[tuple[str, int], dict[str, list[float]]] = {}
@@ -201,7 +201,7 @@ def plot_multi_loss_curve(arch_results, title, output_path, colors=None):
    plt.xlabel("Latent Dimensionality")
    plt.ylabel("Test Loss")
-    plt.title(title)
+    # plt.title(title)
    plt.legend()
    plt.grid(True, alpha=0.3)
    plt.xticks(all_dims)
--- a/tools/plot_scripts/data_anomalies_timeline.py
+++ b/tools/plot_scripts/data_anomalies_timeline.py
@@ -171,28 +171,28 @@ def plot_combined_timeline(
            range(num_bins), near_sensor_binned, color=color, linestyle="--", alpha=0.6
        )
-        # Add vertical lines for manually labeled frames if available
+        # # Add vertical lines for manually labeled frames if available
-        if all_paths[i].with_suffix(".npy").name in manually_labeled_anomaly_frames:
+        # if all_paths[i].with_suffix(".npy").name in manually_labeled_anomaly_frames:
-            begin_frame, end_frame = manually_labeled_anomaly_frames[
+        #     begin_frame, end_frame = manually_labeled_anomaly_frames[
-                all_paths[i].with_suffix(".npy").name
+        #         all_paths[i].with_suffix(".npy").name
-            ]
+        #     ]
-            # Convert frame numbers to normalized timeline positions
+        #     # Convert frame numbers to normalized timeline positions
-            begin_pos = (begin_frame / exp_len) * (num_bins - 1)
+        #     begin_pos = (begin_frame / exp_len) * (num_bins - 1)
-            end_pos = (end_frame / exp_len) * (num_bins - 1)
+        #     end_pos = (end_frame / exp_len) * (num_bins - 1)
-            # Add vertical lines with matching color and loose dotting
+        #     # Add vertical lines with matching color and loose dotting
-            ax1.axvline(
+        #     ax1.axvline(
-                x=begin_pos,
+        #         x=begin_pos,
-                color=color,
+        #         color=color,
-                linestyle=":",
+        #         linestyle=":",
-                alpha=0.6,
+        #         alpha=0.6,
-            )
+        #     )
-            ax1.axvline(
+        #     ax1.axvline(
-                x=end_pos,
+        #         x=end_pos,
-                color=color,
+        #         color=color,
-                linestyle=":",
+        #         linestyle=":",
-                alpha=0.6,
+        #         alpha=0.6,
-            )
+        #     )
    # Customize axes
    ax1.set_xlabel("Normalized Timeline")
@@ -202,7 +202,7 @@ def plot_combined_timeline(
    ax1.set_ylabel("Missing Points (%)")
    ax2.set_ylabel("Points with <0.5m Range (%)")
-    plt.title(title)
+    # plt.title(title)
    # Create legends without fixed positions
    # First get all lines and labels for experiments
@@ -221,7 +221,8 @@ def plot_combined_timeline(
    )
    # Create single legend in top right corner with consistent margins
-    fig.legend(all_handles, all_labels, loc="upper right", borderaxespad=4.8)
+    # fig.legend(all_handles, all_labels, loc="upper right", borderaxespad=2.8)
    fig.legend(all_handles, all_labels, bbox_to_anchor=(0.95, 0.99))
    plt.grid(True, alpha=0.3)
--- a/tools/plot_scripts/data_count_lidar_frames.py
+++ b/tools/plot_scripts/data_count_lidar_frames.py
@@ -122,8 +122,8 @@ def plot_data_points_pie(normal_experiment_frames, anomaly_experiment_frames):
    # prepare data for pie chart
    labels = [
-        "Normal Lidar Frames\nNon-Degraded Pointclouds",
+        "Normal Lidar Frames\nNon-Degraded Point Clouds",
-        "Anomalous Lidar Frames\nDegraded Pointclouds",
+        "Anomalous Lidar Frames\nDegraded Point Clouds",
    ]
    sizes = [total_normal_frames, total_anomaly_frames]
    explode = (0.1, 0)  # explode the normal slice
@@ -150,9 +150,9 @@ def plot_data_points_pie(normal_experiment_frames, anomaly_experiment_frames):
        va="center",
        color="black",
    )
-    plt.title(
+    # plt.title(
-        "Distribution of Normal and Anomalous\nPointclouds in all Experiments (Lidar Frames)"
+    #     "Distribution of Normal and Anomalous\nPointclouds in all Experiments (Lidar Frames)"
-    )
+    # )
    plt.tight_layout()
    # save the plot
--- a/tools/plot_scripts/data_missing_points.py
+++ b/tools/plot_scripts/data_missing_points.py
@@ -5,7 +5,6 @@ from pathlib import Path
 import matplotlib.pyplot as plt
 import numpy as np
 from pointcloudset import Dataset
 # define data path containing the bag files
 all_data_path = Path("/home/fedex/mt/data/subter")
@@ -82,7 +81,7 @@ def plot_data_points(normal_experiment_paths, anomaly_experiment_paths, title):
    plt.figure(figsize=(10, 5))
    plt.hist(missing_points_normal, bins=100, alpha=0.5, label="Normal Experiments")
    plt.hist(missing_points_anomaly, bins=100, alpha=0.5, label="Anomaly Experiments")
-    plt.title(title)
+    # plt.title(title)
    plt.xlabel("Number of Missing Points")
    plt.ylabel("Number of Pointclouds")
    plt.legend()
@@ -109,7 +108,7 @@ def plot_data_points(normal_experiment_paths, anomaly_experiment_paths, title):
        label="Anomaly Experiments",
        orientation="horizontal",
    )
-    plt.title(title)
+    # plt.title(title)
    plt.xlabel("Number of Pointclouds")
    plt.ylabel("Number of Missing Points")
    plt.legend()
@@ -142,7 +141,7 @@ def plot_data_points(normal_experiment_paths, anomaly_experiment_paths, title):
        label="Anomaly Experiments",
        density=True,
    )
-    plt.title(title)
+    # plt.title(title)
    plt.xlabel("Number of Missing Points")
    plt.ylabel("Density")
    plt.legend()
@@ -169,7 +168,7 @@ def plot_data_points(normal_experiment_paths, anomaly_experiment_paths, title):
        label="Anomaly Experiments (With Artifical Smoke)",
        density=True,
    )
-    plt.title(title)
+    # plt.title(title)
    plt.xlabel("Percentage of Missing Lidar Measurements")
    plt.ylabel("Density")
    # display the x axis as percentages
@@ -210,7 +209,7 @@ def plot_data_points(normal_experiment_paths, anomaly_experiment_paths, title):
        alpha=0.5,
        label="Anomaly Experiments",
    )
-    plt.title(title)
+    # plt.title(title)
    plt.xlabel("Number of Missing Points")
    plt.ylabel("Normalized Density")
    plt.legend()
--- a/tools/plot_scripts/data_particles_near_sensor.py
+++ b/tools/plot_scripts/data_particles_near_sensor.py
@@ -5,7 +5,6 @@ from pathlib import Path
 import matplotlib.pyplot as plt
 import numpy as np
 from pointcloudset import Dataset
 # define data path containing the bag files
 all_data_path = Path("/home/fedex/mt/data/subter")
@@ -164,7 +163,7 @@ def plot_data_points(normal_experiment_paths, anomaly_experiment_paths, title):
        plt.gca().set_yticklabels(
            ["{:.0f}%".format(y * 100) for y in plt.gca().get_yticks()]
        )
-        plt.title("Particles Closer than 0.5m to the Sensor")
+        # plt.title("Particles Closer than 0.5m to the Sensor")
        plt.ylabel("Percentage of measurements closer than 0.5m")
        plt.tight_layout()
        plt.savefig(output_datetime_path / f"particles_near_sensor_boxplot_{rt}.png")
@@ -186,7 +185,7 @@ def plot_data_points(normal_experiment_paths, anomaly_experiment_paths, title):
        plt.gca().set_yticklabels(
            ["{:.0f}%".format(y * 100) for y in plt.gca().get_yticks()]
        )
-        plt.title("Particles Closer than 0.5m to the Sensor")
+        # plt.title("Particles Closer than 0.5m to the Sensor")
        plt.ylabel("Percentage of measurements closer than 0.5m")
        plt.ylim(0, 0.05)
        plt.tight_layout()
--- a/tools/plot_scripts/data_spherical_projection.py
+++ b/tools/plot_scripts/data_spherical_projection.py
@@ -112,18 +112,27 @@ cmap = get_colormap_with_special_missing_color(
    args.colormap, args.missing_data_color, args.reverse_colormap
 )
-# --- Create a figure with 2 vertical subplots ---
+# --- Create a figure with 2 vertical subplots and move titles to the left ---
 fig, (ax1, ax2) = plt.subplots(nrows=2, ncols=1, figsize=(10, 5))
-for ax, frame, title in zip(
+# leave extra left margin for the left-side labels
 fig.subplots_adjust(left=0.14, hspace=0.05)
 for ax, frame, label in zip(
    (ax1, ax2),
    (frame1, frame2),
-    (
+    ("(a)", "(b)"),
        "Projection of Lidar Frame without Degradation",
        "Projection of Lidar Frame with Degradation (Artifical Smoke)",
    ),
 ):
    im = ax.imshow(frame, cmap=cmap, aspect="auto", vmin=global_vmin, vmax=global_vmax)
-    ax.set_title(title)
+    # place the "title" to the left, vertically centered relative to the axes
    ax.text(
        -0.02,  # negative x places text left of the axes (in axes coordinates)
        0.5,
        label,
        transform=ax.transAxes,
        va="center",
        ha="right",
        fontsize=12,
    )
    ax.axis("off")
 # Adjust layout to fit margins for a paper
--- a/tools/plot_scripts/results_inference_timelines_exp_compare.py
+++ b/tools/plot_scripts/results_inference_timelines_exp_compare.py
@@ -260,11 +260,11 @@ def baseline_transform(clean: np.ndarray, other: np.ndarray, mode: str):
 def pick_method_series(gdf: pl.DataFrame, label: str) -> Optional[np.ndarray]:
-    if label == "DeepSAD (LeNet)":
+    if label == "DeepSAD LeNet":
        sel = gdf.filter(
            (pl.col("network") == "subter_LeNet") & (pl.col("model") == "deepsad")
        )
-    elif label == "DeepSAD (efficient)":
+    elif label == "DeepSAD Efficient":
        sel = gdf.filter(
            (pl.col("network") == "subter_efficient") & (pl.col("model") == "deepsad")
        )
@@ -311,8 +311,8 @@ def compare_two_experiments_progress(
    include_stats: bool = True,
 ):
    methods = [
-        "DeepSAD (LeNet)",
+        "DeepSAD LeNet",
-        "DeepSAD (efficient)",
+        "DeepSAD Efficient",
        "OCSVM",
        "Isolation Forest",
    ]
@@ -392,8 +392,8 @@ def compare_two_experiments_progress(
    axes = axes.ravel()
    method_to_axidx = {
-        "DeepSAD (LeNet)": 0,
+        "DeepSAD LeNet": 0,
-        "DeepSAD (efficient)": 1,
+        "DeepSAD Efficient": 1,
        "OCSVM": 2,
        "Isolation Forest": 3,
    }
@@ -404,6 +404,8 @@ def compare_two_experiments_progress(
    if not stats_available:
        print("[WARN] One or both stats missing. Subplots will include methods only.")
    letters = ["a", "b", "c", "d"]
    for label, axidx in method_to_axidx.items():
        ax = axes[axidx]
        yc = curves_clean.get(label)
@@ -412,7 +414,7 @@ def compare_two_experiments_progress(
            ax.text(
                0.5, 0.5, "No data", ha="center", va="center", transform=ax.transAxes
            )
-            ax.set_title(label)
+            ax.set_title(f"({letters[axidx]}) {label}")
            ax.grid(True, alpha=0.3)
            continue
@@ -435,6 +437,7 @@ def compare_two_experiments_progress(
        )
        ax.set_ylabel(y_label)
        ax.set_title(label)
        ax.set_title(f"({letters[axidx]}) {label}")
        ax.grid(True, alpha=0.3)
        # Right axis #1 (closest to plot): Missing points (%)
@@ -550,11 +553,11 @@ def compare_two_experiments_progress(
    for ax in axes:
        ax.set_xlabel("Progress through experiment (%)")
-    fig.suptitle(
+    # fig.suptitle(
-        f"AD Method vs Stats Inference — progress-normalized\n"
+    #     f"AD Method vs Stats Inference — progress-normalized\n"
-        f"Transform: z-score normalized to non-degraded experiment | EMA(α={EMA_ALPHA_METHODS})",
+    #     f"Transform: z-score normalized to non-degraded experiment | EMA(α={EMA_ALPHA_METHODS})",
-        fontsize=14,
+    #     fontsize=14,
-    )
+    # )
    fig.tight_layout(rect=[0, 0, 1, 0.99])
    out_name = (
--- a/tools/plot_scripts/results_latent_space_comparisons.py
+++ b/tools/plot_scripts/results_latent_space_comparisons.py
@@ -161,7 +161,7 @@ def _ensure_dim_axes(fig_title: str):
    fig, axes = plt.subplots(
        nrows=4, ncols=2, figsize=(12, 16), constrained_layout=True
    )
-    fig.suptitle(fig_title, fontsize=14)
+    # fig.suptitle(fig_title, fontsize=14)
    axes = axes.ravel()
    return fig, axes
@@ -213,11 +213,13 @@ def plot_grid_from_df(
    legend_labels = []
    have_legend = False
    letters = ["a", "b", "c", "d", "e", "f", "g", "h"]
    for i, dim in enumerate(LATENT_DIMS):
        if i >= 7:
            break  # last slot reserved for legend
        ax = axes[i]
-        ax.set_title(f"Latent Dim. = {dim}")
+        ax.set_title(f"({letters[i]}) Latent Dim. = {dim}")
        ax.grid(True, alpha=0.3)
        if kind == "roc":
--- a/tools/plot_scripts/results_semi_labels_comparison.py
+++ b/tools/plot_scripts/results_semi_labels_comparison.py
@@ -260,9 +260,9 @@ def make_figures_for_dim(
    fig_roc, axes = plt.subplots(
        nrows=2, ncols=1, figsize=(7, 10), constrained_layout=True
    )
-    fig_roc.suptitle(
+    # fig_roc.suptitle(
-        f"ROC — {EVALS_LABELS[eval_type]} — Latent Dim.={latent_dim}", fontsize=14
+    #     f"ROC — {EVALS_LABELS[eval_type]} — Latent Dim.={latent_dim}", fontsize=14
-    )
+    # )
    _plot_panel(
        axes[0],
@@ -272,7 +272,7 @@ def make_figures_for_dim(
        latent_dim=latent_dim,
        kind="roc",
    )
-    axes[0].set_title("DeepSAD (LeNet) + Baselines")
+    axes[0].set_title("(a) DeepSAD (LeNet) + Baselines")
    _plot_panel(
        axes[1],
@@ -282,7 +282,7 @@ def make_figures_for_dim(
        latent_dim=latent_dim,
        kind="roc",
    )
-    axes[1].set_title("DeepSAD (Efficient) + Baselines")
+    axes[1].set_title("(b) DeepSAD (Efficient) + Baselines")
    out_roc = out_dir / f"roc_{latent_dim}_{eval_type}.png"
    fig_roc.savefig(out_roc, dpi=150, bbox_inches="tight")
@@ -292,9 +292,9 @@ def make_figures_for_dim(
    fig_prc, axes = plt.subplots(
        nrows=2, ncols=1, figsize=(7, 10), constrained_layout=True
    )
-    fig_prc.suptitle(
+    # fig_prc.suptitle(
-        f"PRC — {EVALS_LABELS[eval_type]} — Latent Dim.={latent_dim}", fontsize=14
+    #     f"PRC — {EVALS_LABELS[eval_type]} — Latent Dim.={latent_dim}", fontsize=14
-    )
+    # )
    _plot_panel(
        axes[0],
@@ -304,7 +304,7 @@ def make_figures_for_dim(
        latent_dim=latent_dim,
        kind="prc",
    )
-    axes[0].set_title("DeepSAD (LeNet) + Baselines")
+    axes[0].set_title("(a)")
    _plot_panel(
        axes[1],
@@ -314,7 +314,7 @@ def make_figures_for_dim(
        latent_dim=latent_dim,
        kind="prc",
    )
-    axes[1].set_title("DeepSAD (Efficient) + Baselines")
+    axes[1].set_title("(b)")
    out_prc = out_dir / f"prc_{latent_dim}_{eval_type}.png"
    fig_prc.savefig(out_prc, dpi=150, bbox_inches="tight")
--- a/tools/pyproject.toml
+++ b/tools/pyproject.toml
@@ -6,6 +6,7 @@ readme = "README.md"
 requires-python = ">=3.11.9"
 dependencies = [
    "pandas>=2.3.2",
    "pointcloudset>=0.11.0",
    "polars>=1.33.0",
    "pyarrow>=21.0.0",
    "tabulate>=0.9.0",
--- a/tools/uv.lock
+++ b/tools/uv.lock