diff --git a/thesis/Main.tex b/thesis/Main.tex index 0aa5a6a..66cbc15 100755 --- a/thesis/Main.tex +++ b/thesis/Main.tex @@ -761,7 +761,32 @@ In our setting, the raw input dimensionality ($2048 \times 32$ per frame) is too We adapted the baseline implementations to our data loader and input format and added support for multiple evaluation targets per frame (two labels per data point), reporting both results per experiment. Both baselines, like DeepSAD, output continuous anomaly scores, which allows us to evaluate them directly without committing to a fixed threshold. -TODO transition to evaluation metrics, talk about typical ones like F1 scores (single threshold) so we go on to talk about ROC AUC, well known but can suffer from having class imbalance (especially as in our case) maybe calculation and example. say we saw these exact problems in our results so we decided to report mAP which is similar to roc auc but not as sensitive in regards to class imbalance (show with formula why its not) and then go on to explain that its basically the AUC of PRCs, which are more fitting curves for our usecase due to the same stability for class imbalance (like mAP) but for multiple thresholds (unlike F1) and shape can also give more insight than simple mAP alone. +%TODO transition to evaluation metrics, talk about typical ones like F1 scores (single threshold) so we go on to talk about ROC AUC, well known but can suffer from having class imbalance (especially as in our case) maybe calculation and example. say we saw these exact problems in our results so we decided to report mAP which is similar to roc auc but not as sensitive in regards to class imbalance (show with formula why its not) and then go on to explain that its basically the AUC of PRCs, which are more fitting curves for our usecase due to the same stability for class imbalance (like mAP) but for multiple thresholds (unlike F1) and shape can also give more insight than simple mAP alone. + + +\paragraph{Evaluation Metrics} + + +we have no analog ground truth, only binary labels which are noisy and somewhat subjective as discussed in section~\ref{sec:preprocessing}. The models all output analog scores although deepsad outputs geometric distance from point so always positive, isoforest difference from mean tree depth (which can a negative number up to mean tree depth or max positive depending on tree sizes) and ocsvm signed distance from boundary (so also negative and positive values). for these reasons it is infeasible to manually define threshold at which we evaluate models, which means model accuracy or f1 score cannot be reported, since they require evaluation at a fixed threshold. + +so we turn to evaluation curves which can show performance of methods over all possible thresholds. first in line, widely known and used is roc and for single look comparison roc AUC. it models TPR over FPR for all possible thresholds, but comes with the problem of being overly optimistic for imbalanced datasets such as ours, as is typical for anomaly detection tasks. this problem has been discussed in~\cite{roc_vs_prc} and it was concluded that PRC and PRC AUC (which is also called Average precsision=AP) is more robust in such cases. + +PRC curves plot the precision (Of the points I predict as anomalies, how many are actually anomalies) over recall (Out of all real anomalies, how many did I find?). this means for varying thresholds from infinitely low threshold we have perfect precision, but recall nothing and with rising threshold we start to recall more but at the cost of precision, since we start to also flag normal data as anomalies. The exact definition is: + +\[ + \text{Precision} = \frac{\text{TP}}{\text{TP} + \text{FP}}, \quad + \text{Recall} = \frac{\text{TP}}{\text{TP} + \text{FN}}. +\] + +in our own evaluation we found that ROC's overestimation was enough of a problem, that it made reporting of it infeasible. in figure (ref todo) we see this problem visualized in a ROC and PRC form the same data for both isolation forest and deepsad from our own testing. it can be seen that while in ROC both methods look similar and result in similarly good ROC AUC the PRC shows that isolation forest is quite a lot worse, since it wrongly flags a lot of normal data when trying to recall more anomlies, which is hidden in the ROC. the PRC is more sensitive to this class imbalance and its auc (=AP) shows this well, in how it is a lot lower for isolation forest than for deepsad (ROC AUC is 0.693 vs 0.782 whereas AP is 0.207 vs 0.633 for isolation forest and deepsad respectively) + +\figc{roc_vs_prc}{figures/setup_roc_vs_prc.png}{todo caption: prc vs roc of same data (our own runs of deepsad and isolation forest) which visualizes problem of roc which is not sensitive enough to class imbalance and therefore does not show that isolation forest wrongly identifies a lot of normal data as anomalous when trying to recall more true anomalies.}{width=.9\textwidth} + + +To summarize PRC performance in a single number, we use \textbf{Average Precision (AP)}, which computes the area under the PR curve. AP captures the model's precision across all recall levels and provides a robust, threshold-independent score. + +Due to the absence of reliable thresholds and the rarity of degraded data, traditional metrics like accuracy, F1, and ROC AUC do not faithfully reflect anomaly detection quality in our case. PR curves and AP, in contrast, emphasize model behavior on the positive class and penalize false alarms appropriately. This makes them both theoretically and empirically more suitable for evaluating degradation scoring in imbalanced, unsupervised settings. + \newsection{setup_experiments_environment}{Experiment Overview \& Computational Environment}