Program Metrics

Achievement of metrics is a performance indicator under IARPA research programs. IARPA has defined Video LINCS program metrics to evaluate effectiveness of the proposed solutions in achieving the stated program goals and objectives, and to determine whether satisfactory progress is being made.

TA-1 - Re-Identification (ReID)

For TA-1 – ReID, the goal is to re-identify the same object throughout a video collection. Ideally, all appearances of the same object should be declared to be the same object by Video LINCS systems, without any misses and without matches to different objects (false positives).

This will be evaluated using two metrics: IDF1 and HOTA [https://arxiv.org/abs/2009.07736].

1. \(IDF_1\) Measure

\(IDF_1\) (ID \(F_1\)-score) [https://arxiv.org/abs/1609.01775] is a widely used evaluation metric in multi-object tracking (MOT) that measures the accuracy of identifying and consistently tracking individual objects across video frames. It is the harmonic mean of ID precision and ID recall, which assesses how well the tracker maintains the correct identities of objects over time. Unlike traditional metrics (e.g., MOTA) that focus only on object detection or localization, \(IDF_1\) emphasizes identity preservation, making it particularly important for evaluating tracking performance in scenarios where maintaining consistent object IDs is critical.

To calculate IDF1, we can directly extract the \(TP\), \(FP\), and \(FN\) values from the \(HOTA\) calculation and use them to compute the \(IDF_1\) score. In this program, all \(HOTA\) metrics have been extended to include \(IDF_1\)

2. HOTA (Higher Order Tracking Accuracy)

The HOTA metric is an evaluation paradigm for Multi-Object Tracking (MOT) designed to balance detection, association, and localization into a single unified score.

  • Localization Accuracy: Measures the spatial alignment between predicted and ground-truth boxes.

  • Detection Accuracy (\(DetA_\alpha\)): Measures the tracker’s ability to find all objects and avoid false positives. It is calculated using the Jaccard Index (IoU) over the set of all detections.

  • Association Accuracy (\(AssA_\alpha\)): Measures the tracker’s ability to maintain correct identities over time by averaging temporal alignment between matched trajectories.

For a given localization threshold \(\alpha\) (typically IoU), \(HOTA_\alpha\) is the geometric mean of detection and association accuracy:

\[HOTA_\alpha = \sqrt{DetA_\alpha \cdot AssA_\alpha}\]

Detection accuracy measures the tracker’s ability to find all objects and avoid false positives. Detection accuracy is calculated using the Jaccard Index (IoU) formulation over the set of all detections across the entire sequence:

\[DetA_\alpha = \frac{|TP_\alpha|}{|TP_\alpha| + |FP_\alpha| + |FN_\alpha|}\]

Association accuracy measures the tracker’s ability to maintain correct identities over time. Association accuracy measures the average temporal alignment between matched trajectories. It is defined by averaging the association similarity \(\mathcal{A}(c)\) over all true positive matches \(c \in TP_\alpha\):

\[AssA_\alpha = \frac{1}{|TP_\alpha|} \sum_{c \in TP_\alpha} \mathcal{A}(c)\]

The association similarity \(\mathcal{A}(c)\) for a match \(c\) is calculated using the counts of True Positive Associations (TPA), False Positive Associations (FPA), and False Negative Associations (FNA):

\[\mathcal{A}(c) = \frac{|TPA(c)|}{|TPA(c)| + |FPA(c)| + |FNA(c)|}\]

The final HOTA score is obtained by integrating \(HOTA_\alpha\) over 19 localization thresholds ranging from 0.05 to 0.95 in increments of 0.05:

\[HOTA = \int_0^1 HOTA_\alpha \, d\alpha \approx \frac{1}{N} \sum_{\alpha \in \{0.05, 0.10, \dots, 0.95\}} HOTA_\alpha\]

TA-2 - Geo-Localization

For TA-2 – Geo-localization, ground-plane accuracy is measured by computing the position error in meters averaged over all object appearances. This metric evaluates the precision of 3D coordinates (Longitude, Latitude, and Altitude).

Position Error Calculation

The position error is calculated as the Euclidean (\(L_2\)) norm of the difference vector between the system prediction and the ground truth:

\[\text{Error} = ||(\text{long, lat, alt})_{\text{Sys}} - (\text{long, lat, alt})_{\text{GT}}||_2\]

Where:

  • \(Sys\): The predicted coordinates from the system.

  • \(GT\): The corresponding ground-truth coordinates.

  • \(||\cdot||_2\): Represents the square root of the sum of squared differences across all dimensions.

Similarity Score Transformation

\(L_2\) distance tells us how far apart two points are. However, its range is from \([0, \inf]\), which is unbounded. To obtain a normalized similarity score in the \([1,0]\) range, we convert the distance into a similarity score using exponential decay.

\[S = e^{-\frac{d}{10}}\]

The decay is gradual and makes intuitive sense for real-world human scale spatial distances. The negative exponential converts an unbounded distance \([0, \inf]\) into a bounded \([1, 0]\) with 0 distance translating into 1 similarity. The \(\frac{d}{10}\) normalizes the distance values so that human scale distances correspond to reasonable similarity values. This constant divisor (e.g., 10) defines the “strictness” of the evaluation; smaller values penalize small distance errors more heavily.