STARK: Learning Spatio-Temporal Transformer for Visual Tracking

Main contributions

Propose a new Transformer architecture dedicated to visual tracking.
The whole method is end-to-end, does not need any postprocessing steps such as cosine window and bounding box smoothing, thus largely simplifying existing tracking pipelines.
The proposed trackers achieve state-of-the-art performance on five challenging short-term and long-term benchmarks while running at real-time speed.

Untitled

The baseline method only considers spatial information and archives impressive performance.

<aside> 💡 Feature extraction from search region and template

</aside>

Used Vanilla ResNet for feature extraction
- remove the last stage and fully-connected layers
The input of the backbone is a pair of images
- Target object: $z \in \mathbb{R}^{3\times H_z \times W_z}$
- Current frame: $x\in \mathbb{R}^{3\times H_x \times W_x}$
After passing through of the backbone, the template $z$ and the search image $x$ are mapped to two feature maps
- $f_z\in \mathbb{R}^{C\times {H_z \over s}\times {W_z \over s}}$, $f_x\in \mathbb{R}^{C\times {H_x \over s}\times {W_x \over s}}$

<aside> 💡 The self-attention module inside the encoder learns relationships with input features.

</aside>

Preprocessing before feeding into the encoder
- Used bottleneck layer to reduce the channel number from $C$ to $d$
- The feature maps are flattened and concatenated along the spatial dimension.
  
  → Shape(flatten+concat) : ( length(${{H_z \over s}{W_z \over s}+{H_x \over s}{W_x \over s}}$), $d$)
Add sinusoidal positional embeddings because of the permutation invariant of the original transformer.
The encoder captures the feature dependencies among all elements in the sequence and reinforces the original features with global contextual information.

<aside> 💡 Learning query embeddings to predict the spatial position of a target object.

</aside>