Main contributions
- Propose a new Transformer architecture dedicated to visual tracking.
- The whole method is end-to-end, does not need any postprocessing steps such as cosine window and bounding box smoothing, thus largely simplifying existing tracking pipelines.
- The proposed trackers achieve state-of-the-art performance on five challenging short-term and long-term benchmarks while running at real-time speed.
A Simple Baseline Based on Transformer
![Untitled](https://s3-us-west-2.amazonaws.com/secure.notion-static.com/d73459ed-1629-4fab-a072-67886bf50628/Untitled.png)
- The baseline method only considers spatial information and archives impressive performance.
Backbone
<aside>
💡 Feature extraction from search region and template
</aside>
- Used Vanilla ResNet for feature extraction
- remove the last stage and fully-connected layers
- The input of the backbone is a pair of images
- Target object: $z \in \mathbb{R}^{3\times H_z \times W_z}$
- Current frame: $x\in \mathbb{R}^{3\times H_x \times W_x}$
- After passing through of the backbone, the template $z$ and the search image $x$ are mapped to two feature maps
- $f_z\in \mathbb{R}^{C\times {H_z \over s}\times {W_z \over s}}$, $f_x\in \mathbb{R}^{C\times {H_x \over s}\times {W_x \over s}}$
Encoder
<aside>
💡 The self-attention module inside the encoder learns relationships with input features.
</aside>
-
Preprocessing before feeding into the encoder
-
Used bottleneck layer to reduce the channel number from $C$ to $d$
-
The feature maps are flattened and concatenated along the spatial dimension.
→ Shape(flatten+concat) : ( length(${{H_z \over s}{W_z \over s}+{H_x \over s}{W_x \over s}}$), $d$)
-
Add sinusoidal positional embeddings because of the permutation invariant of the original transformer.
-
The encoder captures the feature dependencies among all elements in the sequence and reinforces the original features with global contextual information.
Decoder
<aside>
💡 Learning query embeddings to predict the spatial position of a target object.
</aside>