Main contributions

A Simple Baseline Based on Transformer

Untitled

Backbone

<aside> 💡 Feature extraction from search region and template

</aside>

Encoder

<aside> 💡 The self-attention module inside the encoder learns relationships with input features.

</aside>

Decoder

<aside> 💡 Learning query embeddings to predict the spatial position of a target object.

</aside>