In this blog, we will review this paper An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale available at

  1. Abstract


In CV, attention is either used along with CNN or in some other way keeping the CNN in place. This work uses a pure transformer on sequence of image patches for the classification task.