Skip to primary navigation
Skip to content
Skip to footer
Jason Zhao's Blog
Categories
Tags
About
Toggle search
Toggle menu
An Image is Worth 32 Tokens for Reconstruction and Generation (Titok)
Contribution
First to propose using 1D tokenization
previous work (e.g. CLIP, DINO…) output 1D tokens for classification but yet still capture task specific features
Method
ViT encoder
Vector Quantizer
ViT decoder
Enter your search term...