[Submitted on 9 Mar 2021 (v1), last revised 30 Jun 2021 (this version, v2)]
Abstract: We investigate the capability of a transformer pretrained on natural language
to generalize to other modalities with minimal finetuning — in particular,
without finetuning of the self-attention and feedforward layers of the residual
blocks. We co