[Submitted on 9 Mar 2021 (v1), last revised 30 Jun 2021 (this version, v2)] Download PDF Abstract: We investigate the capability of a transformer pretrained on natural language to generalize to other modalities with minimal finetuning — in particular, without finetuning of the self-attention and feedforward layers of the residual blocks. We co
