VIM研究组在小样本图像分类方向取得新进展.pdf

arXiv:2303.14123v1 [cs.CV] 24 Mar 2023 Semantic Prompt for Few-Shot Image Recognition Wentao Chen1,2 *, Chenyang Si3 *, Zhang Zhang2,4 , Liang Wang2,4 , Zilei Wang1 , Tieniu Tan1,2,4 1 University of Science and Technology of China 2 Center for Research on Intelligent Perception and Computing, NLPR, CASIA 3 Nanyang Technological University, Singapore 4 University of Chinese Academy of Sciences wentao.chen@cripac.ia.ac.cn, chenyang.si.mail@gmail.com, zzhang@nlpr.ia.ac.cn Abstract Few-shot learning is a challenging problem since only a few examples are provided to recognize a new class. Several recent studies exploit additional semantic information, e.g. text embeddings of class names, to address the issue of rare samples through combining semantic prototypes with visual prototypes. However, these methods still suffer from the spurious visual features learned from the rare support samples, resulting in limited benefits. In this paper, we propose a novel Semantic Prompt (SP) approach for few-shot learning. Instead of the naive exploitation of semantic information for remedying classifiers, we explore leveraging semantic information as prompts to tune the visual feature extraction network adaptively. Specifically, we design two complementary mechanisms to insert semantic prompts into the feature extractor: one is to enable the interaction between semantic prompts and patch embeddings along the spatial dimension via self-attention, another is to supplement visual features with the transformed semantic prompts along the channel dimension. By combining these two mechanisms, the feature extractor presents a better ability to attend to the class-specific features and obtains more generalized image representations with merely a few support samples. Through extensive experiments on four datasets, the proposed approach achieves promising results, improving the 1-shot learning accuracy by 3.67% on average. 1. Introduction Few-shot learning (FSL) [21] is a fundamental and challenging task and remains largely unsolved as it aims to predict a new class with rare samples. To address this problem, most effective FSL approaches leverage the prior knowledge learned from a large labeled base dataset, and encode the prior knowledge as a set of initial network parameters [12, 37, 42], or a fixed embedding function shared by * Equal contribution Semantic prompt-guided feature extraction 𝑓 Input image {‘unicycle’} Attention map 𝑔 A unicycle is a vehicle with only one wheel... Figure 1. Given only one image about a new class ‘unicycle’, the feature extractor is easily confused by the spurious features, such as the rider on the unicycle, and fails to obtain generalized image representations about the new class. In this paper, we propose Semantic Prompt, a new method to condition the feature extraction on rich semantic prior knowledge, such that the feature extractor captures the intrinsic class-specific features about the novel class. all classes [16, 45, 46, 49]. As the labeled images of novel classes are scarce, a straightforward alternative is to use auxiliary information from other modalities, e.g. natural language, to assist in learning new concepts, which has been extensively studied in zero-shot learning [13,26,40,43]. These methods usually directly use textual embeddings as the image classifiers for novel classes. Following this idea, a recent FSL study [52] proposes to infer textual prototypes from class names and combine them with the visual prototypes (i.e., classifiers) extracted from the rare support images. Others [32, 53] improve this work by introducing more sophisticated textual prototype predictors (e.g. Graph Convolutional Network) or producing more accurate textual prototypes through leveraging the benefits of large-scale pre-trained language models. In spite of their success, most of the above methods for directly inferring class prototypes from textual features ignore the information gap between textual and visual features. Specifically, the textual features may contain the semantic relationship between a novel class and known classes. However, they fail to provide the exact discriminative visual features of the new class because of lacking interaction with the underlying visual representations. As a result, the rich semantic information has derived limited benefit for recognizing novel classes when directly injecting it into classifiers. Moreover, with only limited support images, the learned visual features still suffer from spurious features, such as background clutters, and struggles to produce an accurate class prototype. For example, as illustrated in Figure 1, given one support image of a novel class ‘unicycle’, the feature extractor may capture image features containing both unicycles and other distractors, like riders and tile roofs, and fail to recognize the unicycle in other environments. Actually, human perception system has a unique visual perceptual mechanism, called cognitive penetrability [30], which uses linguistic prior knowledge to tune ongoing visual perceptual processing to category-relevant stimulus features, promoting the learning of novel objects. Hence, it is necessary to develop a new architecture for effectively leveraging textual information to remedy the defective representation caused by rare samples. In this paper, we propose Semantic Prompt, a novel approach that leverages textual information of class names to significantly improve the representation ability of visual features for few-shot learning. Instead of directly inferring prototypes from textual features, we explore leveraging the textual features as semantic prompts to adaptively tune the feature extraction network for the rare support samples. As shown in Figure 1, with the guidance of semantic prompts, the feature extractor is expected to capture the intrinsic class-specific features for the novel class rather than other background clutters. Moreover, the advent of largescale training has produced a cornucopia of powerful Natural Language Processing (NLP) models, such as BERT [9] and GPT [36], which bootstrap extracting rich textual information from class names. Through the interaction between semantic prompts and visual features, such semantically rich representations have powerful potential to provide the feature extractor with additional discriminative visual features about the new class, and subsequently produce more generalized class prototypes. To condition the visual feature extraction on semantic prompts, we propose two complementary mechanisms to inject semantic information into the feature extractor, which allow the interaction between semantic prompts and visual features on the spatial and the channel dimensions, respectively. Specifically, to facilitate the interaction on the spatial dimension, we extend the image patch sequence with semantic prompts and feed them into a Transformer encoder. Through self-attention layers, the semantic prompts can inform the feature extractor to attend to the class-specific features while suppressing other distractors. For the interaction on the channel dimension, we first concatenate the semantic prompts with the visual context extracted from all patches, and then feed them into an MLP module. The extracted feature vector is added to each patch token to modulate and augment the visual features channel-by-channel. By combining the two interaction mechanisms, the proposed Semantic Prompt approach (SP) can effectively leverage the textual information in class names to boost FSL. Through comprehensive experiments on four benchmarks, the proposed SP presents consistent performance improvements with different types of text encoders and architecture designs, demonstrating its strong generality for the FSL problem. In summary, our contribution are three-folds: • We propose a novel Semantic Prompt approach to leveraging textual information in class names for fewshot image recognition, which is inspired by the topdown cognitive penetrability effect in human perception and aims to adaptively tune the feature extraction to class-specific features according to the semantic prompts. • To condition visual feature extraction on semantic prompts, we propose two complementary mechanisms to inject semantic prompts into the visual feature extractor, which allow the interaction on the spatial and the channel dimensions, respectively. • The proposed method achieves remarkable performance on four FSL benchmarks, improving the FSL accuracy by 3.67% on average under the challenging 1-shot setting. 2. Related work Few-shot learning. FSL aims to recognize novel classes given only a few examples for each class. Previous work usually adopts a meta-learning paradigm, in which a learner is trained on a sequence of few-shot training tasks (named episodes) sampled from a large base dataset in order to rapidly adapt to unseen testing tasks. In particular, optimization-based methods [12,37,42] aim to learn a set of optimal initial parameters shared by all tasks with fast adaptation ability. Metric learning-based methods [16,45,46,49] learn a fixed embedding function, which maps input images into a low-dimension embedding space and classifies unlabeled queries according to certain distances to the support samples, e.g., Euclidean distance [45], cosine-similarity distance [31], and Earth Mover’s Distance [56]. Few-shot learning with language. To leverage additional information from other modalities (especially language) to help recognize novel classes, a line of recent studies [3, 24, 32, 52] propose to integrate both visual features and auxiliary text features to represent a novel class. For example, Xing et al. [52] propose an adaptive fusion mechanism to combine a visual prototype with a semantic prototype obtained by the word embedding of the class label. Peng et al. [32] adopt a Graph Convolutional Network [58] as the predictor to incorporate additional knowledge from a knowledge graph. Yan et al. [54] propose a word vectorguided attention mechanism to obtain label prototypes for the multi-label few-shot learning problem. Different from previous work that leverages semantic information at the level of classifiers or class prototypes, we explore the auxiliary information as a kind of semantic prompt to enhance the feature extraction for the limited support samples. Transformer and prompt-based learning. Transformer is general network architecture for NLP tasks [5, 9, 36, 55], and has also demonstrated great potential to deal with computer vision tasks [11,28,44,59]. Specially, Dosovitskiy et al. [11] propose a simple Vision Transformer (ViT) architecture that regards image patches as a sequence and inputs them into a Transformer encoder to extract visual features. Due to the limited inductive bias in its architecture design, Transformer usually requires a lot of data to learn a new task. To address this problem, prompt-based methods [5, 34] have been proposed to adapt a pre-trained language model to down-stream tasks in a data-efficient way. For example, Brown et al. [5] wrap the input sentence with several hand-crafted prompt words, which inform the model of the task prior knowledge and modulate the model’s behavior to the desired mode. Other studies [23, 25, 57] propose to replace the discrete prompt words with continuous prompt vectors that are easier to optimize than the discrete prompts. Recently, Tsimpoukelli et al. [48] propose a cross-modal prompt approach, which regards image features as the prompts for language inputs to perform multimodal few-shot learning. In this paper, we propose to regard textual features as the semantic prompts for image inputs, which can tune the ongoing visual feature extraction to class-specific features and facilitate learning novel classes with few examples. As far as we know, this is the first time to adopt semantic features as prompts to tune visual feature extractors for few-shot learning. 3. Problem formulation The FSL problem is usually defined as a N -way K-shot classification task, where a model should classify a query sample xq from the query set Q into one of N classes N ×K Cnovel , based on a few labeled examples (xsi , yis )i=1 from the support set S. Since it is very difficult to train a model from scratch with the small support set S, a large labeled dataset Dbase is provided to pre-train the model before performing few-shot learning. Previous work usually adopts a meta-training strategy [49] to split the base dataset into multiple N -way K-shot episodes. Each episode also contains a support set and a query set, mimicking the few-shot learning problem during testing. Note that the base classes Cbase do not overlap with the novel classes, i.e., Cbase ∩ Cnovel = ϕ. Therefore, the model is expected to acquire the ability to generalize to unseen classes after meta-training. Variant: In most previous work, the image label y is usually represented as a one-hot vector, e.g. y = [0, 1, 0, 0, ...]. However, this representation erases the semantic relationships among object concepts and ignores the valuable linguistic information contained in the textual labels. In this paper, we retain text labels (e.g. ‘cat’, ‘dog’) besides the one-hot labels in order to extract semantics from text labels. We denote y text as the text label to distinguish it with the one-hot label y. 4. Method Following [6], our approach consists of two training stages. In the first stage, we pre-train a feature extractor f by classifying all images in the base set Dbase . In the second stage, we fine-tune f with Semantic Prompt (SP) under the meta-learning paradigm, such that f acquires the ability to extract generalized and class-relevant visual features for data-scarce scenarios. 4.1. Pre-training Learning a general feature extractor is the key to transfer knowledge to down-stream learning tasks [15, 19, 35], including few-shot learning [47]. Given the labeled base dataset Dbase , we adopt a simple supervised learning paradigm to learn the feature extractor. A linear classification head [W, b] is added on the top of the feature extractor, which maps the input feature vector f (x) into one of the base classes. We jointly train the feature extractor and the classification head by minimizing the standard cross entropy loss: \mathcal {L}_{pre} = \frac {1}{|D_{base}|} \sum _{(\bm {x},y)\in D_{base}} - \log \frac {\exp (\bm {W}_y^T f(\bm {x}) + \bm {b}_y)}{\sum _i \exp (\bm {W}_i^T f(\bm {x}) + \bm {b}_i)}, (1) where Wi , bi denote the classifier weight and the bias for the class i. Backbone: To facilitate the following interaction between visual features and semantic prompts, we adopt the Vision Transformers as the image feature extractor f . Specifically, an input image x ∈ RH×W ×C is first divided to a sequence of M image patches X = {x1p , x2p , ..., xM p } i P ×P ×C where xp ∈ R is an image patch and P is the patch size. Then, each patch is mapped into an embedding vector and added with a learnable position embedding. The preprocessed image patches for the Transformer input can be written as: Z0 = [z01 , z02 , ..., z0M ], where z0i ∈ RCz is the patch token at the position i and Cz is the number of channels of each token. Spatial Interaction Transformer Layers Weighted Sum DotProduct & SoftMax 𝒒𝟎 Projector 𝒌𝟎 𝒗𝟎 𝒌1 𝒒1 𝒛𝟏 𝒛𝟎 Spatial and Channel Interaction 𝒗1 Channel Interaction Pre-trained Text Encoder MLP Patch Embedding & Transformer Layers Average 𝒛𝟎 A photo of a unicycle. 𝒛𝟏 𝒛𝑴 Figure 2. Framework of the proposed Semantic Prompt approach. The support image is split into small patches and fed into Transformer layers to extract visual features, which however may contain both class-specific features and other clutters. To address this problem, we leverage textual features extracted from class names as semantic prompts to adaptively tune the visual feature extraction. The semantic prompts can interact with visual features along the spatial and the channel dimensions, and guide the feature extractor to capture the intrinsic discriminative features about the new class. The patch tokens are fed into L Transformer layers to extract visual features, each of which consists of multihead self-attention (MSA), a MLP block, Layernorm (LN), and residual connections. (Please refer to the appendix for more details.) At the top layer L, we average all embedding vectors in the sequence as the extracted image features: f(\bm {x}) = \frac {1}{M} \sum _{i=1}^M \bm {z}_L^i, \label {eq:pooling} (2) i where zL is the ith embedding vector at the layer L. Note that self-attention has quadratic computation costs with respect to the sequence length. To reduce computation costs, we adopt the Visformer [7], a variant of the original ViT [11], in our implementation, which replaces the first seven Transformer layers with convolutional blocks and adopts pooling among stages to reduce the sequence length. 4.2. Semantic Prompt After pre-trained on the base dataset, the feature extractor f can extract substantial visual features from the input images. However, due to the semantic shift between novel classes and the base dataset, the feature extractor is limited in its ability to generalize the knowledge to novel concepts with only a few labeled examples, especially when spurious correlations appear in novel class images [3, 50]. For example, given an image of an unseen bird standing in a tree, the model may treat both bird features and other visual features (e.g. leaves, twigs) to represent the concept of the bird, and fails to recognize the bird in other environments. To mitigate this problem, we explore additional semantic information as prompts to guide the visual feature network to obtain intrinsic and discriminative class prototypes under rare support samples, so that query images can be classified easily in terms of their distances to theses prototypes. Specifically, textual data of class names is adopted as prior knowledge for novel classes, due to its strong ability to describe semantics. Moreover, we use the NLP models with large-scale pre-training [33, 35, 38] to extract textual features. The prior knowledge from a large bank of pre-trained NLP models benefits textual feature extraction from class names. To accommodate the model to semantic prompts, we adopt the meta-training strategy [49] to fine-tune the feature extractor associated with semantic prompts on a series of training episodes. The framework of our approach is illustrated in Figure 2. Specifically, given a support image xs in a training episode, we feed its class name y text into a pre-trained language model g(·) to extract semantic features i.e., g(y text ). The semantic features are used to modulate the feature extraction for the rare support samples. We denote fg (xs ) = f (xs |g(y text )) as the conditional feature extraction process, which will be described in the following section. The obtained support features are averaged within each class to compute class prototypes. Let pi denote the prototype for the class i, then \bm {p}_i = \frac {1}{K} \sum _{j=1}^K f_g(\bm {x}^s_j), where xsj is the j th support image of the class i. (3) During meta-training, we freeze the text encoder g(·) and fine-tune other parameters by maximizing the feature similarities between query samples and their prototypes with a cross-entropy loss: \mathcal {L}_{meta} = - {\mathbb {E}}_{S,Q} {\mathbb {E}}_{x^q} \log \frac {\exp (s(f(\bm {x}^q), \bm {p}_{y^q})/\tau )}{\sum _{i=1}^N \exp (s(f(\bm {x}^q), \bm {p}_i)/\tau )}, \label {eq:L_meta} (4) where s denotes the cosine similarity, pyq is the prototype of the class y q , and τ is a temperature hyper-parameter. 4.2.2 Interaction on the channel dimension Besides spatial interaction via MSA, we propose another interaction mechanism that allows modulating and augmenting visual features channel-by-channel according to the input semantic prompts. Given the input sequence of patch 1 2 M , ..., zl−1 ] ∈ RM ×Cz at embeddings Zl−1 = [zl−1 , zl−1 the layer l, we first obtain a global visual context vector c zl−1 ∈ RCz by averaging all patch tokens: \bm {z}^c_{l-1} = \frac {1}{M} \sum _{i=1}^M \bm {z}^i_{l-1}. 4.2.1 Interaction on the spatial dimension We first take inspiration from the prompt methods in NLP [5, 34] to concatenate prompt vectors with the input sequence and feed them together into Transformer layers. Given the semantic features g(y text ) and the input sequence 1 2 M of patch embeddings Zl−1 = [zl−1 , zl−1 , ..., zl−1 ] ∈ M ×Cz R at the layer l, we obtain a new sequence Ẑl−1 ∈ (M +1)×Cz by extending Zl−1 with the projected semantic R features : \hat {\bm {Z}}_{l-1} = [\bm {z}^0, \bm {z}^1_{l-1},...,\bm {z}^M_{l-1}], (5) where z 0 = hs (g(y text )) ∈ RCz is the projected semantic embedding for spatial interaction and hs is the projector that keeps the dimension of the semantic embedding to be the same as the patch embeddings. Then, the extended sequence Ẑl−1 is fed into the remaining Transformer layers, which contain multihead selfattention modules (MSA) to allow the interaction between semantic prompts and patch tokens along the spatial dimension. Specifically, letting Ẑl−1 be the input sequence to a MSA module at the layer l, MSA first maps each token into three vectors, q, k, v ∈ RNh ×(M +1)×Ch , with linear projection parameterized by Wqkv , i.e., [\bm {q}, \bm {k}, \bm {v}] = \hat {\bm {Z}}_{l-1} \bm {W}_{qkv}, (6) where Nh is the number of heads and Ch is the number of channels for each head. It then computes the attention weights A ∈ RNh ×(M +1)×(M +1) by taking the inner product between q and k and performing softmax along the spatial dimension: \bm {A} = softmax(\bm {q}\bm {k}^T/C_h^{\frac {1}{4}}). (7) The attention weights are used to choose and aggregate information from different positions. The final output is obtained by concatenating outputs of all heads and performing linear projection parameterized by Wout : MSA(\hat {\bm {Z}}_{l-1}) = (\bm {A}\bm {v})\bm {W}_{out}. (8) (9) c The visual context zl−1 is then concatenated with the 0 projected semantic vector z = hc (g(ytext )) ∈ RCz , and fed into a 2-layer MLP module to obtain a modulating vector βl−1 ∈ RCz : \bm {\beta }_{l-1} = \sigma (\bm {W}_2\ \sigma (\bm {W}_1\ [\bm {z}^0;\bm {z}^c_{l-1}]+\bm {b}_1)+\bm {b}_2), (10) where W1 , b1 , W2 , b2 are the parameters of the MLP module, σ is the sigmoid activation function, and hc is the projector for the channel interaction. We finally add the modulating vector to all patch tokens such that it can tune the visual features at each channel. The modulated sequence Z̃l−1 ∈ RM ×Cz can be written as: \tilde {\bm {Z}}_{l-1} = [\bm {z}^i_{l-1}+\bm {\beta }_{l-1},]\quad i=1,2,...,M. (11) 5. Experiments 5.1. Datasets and implementation details miniImageNet and tieredImageNet. The miniImageNet dataset is proposed in [49] to benchmark the few-shot learning problem. It contains a subset of 100 classes in the ImageNet [41] dataset, where 64 classes are used as base classes for pre-training and meta-training, 16 classes are used for validation, and 20 classes are used for testing. The tiredImageNet dataset [39] is also derived from ImageNet and contains more classes: 351 classes used for training, 97 classes used for validation, and 160 classes used for testing. The semantic difference between base classes and novel classes in the tieredImageNet dataset is much larger than miniImageNet. CIFAR-FS and FC100. These two datasets are derived from the CIFAR-100 [20] dataset with different partition modes. CIFAR-FS [22] randomly splits 100 classes into 64 training classes, 16 validation classes and 20 testing classes. In contrast, FC100 [31] divides classes based on their semantic superclasses, where 60 classes from 20 superclasses are used for training, 20 classes from 4 superclasses are used for validation, 20 classes from 4 superclasses are used for testing. The large semantic gap makes FC100 more difficult than CIFAR-FS. miniImageNet 5-way 1-shot 5-shot tieredImageNet 5-way 1-shot 5-shot Method Backbone Params/FLOPS LEO [42] CC+rot [14] Align [1] MetaOptNet [22] Meta-Baseline [6] DeepEMD [56] RE-Net [17] TPMM [51] SetFeat [2] SUN [10] WRN-28-10 WRN-28-10 WRN-28-10 ResNet-12 ResNet-12 ResNet-12 ResNet-12 ResNet-12 ResNet-12 Visformer-S 36.5M/3.7 × 1010 36.5M/3.7 × 1010 36.5M/3.7 × 1010 12.5M/3.5 × 109 12.5M/3.5 × 109 12.5M/3.5 × 109 12.5M/3.5 × 109 12.5M/3.5 × 109 12.5M/3.5 × 109 12.4M/1.7 × 108 61.76±0.08 62.93±0.45 65.92±0.60 62.64±0.61 63.17±0.23 65.91±0.82 67.60±0.44 67.64±0.63 68.32±0.62 67.80±0.45 77.59±0.12 79.87±0.33 82.85±0.55 78.63±0.46 79.26±0.17 82.41±0.56 82.58±0.30 83.44±0.43 82.71±0.46 83.25±0.30 66.33±0.05 70.53±0.51 74.40±0.68 65.99±0.72 68.62±0.27 71.16±0.87 71.61±0.51 72.24±0.70 73.63±0.88 72.99±0.50 81.44±0.09 84.98±0.36 86.61±0.59 81.56±0.53 83.74±0.18 86.03±0.58 85.28±0.35 86.55±0.63 87.59±0.57 86.74±0.33 KTN [32] AM3 [52] TRAML [24] DeepEMD-BERT [53] ResNet-12 ResNet-12 ResNet-12 ResNet-12 12.5M/3.5 × 109 12.5M/3.5 × 109 12.5M/3.5 × 109 12.5M/3.5 × 109 61.42±0.72 65.30±0.49 67.10±0.52 67.03±0.79 74.16±0.56 78.10±0.36 79.54±0.60 83.68±0.65 69.08±0.47 73.76±0.72 82.58±0.31 87.51±0.75 Pre-train (Ours) SP-CLIP (Ours) SP-SBERT (Ours) SP-GloVe (Ours) Visformer-T Visformer-T Visformer-T Visformer-T 10.0M/1.3 × 109 10.0M/1.3 × 109 10.0M/1.3 × 109 10.0M/1.3 × 109 65.16±0.44 72.31±0.40 70.70±0.42 70.81±0.42 81.22±0.32 83.42±0.30 83.55±0.30 83.31±0.30 72.38±0.50 78.03±0.46 73.31±0.50 74.68±0.50 86.74±0.34 88.55±0.32 88.56±0.32 88.64±0.31 Table 1. Comparison with previous work on miniImageNet and tieredImageNet. Methods in the top rows do not use semantic information, and methods in the middle rows leverage semantic information from class names [24, 32, 52] or descriptions [53]. Accuracies are reported with 95% confidence intervals. Text encoders. To extract rich semantic features from class names, we adopt three types of text encoders, i.e., CLIP [35], SBERT [38], and GloVe [33], which are pretrained on large-scale corpora and are available for public use. For CLIP, we only use its text encoder, and extend the input class name with a text template: A photo of a {class name}. For SBERT and Glove, we directly feed class names into their encoders and average the output word vectors if there are multiple words in a name. Implementation details. We adopt Visformer-Tiny [7] as the feature extractor and resize the input image into 224×224 by default. Other input resolutions are validated in Section 5.3.5. Images are augmented with RandomResizedCrop, RandAug [8] and RepeatAug [4]. During pretraining, we use the AdamW optimer [29] with a learning rate of 5e-4 and a weight decay of 5e-2. We pre-train the model for 800 epochs on miniImageNet, CIFAR-FS and FC100, and for 300 epochs on tieredImageNet. During meta-training, we reduce the learning rate of the feature extractor to 1e-6 and set the learning rate of the projectors as 5e-4. The model is meta-trained for 100 epochs on all datasets. The hyper-parameter τ is set as 0.2 according to validation accuracy. We conduct experiments with a TITAN Xp server and training can be done with one GPU. During evaluation, we randomly sample 2,000 test episodes from the novel classes. For 1-shot learning, we use the cosine classifier for prediction as in Eq.4. For 5shot learning, we adopt logistic regression classifiers with random crop augmentation. We finally report the average accuracy with 95% confidence intervals. 5.2. Comparison with the state-of-the-art To evaluate the effectiveness of our approach, we conduct extensive experiments on four datasets , and compare the results with previous state-of-the-art methods in Table 1 and Table 2. Compared with previous methods that leverages semantic information (KTN [32], AM3 [52], TRAML [24], DeepBERT [53]), our method improves 1-shot accuracy by 5.21% on miniImageNet and by 4.27% on tieredImageNet. DeepEMD-BERT achieves better 5-shot accuracy than ours on miniImageNet, but requires multiple forward passes and additional inner optimization step to obtain reliable local feature similarities. Note that previous methods usually adopts CNN as the backbone, except a recently proposed method SUN [10] that also adopts the Visformer backbone. Nevertheless, our method outperforms SUN by 2.46% on average over three datasets. When using different text encoders to extract semantic features, the proposed SP presents consistent improvements over the pre-training baseline. Specifically, we can see that SP with CLIP achieves better on 1-shot than SBERT and GloVe, probably because CLIP’s multi-modal pre-training results in better alignment of semantic embeddings with visual concepts. In 5-shot, the performance difference decreases as the model performance is dominated by visual features when support images are sufficient. In the following experiments, we use CLIP as the default text encoder. CIFAR-FS 5-way 1-shot 5-shot FC100 5-way 1-shot 5-shot Method Backbone Params/FLOPs PN+rot [14] Align [1] ProtoNet [45] MetaOptNet [22] MABAS [18] Distill [47] RE-Net [17] infoPatch [27] SUN [10] WRN-28-10 WRN-28-10 ResNet-12 ResNet-12 ResNet-12 ResNet-12 ResNet-12 ResNet-12 Visformer-S 36.5M/3.7 × 1010 36.5M/3.7 × 1010 12.5M/3.5 × 109 12.5M/3.5 × 109 12.5M/3.5 × 109 12.5M/3.5 × 109 12.5M/3.5 × 109 12.5M/3.5 × 109 12.4M/1.7 × 108 69.55±0.34 72.2±0.7 72.6±0.7 73.51±0.92 73.9±0.8 74.51±0.46 78.37±0.46 82.34±0.24 83.5±0.5 84.3±0.5 85.49±0.68 86.9±0.5 86.60±0.32 88.84±0.32 45.83±0.48 37.5±0.6 41.1±0.6 42.31±0.75 44.6±0.7 43.8±0.4 - 59.74±0.56 52.5±0.6 55.5±0.6 57.56±0.78 60.9±0.6 58.0±0.4 - Pre-train (Ours) SP-CLIP (Ours) SP-SBERT (Ours) SP-GloVe (Ours) Visformer-T Visformer-T Visformer-T Visformer-T 10.0M/1.3 × 109 10.0M/1.3 × 109 10.0M/1.3 × 109 10.0M/1.3 × 109 71.99±0.47 82.18±0.40 81.32±0.40 81.62±0.41 85.98±0.34 88.24±0.32 88.31±0.32 88.32±0.32 43.77±0.39 48.53±0.38 47.03±0.40 46.69±0.41 59.48±0.39 61.55±0.41 61.03±0.40 61.18±0.41 Table 2. Comparison with previous work on CIFAR-FS [22] and FC100 [31]. Aug SI CI Mini Tiered CIFAR-FS FC100 × ✓ ✓ ✓ ✓ × × ✓ × ✓ × × × ✓ ✓ 61.96 65.15 71.59 70.48 72.31 71.91 72.38 76.20 77.62 78.03 68.84 71.99 81.19 79.80 82.18 40.78 43.77 47.83 47.10 48.53 Table 3. Ablation study on four datasets under the 1-shot setting. SI means spatial interaction, and CI means channel interaction. 5.3. Model analysis 5.3.1 Ablation study The ablation study results are shown in Table 3. By extending the standard RandomResizedCrop with RandAug and RepeatAug, the 1-shot accuracy of the pre-trained feature extractor is improved by 2.45% on average over four datasets. To validate the effectiveness of SP, we fine-tune the feature extractor with three different interaction mechanisms, including SI (spatial interaction), CI (channel interaction) and SI+CI. As shown in Table 3, both SI and CI are very effective, improving average 1-shot accuracy on 4 datasets by 5.89% and 5.43%, respectively. Furthermore, by combing them together, the 1-shot learning accuracy is further improved on all four datasets. These results indicate that the proposed SP is an effective approach to leveraging semantic information for few-shot learning. 5.3.2 Layer selection Theoretically, the semantic prompt in this work can be inserted into the feature extractor at any layer. However, we find that the layer selection has a significant impact on the performance. In Figure 3, we can see that inserting prompts at higher layers improves accuracies, while inserting prompts at lower layers leads to performance drop. Con- (a) (b) Figure 3. Accuracy vs. different layers to inset prompts. We report 5-way 1-shot accuracy (%) on the validation set of miniImageNet and CIFAF-FS along the meta-training process. The feature extractor has three stages and multiple Transformer layers in each stage. sidering that prompt vectors are class-specific, these results indicate that class-specific features should be extracted at higher network layers, while features at lower layers should better be shared among classes. When looking into the performance of each layer, we can see that while the optimal layer selection varies slightly for different datasets, SP at all layers of the third stage improves accuracy consistently. To simplify architecture design, we choose the layer3-2 as default in our experiments. 5.3.3 The backbone and classifier architectures In Table 4, we re-implement three baseline methods with the same Visformer backbone as ours, and compare the results with different backbones under the miniImageNet 1shot setting. It can be seen that simply replacing ResNet12 with Visformer can not obtain significant improvement. Instead, using semantic prompt can improves 1-shot performance over these baselines when equipped with the same Visformer backbone. In Tab.5, we compare the LR and NN classifiers over all datasets. The simple NN classifier performs as well as the Backbone ProtoNet [45] MetaOptNet [22] Meta-Baseline [6] Ours ResNet-12 Visformer-T 63.28 63.16 63.29 64.39 64.36 63.32 72.31 Table 4. Comparison with different backbones. Mini Tiered CIFAR-FS FC100 Classifier 1-shot 5-shot 1-shot 5-shot 1-shot 5-shot 1-shot 5-shot NN LR 72.31 82.86 78.03 87.74 82.18 88.04 48.53 61.10 72.37 83.42 78.11 88.64 82.17 88.24 48.61 61.55 Table 5. Comparison of classifiers. NN: cosine-distance nearest prototype classifier. LR: linear logistic regression classifier. Projector 1-shot 5-shot Input size 224×224 84×84 84×84 Stem Ks=7, Stride=2 Ks=7, Stride=2 Ks=3, Stride=1 MiniImageNet TieredImageNet CIFAR-FS FC100 72.31±0.40 78.03±0.46 82.18±0.40 48.53±0.38 68.09±0.38 72.14±0.47 77.26±0.42 46.44±0.40 72.16±0.40 77.28±0.46 82.00±0.41 48.52±0.40 Table 7. The effect of input size and stem design. ‘Ks’ means the kernel size of the first convolution layer (stem), and ‘Stride’ means its stride. 5-way 1-shot accuracy is reported on four datasets with 95% confidence intervals. Pooling strategy Linear MLP Head Patches All 72.31 83.42 72.70 83.56 66.48 72.70 72.29 83.39 72.31 83.42 Table 6. Choice of the projector, and the pooling strategy for the output sequence. ‘Head’ means selecting the output at the position of the prompt vector; ‘Patches’ means averaging the output features of all patches; ‘All’ means averaging all feature vectors in the output sequence. Input image with harvestman and spider web 5.3.4 Projector structure and pooling strategy As shown in Table 6, the projector design has little effect on performance: both linear and MLP projectors work well and the MLP has slight advantage. In contrast, the pooling strategy has much more effect on performance. When adopting the ‘Head’ strategy, both 1-shot and 5-shot learning accuracies are very poor. This indicates that the output at the position of the prompt vector is easy to overfit on semantic features and neglect rich visual features in image patches. Adopting average on all output features can address this problem and achieve better results. 5.3.5 Image size and stem design In Table 7, we experiment with a smaller input size, 84×84, to validate the influence of image size. It can be seen that directly changing the input size to 84×84 leads to evident performance drop on all datasets. We suppose that this is because the kernel size and the stride of the stem is too large to capture the detailed visual features when the input image gets small. To address this problem, we reduce the kernel size and the stride of the stem accordingly. After this change, the 1-shot learning performance under 84×84 improves significantly, and gets comparable results with the 224×224 resolution on all datasets. Prompt with harvestman Prompt with spider web Figure 4. Visualization of attention maps when prompting with different class labels. 5.3.6 LR classifier for 1-shot, while the LR benefits from more training examples and outperforms the NN by 0.53% for 5-shot. Pre-training baseline Visualization In Figure 4, we visualize the attention maps by computing the dot product between the output feature and the feature vector at each location. It can be seen that the visual features of the pre-training baseline are cluttered with background information, but our method can focus on semantic-level visual features according to the given text prompt. For example, given the text prompt of harvestman, the model will attend to the features of the harvest rather than spider web or background clutters. 6. Conclusion In this paper, we propose a novel Semantic Prompt (SP) approach for FSL, which adaptively tunes the feature extraction with the semantic features derived from class names. The proposed approach is evaluated on four benchmark datasets, and achieves significant improvements against previous methods. More in-depth analysis demonstrates that SP encourages the model to extract more classspecific features and is robust to different text encoders and model designs. Acknowledgement This work was supported in part by the National Natural Science Foundation of China under Grants 61721004, 61976214, 62076078, 62176246 and National Key R&D Program of China (2022ZD0117901). References [1] Arman Afrasiyabi, Jean-Franccois Lalonde, and Christian Gagné. Associative alignment for few-shot image classification. In ECCV, 2020. 6, 7 [2] Arman Afrasiyabi, Hugo Larochelle, Jean-François Lalonde, and Christian Gagné. Matching feature sets for few-shot image classification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 9014– 9024, 2022. 6 [3] Afra Feyza Akyürek, Ekin Akyürek, Derry Wijaya, and Jacob Andreas. Subspace regularizers for few-shot class incremental learning. In International Conference on Learning Representations, 2022. 2, 4 [4] Maxim Berman, Hervé Jégou, Andrea Vedaldi, Iasonas Kokkinos, and Matthijs Douze. Multigrain: a unified image embedding for classes and instances. arXiv preprint arXiv:1902.05509, 2019. 6 [5] Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020. 3, 5 [6] Yinbo Chen, Zhuang Liu, Huijuan Xu, Trevor Darrell, and Xiaolong Wang. Meta-baseline: exploring simple metalearning for few-shot learning. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 9062–9071, 2021. 3, 6, 8 [7] Zhengsu Chen, Lingxi Xie, Jianwei Niu, Xuefeng Liu, Longhui Wei, and Qi Tian. Visformer: The vision-friendly transformer. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 589–598, 2021. 4, 6 [8] Ekin D Cubuk, Barret Zoph, Jonathon Shlens, and Quoc V Le. Randaugment: Practical automated data augmentation with a reduced search space. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, pages 702–703, 2020. 6 [9] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018. 2, 3 [10] Bowen Dong, Pan Zhou, Shuicheng Yan, and Wangmeng Zuo. Self-promoted supervision for few-shot transformer. arXiv preprint arXiv:2203.07057, 2022. 6, 7 [11] Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929, 2020. 3, 4 [12] Chelsea Finn, Pieter Abbeel, and Sergey Levine. Modelagnostic meta-learning for fast adaptation of deep networks. In ICML, 2017. 1, 2 [13] Yanwei Fu, Tao Xiang, Yu-Gang Jiang, Xiangyang Xue, Leonid Sigal, and Shaogang Gong. Recent advances in zeroshot recognition: Toward data-efficient understanding of visual content. IEEE Signal Processing Magazine, 35(1):112– 125, 2018. 1 [14] Spyros Gidaris, Andrei Bursuc, Nikos Komodakis, Patrick Pérez, and Matthieu Cord. Boosting few-shot visual learning with self-supervision. In ICCV, 2019. 6, 7 [15] Kaiming He, Haoqi Fan, Yuxin Wu, Saining Xie, and Ross Girshick. Momentum contrast for unsupervised visual representation learning. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 9729–9738, 2020. 3 [16] Hongwei Huang, Zhangkai Wu, Wenbin Li, Jing Huo, and Yang Gao. Local descriptor-based multi-prototype network for few-shot learning. PR, 116:107935, 2021. 1, 2 [17] Dahyun Kang, Heeseung Kwon, Juhong Min, and Minsu Cho. Relational embedding for few-shot classification. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 8822–8833, 2021. 6, 7 [18] Jaekyeom Kim, Hyoungseok Kim, and Gunhee Kim. Modelagnostic boundary-adversarial sampling for test-time generalization in few-shot learning. In ECCV, 2020. 7 [19] Alexander Kolesnikov, Lucas Beyer, Xiaohua Zhai, Joan Puigcerver, Jessica Yung, Sylvain Gelly, and Neil Houlsby. Big transfer (bit): General visual representation learning. In European conference on computer vision, pages 491–507. Springer, 2020. 3 [20] Alex Krizhevsky. Learning multiple layers of features from tiny images. Technical report, University of Toronto, 2019. 5 [21] Brenden Lake, Ruslan Salakhutdinov, Jason Gross, and Joshua Tenenbaum. One shot learning of simple visual concepts. In CogSci, 2011. 1 [22] Kwonjoon Lee, Subhransu Maji, A. Ravichandran, and Stefano Soatto. Meta-learning with differentiable convex optimization. In CVPR, 2019. 5, 6, 7, 8 [23] Brian Lester, Rami Al-Rfou, and Noah Constant. The power of scale for parameter-efficient prompt tuning. arXiv preprint arXiv:2104.08691, 2021. 3 [24] Aoxue Li, Weiran Huang, Xu Lan, Jiashi Feng, Zhenguo Li, and Liwei Wang. Boosting few-shot learning with adaptive margin loss. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 12576– 12584, 2020. 2, 6 [25] Xiang Lisa Li and Percy Liang. Prefix-tuning: Optimizing continuous prompts for generation. arXiv preprint arXiv:2101.00190, 2021. 3 [26] Yanan Li, Donghui Wang, Huanhang Hu, Yuetan Lin, and Yueting Zhuang. Zero-shot recognition using dual visualsemantic mapping paths. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 3279–3287, 2017. 1 [27] Chen Liu, Yanwei Fu, Chengming Xu, Siqian Yang, Jilin Li, Chengjie Wang, and Li Zhang. Learning a few-shot embedding model with contrastive learning. In AAAI, 2021. 7 [28] Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, and Baining Guo. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 10012–10022, 2021. 3 [29] Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101, 2017. 6 [30] Martin Maier and Rasha Abdel Rahman. No matter how: Top-down effects of verbal and semantic category knowledge on early visual perception. Cognitive, Affective, & Behavioral Neuroscience, 19(4):859–876, 2019. 2 [31] Boris Oreshkin, Pau Rodrı́guez López, and Alexandre Lacoste. Tadam: Task dependent adaptive metric for improved few-shot learning. Advances in neural information processing systems, 31, 2018. 2, 5, 7 [32] Zhimao Peng, Zechao Li, Junge Zhang, Yan Li, Guo-Jun Qi, and Jinhui Tang. Few-shot image recognition with knowledge transfer. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 441–449, 2019. 1, 2, 3, 6 [33] Jeffrey Pennington, Richard Socher, and Christopher D Manning. Glove: Global vectors for word representation. In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), pages 1532–1543, 2014. 4, 6 [34] Fabio Petroni, Tim Rocktäschel, Patrick Lewis, Anton Bakhtin, Yuxiang Wu, Alexander H Miller, and Sebastian Riedel. Language models as knowledge bases? arXiv preprint arXiv:1909.01066, 2019. 3, 5 [35] Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. In International Conference on Machine Learning, pages 8748–8763. PMLR, 2021. 3, 4, 6 [36] Alec Radford, Karthik Narasimhan, Tim Salimans, and Ilya Sutskever. Improving language understanding by generative pre-training. 2018. 2, 3 [37] Sachin Ravi and Hugo Larochelle. Optimization as a model for few-shot learning. In ICLR, 2017. 1, 2 [38] Nils Reimers and Iryna Gurevych. Sentence-bert: Sentence embeddings using siamese bert-networks. arXiv preprint arXiv:1908.10084, 2019. 4, 6 [39] Mengye Ren, Eleni Triantafillou, Sachin Ravi, Jake Snell, Kevin Swersky, Joshua B. Tenenbaum, Hugo Larochelle, and Richard S. Zemel. Meta-learning for semi-supervised fewshot classification. In ICLR, 2018. 5 [40] Bernardino Romera-Paredes and Philip Torr. An embarrassingly simple approach to zero-shot learning. In International conference on machine learning, pages 2152–2161. PMLR, 2015. 1 [41] Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, Alexander C. Berg, and Li Fei-Fei. ImageNet Large Scale Visual Recognition Challenge. IJCV, 2015. 5 [42] Andrei A. Rusu, Dushyant Rao, Jakub Sygnowski, Oriol Vinyals, Razvan Pascanu, Simon Osindero, and Raia Hadsell. Meta-learning with latent embedding optimization. In ICLR, 2019. 1, 2, 6 [43] Yutaro Shigeto, Ikumi Suzuki, Kazuo Hara, Masashi Shimbo, and Yuji Matsumoto. Ridge regression, hubness, and zero-shot learning. In Joint European conference on machine learning and knowledge discovery in databases, pages 135–151. Springer, 2015. 1 [44] Chenyang Si, Weihao Yu, Pan Zhou, Yichen Zhou, Xinchao Wang, and Shuicheng YAN. Inception transformer. In Advances in Neural Information Processing Systems, 2022. 3 [45] Jake Snell, Kevin Swersky, and Richard Zemel. Prototypical networks for few-shot learning. In NeurIPS, 2017. 1, 2, 7, 8 [46] Flood Sung, Yongxin Yang, Li Zhang, Tao Xiang, Philip HS Torr, and Timothy M Hospedales. Learning to compare: Relation network for few-shot learning. In CVPR, 2018. 1, 2 [47] Yonglong Tian, Yue Wang, Dilip Krishnan, Joshua B Tenenbaum, and Phillip Isola. Rethinking few-shot image classification: a good embedding is all you need? In European Conference on Computer Vision, pages 266–282. Springer, 2020. 3, 7 [48] Maria Tsimpoukelli, Jacob L Menick, Serkan Cabi, SM Eslami, Oriol Vinyals, and Felix Hill. Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems, 34:200–212, 2021. 3 [49] Oriol Vinyals, Charles Blundell, Timothy Lillicrap, Daan Wierstra, et al. Matching networks for one shot learning. In NeurIPS, 2016. 1, 2, 3, 4, 5 [50] Zhenhailong Wang, Hang Yu, Manling Li, Han Zhao, and Heng Ji. Model-agnostic multitask fine-tuning for few-shot vision-language transfer learning. arXiv preprint arXiv:2203.04904, 2022. 4 [51] Jiamin Wu, Tianzhu Zhang, Yongdong Zhang, and Feng Wu. Task-aware part mining network for few-shot learning. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 8433–8442, 2021. 6 [52] Chen Xing, Negar Rostamzadeh, Boris Oreshkin, and Pedro O O Pinheiro. Adaptive cross-modal few-shot learning. Advances in Neural Information Processing Systems, 32, 2019. 1, 2, 3, 6 [53] Kun Yan, Zied Bouraoui, Ping Wang, Shoaib Jameel, and Steven Schockaert. Aligning visual prototypes with bert embeddings for few-shot learning. In Proceedings of the 2021 International Conference on Multimedia Retrieval, pages 367–375, 2021. 1, 6 [54] Kun Yan, Chenbin Zhang, Jun Hou, Ping Wang, Zied Bouraoui, Shoaib Jameel, and Steven Schockaert. Inferring prototypes for multi-label few-shot image classification with word vector guided attention. Proceedings of the AAAI Conference on Artificial Intelligence, 36(3):2991–2999, Jun. 2022. 3 [55] Zhilin Yang, Zihang Dai, Yiming Yang, Jaime Carbonell, Russ R Salakhutdinov, and Quoc V Le. Xlnet: Generalized autoregressive pretraining for language understanding. Advances in neural information processing systems, 32, 2019. 3 [56] Chi Zhang, Yujun Cai, Guosheng Lin, and Chunhua Shen. Deepemd: Few-shot image classification with differentiable earth mover’s distance and structured classifiers. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 12203–12213, 2020. 2, 6 [57] Ningyu Zhang, Luoqiu Li, Xiang Chen, Shumin Deng, Zhen Bi, Chuanqi Tan, Fei Huang, and Huajun Chen. Differentiable prompt makes pre-trained language models better fewshot learners. arXiv preprint arXiv:2108.13161, 2021. 3 [58] Si Zhang, Hanghang Tong, Jiejun Xu, and Ross Maciejewski. Graph convolutional networks: a comprehensive review. Computational Social Networks, 6(1):1–23, 2019. 3 [59] Zizhao Zhang, Han Zhang, Long Zhao, Ting Chen, Sercan Ö Arik, and Tomas Pfister. Nested hierarchical transformer: Towards accurate, data-efficient and interpretable visual understanding. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 36, pages 3417–3425, 2022. 3