OmDet: Large-scale vision-language multi-dataset pre-training with multimodal detection network

The advancement of object detection (OD) in open-vocabulary and open-world scenarios is a critical challenge in computer vision. This work introduces OmDet, a novel language-aware object detection architecture, and an innovative training mechanism that harnesses continual learning and multi-dataset vision-language pre-training. Leveraging natural language as a universal knowledge representation, OmDet accumulates a"visual vocabulary"from diverse datasets, unifying the task as a language-conditioned detection framework. Our multimodal detection network (MDN) overcomes the challenges of multi-dataset joint training and generalizes to numerous training datasets without manual label taxonomy merging. We demonstrate superior performance of OmDet over strong baselines in object detection in the wild, open-vocabulary detection, and phrase grounding, achieving state-of-the-art results. Ablation studies reveal the impact of scaling the pre-training visual vocabulary, indicating a promising direction for further expansion to larger datasets. The effectiveness of our deep fusion approach is underscored by its ability to learn jointly from multiple datasets, enhancing performance through knowledge sharing.


Introduction
Object detection (OD) is one of the monumental tasks in computer vision (CV).Classical OD research has been focusing on improving the detector net-many small domain-specific datasets is much cheaper than creating a single large-vocabulary large dataset (Gupta et al., 2019).
On the other hand, joint training from multiple OD datasets with different labels faces two key technical challenges: (1) taxonomy conflict: each OD dataset is annotated with its pre-defined labels and a classic detector uses a fixed Softmax layer to classify object types (Ren et al., 2015).Such design forbids the possibility of learning from different label sets or dynamically adapting to new classes.(2) fore/background inconsistency: since the label set is different, then an object proposal may be considered as foreground in dataset A, while it is considered as background in dataset B. For example, an object "cat" is annotated in dataset A, but not in dataset B. Our study shows that this greatly hurts the multi-dataset performance of classic detectors since the RPN head is confused by the conflicting ground truth.
To address the above challenges, this work proposes a novel vision-language model, OmDet, for open vocabulary object detection and phrase grounding.The main architecture novelty of OmDet is its latent query-centric fusion module that combines information from visual and text features and the proposed training mechanism that can easily accumulate knowledge from OD/grounding datasets from various domains.Two versions of OmDet is pre-trained, including OmDet V1 which is purely pre-trained on a large number of OD datasets (more than 100 domains), and OmDet V2 which is additionally pre-trained on visual grounding data (Kamath et al., 2021).
The proposed method is evaluated on three downstream tasks: object detection in the wild (ODinW) (Li et al., 2022a), open-vocabulary detection, and phrase grounding (Plummer et al., 2015).Results show that OmDet is able to outperform all prior art, including the powerful GLIP (Li et al., 2022b) that is pre-trained on much larger datasets.Moreover, comprehensive model analysis is conducted to better understand the strength and limitations of OmDet.We conduct controlled study on joint training from four diverse datasets (COCO, Pascal VOC, and Wider Face/Pedestrian) and results show that our method is not only able to learn from all datasets without suffering from label and localization conflicts, but achieves stronger performance than single dataset detectors due to its share of knowledge among tasks.Also, we show that accumulating multiple datasets to expand to large vocabulary OD learning is an effective method to boost OmDet's zero/few-shot ability as well as parameter-efficient training performance (e.g., prompt tuning) In summary, the contributions of this our paper are four folds: • We present OmDet, a novel language-aware OD architecture with Multimodal Detection Network (MDN) that can learn from any number of OD and grounding datasets.
• Experiments show OmDet's state-of-the-art performance on well-known ODinW, open-vocabulary detection and phrase grounding benchmark.
• Experiments confirm the effectiveness of the proposed multi-dataset training by solving the label difference and fore/background inconsistency challenges.
• Experiments show that by scaling up visual vocabulary size via multidataset training, one can improve zero/few-shot and parameter-efficient fine-tuning.

Vision-Language Pre-training
One of the most studied topics of VLP is to pre-train massive image-text pair data.Recent advances in self-supervised learning have enabled models to learn rich representations from large-scale unlabeled data.
For example, CLIP (Radford et al., 2021a) learns to predict which text matches which image, resulting in a versatile model that can perform well on various vision tasks without task-specific supervision.ALIGN (Li et al., 2021) further scales up CLIP by using a noisy dataset of over one billion image alt-text pairs.However, these models mainly focus on vision-based tasks and neglect the interaction between multiple modalities during pretraining.To address this limitation, several studies propose to learn joint multi-modal representations of image content and natural language for vi-sion+language tasks (such as VQA and visual reasoning).Among them, OSCAR (Li et al., 2020), UNITER (Chen et al., 2020) and VILLA (Gan et al., 2020) adopt a two-stage approach: they first use an object detector (e.g., Faster R-CNN (Zhang et al., 2021)) to extract vision features, then they apply a multi-layer transformer (Vaswani et al., 2017) to the concatenation of the visual features and text features to learn joint embeddings.
Some studies propose to model visual input without relying on pre-trained object detectors.For instance, SOHO (Huang et al., 2021) uses a visual dictionary to extract compact image features from a whole image, which enables 10 times faster inference time than region-based methods.Similarly, ViLT (Kim et al., 2021) employs a vision transformer (Dosovitskiy et al., 2020) to capture long-range dependencies over a sequence of fixed-size nonoverlapping image patches, without using convolutional visual features.

Object Detection
Objection detection, one of the predominant tasks in computer vision, aims to detect bounding boxes and classes of object instances.It has significantly evolved through the contributions of massive research in recent years.There are two major categories of detectors: two-stage and one-stage methods.Two-stage methods consist of a region proposal network (RPN) and a region-wise classifier.Classic models include R-CNN (Girshick et al., 2014), Fast R-CNN (Girshick, 2015) and Faster R-CNN (Ren et al., 2015).One-stage methods eliminate the RPN stage and directly make final object predictions on the visual feature maps.Well-known systems include SSD (Liu et al., 2016), Yolo (Redmon et al., 2016) and RetinaNet (Lin et al., 2017b).Recently, end-to-end detectors such as DETR (Carion et al., 2020) have proposed to formulate the object detection task as a set prediction task.However, objection detection is often formulated as a closed-set problem with fixed and predefined classes and cannot handle object detection in the wild.To overcome the closed-set limitation, more realistic scenarios such as Multi-Dataset Object Detection (MDOD) and Open-Vocabulary Object Detection (OVOD) have attracted lots of attention.
Multi-Dataset Object Detection: MDOD focuses on increasing detectable object classes by training a single detector using multiple datasets.Traditional closed-set object detection demands training detectors on datasets with full annotations, and adding a new dataset means costly extra human annotations.Research on MDOD attempts to bypass the closed-set limitation, where a single detector is able to incrementally add object classes by adding new datasets with new classes.Yao et al., (Yao et al., 2020) proposes an MDOD framework with a preprocessed hybrid dataset and a datasetaware focal loss.(Zhao et al., 2020) designs a conflict-free loss to avoid the ambiguity between positive and negative samples.Detection Hub (Meng et al., 2022) unifies multiple datasets with a query-based object detector with natural language embedding.
Open-Vocabulary Object Detection: OVOD, a more ambitious goal beyond the closed-set problem, refers to the capability of only training on annotated datasets and generalizing to unseen novel classes.Recently, OVOD has made such progress with the utilization of multi-modal vision-language pre-trained models (Li et al., 2022b) (Zhou et al., 2022b) (Kamath et al., 2021).RegionCLIP (Zhong et al., 2022) generates pseudo-labels for regiontext pairs from caption datasets to perform regional vision-language pretraining and transfer to OVOD.ViLD (Gu et al., 2021) proposed a two-stage open-vocabulary detector, which distills embeddings from teacher model CLIP (Radford et al., 2021b) or ALIGN (Jia et al., 2021).With inspiration from CoOp (Zhou et al., 2022a), DetPro (Du et al., 2022) introduces a technique to learn continuous prompt embedding that improves the performance of ViLD.OWL-ViT (Minderer et al., 2022) transfers the pre-trained imagetext model to the object detection by adding downstream detection heads and fine-tuning on OD datasets.
Object Detection as Grounding: Phrase grounding refers to the process of identifying the relationship between individual phrases within a sentence and specific objects or regions depicted in an image (Kamath et al., 2021;Deng et al., 2021).GLIP (Li et al., 2022b) proposed that object detection can be viewed as a special case of phrase grounding.The authors of GLIP concatenate object types as a single string and ask the model to ground objects to word spans.This setup enables unified modeling between phrase grounding and object detection, and the resulting system achieves strong performance in long-tail object detection and zero-shot detection.
Unlike previous grounding-based methods, the proposed method is designed to learn from an arbitrary number of object detection (OD) datasets, which does not necessarily need to train on grounding data.This ability is valuable for real-world scenarios, e.g., creating a multi-task OD model that simultaneously learns from many independent OD datasets.

Our Approach
Before getting into the details of the proposed system, we first define the problem formulation.OmDet is designed for language-conditioned detection.Let V be a large vocabulary of object types that OmDet can potentially detect.A task T = {w 1 , w 2 , ...w k } is a set of k object types that the model should detect in its forward path, where w ∈ V .Note that the size of T can be dynamic ranging from 1 to K, where K is the maximum supported number of object types in a single inference run.For the visual grounding setting, T is the query sentence that contains K word tokens.Meanwhile, Let L be a set of natural language labels.In the object detection case, L = T .For the grounding cases, L is the set of entities that appeared in caption T .Then given an input image x, a task T , and a label set L, the model is expected to detect all objects mentioned in T from x. Since T and L are not fixed, an ideal model can dynamically adapt its detection targets conditioned on the task.

Model Architecture
Following the above design principle, OmDet is introduced, a task-conditioned detection network that can learn from infinite combinations of tasks.It is composed of a vision backbone, a task encoder, a label encoder, and a multimodal detection network.The overall structure is illustrated in Fig1.The following will describe each component in detail.
Vision Backbone Starting from the initial image x img ∈ R 3×H 0 ×W 0 (with 3 color channels), let the vision encoder f v be a conventional Convolutional Neural Network (CNN) (Liu et al., 2022) or Vision Transformer (e.g.Swin Transformer (Liu et al., 2021)).The vision encoder generates a lower-resolution visual feature map f ∈ R C×H×W at each output layer.Then Feature Pyramid Network (FPN) (Lin et al., 2017a) is used to aggre-gate information from top to bottom and output a set of visual feature maps {P 2, P 3, P 4, P 5}.
Task Encoder and Label Encoder The term "task" refers to a natural language query designed to expand various text-aware vision tasks; (e.g., "Detect objects: {the specified list of objects that we aim to identify}") The term 'label' refers to the language phrase output that is intended for detection purposes.The task set T = {w 1 , w 2 , ...w k } ∈ R k×V is set of natural language words.Then a task encoder f t or a label encoder f l is a transformer model that encodes the task set T as a natural language sentence, and outputs a set of contextual word embeddings, i.e.
where d is the contextual word embedding dimension size.We use pre-trained transformer-based language models, e.g.CLIP (Radford et al., 2021a) to initialize the task and label encoders.2021), we deploy deep fusion to combine information from the image and current task early on, in order to achieve strong performance.We are inspired by the Sparse-RCNN (Sun et al., 2021) network design and developed an iterative querybased fusion mechanism that fuses text features and visual features into latent queries.Figure 3 illustrates the differences between our method versus prior art.Let Q ∈ R N ×d be a fixed small set of learnable proposal features.The N denotes the number of proposal features.It is a set of high-dimensional (e.g., d = 256) latent features that capture the rich information of a potential instance, by combining data from the vision backbone and contextual task embedding from the task encoder.Also, let B ∈ R N ×4 be a set of learnable one-to-one proposal boxes assigned to each feature.Then given the FPN output and task/label encoder output, the initial MDN operates as the following: where T i is the task embedding at iteration i and L is the label embedding.
Note that MDN can be stacked to iterative refine its output the same as Sparse-RCNN, with the key difference that T i is fused with the proposal feature before the Dynamic Convolution layer and also T i is also iteratively updated at each run of MDN block.This enables the network to learn to adjust the task embedding and the proposal embedding jointly and adapt both object localization and classification heads conditioned on the given task.Figure 2 shows the process by which MDN first combines information between latent queries and language embedding via MHSA, and then infuses visual features with DynamicConv.Note that we can easily adapt MDN to other query-based detectors such as DETR Carion et al. (2020), in which the DynamicConv operation is replaced by a CrossAttention module.
With the utilization of deep fusion between image features and task embedding at MDN, the challenge of fore/background inconsistency is solved.Other models like (Zhou et al., 2022b) (Minderer et al., 2022) try to solve the fore/background inconsistency by training a perfect RPN to find all possible objects, which is hard to achieve.Our method applies deep fusion at an early stage to help the model be conscious of fore/background according to task embedding, and therefore properly switching fore/background among different tasks.To handle the taxonomy conflict, the label encoder is applied to get the text embedding of the target label, then the label embedding is passed to the classification stage to eliminate naming differences.Taxonomy conflict is solved by projecting the target label into embedding space since the same object with different naming will be close to each other.

Model Training
Set Prediction Loss Given the proposed model, it uses set prediction loss (Carion et al., 2020) on the fixed-size set of predictions of classification and box coordinates.Set-based loss produces an optimal bipartite matching between predictions and ground truth objects using the Hungarian algorithm.The matching cost is defined as follows: Here L cls is focal loss (Lin et al., 2017b) of predicted classifications and ground truth category labels, L L 1 and L giou are L1 loss and generalized IoU loss (Carion et al., 2020) between normalized center coordinates and height and width of predicted boxes and ground truth box, respectively.λ cls , λ L 1 and λ giou are coefficients of each component.The training loss is the same as the matching cost except that only performed on matched pairs.The final loss is the sum of all pairs normalized by the number of objects inside the training batch.
Task-Sampling Strategy For object detection datasets, in order to simulate a diverse set of tasks for meta-learning during training and also enforce the model to condition its output on a given task, a novel task sampling strategy is used during training.
1. Let the max size of a given task be K, for an image x from a dataset d in the mini-batch, we first sample k ∈ [1, K] with a uniform distribution.
2. Let the number of unique object types in x be m, if m > k, then only a random subset of k object types are kept and the extra annotations are removed for this mini-batch.If m < k, then additional negative object types are randomly selected from the vocabulary V of dataset d. 3. The model is trained with the above-sampled task and ground truth annotations.
With the above method, each image in every mini-batch will have a different set of tasks to learn from.When we learn from a large-vocabulary object detection dataset, e.g., LVIS, which contains 1200 unique object types, the unique combination of task size then it produces 1.34E43 possibilities, a quite large number.Experiments show that the proposed training strategy serves the purpose well, and yields models that perform task-conditioned object detection.
For learning from phrase grounding dataset, the task T is simply the corresponding caption of the image.The label set L is the set of entities that appeared in the caption.However, since there are only a few entities in each caption, learning of L cls becomes too easy.Therefore, we randomly select from other entities in the dataset to create a label set up to K classes to increase the difficulty of learning.This method is proven to be effective in improving performance on phrase grounding in later experiments.

Comparison to Grounding-based Method
Our proposed architecture, the Multimodal Detection Network, has several strengths over traditional approaches that directly fuse text and vision features.Instead, our model fuses latent queries with text features, leading to the following advantages: Deep fusion for any query-based OD: early VLP work, e.g., ViLD (Gu et al., 2021) and Detic (Zhou et al., 2022b), use shallow fusion for object detection, i.e. use text embedding only for classification, which cannot solve fore/background conflicts.Meanwhile, prior deep fusion models, e.g., MDETR (Kamath et al., 2021) and GLIP (Li et al., 2022b)), use specialized cross-attention architecture to fuse the text and visual features.Our method can be applied to any query-based OD architecture, e.g.DETR, Sparse-RCNN, without the need for model change.
Inference speed and performance: visual grounding MDETR (Kamath et al., 2021) and TransVG (Deng et al., 2021) models encode one class at a time for OD and suffer from slow inferences speed, e.g.10s/image for MDETR.Also, MDETR uses a transformer to fuse images with text, which cannot scale up to multi-scale features due to the complexity of self-attention.Our method deals with fixed-size latent queries, which are independent of visual features.Thus, our method is able to predict many classes with significant speed up with on-par or better performance.

Implementation Details
We implement OmDet with the following settings: For text embeddings, CLIP-B/16 text encoder (Radford et al., 2021b) is used throughout the study.We did not use the prompt template as used in study (Gu et al., 2021), i.e. encoding object names in a template a photo of {}.This is because preliminary studies show no major difference between using versus not using the prompt template.Furthermore, the preliminary study also suggests there are no significant differences between using singlemodal language models, e.g.BERT (Devlin et al., 2018) and RoBERTa (Liu et al., 2019), versus multimodal-language models e.g.CLIP.We suspect this is because object detection does not involve complex language understanding.
The task and label encoders share the same text encoders.On top of the text encoder, two independent Transformers layers (Vaswani et al., 2017) are used to further dedicated encoding for task input and label input.Study shows that the set encoding is able to improve OmDet's performance.
For visual backbones, both Swin Transformers (Liu et al., 2021) and Con-vNeXt (Liu et al., 2022) are used in the experiments.A standard FPN (Lin et al., 2017a) is used to extract a four-level feature map form the visual encoders.Both backbones are pre-trained on ImageNet 21K data (Ridnik et al., 2021).Preliminary studies found that ConvNeXt usually performs on par or better than Swin Transformers.Therefore, we use ConvNeXt as the default choice.
Lastly, the MDN network utilizes MHSA to fuse information from visual input and text input to latent queries.We equip MDN with 300 latent queries and we use ROIAlignV2 (He et al., 2017) as the ROI Pooler to extract region features from the visual backbone.6 sequential MDN blocks are cascaded to create the final bounding boxes and classification prediction.

Large-scale Pre-training
Two versions of large-scale pre-training are conducted.
Large-scale OD Pre-training (OmDet V1): in this setting, we accumulate a large number (104) of object detection datasets for pre-training to show that OmDet is able to accumulate knowledge from many OD datasets without suffering from fore/background and label inconsistency challenges.Pre-training datasets include COCO (Lin et al., 2014), Object365 (Shao et al., 2019), LVIS (Gupta et al., 2019), PhraseCut (Wu et al., 2020) and Roboflow 100 (Ciaglia et al., 2022) Large-scale OD & Grounding Pre-training (OmDet V2): In the second version, we exclude any images related to COCO and LVIS datasets from pre-training since we will test zero-shot performance on these two datasets.In addition to large-scale OD multi-dataset pre-training, OmDet is able to horizontally expand to the non-OD type of training data.Specifically, we include the GoldG grounding dataset curated by Kamath et al. (2021), which includes 1.3M pairs of image-caption data with grounded entities.Data details are described in  Model Training: For OmDet models, the initial learning rate is 5e-5 and it decays at 70% and 90% of total iteration steps by 0.1.ConvNeXt Base backbone is used with a 6-layer MDN head.The batch size is 40 and the maximum number of detections per image is 300 and K is set to 80.All of the proposed models are pre-trained for 36 epochs using NVIDIA A100 GPU cluster and then fine-tuned on the downstream data.

Downstream Tasks
We focus on three types of downstream tasks for evaluation: Object Detection in the Wild: object detection in the wild test a model's ability to adapt to various different domains with drastically different label sets.ELEVATER (Li et al., 2022a) is a new object detection benchmark that is composed of 35 diverse real-world challenging domains with full-shot, few-shot and zero-shot training settings.Note that there are two variations of data used in prior work (Li et al., 2022b).The full version includes 35 domains, which we will refer to as ODinW35, and the second version only includes 13 out of 35 domains, which we will refer to as ODinW13.The evaluation metric is AP.
Open-vocabulary Object Detection: open-vocabulary detection tests models' ability to recognize a large number of objects with types that are not included in the training.Zero-shot performance on COCO (Lin et al., 2014), LVIS (Gupta et al., 2019), and ODinW (Li et al., 2022a) are commonly used as the benchmark.The evaluation metric is AP.
Phrase Grounding: For phrase grounding on Flickr30k (Plummer et al., 2015), we do not further fine-tune the model after grounding pre-training, and just directly evaluate on the Recall@ 1,5,10 metrics.
We provide the detailed settings of the baseline models used in our experiments, including information on the fusion (deep vs. shallow), backbones, number of parameters and pretraining data (Table 3)

Results on Object Detection in the Wild
In our evaluation of ODinW, we compared the zero-shot, few-shot, and full-shot scores of GLIP-Tiny, DyHead-Tiny, DINO Swin-Tiny, and OmDet (Table 4).When compared to state-of-the-art models that were trained with the same backbone size, our OmDetV1-T model achieved the highest AP scores on few-shot and full-shot evaluations.Furthermore, we trained OmDet V1-T and OmDet V1-B under the same settings, but with different backbones.OmDet V1-B outperformed all other models and achieved stateof-the-art results on zero-shot, few-shot, and full-shot evaluations.Note that the word definition of Wiktionary is used as a knowledge source for OmDet v1-B on the zero-shot setting.

Results on Open-Vocabulary Detection
We evaluated the zero-shot performance of different models on several open-vocabulary detection datasets, including COCO Val, LVIS MiniVal, ODinW13, and ODinW35 (Table 5).Our proposed models, OmDetV1-B and OmDetV2-B, achieved competitive results on the evaluated datasets.Specifically, OmDetV2-B significantly outperformed all other models on all datasets, achieving 9 points higher AP than the previous state-of-the-art model (GLIP-B) with the same base backbone on COCO Val.Moreover, our model showed exceptional performance on rare objects in LVIS Mini-Val, outperforming the previous state-of-the-art model OWL-B by almost 8 points (20.8 vs. 27.92).These results demonstrate the effectiveness of our proposed models for open-vocabulary detection tasks, particularly on rare object detection where data scarcity is an issue.Thus, our model achieved the highest performance on ODinW, which is a dataset that contains a large number of rare objects and reflects real-world applications.In the terms APr, APc, and APf, the letters r, c, and f denote rare, common, and frequent, respectively.
Our model achieved a competitive performance on this task, with Recall@1, Recall@5, and Recall@10 scores of 85.1, 96.2, and 97.5, respectively.Our model's performance is only slightly lower than that of the state-of-theart model, GLIP-B (85.7 vs. 85.1) at Recall@1, this is due to our approach of fixed-size latent queries independently over contexts to reduce complexity.However, we achieved higher scores than GLIP-B at Recall@5 and Recall@10 with significantly more efficient computations than the traditional approaches.
These results demonstrate the effectiveness of our OmDet for the task of phrase grounding as well and highlight the importance of incorporating latent query deep fusion in both object detection and phrase grounding.

Ablation and Analysis
To further investigate the behavior of our proposed method, we conducted several follow-up studies: 1) an analysis of the efficacy of deep fusion, 2) an analysis of the effect of pre-training, and 3) visualizations of language-aware object detection.We verify that the proposed deep fusion mechanism can effectively learn from multiple object detection datasets without suffering from the task conflict challenge.Additionally, we show the benefit of the proposed multiple datasets training over a single dataset.

Analyze the Efficacy of Deep Fusion
For MDOD, we follow the experimental setting from (Yao et al., 2020) and choose COCO (Lin et al., 2014), Pascal VOC (Everingham et al., 2010), WIDER FACE (Yang et al., 2016) and WIDER Pedestrian (Loy et al., 2019) as joint-training datasets.Note that COCO is the larger data with 118K images while the other 3 datasets are almost 10 times smaller.Also the COCO dataset has a diverse set of categories that cover the classes in Pascal VOC and WIDER Pedestrian.WIDER Face is the only dataset that has "face" class.Therefore, these four datasets serve as a great testing bed for MDOD study.Three baselines are served as baselines.First, we include Sparse R-CNN (Sun et al., 2021) as the baseline due to its strong performance and similarity to OmDet in terms of model structure.Then we create OmDet-Single, OmDet-Shallow for the ablation study.
• OmDet-Single: To compare the performance on single datasets with Sparse R-CNN, we train OmDet on the four datasets separately.Since they are only trained with a single dataset, so it cannot benefit from the proposed multi-dataset training.
• OmDet-Shallow: We also train an OmDet-Shallow model by removing the task encoder and MDN, which degenerates the localization network to Sparse R-CNN, and only utilize the language feature for the final label classification.This is similar to previous work such as Detic(Zhou et al., 2022b).
We use Image Net (Deng et al., 2009) pre-trained Swin Transformer Tiny (Liu et al., 2021) as the visual backbone and use CLIP ViT-B/16 as language en-coder for OmDet.The same Swin Transformer is used as the backbone for Sparse R-CNN.All models are trained with 12 epochs.The initial learning rate is set to 5e-5 for OmDet and 2.5e-5 for Sparse R-CNN.
OmDet vs. Sparse R-CNN on single Dataset: First, we demonstrate the validity of our framework on classic OD tasks.
As shown in Table 7, OmDet-Single gets higher AP scores on COCO, PASCAL VOC and WIDER FACE than Sparse R-CNN under the same training setting, and is only about 1 point lower on WIDER Pedestrian.These results prove our model with the novel language-aware OD architecture still maintains the same or better performance on single dataset OD using the same number of trainable parameters, but since it is only trained with a small set of vocabulary, it does not have the open-vocabulary or few-shot capability.
OmDet vs. OmDet-Single: The only difference between OmDet and OmDet-Single is that OmDet is joint-trained on all four datasets by utilizing the proposed language-aware OD architecture.Table 7 shows that the AP scores of OmDet are significantly higher than OmDet-Single on PASCAL VOC (+15.52 AP), WIDER FACE (+7.33 AP), and WIDER Pedestrian (+12.85AP).Moreover, OmDet shows better performances than Sparse R-CNN on all datasets.These results confirm that OmDet possesses the capability of multi-dataset training by solving taxonomy conflicts and fore/background inconsistency.Moreover, knowledge sharing in joint training improves the overall detection performance, especially for the ones with fewer training samples.
OmDet vs. OmDet-Shallow: Lastly an ablation study is used to verify the contribution of the proposed MDN block.OmDet's fusion mechanism is deep since the task embedding is combined with visual features early on and influences both localization and classification.On the other hand, OmDet-Shallow's fusion mechanism is shallow, i.e., it only utilizes the label embedding in the final layer of object classification.
Table 7 shows that OmDet performs stronger and OmDet-Shallow only partially solve the MDOD challenges.OmDet-Shallow achieves good performance on PASCAL VOC, WIDER FACE, and WIDER Pedestrian compared with OmDet-Single.This is because OmDet-Shallow resolves the taxonomy conflict challenge and enables semantic sharing among object label embeddings, similar to Detic (Zhou et al., 2022b).
However, OmDet-Shallow fails on COCO with a low AP score, since it cannot resolve the fore/background inconsistency challenge.Since COCO has 80 categories, which is much larger than the other three datasets, many of its objects are considered as background in the other three datasets.Therefore, the low AP is caused by incorrectly detecting COCO objects as background.We visualize outputs of OmDet-Shallow and OmDet on COCO images in Figure 4, which confirms our hypothesis that OmDet-Shallow detects many objects that are not in Pascal VOC and Wider Face/Pedestrian as background.More examples can be found that although OmDet-Shallow correctly detects all pedestrians in the last images, it misses object "Skis".Unlike OmDet-Shallow, OmDet has benefited from deep fusion and detects all the images correctly.8.The aim of this setup is to examine the relationship between the number of visual concepts in the pretraining data and the performance of the model on downstream tasks under various fine-tuning settings.Note that we used OmDet ConvNeXt-T as the backbone architecture for our ablation studies.
The effectiveness of Zero/Few-Shot : As shown in Table 8, adding more pre-train datasets yields significant improvement in zero-shot settings.Specifically, adding the object365 dataset gives an absolute gain of 3.7 points on the average mAP.Surprisingly, adding LVIS to the pre-train data hurts performance by 1.1 points.We speculate that the performance drop is due to the noisy and incomplete annotations of LVIS dataset.Adding GCC dataset to the pre-train corpora yields another huge gain, leading the zeroshot performance to 16.0 (compared to 9.8 for OmDet-C).There are several promising directions to further improve the zero-shot performance OmDet, including unfreezing the text encoder during pre-training and incorporating phrase grounding data with contextual text information in pre-training.We leave them to future research.
Meanwhile Parameter-efficient Fine-tuning: As large-scale pretraining models get significantly larger, e.g., more than 1B parameters, the cost to fine-tune (FT) the entire model becomes prohibitive for low-end GPUs.Parameterefficient fine-tuning is designed to alleviate this challenge by only tuning a very small proportion of the entire model.In this paper, we explore two options: Head-only Tuning and Prompt Tuning.
Experimental results show that large-scale multi-dataset pre-training is crucial for successful parameter-pretraining (Table 8).For Head-only FT, the performance drop is reduced from 11.3% for OmDet-C to only 6.1% for OmDet.The same trend is observed for Prompt FT, in which the performance drop compared to full-model tuning is reduced from 65.9% to 45.5% from OmDet-C to OmDet. Figure 5 also visualizes the trend of AP vs. the vocabulary size in pre-training (log-scale).The apparent up-going curve can be observed as more visual concepts are included during pre-training.This suggests that: (1) multi-dataset pre-training enables the accumulation of a large number of visual concepts, which leads to a stronger backbone that extracts generalpurpose visual features (supported by head-only FT results).
(2) the diversity in language is crucial for successful prompt tuning such that the entire model output can be controlled by the task embedding only (less than 1% of the parameters of the entire model).

Visualization of Language-Aware Detection
Lastly, we conducted qualitative visualizations to showcase the effectiveness of our proposed language-aware object detection model in accurately localizing and labeling objects based on natural language inputs (Figure 6).By inputting different tasks, e.g., [Sandwich, Tobacco Pipe] vs. [Lighter, Bottle], OmDet can dynamically adapt its object localization and classification conditioned on the given task.Figure 6 visualizes the intermediate output at each stage of the MDN block.We found that the model learns to place its proposal boxes as the whole image for the initial stage and quickly narrows its focus from the initial whole-image boxes to the objects of interest quickly in 2-3 steps in a top-down search manner.The later stage output continues to refine its output and confidence scores (e.g., with less duplicated bounding boxes and more certain confidence).Table 10: Inference Speed on LVIS datasets, 12K labels Table 10 presents a comparison of the inference speeds across different models.Visual grounding MDETR model encodes one class at a time for OD and suffer from slow inferences speed, e.g.10s/img for MDETR.Also, MDETR uses transformer to fuse image with text, which cannot scale up to multi-scale features due to complexity of self-attention.In the GLIP method for object detection, objects are identified by combining all their labels into a single descriptive sentence.While this method proves effective in certain contexts, it encounters limitations when applied to datasets with an extensive labels, such as those found in LVIS.The reason for this slowdown is that creating one big sentence out of many labels creates unnecessary links between the labels, which complicates the detection process and reduces speed.Our MDN deals with fixed-size latent queries, which are independent of visual features.Thus, our method is able to predict many classes with significant speed up with on par or better performance.

Different Iterative Fusion
We have explored the influence of the iterative number on the multimodal detection network based on iterative fusion in the ODinW datasets.Our investigation involved varying the number of heads in the ConvNext-B architecture, specifically analyzing the performance impact of 1, 3, and 6 heads.The results of these experiments are summarized in Table 11.We observe a substantial improvement when increasing the number of heads from 1 to 3, indicating that additional iterations enhance the network's capability to discern complex patterns in data.This improvement continues, though at a reduced pace, when expanding from 3 to 6 heads.This suggests that while iterative fusion brings benefits in handling complex scenes.12, we present a comprehensive quantitative comparison between our proposed model, OmDet, and several state-of-the-art object detection models, namely ViLD Gu et al. (2021), CORA Wu et al. (2023b), and BARON Wu et al. (2023a).The evaluation is conducted on the widely recognized COCO and LVIS benchmarks, utilizing the open-vocabulary setting as conducted in VILD.The evaluation metrics employed for the comparison include AP 50 novel and AP 50 base for COCO, as well as AP r (the AP of rare categories) for LVIS.The AP 50 novel score evaluates the model's performance on novel objects, which are not seen during the training phase, while the AP 50 Base score assesses detection on the base categories, which are present in the training dataset.Additionally, the AP r score provides valuable insights into the model's performance on the 337 rare categories of LVIS that were not part of the training categories.These evaluation metrics collectively offer a comprehensive assessment of the proposed model's effectiveness across different open-vocabulary object detection scenarios and dataset characteristics.As illustrated in the

Conclusion
This work proposes to advance zero/few-shot OD via continual pre-training from a large number of OD datasets by solving the two key technical challenges: Taxonomy conflict and Fore/background inconsistency.OmDet proposes a novel multimodal detection network that is able to do a fusion of natural language prompts with visual features for language-augmented object detection.Study results confirm the efficacy of OmDet for multi-dataset learning and large-scale pre-training as a foundation model.Our approach OmDet achieved state-of-the-art performance on OV-COCO with a notable AP 50 novel score of 75.17,AP 50 base score of 70.79, and an APr score of 24.65 on OV-LVIS, thereby significantly surpassing competing models such as BARON and CORA.We also show that enlarging the vocabulary size via multi-datasets pre-training effectively improves zero/few-shot learning and parameter-efficient fine-tuning.OmDet achieved state-of-the-art performance on 35 downstream tasks from ODinW.Future research will focus on improving OmDet by exploring better text prompt encoding methods and pre-training strategies that will improve zero-shot detection performance and prompt-tuning performance.

Figure 1 :
Figure 1: Overview of OmDet Architecture.The proposed Multimodal Detection Network iteratively fuses vision and language features into latent queries for object detection.

Figure 2 :
Figure 2: Network architecture for the Multimodal Detection Network (MDN), simplified here for illustration purposes.

Figure 3 :
Figure 3: Comparison with other frameworks.(a) Shallow fusion that only utilizes text information for object classification.(b) Deep fusion that fuses visual and text in the backbone before entering the object detection head.(c) Deep latent fusion (ours) utilizes latent queries to fuse multimodal information, enabling adaption to any querybased OD architecture.

Figure 5 :
Figure 5: Vocabulary size used in pre-training vs. the AP score of fine-tuning on ODinW with head-only and prompt tuning.. X-axis is in log-scale.

Figure 6 :
Figure 6: Illustration of language-aware OD, where a single model can generalize (without fine-tuning) to any input tasks on the fly in the form of natural language.

Table 1 :
. Data details are described in Table1Pre-train data used in large-scale OD pre-training, resulting in OmDetV1.

Table 3 :
Baseline models and their training setup.

Table 4 :
Comparison between OmDetV1 and other models, on average AP of zero-shot, few-shot (3-shot), and full-shot on ODinW35.

Table 6 :
Zero-shot Performance on Flickr30K val for Phrase Grounding.

Table 7 :
MDOD training results on four datasets.OmDet is able to resolve task conflict issues in MDOD and achieves higher performance compared to single dataset models.

Table 8 :
Average AP of zero-shot, full-model, head-only and prompt finetuning on 35 downstream tasks in ODinW.The gray text shows the performance drop of parameter-efficient tuning compared to full-model tuning.

Table 9 :
, the 35 downstream tasks in ODinW come with different training data sizes, varying from only 17 training images to more than 32K training images.Therefore, we divide the 35 tasks into three categories: (1) Small-shot (8 tasks): tasks with less than 200 training images (2) Medium-shot (13 tasks): tasks with between 200 to 2000 training data (3) Big-shot (14 tasks): tasks with more than 2000 training images.Results with full-model fine-tuning are summarized in Table9.Results show that large-scale multi-dataset pre-training is particularly effective for small-shot and medium-shot tasks with limited in-domain training data.Especially for small-shot datasets, OmDet outperforms OmDet-C with 10.99 absolute AP points.Whereas for Big-shot tasks, the advantages of pre-training become less evident.Average AP of full-model fine-tuning on 35 downstream tasks in ODinW for Small-shot, Medium-Shot and Big-Shot tasks.

Table 11 :
Comparison of AP and Inference Speed with Different Iterative Numbers 7.3.Comparison with State-of-the-Art Methods on Open-Vocabulary Benchmarks In Table

Table 12 :
table, OmDet significantly outperforms the other models across all three metrics.With an AP 50 novel score of 75.17, an AP 50 Base score of 70.79, and an AP r of 24.65, OmDet demonstrates superior detection capabilities for both novel and base objects.Comparative analysis of recent object detection models on OV-COCO and OV-LVIS