A Proposal Free One Stage Framework for Referring Expression Comprehension and Generation via Dense

A Proposal Free One Stage Framework for Referring Expression Comprehension and Generation via Dense

admin

admin

Feb 2, 2024 - 16:15

0 18

Abstract:

Referring Expression Comprehension (REC) and Generation (REG) have become one of the most important tasks in visual reasoning, since it is an essential step for many vision-and-language tasks such as visual question answering or visual dialogue. However, it has not been widely used in many downstream tasks, mainly for the following reasons: 1) mainstream two-stage methods rely on additional annotations or off-the-shelf detectors to generate proposals. It would heavily degrade the generalization ability of models and lead to inevitable error accumulation. 2) Although one-stage strategies for REC have been proposed, these methods have to depend on lots of hyper-parameters (such as anchors) to generate bounding box. In this paper, we present a proposal-free one-stage (PFOS) framework that can directly regress the region-of-interest from the image or generate unambiguous descriptions in an end-to-end manner. Instead of using the dominant two-stage fashion, we take the dense-grid of images as input for a cross-attention transformer that learns multi-modal correspondences. The final bounding box or sentence is directly predicted from the image without the anchor selection or the computation of visual difference. Furthermore, we expand the traditional two-stage listener-speaker framework to jointly train by a one-stage learning paradigm. Our model achieves state-of-the-art performance on both accuracy and speed for comprehension and competitive results for generation.

Click Here To See More

Tags:

Previous Article

Retinal OCT Image Registration Methods and Applications

Scale Aware Automatic Augmentations for Object Detection With Dynamic Training

What's Your Reaction?

0

Like

0

Dislike

0

Love

0

Funny

0

Angry

0

Sad

0

Wow

Related Posts

Multitask Learning for Visual Question Answering in Python

admin Feb 3, 2024 0 19

Joint Reason Generation and Rating Prediction for Expla...

admin Jan 22, 2024 0 21

Automated Diagnosis System for Outpatients and Inpatien...

admin Dec 23, 2021 0 30

Evaluation of Short Range Depth Sonifications for Visua...

admin Jan 23, 2024 0 16

Memory Conscious Machine Learning Method to Extract Tim...

admin Jan 18, 2024 0 18

Security and Reliability Concerns in Machine Learning f...

admin Jan 31, 2024 0 13

Comments