Abstract:
Context: Users and developers use bug tracking systems to report errors that occur during the development and testing of software. The manual identification of duplicates is a tedious task especially with software that have large bug repositories. In this context, their automatic detection becomes a necessary task that can help prevent frequently fixing the same bug. Objective: In this article, we propose BERT-MLP , a novel pretrained language model using bidirectional encoder representations from ransformers (BERT) for duplicate bug report detection (DBRD) with the aim of improving the detection rate compared to existing works. Method: Our approach considers only unstructured data. These are fed into the BERT model in order to learn the contextual relationships between words. The output is fed into a multilayer perceptron (MLP) classifier, representing our base DBRD. Results: Our approach was evaluated on three projects: Mozilla Firefox, Eclipse Platform, and Thunderbird. It achieved an accuracy of 92.11, 94.08, and 89.03%, respectively, for Mozilla, Eclipse, and Thunderbird. A comparison with a dual-channel convolutional neural network (DC-CNN) model and other pretrained models, including RoBERTa and Sentence-Bert has been conducted. Results showed that BERT-MLP outperformed, the second best performing models (DC-CNN and Sentence-BERT) by 12% in accuracy for Eclipse and 9% for both Mozilla and Thunderbird, respectively.