Abstract:
Text classification refers to the process of automatically determining text categories based on text content in a given classification system. Text classification mainly includes several steps such as word segmentation, feature selection, weight calculation and classification performance evaluation. Among them, feature selection is a key step in text classification, which affects the classification accuracy. Feature selection can help indicate the relevance of text contents and can better classify the text. Meanwhile feature selection has a great influence on the classification result. Text classification is a very important module in text processing, and it is widely applied in areas like spam filtering, news classification, sentiment classification, and part-of-speech tagging. This paper proposes a method for extracting feature words based on Chi-square Statistics. Because the feature words that appear together or separately may differ in different situations, we classify texts by using single word and double words as features at the same time. Based on our method, we performed experiments using classical Naive Bayes and Support Vector Machine classification algorithms. The efficiency of our method was demonstrated by the comparison and analysis of experimental results.