Abstract:
With the development of sensor technology, multi-variate time series classification is an essential element in time data mining. Multivariate time series are everywhere in our daily lives, like finance, the weather, and the healthcare system. In the meantime, Transformers has achieved excellent results in terms of NLP and CV tasks. The Vision Transformer (ViT) achieves excellent results compared to SOTA’s convolutional networks when pre-training large amounts of data and transferring it to multiple small to medium image recognition baselines while significantly reducing the required computing resources. At the same time, multi-modality can extract more excellent features, and related research has also developed significantly. In this work, we propose a multi-modal fusion transformer for time series classification. We use Gramian Angular Field (GAF) to convert time series to 2D images and then use CNN to extract features from 1D time series and 2D images separately to fuse them. Finally, the information output from the transformer encoder fuse is entered in ResNet for classification. We conduct extensive experiments on twelve time series datasets. Compared to several baselines, our model has obtained higher accuracy.