An Experimental Analysis of Optimal Hybrid Word Embedding Methods for Text Classification Using a Movie Review Dataset

Alagarsamy, Sandhya; James, Visumathi; Raj, Raja Soosaimarian Peter

doi:10.1590/1678-4324-2022210830

Acessibilidade / Reportar erro

Brasil

Brazilian Archives of Biology and Technology

Español English

Brasil

Español English

sumário « anterior atual seguinte »

Sumário

Article - Engineering, Technology and Techniques • Braz. arch. biol. technol. 65 • 2022 • https://doi.org/10.1590/1678-4324-2022210830 copy

An Experimental Analysis of Optimal Hybrid Word Embedding Methods for Text Classification Using a Movie Review Dataset

Authorship SCIMAGO INSTITUTIONS RANKINGS

HIGHLIGHTS

Proposed Hybrid Word Embedding (HWE) models for Efficient Text classification.
Data Sparsity issue is reduced using WordNet repository along with proposed model.
Optimal model is derived based on the Performance evaluation on the model.

Abstract

Today, a wealth of data is being produced over the internet from multiple sources, giving rise to the term big data. Much big data is contributed largely in the form of text. This work focuses on text classification of movie reviews dataset using Hybrid Word Embedding (HWE) models and deriving the optimal text classification model. However, in text processing, efficient handling and processing of the words and sentences in a document plays a vital role. In traditional methods like Bag of words (BoW) semantic correlation among the words does not exist. Further, the words in a document are not always processed in order, which results in certain words not being processed at all and creating problems with data sparsity. To overcome the data sparsity problem, the proposed work applied hybrid word embedding using WordNet repository. The hybrid model is built with three word embedding methods, namely, an embedding layer, Word2Vec and GloVe, in combination with the deep learning Convolutional Neural Network (CNN). The results obtained for the movie review dataset set was compared and the optimal classification model is identified. Various metrics considered for evaluation includes Log loss, Area under Curve (AUC), Mean Reciprocal Rank (MRR), Normalized Discounted Cumulative Gain (NDCG), Mean Absolute Error (MAE), Error Rate (ERR), Mathews Correlation Coefficient (MCC), Training Accuracy, Test Accuracy, Precision, Recall and F1 score. Finally, the experimental results proved that the word2vec is derived as the optimal hybrid word embedding model for classification of chosen movie review dataset.

Keywords:
HybridWord Embedding; Natural Language Processing; Deep Neural Network; Text Classification; CNN.

Reference	Proposed Methodology	Finding	Limitation
Weston and Collobert (2008)	Convolution neural network architectures were proposed for performing NLP tasks	Text classification process can be performed using CNN.	Word embedding techniques were not used.
^{Kim (2014)}7 Kim Y. Convolutional neural networks for sentence classification. In Proceedings of the Conference on Empirical Methods in Natual Language Processing (EMNLP); 2014 Oct; Doha, Qatar. Association for Computational Linguistics; c2014. p.1746-1751.	Suggested to implement feature extraction techniques along with deep neural network methods.	Simple CNN and word2vec with slight modification in the hyper parameters, yields optimal classification of NLP tasks.	Single layer CNN is used. This will not perform efficiently for huge datasets.
Le Quoc and T. Mikolov (2014)	Proposed an unsupervised algorithm named paragraph vector to overcome the drawbacks of bag-of-words.	Paragraph vector works efficiently than bag of words in text classification	Implementation of paragraph vector is expensive
Pennington et al. (2014)	Proposed a global regression model that combines two models namely global matrix factorization and local context window method.	The vector space produced by the model is with meaningful sub structures.	The models performance varies based on the number of negative samples.
Bozyigit et al. (2015)	Five classifiers and two feature selection methods in the Text classifications were evaluated on news dataset.	Best classification accuracy is obtained using this combination on the dataset.	The classification accuracy is not achieved for large datasets.
Ming and Xianchun (2016)	Proposed the doc2vec model which is the combination of word2vec and clustering algorithm to express the information of document.	TF-IDF algorithm is used along with word2vec to form document vectors.	Single layer CNN is used which holds good for small documents. Multiple layers is missing for handling large documents.
Andrei and Radu (2017)	The new approach proposed for text classification is clustering based word embeddings using k-means.	The proposed work provides better results than bag of words approach.	Alternate to k-means algorithm can be implemented to yield better results.
Hughes et al. (2017)	The proposed approach is to perform classification at sentence level using deep convolutional neural network.	Multilayer deep convolutional neural network generates optimal features to represent semantics.	This approach has scalability issue. It works only for small datasets.
^{Kilimci et al (2018)}8 Kilimci Zeynep H,Akyokus S. Deep Learning- and Word Embedding-Based Heterogeneous Classifier Ensembles for Text Classification. Hindawi Complexity. 2018 Oct; 2018(7): 1-10.	Different word embeddings and ensemble learning for classifiers is proposed for text classification.	The use of heterogeneous ensembles with word embeddings and deep learning enhances the text classification.	Selecting the appropriate ensemble technique to yield optimal accuracy is challenging.
Roger et al (2019)	Word embedding models along with machine learning models is proposed for hierarchical text classification.	Word2vec, Glove and fastText proved to be best classification models.	Hierarchical text classification becomes complex while handling the real time continuous data.
Yao et al (2019)	Graph convolutional neural network is proposed for text classification.	Single text graph is built for word corpus and then Text Graph convolutional network built for the corpus yields better results.	This approach lowers the training percentage in the dataset.
Albalawi et al. (2021)	Deep learning models like BiLSTM with word embeddings are compared with traditional machine learning models for health related tweets from social media.	The classification accuracy is more with deep learning model when compared with ML models.	The use of advanced Deep learning techniques like auto encoders may impact better than the used approaches.
Guilherme et al. (2021)	An embedding technique (Distance based vector Embedding) based on Logistic Markov Embedding is proposed.	Scalability issue is addressed using the proposed model along with negative sampling approach.	The work limits with machine learning approaches. Deep learning techniques were not implemented.
Moreo et al. (2021)	Proposed word class embedding methods were merged with pre-trained word embeddings for solving NLP tasks.	The proposed work enhances the deep learning training and multiclass classification.	This approach is not suitable for binary classification.
Pittaras et al. (2021)	Semantics were extracted for each word and then word2vec embedding model is applied.	Applying semantics yields better performance on text classification.	The classification models computation complexity is increased.

Word Embedding Model Name	Batch Size		Learning Rate
Word Embedding Model Name	5	10	0.3	0.5	0.2	0.5
Embedding Layer + CNN	1.36	1.65	1.27	1.25	2.87	2.41
Word2Vec + CNN	0.33	0.35	0.30	0.27	0.44	0.35
Glove + CNN	0.45	0.50	0.41	0.36	1.06	0.72

Word Embedding Model Name	Batch Size		Learning Rate		Dropout Rate
Word Embedding Model Name	5	10	0.3	0.5	0.2	0.5
Embedding Layer + CNN	0.75	0.71	0.72	0.66	0.73	0.68
Word2Vec + CNN	0.98	0.97	0.95	0.91	0.97	0.95
Glove + CNN	0.81	0.89	0.92	0.85	0.87	0.81

Word Embedding Model Name	Batch Size		Learning Rate		Dropout Rate
Word Embedding Model Name	5	10	0.3	0.5	0.2	0.5
Embedding Layer + CNN	0.79	0.72	0.79	0.76	0.68	0.66
Word2Vec + CNN	0.88	0.87	0.97	0.95	0.98	0.97
Glove + CNN	0.81	0.84	0.87	0.84	0.89	0.88

Word Embedding Model Name	Batch Size		Learning Rate		Dropout Rate
Word Embedding Model Name	5	10	0.3	0.5	0.2	0.5
Embedding Layer + CNN	0.658	0.618	0.667	0.651	0.622	0.617
Word2Vec + CNN	0.985	0.961	0.983	0.979	0.969	0.968
Glove + CNN	0.861	0.838	0.841	0.837	0.830	0.827

Word Embedding Model Name	Batch Size		Learning Rate		Dropout Rate
Word Embedding Model Name	5	10	0.3	0.5	0.2	0.5
Embedding Layer + CNN	0.691	0.687	0.670	0.666	0.751	0.750
Word2Vec + CNN	0.935	0.930	0.899	0.897	0.962	0.950
Glove + CNN	0.818	0.815	0.873	0.838	0.891	0.817

Word Embedding Model Name	Batch Size		Learning Rate		Dropout Rate
Word Embedding Model Name	5	10	0.3	0.5	0.2	0.5
Embedding Layer + CNN	96.5%	96%	94.5%	94.2%	89.9%	89.7%
Word2Vec + CNN	99%	98.5%	98.9%	98.5%	98.5%	98.1%
Glove + CNN	95.5%	94%	96.1%	95.9%	96.2%	96%

Word Embedding Model Name	Batch Size		Learning Rate		Dropout Rate
Word Embedding Model Name	5	10	0.3	0.5	0.2	0.5
Embedding Layer + CNN	84%	82.5%	85.7%	85.2%	87.2%	86.8%
Word2Vec + CNN	98.7%	98.5%	97.9%	97.4%	98.1%	97.6%
Glove + CNN	92.5%	92%	91.7%	91.1%	89.9%	89.8%

Word Embedding Model Name	Batch Size		Learning Rate		Dropout Rate
Word Embedding Model Name	5	10	0.3	0.5	0.2	0.5
Embedding Layer + CNN	0.69	0.72	0.65	0.66	0.59	0.51
Word2Vec + CNN	0.18	0.21	0.22	0.29	0.28	0.19
Glove + CNN	0.46	0.49	0.51	0.53	0.39	0.31

Word Embedding Model Name	Batch Size		Learning Rate		Dropout Rate
Word Embedding Model Name	5	10	0.3	0.5	0.2	0.5
Embedding Layer + CNN	0.68	0.61	0.66	0.61	0.62	0.67
Word2Vec + CNN	0.95	0.91	0.93	0.99	0.96	0.98
Glove + CNN	0.81	0.83	0.81	0.83	0.80	0.82

Batch Size	Training Accuracy in Percentage
Batch Size	Embedding Layer + CNN	Word2Vec + CNN	Glove + CNN
1	94.845	97.631	90.928
2	97.196	98.721	98.193
3	95.941	99.156	89.154
4	89.158	98.943	93.721
5	96.522	99.066	95.599
6	97.490	98.721	98.963
7	88.921	97.911	93.113
8	95.810	99.167	92.992
9	89.733	98.922	90.851
10	96.231	98.515	94.532
Average	94.12%	98.64%	93.7%

Word Embedding Models	Training Accuracy
Embedding Layer + CNN	96%	82.5%	30
Word2Vec + CNN	99.5%	98.5%	10
Glove + CNN	96%	92%	28

Word Embedding Models	Precision	Recall	F score
Embedding Layer + CNN	0.868	0.887	0.875
Word2Vec + CNN	0.901	0.905	0.901
Glove + CNN	0.874	0.886	0.877

Instituto de Tecnologia do Paraná - Tecpar Rua Prof. Algacyr Munhoz Mader, 3775 - CIC, 81350-010 Curitiba PR Brazil, Tel.: +55 41 3316-3052/3054, Fax: +55 41 3346-2872 - Curitiba - PR - Brazil
E-mail: babt@tecpar.br

Acompanhe os números deste periódico no seu leitor de RSS

[1] *Correspondence: sandhyalagar@gmail.com (S.A.).