Abstract
Natural language processing (NLP) provides a framework for large-scale text analysis. One common processing method uses vector space models (VSMs) which embed word attributes, called features, into highly dimensional vectors. Comprehensive VSMs are generated on sources such as the GoogleNews archive. A thesaurus, a collection of semantically-related words, can be created for a particular root word using cosine similarity with a given VSM. Many methods have been developed to reduce the complexity of these models by maintaining useful semantic information while discarding non-informative features. One such method, variance thresholding, retains high-variance features above a manually-determined threshold, providing higher differentiation between words for classification purposes. Our research developed a dimension-reducing methodology called dynamic variance thresholding (DyVaT). DyVaT reduces the specificity of word embeddings by maintaining low-variance features, allowing for a broader thesaurus preserving semantic similarity. A dynamic variance threshold, determining which low-variance features are retained, is selected using the kneedle algorithm, improving the current results. Our test case for examining the efficiency of DyVat in creating a contextual thesaurus is the visual, auditory and kinesthetic learning style context. We conclude that DyVaT is a valid method for generating loosely-connected word collections with potential uses in NLP classification or clustering tasks.
Original language | English |
---|---|
Article number | 118157 |
Journal | Expert Systems with Applications |
Volume | 208 |
DOIs | |
State | Published - 1 Dec 2022 |
Externally published | Yes |
Keywords
- Dimensionality reduction
- Feature selection
- Learning styles
- Machine learning
- Natural language processing
ASJC Scopus subject areas
- General Engineering
- Computer Science Applications
- Artificial Intelligence