The purpose of Natural Language Processing project was to develop a model that could process the text of news articles and classify them into categories based on their content.
A business case for this task is that an NLP model can make the classification of articles and more efficient than manually classifying them.
The data consists of several thousand articles published by the news agency Reuters in 1987. Articles seem to be about commodities and related business topics. They were classified into 135 topics that in the data are represented by keywords such as gold, linseed, and rubber.
Some articles have been assigned more than one topic, while some haven’t been assigned any topic.
For this project, I focused on the topic earn, short hand for articles about earnings.
-
The data was provided as 22 files in the
sgm
format. Each file contained 1,000 text articles marked up with HTML tags. -
I used Beautiful Soup to assist with reading the files and extracting the article content and their assigned topics.
-
Because for the purpose of this project, I was only interested in the earn topic, all topics were reassigned as either:
- earn – meaning it was at least one of the assigned topics
- other – meaning earn was not one of the assigned topics
- blank – meaning no topic was assigned
There were 22,000 articles included in the dataset. Of the three topic categories:
- 18.3% - earn
- 34.2% - other
- 47.5% - blank
- Tokenized the text – split articles into lists of lowercase words, removed numbers, stripped punctuation but reserved angle-bracketed company symbols (“< CH >”)
- Removed stopwords like “the” and “if”
- Lemmatized words (“banks” → “bank”)
- Separated data into training set to train model and testing set to verify performance
- Vectorized data – transformed words into their numerical frequencies
- Ran five different baseline models (Multinomial Naïve Bayes, Random Forest, Decision Tree, KNN, XGBoost)
- In scoring, ssed recall as the primary metric since the classes were imbalanced, and false positives weren’t concerning
- Looked for similar results between training and test results to minimize overfitting
- Built pipelines to experiment with various parameters for:
- two vectorization techniques (Count Vectorizer and TF-IDF)
- Decision Tree, which was the best baseline model
- Experimented with variations on the dataset’s topics:
- earn, other, and blank (3 topics)
- earn and other, merging other and blank together (2 topics)
- earn and other, dropping all articles categorized as blank (2 topics)
-
The “winning” Decision Tree model, using the TF-IDF vectorizer, classified topics 100% correctly with minimal overfitting
-
The blank topics were eliminated entirely for this model.
This project only classified articles into two topics, earn and other. Ideally, a model could classify articles into any topic.
The README included with the dataset had questions about why so many articles were unclassified, questions which I share. It seems many of these articles could have had topics, and assigning them topics would improve data quality and help with modeling.