Text classification with sklearn


  • To classify news articles
  • Learn the basics of natural language processing
  • Build models using sklearn and choose the best one
  • Use sklearn's Pipeline class
In this post we'll classify news articles into different categories. First download the dataset from http://mlg.ucd.ie/files/datasets/bbc-fulltext.zip and extract. The dataset consists of 2225 documents and 5 categories: business, entertainment, politics, sport, and tech.

Let's import necessary libraries and functions.

[pre class="brush:python"] %matplotlib inline import numpy as np import matplotlib.pyplot as plt from sklearn.datasets import load_files DATA_DIR = "./bbc/" [/pre]

We'll use load_files function which loads text files with categories as subfolder names. Our dataset already has articles organized into different folders. After loading the data, we'll also check how many articles are there per category.

[pre class="brush:python"]
data = load_files(DATA_DIR, encoding="utf-8", decode_error="replace")
# calculate count of each category
labels, counts = np.unique(data.target, return_counts=True)
# convert data.target_names to np array for fancy indexing
labels_str = np.array(data.target_names)[labels]
print(dict(zip(labels_str, counts)))

{'tech': 401, 'sport': 511, 'business': 510, 'entertainment': 386, 'politics': 417}
Each category has different number of articles. However, it does not look too imbalanced and the model should be able to learn properly.

Data preparation

Now we'll split the data into training and testing set and then print out first 80 chars of some samples. [pre class="brush:python"] from sklearn.model_selection import train_test_split X_train, X_test, y_train, y_test = train_test_split(data.data, data.target) list(t[:80] for t in X_train[:10]) [/pre]
Before we go further, lets quickly go through what are the common natural language processing pipeline.
  • Tokenize i.e. split the text into words 
  • Convert the case of letters to either upper or lower 
  • Remove stopwords. For e.g. "the", "an", "with" 
  • Perform stemming or lemmatization to reduce inflected words to its stem. For e.g. transportation -> transport, transported -> transport (maybe some others) 
  • Vectorization (Count, Binary, TF-IDF) 

Many libraries already exist to perform all of the steps mentioned above. 
The data is in textual format and we cannot use it as it is. We need to convert it to a numerical format. A very common method, among others, is to calculate TF-IDF matrix. TF stands for term frequency in which we calculate how many times a term/word appears in a document. IDF stands for inverse document frequency which measures how important a word is. In simple terms it gives more weight to rare words than common ones. Once we calculate both TF and IDF, we can simply multiply them together to obtain TF-IDF value.
tfidf(t, d, D) = tf(t, d) * idf(t, D) where, 
t is a term 
d is a document
D is set of all documents 
For details about TF-IDF check 
[pre class="brush:python"] from sklearn.feature_extraction.text import TfidfVectorizer vectorizer = TfidfVectorizer(stop_words="english", max_features=1000, decode_error="ignore") vectorizer.fit(X_train) [/pre] 
We used TfidfVectorizer to calculate TF-IDF. When initializing the vectorizer, we passed stop_words as "english" which tells sklearn to discard commonly occurring words in English. Then we also specifed max_features to 1000. The vectorizer will build a vocabulary of top 1000 words (by frequency). This means that each text in our dataset will be converted to a vector of size 1000. 
Next, we call fit function to "train" the vectorizer and also convert the list of texts into TF-IDF matrix. We can also use another function called fit_transform, which is equivalent to: [pre class="brush:python"] vectorizer.fit(X_train) X_train_vectorized = vectorizer.transform(X_train) [/pre] Important We should use only the training data to fit the vectorizer, otherwise it is cheating.

Build model

We'll create a simple naive Bayes model first.

[pre class="brush:python"]
from sklearn.naive_bayes import MultinomialNB
cls = MultinomialNB()
# transform the list of text to tf-idf before passing it to the model
cls.fit(vectorizer.transform(X_train), y_train)

from sklearn.metrics import classification_report, accuracy_score

y_pred = cls.predict(vectorizer.transform(X_test))
print(accuracy_score(y_test, y_pred))
print(classification_report(y_test, y_pred))

             precision    recall  f1-score   support

          0       0.95      0.95      0.95       123
          1       0.99      0.94      0.96       100
          2       0.92      0.96      0.94        95
          3       0.97      1.00      0.98       115
          4       0.97      0.94      0.96       124

avg / total       0.96      0.96      0.96       557

95% accuracy! Not bad. Let's see if we can find a better model. We'll train several models using sklearn Pipelines. Pipelines allow us to add the necessary steps for a model to do its task. In our case, we need to convert the raw texts into vectorized format and then pass it to the model. Pipeline allows us to group these related steps. We can consider a Pipeline object as a model itself i.e. we can call fit, predict functions.

For this demo, we'll create four different pipelines using TF-IDF and CountVectorizer for vectorization and SGDClassifier and SVC (support vector classifier). Then using cross_val_score function, we'll train the each model two times and record their mean accuracy. We'll choose the highest performing model and train it and then evaluate it in the test set.

[pre class="brush:python"]
from sklearn.linear_model import SGDClassifier
from sklearn.svm import SVC
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.model_selection import cross_val_score

# start with the classic
# with either pure counts or tfidf features
sgd = Pipeline([
        ("count vectorizer", CountVectorizer(stop_words="english", max_features=3000)),
        ("sgd", SGDClassifier(loss="modified_huber"))
sgd_tfidf = Pipeline([
        ("tfidf_vectorizer", TfidfVectorizer(stop_words="english", max_features=3000)),
        ("sgd", SGDClassifier(loss="modified_huber"))

svc = Pipeline([
        ("count_vectorizer", CountVectorizer(stop_words="english", max_features=3000)),
        ("linear svc", SVC(kernel="linear"))
svc_tfidf = Pipeline([
        ("tfidf_vectorizer", TfidfVectorizer(stop_words="english", max_features=3000)),
        ("linear svc", SVC(kernel="linear"))
all_models = [
    ("sgd", sgd),
    ("sgd_tfidf", sgd_tfidf),
    ("svc", svc),
    ("svc_tfidf", svc_tfidf),

unsorted_scores = [(name, cross_val_score(model, X_train, y_train, cv=2).mean()) for name, model in all_models]
scores = sorted(unsorted_scores, key=lambda x: -x[1])

[('svc_tfidf', 0.973026575899821), ('svc', 0.95623710562069142), ('sgd_tfidf', 0.95384189603985314), ('sgd', 0.93645074796385619)]

Support Vector Machine with tf-idf features scored the highest accuracy of 97%. Lets train it and evaluate it in the test dataset.

[pre class="brush:python"]
model = svc_tfidf
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
print(accuracy_score(y_test, y_pred))
print(classification_report(y_test, y_pred))

             precision    recall  f1-score   support

          0       0.99      0.94      0.97       141
          1       0.98      1.00      0.99        96
          2       0.96      0.99      0.98        99
          3       0.97      1.00      0.99       114
          4       0.98      0.97      0.98       107

avg / total       0.98      0.98      0.98       557

98% accuracy! Unlike before, we don't have to vectorize the documents manually before passing it to the model, since we have defined the vectorization process in the pipeline itself.