Building Multi-Label Text Classifiers for arXiv Paper Abstract Dataset
Improving Paper Submission Systems
Paper submission systems (, , etc.) require the users to upload their paper titles and paper abstracts and then specify the subject areas their papers best belong to. arXiv is a free distribution service and an open-access archive for 1,950,165 scholarly articles in the fields of physics, mathematics, computer science, quantitative biology, quantitative finance, statistics, electrical engineering and systems science, and economics. It is mostly used for academicians to upload their papers. The provides more than 38000 unique paper titles along with their summaries and subject areas. The dataset is uploaded just a few days ago (as of writing this blog) and the collection process is available here in .
It would be interesting if submission systems like arXiv provide viable subject area suggestions as to where the corresponding papers could be best associated with? Our task is to build a text classifier model that can predict the subject areas given paper abstracts and titles.
You can download the complete code from or .
Enough of talk, let’s start coding…
Data wrangling
First thing first, lets import the necessary libraries and the dataset.
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import seaborn as sns
import ast
import matplotlib.pyplot as plt
import nltk
from nltk.corpus import stopwords
from nltk.stem.snowball import SnowballStemmer
import re
import sys
import warnings
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
papers = pd.read_csv('/kaggle/input/arxiv-paper-abstracts/arxiv_data.csv')
papers.head()
The titles and summaries are the independent variables and terms are the dependent variable. It is a multi-label classification problem so terms have multiple values i.e. cs, AI, and so on. We will first remove the single quotes from each of the terms. For that purpose, we will be using the literal_eval function. The literal_eval safely evaluates an expression node or a string containing a Python literal or container display.
description_category = papers[['terms','summaries','titles']]
description_category['terms'] = description_category['terms'].apply(lambda x: ast.literal_eval(x))
#description_category['terms'] = description_category['terms'].apply(lambda x: re.sub(r'([^)]*)', '',x)
description_category.head()
After initial exploration, we came to conclusion that there are 11 labels/terms in our dataset. We will convert each label as a single column.
columns = ['category_1', 'category_2', 'category_3',
'category_4', 'category_5', 'category_6',
'category_7', 'category_8', 'category_9',
'category_10', 'category_11']
cat = pd.DataFrame(description_category['terms'].to_list(), columns = columns)
cat
Now lets convert these values into numerical values. We will insert 0 for all none values and 1 for all labels.
category_1_genres = cat.category_1.unique()
category_2_genres = cat.category_2.unique()
category_3_genres = cat.category_3.unique()
category_4_genres = cat.category_4.unique()
category_5_genres = cat.category_5.unique()
category_6_genres = cat.category_6.unique()
category_7_genres = cat.category_7.unique()
category_8_genres = cat.category_8.unique()
category_9_genres = cat.category_9.unique()
category_10_genres = cat.category_10.unique()
category_11_genres = cat.category_11.unique()
genres = np.concatenate([category_1_genres, category_2_genres, category_3_genres,
category_4_genres, category_5_genres, category_6_genres,
category_7_genres, category_8_genres, category_9_genres,
category_10_genres, category_11_genres
])
genres = list(dict.fromkeys(genres))
genres = [x for x in genres if x is not None]
cat = pd.concat([cat,pd.DataFrame(columns = list(genres))])
cat.fillna(0, inplace = True)
cat.head()
The below code should be done for all categories/labels. You can download the complete code from or .
row = 0
for genre in cat['category_1']:
if genre != 0:
cat.loc[row, genre] = 1
row = row + 1
description_category_new = pd.concat([description_category['titles'],description_category['summaries'],
cat.loc[:,"cs.CV":]],
axis=1)
description_category_new.head()
Data Visualization
After initial data wrangling, we move on to data visualization.
bar_plot = pd.DataFrame()
bar_plot['cat'] = description_category_new.columns[2:]
bar_plot['count'] = description_category_new.iloc[:,2:].sum().values
bar_plot.sort_values(['count'], inplace=True, ascending=False)
bar_plot.reset_index(inplace=True, drop=True)
bar_plot.head()
threshold = 1000
main_categories = pd.DataFrame()
main_categories = bar_plot[bar_plot['count']>1000]
categories = main_categories['cat'].values
categories = np.append(categories,'Others')
not_category = []
description_category_new['Others'] = 0
for i in description_category_new.columns[2:]:
if i not in categories:
description_category_new['Others'][description_category_new[i] == 1] = 1
not_category.append(i)
description_category_new.drop(not_category, axis=1, inplace=True)
We will keep the top four categories or labels and place all other labels in the Others count.
most_common_cat = pd.DataFrame()
most_common_cat['cat'] = description_category_new.columns[2:]
most_common_cat['count'] = description_category_new.iloc[:,2:].sum().values
most_common_cat.sort_values(['count'], inplace=True, ascending=False)
most_common_cat.reset_index(inplace=True, drop=True)
most_common_cat.head()
plt.figure(figsize=(15,8))
sns.set(font_scale = 1.5)
sns.set_style('whitegrid')
pal = sns.color_palette("Blues_r", len(most_common_cat))
rank = most_common_cat['count'].argsort().argsort()
sns.barplot(most_common_cat['cat'], most_common_cat['count'], palette=np.array(pal[::-1])[rank])
plt.axhline(threshold, ls='--', c='red')
plt.title("Most commons categories", fontsize=24)
plt.ylabel('Number of papers', fontsize=18)
plt.xlabel('terms', fontsize=18)
plt.xticks(rotation='vertical')
plt.show()
The plot shows the most common categories of papers alongside the number of papers. We can see that cs.CV dominates the categories.
rowSums = description_category_new.iloc[:,2:].sum(axis=1)
multiLabel_counts = rowSums.value_counts()
sns.set(font_scale = 1.5)
sns.set_style('whitegrid')
plt.figure(figsize=(10,6))
sns.barplot(multiLabel_counts.index, multiLabel_counts.values)
plt.title("Number of terms per paper", fontsize=24)
plt.ylabel('Number of terms', fontsize=18)
plt.xlabel('Number of categories', fontsize=18)
plt.show()
Data Preprocessing
In this step, we will convert abstract/summaries into lowercase, remove stop words from them, and will further stem them. The details of each function can be found in the complete code section.
description_category_new['summaries'] = description_category_new['summaries'].str.lower()
description_category_new['summaries'] = description_category_new['summaries'].apply(decontract)
description_category_new['summaries'] = description_category_new['summaries'].apply(cleanPunc)
description_category_new['summaries'] = description_category_new['summaries'].apply(keepAlpha)
description_category_new['summaries'] = description_category_new['summaries'].apply(removeStopWords)
description_category_new['summaries'] = description_category_new['summaries'].apply(stemming)
description_category_new.head()
Deep Learning Models
We will now apply the basic deep learning models. The machine learning part is not discussed in this blog. However, you can see that in the complete code uploaded on or .
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
tokenizer = Tokenizer(num_words=5000, lower=True)
tokenizer.fit_on_texts(description_category_new['summaries'])
sequences = tokenizer.texts_to_sequences(description_category_new['summaries'])
x = pad_sequences(sequences, maxlen=200)
After importing the necessary libraries for deep learning and tokenizing our abstracts. We will split data into train and test sets.
X_train, X_test, y_train, y_test = train_test_split(x,
description_category_new[description_category_new.columns[2:]],
test_size=0.3,
random_state=seeds[4])
We applied one basic deep neural network model and a convolutional neural network model. The AUC and validation accuracy are mentioned in the below table.
The initial results are quite overwhelming. The dataset has been uploaded just a few days ago (as of writing this blog) and provides a great opportunity for learning NLP skills.
I hope you learned something valuable from this blog.
Until next time, Happy Coding…