Sitemap
Analytics Vidhya

Analytics Vidhya is a community of Generative AI and Data Science professionals. We are building the next-gen data science ecosystem

Building Multi-Label Text Classifiers for arXiv Paper Abstract Dataset

Adeel
5 min readOct 4, 2021

--

Photo by on

Paper submission systems (, , etc.) require the users to upload their paper titles and paper abstracts and then specify the subject areas their papers best belong to. arXiv is a free distribution service and an open-access archive for 1,950,165 scholarly articles in the fields of physics, mathematics, computer science, quantitative biology, quantitative finance, statistics, electrical engineering and systems science, and economics. It is mostly used for academicians to upload their papers. The provides more than 38000 unique paper titles along with their summaries and subject areas. The dataset is uploaded just a few days ago (as of writing this blog) and the collection process is available here in .

It would be interesting if submission systems like arXiv provide viable subject area suggestions as to where the corresponding papers could be best associated with? Our task is to build a text classifier model that can predict the subject areas given paper abstracts and titles.

You can download the complete code from or .

Enough of talk, let’s start coding…

Data wrangling

First thing first, lets import the necessary libraries and the dataset.

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import seaborn as sns
import ast
import matplotlib.pyplot as plt
import nltk
from nltk.corpus import stopwords
from nltk.stem.snowball import SnowballStemmer
import re
import sys
import warnings
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
papers = pd.read_csv('/kaggle/input/arxiv-paper-abstracts/arxiv_data.csv')
papers.head()
Figure 1: First five columns of Dataset

The titles and summaries are the independent variables and terms are the dependent variable. It is a multi-label classification problem so terms have multiple values i.e. cs, AI, and so on. We will first remove the single quotes from each of the terms. For that purpose, we will be using the literal_eval function. The literal_eval safely evaluates an expression node or a string containing a Python literal or container display.

description_category = papers[['terms','summaries','titles']]
description_category['terms'] = description_category['terms'].apply(lambda x: ast.literal_eval(x))
#description_category['terms'] = description_category['terms'].apply(lambda x: re.sub(r'([^)]*)', '',x)
description_category.head()
Figure 2: Dataset view

After initial exploration, we came to conclusion that there are 11 labels/terms in our dataset. We will convert each label as a single column.

columns = ['category_1', 'category_2', 'category_3',
'category_4', 'category_5', 'category_6',
'category_7', 'category_8', 'category_9',
'category_10', 'category_11']
cat = pd.DataFrame(description_category['terms'].to_list(), columns = columns)
cat
Figure 3: Converting labels into columns

Now lets convert these values into numerical values. We will insert 0 for all none values and 1 for all labels.

category_1_genres = cat.category_1.unique()
category_2_genres = cat.category_2.unique()
category_3_genres = cat.category_3.unique()
category_4_genres = cat.category_4.unique()
category_5_genres = cat.category_5.unique()
category_6_genres = cat.category_6.unique()
category_7_genres = cat.category_7.unique()
category_8_genres = cat.category_8.unique()
category_9_genres = cat.category_9.unique()
category_10_genres = cat.category_10.unique()
category_11_genres = cat.category_11.unique()


genres = np.concatenate([category_1_genres, category_2_genres, category_3_genres,
category_4_genres, category_5_genres, category_6_genres,
category_7_genres, category_8_genres, category_9_genres,
category_10_genres, category_11_genres
])
genres = list(dict.fromkeys(genres))
genres = [x for x in genres if x is not None]

cat = pd.concat([cat,pd.DataFrame(columns = list(genres))])
cat.fillna(0, inplace = True)
cat.head()
Figure 4: Label converted to zero

The below code should be done for all categories/labels. You can download the complete code from or .

row = 0
for genre in cat['category_1']:
if genre != 0:
cat.loc[row, genre] = 1
row = row + 1
description_category_new = pd.concat([description_category['titles'],description_category['summaries'],  
cat.loc[:,"cs.CV":]],
axis=1)
description_category_new.head()
Figure 5: Label converted to zero and one

Data Visualization

After initial data wrangling, we move on to data visualization.

bar_plot = pd.DataFrame()
bar_plot['cat'] = description_category_new.columns[2:]
bar_plot['count'] = description_category_new.iloc[:,2:].sum().values
bar_plot.sort_values(['count'], inplace=True, ascending=False)
bar_plot.reset_index(inplace=True, drop=True)
bar_plot.head()
Figure 6: Category vs Count
threshold = 1000
main_categories = pd.DataFrame()
main_categories = bar_plot[bar_plot['count']>1000]
categories = main_categories['cat'].values
categories = np.append(categories,'Others')
not_category = []
description_category_new['Others'] = 0

for i in description_category_new.columns[2:]:
if i not in categories:
description_category_new['Others'][description_category_new[i] == 1] = 1
not_category.append(i)

description_category_new.drop(not_category, axis=1, inplace=True)

We will keep the top four categories or labels and place all other labels in the Others count.

most_common_cat = pd.DataFrame()
most_common_cat['cat'] = description_category_new.columns[2:]
most_common_cat['count'] = description_category_new.iloc[:,2:].sum().values
most_common_cat.sort_values(['count'], inplace=True, ascending=False)
most_common_cat.reset_index(inplace=True, drop=True)
most_common_cat.head()
Figure 7: Category vs Count
plt.figure(figsize=(15,8))
sns.set(font_scale = 1.5)
sns.set_style('whitegrid')


pal = sns.color_palette("Blues_r", len(most_common_cat))
rank = most_common_cat['count'].argsort().argsort()

sns.barplot(most_common_cat['cat'], most_common_cat['count'], palette=np.array(pal[::-1])[rank])
plt.axhline(threshold, ls='--', c='red')
plt.title("Most commons categories", fontsize=24)
plt.ylabel('Number of papers', fontsize=18)
plt.xlabel('terms', fontsize=18)
plt.xticks(rotation='vertical')

plt.show()

The plot shows the most common categories of papers alongside the number of papers. We can see that cs.CV dominates the categories.

Figure 8: Most common categories
rowSums = description_category_new.iloc[:,2:].sum(axis=1)
multiLabel_counts = rowSums.value_counts()
sns.set(font_scale = 1.5)
sns.set_style('whitegrid')
plt.figure(figsize=(10,6))

sns.barplot(multiLabel_counts.index, multiLabel_counts.values)
plt.title("Number of terms per paper", fontsize=24)
plt.ylabel('Number of terms', fontsize=18)
plt.xlabel('Number of categories', fontsize=18)

plt.show()
Figure 9: Number of terms per paper

Data Preprocessing

In this step, we will convert abstract/summaries into lowercase, remove stop words from them, and will further stem them. The details of each function can be found in the complete code section.

description_category_new['summaries'] = description_category_new['summaries'].str.lower()
description_category_new['summaries'] = description_category_new['summaries'].apply(decontract)
description_category_new['summaries'] = description_category_new['summaries'].apply(cleanPunc)
description_category_new['summaries'] = description_category_new['summaries'].apply(keepAlpha)
description_category_new['summaries'] = description_category_new['summaries'].apply(removeStopWords)
description_category_new['summaries'] = description_category_new['summaries'].apply(stemming)
description_category_new.head()
Figure 10: Dataset view

Deep Learning Models

We will now apply the basic deep learning models. The machine learning part is not discussed in this blog. However, you can see that in the complete code uploaded on or .

from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences

tokenizer = Tokenizer(num_words=5000, lower=True)
tokenizer.fit_on_texts(description_category_new['summaries'])
sequences = tokenizer.texts_to_sequences(description_category_new['summaries'])
x = pad_sequences(sequences, maxlen=200)

After importing the necessary libraries for deep learning and tokenizing our abstracts. We will split data into train and test sets.

X_train, X_test, y_train, y_test = train_test_split(x, 
description_category_new[description_category_new.columns[2:]],
test_size=0.3,
random_state=seeds[4])

We applied one basic deep neural network model and a convolutional neural network model. The AUC and validation accuracy are mentioned in the below table.

Figure 11: AUC and Validation accuracy

The initial results are quite overwhelming. The dataset has been uploaded just a few days ago (as of writing this blog) and provides a great opportunity for learning NLP skills.

I hope you learned something valuable from this blog.
Until next time, Happy Coding…

Analytics Vidhya
Analytics Vidhya

Published in Analytics Vidhya

Analytics Vidhya is a community of Generative AI and Data Science professionals. We are building the next-gen data science ecosystem

Adeel
Adeel

Written by Adeel

Machine Learning Researcher and NLP Engineer. For more details visit my personal website:

No responses yet