Data Transformation Methods: Deep Neural Networks for Tabular Data

The unconquered castle for deep learning models

5 min readOct 18, 2021

Deep learning has enjoyed plenty of success in recent times, particularly with homogeneous data sets. Their performance on classification and data generation for images/audios is outstanding. However, the tabular data sets are still an unconquered castle for deep neural network models.

Tabular data is heterogeneous and can lead to dense numerical and sparse categorical features. In addition, the correlation among features is weaker than the spatial or semantic relationship in images or speech data.

However, we see ubiquitous usage of heterogeneous data in many critical applications including , , and .

Challenges of Learning with Tabular Data

Tabular data or heterogeneous data is different from homogeneous data as they contain a variety of attributes including continuous and categorical attributes. There are different research challenges or reasons because of that deep learning models cannot achieve the same level of predictive quality.

Inappropriate Training Data

Quality of data is one of the major issues in tabular data sets and because of that, we find a lot of missing values in tabular data. They often include outliers, and have small sizes relative to high-dimensional feature vectors generated from the data.

Complex Irregular Spatial Dependencies

In tabular data sets, we cannot find spatial correlations between the variables. Mostly, the dependencies are complex and irregular. So methods like convolutional neural networks are unable to model tabular data.

Extensive Preprocessing

Another important challenge is the conversion of categorical attributes in tabular data. Mostly a one-hot encoding scheme is used to convert categorical data. However, it produces a sparse matrix and can trigger the curse of dimensionality. In addition, data augmentation is also very challenging to apply for tabular data. In this blog, we will be targeting this challenge.

Model Sensitivity

Comparative to tree-based methods, deep neural networks are extremely sensitive to small changes in input data. Therefore, deep neural networks have high curvature decision boundaries. The number of hyperparameters in a deep learning model is also high compared to tree-based methods and that makes it computationally expensive to tune.

Data Transformation Models

Most of the deep learning methods are used for the task of transforming tabular data. The challenge in tabular data is the existence of categorical attributes as the neural network only accepts real numbers as an input. Therefore, a method is required to convert these categories into the numeric format. Two different forms are used for this purpose including the deterministic and automatic techniques.

Deterministic Techniques

Label or ordinal encoding is the most used form of deterministic technique where every category is mapped to certain numbers. For example: {‘Mango’,‘Pineapple’} are encoded as {0,1}. This however produces an artificial order that is not useful for neural networks.

Another way is to encode through one-hot encoding. In our example, “Mango” to (1,0) and “Pineapple” to (0,1). This method leads to sparse feature vectors and can trigger the curse of dimensionality.

Binary encoding is another mechanism that is used to encode categories. Extending our example, if we add another fruit: “Banana” to our categories. The binary encoding will be like this: (01), (10), (11). In this case, there is only log(category) new columns.

The most commonly used method is the leave-one-out encoding. The technique was proposed by Micci-Barreca (2001) [1]. In this method, every category is replaced by the mean of the target variable of the category, and the current row is excluded from the calculation to avoid overfitting. This approach is used in the CatBoost algorithm as well.

Another strategy known as the hash-based encoding is also used which transforms the fixed-size value through a hash function. The hash function is also deterministic.

Automatic Encoding

There are different automatic encoding methods used to encode categorical attributes. VIME approach proposed by Yoon et al. [2] is one of the approaches. VIME initially determines which values are corrupted samples. A corrupted sample is created using the masked generator that generates the binary mask vector and input sample. Note that the input sample is generated from the unlabeled data set.

The authors made sure that the corrupted sample is tabular and similar to the input distribution. The corrupted feature is then passed to an encoder which generates feature representation. A feature vector estimator and mask vector estimator then generate the recovered feature and mask respectively.

SuperTML by Sun et al. [3] is another method of automatic encoding tabular data using convolutional neural networks. SuperTML converts data into visual data format i.e. 2D matrices or a black and white image. These images are then fed into fine-tuned 2D CNN models for classification. This process handles the categorical data and missing values in tabular data automatically. Another similar approach is followed by Zhu et al.[4] that converts tabular data into images to make use of convolutional neural networks.

Data Transformation is one of the main challenges for modeling tabular data. There are significant advancements in this field including the deterministic and dynamic methods. However, new ideas in this area are also in high demand.

Thanks for reading my article. Until next time…

Happy Reading!

Credits:

The content is inspired from the paper: , (2021) by Vadim et al.

References:

[1] Micci-Barreca and Daniele, (2018), ACM SIGKDD Explorations Newsletter 3

[2] Jinsung Yoon, Yao Zhang, James Jordon, and Mihaela van der Schaar, Vime: (2020), Advances in Neural Information Processing Systems

[3] Sun, Baohua, Lin Yang, Wenhan Zhang, Michael Lin, Patrick Dong, Charles Young, and Jason Dong, (2019) In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops

[4] Zhu, Yitan, Thomas Brettin, Fangfang Xia, Alexander Partin, Maulik Shukla, Hyunseung Yoo, Yvonne A. Evrard, James H. Doroshow, and Rick L. Stevens, (2021) Nature Scientific reports

TDS Archive