Is Graph Machine Learning the New Cryptocurrency Police?

15 min readApr 2, 2025

Testing the limits of recognizing illicit Bitcoin transactions with machine learning on graphs. Could it be useful in real life?

Authors: Medič Aljaž, Mislej Nina, Sabotič Luka and Stepančič Aleks
Project: Stanford CS224W course project
Code:

Introduction

We can notice the occurrence of graphs everywhere. When faced with decision making one of the first approaches that are thaught to children are decision trees. Later on, we are prompted to draw mind maps. With a little bit of observation, we can see that social circles are also graphs. Roads and cities can be graphs, molecules can be graphs and even citing papers in this article can form a graph. [3]

**Figure 1**: Examples of graphs we all know from everyday life.

So what is a graph exactly? We can imagine it as a set of nodes that are connected with edges. Both of these can carry information, for example, gas consumption or time needed to travel from one city to another. Networks can have many different qualities, they can be directed or undirected, connected or unconnected, bipartite, etc. Graph traits can be represented by node, edge, and graph features. [3]

Graph theory started developing way before we ever knew about computers or let alone cryptocurrency. While there are still many unanswered theoretical questions in this field alone, scientists then started using computers to solve certain graph problems such as searching for the shortest path or perhaps the lowest cost between two nodes.

What comes next? In the modern day and age, we became baffled by the results artificial intelligence can bring. Art crafted at the hands of a computer, whole conversations with our refrigerators, face recognition to unlock our phones and so much more! Image recognition uses matrices as input and speech recognition manipulates sequences. Experts in this field of research asked a wonderful question — what would happen if we trained our models on graphs? While it did take some adjusting and changes, we now have many different ML models for such tasks and we used one of them to show just how useful they can be. [3]

Bitcoin Transactions

A Bitcoin transaction is the transfer of value from one Bitcoin address to another. While understanding the mechanisms in detail is not crucial, it does explain why tackle this problem in the first place. First, the transaction is initiated when the sender creates a new transaction of funds. This includes specifying the recipient, the amount of money sent, and other information. It generates a unique private key that is later used for verification. The transaction is then broadcasted to the Bitcoin network and this is where graphs come into the equation. Verification by miners on the network who solve complex mathematical problems is required for the funds to go through. Now the transaction is finally added to the blockchain and it is considered confirmed. When a transaction is broadcasted, it is picked up and forwarded by multiple nodes. [5]

Dataset

A few years back an interesting article was published by Mark Weber called . It provides an Eliptic Data Set of Bitcoin Transaction Graph. The nodes represent transactions and edges the flow of Bitcoins between them. This graph is not connected but can be viewed as a set of connected directed graphs each in a different timeframe. Each step captures 3 hours of continuous network traffic. The steps are evenly spaced with 2 weeks between graphs. Altogether there are 49 timeframes, 203,769 nodes, and 234,355 edges. [6]

**Figure 2**: First is the number of nodes, second is the number of edges in subgraphs, thirdly we have the clustering coefficient, then the average of each class, and finally, the percent of nodes that are illicit in each of the subgraphs.

There are 3 different labels for the nodes. The transaction can be labeled as licit, illicit, or unknown. Each node also has 166 features. Let’s choose one transaction and look at its features. The first 94 are local information that is obtained at the initialization. For every graph, each node has its own neighborhood of nodes that are one hop away. The quantity of these nodes is described as a node degree. If we want to represent the graph even better it is good to learn the neighborhood properties as well, so the remaining features are aggregated features of these neighbors. An example of aggregation would be the maximum, minimum, standard deviation, and correlation coefficients of neighboring local information. [6]

**Figure 3**: Example of one of the graphs in the dataset. Green nodes are licit transactions and blue ones the illicit ones. For better visualization, the unlabeled nodes are not shown — it can be seen in Google Colab.

Graph Machine Learning Models

Basics Overview

Note: because this is a very broad topic in order to understand the next few segments at least a little bit of knowledge is required.

Graphs are often denoted as a set of sets G = { V, E, X, Y }.
- V represents the set of nodes or vertices.
- E is short for the set of all edges in a graph
- X is a matrix where rows are feature vectors of the nodes.
- Y classifies each node in one of the graph’s classes. It is a vector as well.

Furthermore, there is some other important notation we should use:
- A is the adjacency matrix and it holds information about the graph's connectivity. Another way to store this data is with an edge index.
- D is a diagonal matrix and its elements are node degrees.

Now we are presented with a problem. Given some dataset, could we train the model to predict certain graph qualities? There are many different tasks that require different techniques. One could focus on link prediction or perhaps, if looking at the broader picture, even graph classification. An example of the first one would be predicting the side effects of different drugs when paired together. [3]

One of the very common tasks in this field is node classification. This is also the task that was relevant to our problem because we were trying to predict which nodes are licit or illicit. We are trying to solve an inductive problem so we assume all of the nodes in the graph will be unlabeled. To base our problem better we decided to ask a couple of important questions before starting our research. [3]

**Figure 4:** This is a subgraph of the graph in the first picture, but this time we are viewing the unlabeled nodes as well as the labeled ones. In real life, our graph would have all of the nodes unlabeled.

Do all neighbors hold the same weight? Are illicit nodes perhaps more important than licit ones? We ended up with 2 different strategies. First, we will treat all the neighboring nodes as equal using a Graph Convolutional Network model, or GCN for short. We decided to compare the result with the one of a Graph Attention Network (GAT) model and see which one performs better. This last one presumes not all nodes are of the same importance by giving them different weights. [1, 2, 3]

But how should we measure our classification performance? There are many measures and we opted for the F1 score. Considering our particular situation the worst possible prediction is a false negative one. Predicting a node to be licit and it is actually illicit is way worse than the other way around, so this score works best in this scenario. [8]

**Figure 5**: Different measures for performance and what they are based on. We can see that F1 takes both recall and precision into consideration

GCN and GAT

Each node of the graph is represented as a feature vector. Every single one of these is then updated by aggregating information from its neighboring nodes. The process is named graph convolution and is largely based on the convolution in traditional convolutional neural networks (CNNs). [1]

This model is just one of many in the graph neural network (GNN) family. They all generate embeddings that are later used for downstream tasks. If two embeddings are close in their embedding space the nodes in the observed graph are similar. [1, 2]

The reason why we need these and not just use classic CNNs is that graphs in reality have no order. The structure of a graph is permutable and the output of the model cannot depend on its physical structure. [3]

**Figure 6**: 3 different visualizations of the same graph from figure 2.

Now to overcome this problem let’s look at the basic GNN structure. Each node gets its own computational graph. We calculate the embeddings in 3 crucial steps. First, each neighboring node sends a message. The messages are then aggregated and finally, the embedding of our node is updated. [1]

**Figure 7**: The process of one embedding layer for node A. Each of the nodes in the graph goes through this. [1]

Now more specifically these are the 3 steps when dealing with GCNs:
1. Sending messages: neighboring nodes send their embedding
2. Aggregating the messages: the average of all received embeddings
3. Updating: updating the embedding

**Figure 8**: All 3 steps written together. In matrix form, this could be written as nonlinearity applied to the inverse of D multiplied by A, X, and the trainable matrix W on the previous layer. The σ represents a nonlinearity to add expressiveness. [1]

The only difference when using GAT is that we do not take the average but instead we add attention weights to each message. These are also learned. This way we can put more attention to the important bits of the data we used for training. In the picture above we can see that K layers are used, K being the last one. This means we acquire data from K-hops away and only the nodes whose paths are of this length or less can influence our final embedding. [1]

The last important bit of knowledge is on how to train the attention weights. Using attention mechanism , we calculate s the importance of node A’s message to node B from figure 5. [1]

Figure 9: The e in this picture is our attention mechanism LeakyReLU.

Afterward, we use to normalize the weights. After this process, the sum of the attention weights for all nodes should be 1. One could also try and optimize the output using a multi-head attention algorithm that instead of computing just one embedding, computes multiple and then takes their average (the nonlinearity is applied to the final result). [1]

Code Review

The segments of code that are being shown here are only the important parts that we thought could use some explaining. The whole program is available on . It is based on the module that is built on top of PyTorch. It provides a set of tools for deep learning on graphs.

Firstly, let’s view how to interpret the data. It comes in 3 files: one documents the 3 possible classes, the second has all of the features and the last one gives us an edge index.

**Figure 10**: This is an example of how the first 5 features look for the first and the last graph.

We then used implementation for our graphs. Index 0 represents the illicit label, 1 is the licit label and 2 is for all of the unlabeled nodes. There are some interesting statistics to be seen in our Colab, so we recommend you check it out! We used 39 of the graphs for the training set on which we trained the model and the other 10 for the testing set.

Then we implemented our models. We use PyTorch Geometric’s implementation for the convolution layers. This works for both GCN and GAT because we have a separate configuration part where we define the parameters and decide which convolution model to use.

# Model definition
from torch_geometric.nn import GCNConv, GATConv, BatchNorm
import torch.nn as nn
import torch.nn.functional as F

# Just like Euler did not choose the letter e because it is 
# the first letter of his name, we didn't name our model 
# after our initials either …
class SM2GNN(torch.nn.Module):
    def __init__(self,
                 in_channels: int,
                 hidden_channels: int,
                 out_channels: int,
                 conv_model: type[nn.Module],
                 nonlinearity: nn.Module,
                 num_hidden: int = 3,
                 conv_args: Dict = {},
                 dropout: float = 0.5):
        super(SM2GNN, self).__init__()

        self.convs = nn.ModuleList()
        self.bns = nn.ModuleList()

        self.convs.append(
            conv_model(in_channels, hidden_channels, **conv_args)
        )
        self.bns.append(BatchNorm(hidden_channels))
        for _ in range(num_hidden):
            self.convs.append(
                conv_model(hidden_channels, hidden_channels, **conv_args))
            self.bns.append(BatchNorm(hidden_channels))

        self.convs.append(
            conv_model(hidden_channels, out_channels, **conv_args))
        self.bns.append(BatchNorm(hidden_channels))

        self.dropout = dropout
        self.nonlinearity = nonlinearity

    def reset_parameters(self):
        for conv in self.convs:
            conv.reset_parameters()
        for bn in self.bns:
            bn.reset_parameters()

    def forward(self, x, edge_index, return_embeddings: bool = False):
        for i in range(len(self.convs)-1):
            conv = self.convs[i]
            x = conv(x, edge_index)
            x = self.nonlinearity(x)
            x = F.dropout(x, p=self.dropout, training=self.training)
            x = self.bns[i](x)

        if not return_embeddings:
            x = self.convs[len(self.convs)-1](x, edge_index)
            x = F.softmax(x, dim=1)
        return x

Interpretation of the parameters:
- in_channels: the dimensions or so-called shape of our feature vector
- hidden_channels: the dimensions of embeddings except for the last one
- out_channels: the shape of our last embedding
- conv_model: decides whether we use GAT or GCN
- num_hidden: the number of convolution layers
- dropout: prevents overfitting — the percent of nodes dropped in training
- nonlinearity : the function of choice

The bnspart is batch normalization for better performance. The process described under the previous title about the theory of GCN and GAT can be seen in the forward step.

In our base model configuration, which outputs a simple JSON file we opted for the following parameter values ( conv_model could be GCNConv). The number in n in GATConv(n) gives us the number of heads.

{
    "in_channels": -1,
    "hidden_channels": 8,
    "out_channels": 2,
    "num_hidden": 2,
    "conv_model": "GATConv(4)",
    "conv_args": {},
    "dropout": 0.5,
    "nonlinearity": "leaky_relu(0.2)"
}

then we implemented our test and train functions. First, we train the model. The function works by iterating over a fixed number of epochs that are specified in the train configuration part of the code. The optimizer and the loss function are also written there. We process the training set in smaller batches. We trained the model only on the labeled nodes.

def train(model: SM2GNN, data_loader: DataLoader, config: TrainConfig, shortlog: bool = False):
    best_tuple = (-1, np.inf)
    dataset = data_loader.dataset
    loss_fn = config.get_loss()
    optimizer = config.get_optimizer_for(model)
    model.to(device)
    model.reset_parameters()
    model.train()
    for epoch in trange(config.num_epoch, unit="Epochs", desc="Training"):
        epoch_loss = 0

        for batch in data_loader:
            optimizer.zero_grad()

            batch.to(device)
            out = model(batch.x.float(), batch.edge_index)
            label_mask = (batch.y != ID_UNLABELED)
            loss = loss_fn(out[label_mask],
                           batch.y[label_mask])
            loss.backward()
            optimizer.step()

            epoch_loss += loss.item() * batch.num_graphs
        avg_epoch_loss = epoch_loss / len(dataset)
        log(epoch, avg_epoch_loss, best_tuple, one_line=shortlog)
    print()h}, Loss: {avg_epoch_loss:.4f}")

A quick glance at some of the components:
- loss_fn : this is the loss function that quantifies how well a model’s predictions match the true values of the training data. Here we accounted for an unbalanced dataset by adding weights to each class. The nodes that are marked illicit have a higher weight.
- optimizer : an algorithm optimizes weights and biases that are used to transform the input data into output predictions.
- labeled_mask : we use this because we decided to train the model on the labeled nodes only.

This is our base training configuration:

{
    "in_channels": -1,
    "hidden_channels": 8,
    "out_channels": 2,
    "num_hidden": 2,
    "conv_model": "GATConv(4)",
    "conv_args": {},
    "dropout": 0.5,
    "nonlinearity": "leaky_relu(0.2)"
}

We optimized the training by doing repeated k-fold cross-validation. Instead of just having trained the model and validating the model once, we split out the training set into k equally sized parts. Then the model is trained on the k-1 parts and later on validated on the remaining 1. This process is repeated k times, with each part or so-called fold used once as the validation set. This approach helps to reduce the impact of the randomness in the train-validation split on the performance and makes the model more reliable. This is the algorithm:

from torch.utils.data import SubsetRandomSampler
from sklearn.model_selection import RepeatedKFold

def kfold(model, config, train_dataset, 
          params = None, n_splits=5, batch_size=32):
  history = {'valid_f1':[]}

  splits=RepeatedKFold(n_splits=n_splits,n_repeats=3)

  folds = splits.split(np.arange(len(train_dataset)))
  for fold, (train_idx,val_idx) in enumerate(folds):
    train_sampler = SubsetRandomSampler(train_idx)
    valid_sampler = SubsetRandomSampler(val_idx)
    train_loader = DataLoader(train_dataset, 
                              batch_size=batch_size, sampler=train_sampler)
    valid_loader = DataLoader(train_dataset, 
                              batch_size=batch_size, sampler=valid_sampler)

    train(model, train_loader, config)
    f1 = test(model, valid_loader)
    history["valid_f1"].append(f1)
  avg_test_f1 = np.mean(history["valid_f1"])
  print(f"Average F1: {avg_test_f1} for params: {params}" )
  return avg_test_f1


kfold(model, config, train_dataset, n_splits = 5)

- train_dataset : the 39 graphs we used for testing
- n_splits : number of folds
- n_repeats : number of repetitions of this process

This method is very stable in terms of variance and even better if given good hyperparameters. Because these are fixed and not learned from data we have to decide on those ourselves.

How did we do that? First, we trained our base model with the configuration above. Then we used an open-source Python library for this task. It implements a process of finding the optimal set of hyperparameters for our model that maximizes its performance on a given metric — the F1 score calculated in our k-fold cross-validation. We used the algorithm for that.

It decides whether it is better to use GCN or GAT, what nonlinearity to use, the number of convolutional layers, dropout value, and dimensions of our embeddings. It also learns some parameters for training, for example, weight decay for our optimizer.

# Hyperparameter search
import optuna
from optuna.samplers import TPESampler


def objective(trial):
    global X_train
    model_config = ModelConfig(
        hidden_channels=trial.suggest_int('hidden_channels', 2, 6),
        num_hidden=trial.suggest_int('num_hidden', 1, 4),
        conv_model=trial.suggest_categorical(
            'conv_model', ['GATConv(1)', 'GATConv(2)', 'GATConv(3)', 'GATConv(4)', 'GCNConv']),
        conv_args={},
        dropout=trial.suggest_float('dropout', 0.1, 0.5),
        nonlinearity=trial.suggest_categorical(
            'nonlinearity', ['relu', 'leaky_relu(0.2)', 'sigmoid'])
    )
    config = TrainConfig(
        num_epoch=64,
        batch_size=8,
        test_split=0.2,  # Not used in kfold
        optimizer_name='adam',
        lr=trial.suggest_float('lr', 1e-4, 1e-1, log=True),
        weight_decay=trial.suggest_float('weight_decay', 1e-5, 1e-1, log=True),
        loss_name='cross_entropy',
        class_weight=trial.suggest_float('class_weight', 0.6, 0.9, log=True)
    )
    model = model_config.create()
    return kfold(model, config, X_train, n_splits=5)


(X_train, _), (loader_X_train, loader_X_test) = split_and_create_loaders(config, dataset)
study = optuna.create_study(direction='maximize', sampler=TPESampler())

try:
    study.optimize(objective, n_trials=5, show_progress_bar=True)
except KeyboardInterrupt:
    pass
except Exception as e:
    print(e)

- trial : Optuna initializes a trial runs the training like we described it before and optimizes the parameters for the next trial
- n_trials : sets the number of trials for optimization

An example of what parameters Optuna found for our model. The first one is our model configuration and the second one specifies our training parameters:

{
    "in_channels": -1,
    "hidden_channels": 6,
    "out_channels": 2,
    "num_hidden": 1,
    "conv_model": "GATConv(3)",
    "conv_args": {},
    "dropout": 0.122052642841159,
    "nonlinearity": "relu"
}

{
    "num_epoch": 64,
    "batch_size": 8,
    "test_split": 0.2,
    "random_state": null,
    "optimizer_name": "adam",
    "lr": 0.023517346536394833,
    "weight_decay": 1.2773107092679232e-05,
    "loss_name": "cross_entropy",
    "class_weight": 0.786639525314248
}

We validate and test our model in our test function. The implementation is very standard and can of course be reviewed in our linked code.

Results

First, we can see from the code blocks above that GAT with 4 heads was the optimal model for maximizing our F1 score. It is important to note that the score is obtained in k-fold validation which makes it even more reliable. The dropout seems relatively small. The result suggests we use 1 layer where the embeddings have a dimension of 6.

Now let’s evaluate our model.
We managed to get the best F1 score of 0.8331 and an accuracy of 0.9315. The final score is calculated using all of the prediction outcomes, meaning we saved all true positives, negatives, false positives, and negatives and calculated the score using this data. One important notion is that our model has never had any interaction with our test set before this.

**Figure 11**: Accuracy and F1 score for all 10 graphs in our test set.

For example, perhaps we could label the unlabeled nodes for one graph and visualize it. The blue nodes are illicit and the green ones are licit.

**Figure 12**: This is the smallest of the graphs in our dataset, while the coloring may be inaccurate because the graph could be in the training set, we think it visualizes the result very well.

As the last part of the evaluation, we want to look at the node embeddings we generated. If the embeddings are close it means that the nodes are similar as we discussed earlier. The orange represents the illicit nodes and the blue licit ones.

**Figure 13**: The embeddings in 2 dimensions for all of the 10 graphs in the test set.

Conclusion

This is just one example of how intriguing the results of machine learning on graphs can be. We had an incomplete set of data, but still decided to try and label a previously unseen graph, and even got relatively good scores.

We wanted to point out the interesting and challenging parts while improving our model. In the beginning, we started with pretty bad scores, most of them were under 0.5. These results were a product of working with just GCN layers and not putting any attention to the importance of the nodes. Our output improved slightly when we implemented GAT. One major difference in the stability of the model happened when we decided to implement the algorithm for k-fold cross-validation. The next big leap in our prediction was finding the right hyperparameters for our model.

This was definitely an interesting experience for all of us and I think every single member of our team can say we learned a lot during this project. If you happen to be eager for learning more on this topic, we differently recommend diving into the unknown and trying it out yourself!

References

[] Graph Neural Networks:

[] Node Embeddings:

[] Intro:

[] Zitnik M, Agrawal M, Leskovec J. Modeling polypharmacy side effects with graph convolutional networks. Bioinformatics. 2018 Jul 1

[] “How does Bitcoin work?” Bitcoin, Date accessed: March 25, 2023

[] Elliptic,

[] Label Propagation:

[] Korstanje, Joos. “The F1 Score.” Medium, Towards Data Science, date published: 31 Aug. 2021

[9] Figure 1: Social Network Analytics, , ,