Bitcoin Research with a Transaction Graph Dataset

ยท

Abstract

Bitcoin, launched in 2008 by Satoshi Nakamoto, established a new digital economy where value can be stored and transferred in a fully decentralized manner, eliminating the need for a central authority. This paper introduces a large-scale dataset in the form of a transaction graph representing Bitcoin user transactions, along with a set of tasks and baselines.

The graph includes:

Each node and edge is timestamped, providing temporal context. For supervised tasks, we provide:

  1. A 33,000-node set labeled by entity type
  2. Nearly 100,000 Bitcoin addresses labeled with entity names and types

This represents the largest publicly available Bitcoin transaction dataset, designed to overcome limitations of existing datasets. We trained various graph neural network models to predict node labels, establishing baselines for future research. Several use cases demonstrate the dataset's applicability beyond Bitcoin analysis. All data and source code are publicly available to ensure reproducibility.

Background & Summary

Bitcoin represents a groundbreaking digital economy where value can be stored and transferred without central authority. Key characteristics include:

Despite Bitcoin's public transaction data, there's a scarcity of quality public datasets for researchers. Current limitations of existing datasets include:

  1. Elliptic datasets:

    • Limited to binary classification (licit/illicit)
    • Primarily useful for money laundering research
  2. Address-only datasets:

    • Require researchers to construct graphs
    • Demand specialized Bitcoin knowledge

Our dataset addresses these gaps by providing:

Methods

Graph Construction

Data Extraction

We established a Bitcoin Core node to download the complete transaction ledger, parsing the first 700,000 blocks of the blockchain.

Node Definition

Nodes represent clusters of locking scripts, identified using heuristics from previous research. Key aspects:

Edge Creation

Edges represent value transfers between nodes, calculated using:

value_received = (output_value - input_value)

We excluded:

Attributes

Both nodes and edges include temporal attributes based on block indices. Key attributes:

Node AttributesEdge Attributes
Transaction countsTransaction value
Degree metricsBlock index
Cluster propertiesDirection

Node Labeling

We identified and labeled entities across 10 categories:

  1. Individual
  2. Mining
  3. Exchange
  4. Marketplace
  5. Gambling
  6. Bet
  7. Faucet
  8. Mixer
  9. Ponzi
  10. Ransomware

Labeling Pipeline

Our multi-source approach included:

  1. BitcoinTalk Forum:

    • 14 million messages analyzed
    • Addresses extracted from posts and profiles
  2. ChatGPT Analysis:

    • Entity identification from context
    • Transaction pattern matching
  3. Additional Sources:

    • Exchange-provided addresses
    • Ransomware datasets
    • Government lists (SDN)
    • Mining signatures

This produced 101,186 labeled addresses and 33,000 labeled nodes.

Data Records

The complete dataset includes:

  1. BitcoinTalk Threads:

    • 546,440 threads
    • 14 million messages
    • JSON format
  2. Labeled Addresses:

    • 101,186 addresses
    • Entity type and source
    • CSV format
  3. Graph Database:

    • PostgreSQL format
    • 252 million nodes
    • 785 million edges
  4. Supplemental Files:

    • Alternative source data
    • Labeling documentation

Technical Validation

We validated dataset quality through node classification tasks using:

  1. Graph Neural Networks:

    • GCN
    • GraphSage
    • GAT
    • GIN
  2. Traditional Model:

    • Gradient Boosting Classifier

Results:

ModelMacro-F1 Score
GAT0.64
GIN0.63
GraphSage0.62
GCN0.60
GBC0.57

The strong performance demonstrates the dataset's utility for entity classification tasks.

Usage Notes

Database Restoration

The PostgreSQL database can be restored using pg_restore with recommended settings:

pg_restore -j 4 -Fd -0 -U username -d dbname dataset

Potential Use Cases

  1. Entity interaction analysis
  2. Temporal network evolution
  3. Comparative network studies
  4. Pre-training for financial networks

๐Ÿ‘‰ Explore the complete dataset and code

FAQ

Q: How does this dataset compare to existing Bitcoin datasets?
A: Our dataset is significantly larger and more comprehensive than alternatives like Elliptic datasets, with detailed entity typing and temporal information.

Q: What computational resources are required?
A: The full database requires ~120GB storage. We recommend a server with 32GB+ RAM for efficient processing.

Q: Can this dataset be used for non-Bitcoin research?
A: Yes, the methodologies and graph structures are applicable to other transaction networks and financial systems.

Q: How frequently is the dataset updated?
A: Currently this represents a static snapshot, but the methodology could be applied to ongoing data collection.

๐Ÿ‘‰ Access the labeling pipeline code

Q: What license applies to this dataset?
A: The dataset is released under CC-BY 4.0, allowing both academic and commercial use with attribution.