Bitcoin Research with a Transaction Graph Dataset

Abstract

Bitcoin, launched in 2008 by Satoshi Nakamoto, established a new digital economy where value can be stored and transferred in a fully decentralized manner, eliminating the need for a central authority. This paper introduces a large-scale dataset in the form of a transaction graph representing Bitcoin user transactions, along with a set of tasks and baselines.

The graph includes:

252 million nodes
785 million edges
Coverage of nearly 13 years
670 million transactions

Each node and edge is timestamped, providing temporal context. For supervised tasks, we provide:

A 33,000-node set labeled by entity type
Nearly 100,000 Bitcoin addresses labeled with entity names and types

This represents the largest publicly available Bitcoin transaction dataset, designed to overcome limitations of existing datasets. We trained various graph neural network models to predict node labels, establishing baselines for future research. Several use cases demonstrate the dataset's applicability beyond Bitcoin analysis. All data and source code are publicly available to ensure reproducibility.

Background & Summary

Bitcoin represents a groundbreaking digital economy where value can be stored and transferred without central authority. Key characteristics include:

Daily users: 270,000 (2023)
Annual transaction volume: $8.6 trillion (2023)
Research interest: 30,000+ annual papers

Despite Bitcoin's public transaction data, there's a scarcity of quality public datasets for researchers. Current limitations of existing datasets include:

Elliptic datasets:
- Limited to binary classification (licit/illicit)
- Primarily useful for money laundering research
Address-only datasets:
- Require researchers to construct graphs
- Demand specialized Bitcoin knowledge

Our dataset addresses these gaps by providing:

Direct graph representation
Detailed entity typing
Temporal information
Broad applicability

Methods

Graph Construction

Data Extraction

We established a Bitcoin Core node to download the complete transaction ledger, parsing the first 700,000 blocks of the blockchain.

Node Definition

Nodes represent clusters of locking scripts, identified using heuristics from previous research. Key aspects:

874 million scripts analyzed
252 million script clusters identified
Each cluster assigned unique integer alias

Edge Creation

Edges represent value transfers between nodes, calculated using:

value_received = (output_value - input_value)

We excluded:

CoinJoin transactions (privacy-focused)
Colored coin transactions (non-Bitcoin assets)

Attributes

Both nodes and edges include temporal attributes based on block indices. Key attributes:

Node Attributes	Edge Attributes
Transaction counts	Transaction value
Degree metrics	Block index
Cluster properties	Direction

Node Labeling

We identified and labeled entities across 10 categories:

Individual
Mining
Exchange
Marketplace
Gambling
Bet
Faucet
Mixer
Ponzi
Ransomware

Labeling Pipeline

Our multi-source approach included:

BitcoinTalk Forum:
- 14 million messages analyzed
- Addresses extracted from posts and profiles
ChatGPT Analysis:
- Entity identification from context
- Transaction pattern matching
Additional Sources:
- Exchange-provided addresses
- Ransomware datasets
- Government lists (SDN)
- Mining signatures

This produced 101,186 labeled addresses and 33,000 labeled nodes.

Data Records

The complete dataset includes:

BitcoinTalk Threads:
- 546,440 threads
- 14 million messages
- JSON format
Labeled Addresses:
- 101,186 addresses
- Entity type and source
- CSV format
Graph Database:
- PostgreSQL format
- 252 million nodes
- 785 million edges
Supplemental Files:
- Alternative source data
- Labeling documentation

Technical Validation

We validated dataset quality through node classification tasks using:

Graph Neural Networks:
- GCN
- GraphSage
- GAT
- GIN
Traditional Model:
- Gradient Boosting Classifier

Results:

Model	Macro-F1 Score
GAT	0.64
GIN	0.63
GraphSage	0.62
GCN	0.60
GBC	0.57

The strong performance demonstrates the dataset's utility for entity classification tasks.

Usage Notes

Database Restoration

The PostgreSQL database can be restored using pg_restore with recommended settings:

pg_restore -j 4 -Fd -0 -U username -d dbname dataset

Potential Use Cases

Entity interaction analysis
Temporal network evolution
Comparative network studies
Pre-training for financial networks

👉 Explore the complete dataset and code

FAQ

Q: How does this dataset compare to existing Bitcoin datasets?
A: Our dataset is significantly larger and more comprehensive than alternatives like Elliptic datasets, with detailed entity typing and temporal information.

Q: What computational resources are required?
A: The full database requires ~120GB storage. We recommend a server with 32GB+ RAM for efficient processing.

Q: Can this dataset be used for non-Bitcoin research?
A: Yes, the methodologies and graph structures are applicable to other transaction networks and financial systems.

Q: How frequently is the dataset updated?
A: Currently this represents a static snapshot, but the methodology could be applied to ongoing data collection.

👉 Access the labeling pipeline code

Q: What license applies to this dataset?
A: The dataset is released under CC-BY 4.0, allowing both academic and commercial use with attribution.