Abstract
Bitcoin, launched in 2008 by Satoshi Nakamoto, established a new digital economy where value can be stored and transferred in a fully decentralized manner, eliminating the need for a central authority. This paper introduces a large-scale dataset in the form of a transaction graph representing Bitcoin user transactions, along with a set of tasks and baselines.
The graph includes:
- 252 million nodes
- 785 million edges
- Coverage of nearly 13 years
- 670 million transactions
Each node and edge is timestamped, providing temporal context. For supervised tasks, we provide:
- A 33,000-node set labeled by entity type
- Nearly 100,000 Bitcoin addresses labeled with entity names and types
This represents the largest publicly available Bitcoin transaction dataset, designed to overcome limitations of existing datasets. We trained various graph neural network models to predict node labels, establishing baselines for future research. Several use cases demonstrate the dataset's applicability beyond Bitcoin analysis. All data and source code are publicly available to ensure reproducibility.
Background & Summary
Bitcoin represents a groundbreaking digital economy where value can be stored and transferred without central authority. Key characteristics include:
- Daily users: 270,000 (2023)
- Annual transaction volume: $8.6 trillion (2023)
- Research interest: 30,000+ annual papers
Despite Bitcoin's public transaction data, there's a scarcity of quality public datasets for researchers. Current limitations of existing datasets include:
Elliptic datasets:
- Limited to binary classification (licit/illicit)
- Primarily useful for money laundering research
Address-only datasets:
- Require researchers to construct graphs
- Demand specialized Bitcoin knowledge
Our dataset addresses these gaps by providing:
- Direct graph representation
- Detailed entity typing
- Temporal information
- Broad applicability
Methods
Graph Construction
Data Extraction
We established a Bitcoin Core node to download the complete transaction ledger, parsing the first 700,000 blocks of the blockchain.
Node Definition
Nodes represent clusters of locking scripts, identified using heuristics from previous research. Key aspects:
- 874 million scripts analyzed
- 252 million script clusters identified
- Each cluster assigned unique integer alias
Edge Creation
Edges represent value transfers between nodes, calculated using:
value_received = (output_value - input_value)We excluded:
- CoinJoin transactions (privacy-focused)
- Colored coin transactions (non-Bitcoin assets)
Attributes
Both nodes and edges include temporal attributes based on block indices. Key attributes:
| Node Attributes | Edge Attributes |
|---|---|
| Transaction counts | Transaction value |
| Degree metrics | Block index |
| Cluster properties | Direction |
Node Labeling
We identified and labeled entities across 10 categories:
- Individual
- Mining
- Exchange
- Marketplace
- Gambling
- Bet
- Faucet
- Mixer
- Ponzi
- Ransomware
Labeling Pipeline
Our multi-source approach included:
BitcoinTalk Forum:
- 14 million messages analyzed
- Addresses extracted from posts and profiles
ChatGPT Analysis:
- Entity identification from context
- Transaction pattern matching
Additional Sources:
- Exchange-provided addresses
- Ransomware datasets
- Government lists (SDN)
- Mining signatures
This produced 101,186 labeled addresses and 33,000 labeled nodes.
Data Records
The complete dataset includes:
BitcoinTalk Threads:
- 546,440 threads
- 14 million messages
- JSON format
Labeled Addresses:
- 101,186 addresses
- Entity type and source
- CSV format
Graph Database:
- PostgreSQL format
- 252 million nodes
- 785 million edges
Supplemental Files:
- Alternative source data
- Labeling documentation
Technical Validation
We validated dataset quality through node classification tasks using:
Graph Neural Networks:
- GCN
- GraphSage
- GAT
- GIN
Traditional Model:
- Gradient Boosting Classifier
Results:
| Model | Macro-F1 Score |
|---|---|
| GAT | 0.64 |
| GIN | 0.63 |
| GraphSage | 0.62 |
| GCN | 0.60 |
| GBC | 0.57 |
The strong performance demonstrates the dataset's utility for entity classification tasks.
Usage Notes
Database Restoration
The PostgreSQL database can be restored using pg_restore with recommended settings:
pg_restore -j 4 -Fd -0 -U username -d dbname datasetPotential Use Cases
- Entity interaction analysis
- Temporal network evolution
- Comparative network studies
- Pre-training for financial networks
๐ Explore the complete dataset and code
FAQ
Q: How does this dataset compare to existing Bitcoin datasets?
A: Our dataset is significantly larger and more comprehensive than alternatives like Elliptic datasets, with detailed entity typing and temporal information.
Q: What computational resources are required?
A: The full database requires ~120GB storage. We recommend a server with 32GB+ RAM for efficient processing.
Q: Can this dataset be used for non-Bitcoin research?
A: Yes, the methodologies and graph structures are applicable to other transaction networks and financial systems.
Q: How frequently is the dataset updated?
A: Currently this represents a static snapshot, but the methodology could be applied to ongoing data collection.
๐ Access the labeling pipeline code
Q: What license applies to this dataset?
A: The dataset is released under CC-BY 4.0, allowing both academic and commercial use with attribution.