Our datasets

We offer both data in raw format (archives of random torrent files), as well as pre-processed hasdhb databases. Make sure to run "7z x" to extract the archives to keep the folder structure intact, some file systems have troubles with a large number of files. Also, hashdb removes duplicate entries for chunk hashes - you only get one InfoHash from the hashdb.

All data is GPL-2, and if you use it for a paper please cite us (bibtex at the bottom).

Torrent Archive

4.68 million torrent files (SHA256): 122 GB compressed, 155 GB raw

Used processing steps:

  • Sort torrent files by chunk size using torrentSort0r.py
  • Remove all uncommon chunk sizes
  • Create hashdb databases using hashdb create -b CHUNKSIZE NAME
    • Example: hashdb create -b 16384 peekaTorrent_hashdb_InfoHash_16k.hdb
  • Populate hashdb using hash_extract0rINFOHash.py
    • Example: python hash_extract0rINFOHash.py 16384/|hashdb import_tab peekaTorrent_hashdb_InfoHash_16k.hdb/ -
  • Use the data with bulk_extractor and hashdb plugin during analysis
    • Example: TBD

hashdb Datasets with InfoHash as output

chunk size 16k (SHA256), 2.9 GB compressed, 6 GB raw

chunk size 32k (SHA256), 6.2 GB compressed, 13 GB raw

chunk size 64k (SHA256), 10 GB compressed, 21 GB raw

chunk size 128k (SHA256), 13 GB compressed, 27 GB raw

chunk size 256k, part 1 (SHA256), 40 GB compressed, 84 GB raw

chunk size 256k, part 2 (SHA256), 40 GB compressed, 84 GB raw

chunk size 512k (SHA256), 32 GB compressed, 66 GB raw

chunk size 1024k (SHA256), 26 GB compressed, 56 GB raw

chunk size 2048k (SHA256), 20 GB compressed, 41 GB raw

chunk size 4096k (SHA256), 26 GB compressed, 56 GB raw

chunk size 8192k (SHA256), 7 GB compressed, 15 GB raw

chunk size 16m (SHA256), 1.1 GB compressed, 2.3 GB raw