The keys I understand, t
+ 32-byte hash.
But my problem are the values. I understand from sources such as What are the keys used in the blockchain levelDB (ie what are the key:value pairs)? that the values should encode three values: dat file number, block offset, and tx offset within block.
But I've noticed that each value has a different sizes between 5 and 10 on the first thousand entries, so I'm not sure how to decode the values into those three fields. Are those fields simply 3 varint values?
Here's my Plyvel code that prints out the lengths using plyvel==1.5.1, Bitcoin Core v26.0.0 on Ubuntu 23.10:
#!/usr/bin/env python3
import struct
import plyvel
def decode_varint(data):
"""
https://github.com/alecalve/python-bitcoin-blockchain-parser/blob/c06f420995b345c9a193c8be6e0916eb70335863/blockchain_parser/utils.py#L41
"""
assert(len(data) > 0)
size = int(data[0])
assert(size <= 255)
if size < 253:
return size, 1
if size == 253:
format_ = '<H'
elif size == 254:
format_ = '<I'
elif size == 255:
format_ = '<Q'
else:
# Should never be reached
assert 0, "unknown format_ for size : %s" % size
size = struct.calcsize(format_)
return struct.unpack(format_, data[1:size+1])[0], size + 1
ldb = plyvel.DB('/home/ciro/snap/bitcoin-core/common/.bitcoin/indexes/txindex/', compression=None)
i = 0
for key, value in ldb:
if key[0:1] == b't':
txid = bytes(reversed(key[1:])).hex()
print(i)
print(txid)
print(len(value))
print(value.hex(' '))
value = bytes(reversed(value))
file, off = decode_varint(value)
blk_off, off = decode_varint(value[off:])
tx_off, off = decode_varint(value[off:])
print((txid, file, blk_off, tx_off))
print()
i += 1
but it eventually blows up at:
131344
ec4de461b0dd1350b7596f95c0d7576aa825214d9af0e8c54de567ab0ce70800
7
42 ff c0 43 8b 94 35
Traceback (most recent call last):
File "/home/ciro/bak/git/bitcoin-strings-with-txids/./tmp.py", line 39, in <module>
blk_off, off = decode_varint(value[off:])
^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/ciro/bak/git/bitcoin-strings-with-txids/./tmp.py", line 29, in decode_varint
return struct.unpack(format_, data[1:size+1])[0], size + 1
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
struct.error: unpack requires a buffer of 8 bytes
So I wonder if I guessed the format wrong, or if it's just a bug in my code.
Comparing to: https://en.bitcoin.it/wiki/Protocol_documentation#Variable_length_integer I would decode:
42 ff c0 43 8b 94 35
manually as:
- 42
- ff: expect 8 bytes next
- c0 43 8b 94 35: only 5 bytes left, blowup
I also tried to inverse value:
value = bytes(reversed(value))
but then it blows up very early, definitely wrong.
I also tried to ignore the error to see if there are others, but there were hundreds them, so something is definitely wrong with my method.
Related:
Update
Given Antoine's answer, I've updated my code to:
#!/usr/bin/env python3
import struct
import plyvel
def decode_varint(data):
i = 0
ret = 0
while True:
b = data[i]
ret += b & 0x7F
if b & 0x80:
b += 1
else:
return (ret, i + 1)
ret <<= 7
i += 1
ldb = plyvel.DB('/home/ciro/snap/bitcoin-core/common/.bitcoin/indexes/txindex/', compression=None)
i = 0
nerrs = 0
for key, value in ldb:
if key[0:1] == b't':
if i % 1000000 == 0:
print(i//1000000)
txid = bytes(reversed(key[1:])).hex()
print(value.hex(' '))
total_off = 0
file, off = decode_varint(value)
total_off += off
blk_off, off = decode_varint(value[total_off:])
total_off += off
tx_off, off = decode_varint(value[total_off:])
total_off += off
assert total_off == len(value)
print((txid, file, blk_off, tx_off))
i += 1
print(f'nerrs: {nerrs}')
and all length asserts seem to pass which is promising. However, when I tried to verify the first few values:
93 1a 96 8c e7 74 a7 bb 0a
('d176d4960a78b41971f9d19207b59af6584b16ef323de55e983aec0100000000', 2458, 46347252, 646538)
93 1a aa d0 f0 72 86 90 01
('4d19a5ba6a5455a6d4d46d6999fb0a4a7b29e603fe229cd4f1a34d0300000000', 2458, 89405554, 100353)
93 1e d4 cf 18 82 be 76
('ba75d6ac075959632d9e012f0321c367a54e9f2bde2e2bebce6baa0300000000', 2462, 1386392, 40822)
93 18 a8 96 8e 5a 8a d3 14
('7df98bab9d89fc14d0dec365fa124b3e112017c6b8f56f64cf159e0600000000', 2456, 84248410, 174484)
I didn't find a match between:
bitcoin-core.cli getrawtransaction d176d4960a78b41971f9d19207b59af6584b16ef323de55e983aec0100000000
tail -c+$((46347252 + 646538 - 1)) ~/snap/bitcoin-core/common/.bitcoin/blocks/blk02458.dat | head -c 2000 | xxd -p | tr -d '\n'
bitcoin-core.cli getrawtransaction 4d19a5ba6a5455a6d4d46d6999fb0a4a7b29e603fe229cd4f1a34d0300000000
tail -c+$((89405554 + 100353 - 1)) ~/snap/bitcoin-core/common/.bitcoin/blocks/blk02458.dat | head -c 2000 | xxd -p | tr -d '\n'
bitcoin-core.cli getrawtransaction ba75d6ac075959632d9e012f0321c367a54e9f2bde2e2bebce6baa0300000000
tail -c+$((89405554 + 100353 - 1)) ~/snap/bitcoin-core/common/.bitcoin/blocks/blk02462.dat | head -c 2000 | xxd -p | tr -d '\n'
so either my parsing or my checks are wrong somehow.
I've searched for the bytes that getrawtransactions
gave me for d176d4960a78b41971f9d19207b59af6584b16ef323de55e983aec0100000000 using bgrep
: https://unix.stackexchange.com/questions/223078/best-way-to-grep-a-big-binary-file/758528#758528 and found it at:
~/snap/bitcoin-core/common/.bitcoin/blocks/blk02586.dat: 02ed92ce
and offset 0x02ed92ce = 49124046, so these would be the expected correct values for this one.