0

Below is my logic to parse transaction from Bitcoin blk files

  1. Pull Version by getting uint32
  2. Read next 2 bytes. If they are 0,1 then it is SegWit
  3. Else read again for varint to pull input count
  4. If input count = 0, read one byte and check if it is 0x01.
  5. If not 0x01 then error out as I do not know what kind of transaction it is with 0 input
  6. Else read varint to fetch input count

I know that blk?????.dat files are not in particular order but I did not get this issue in #5 while parsing blocks in first 1000 files.

The above logic seems flawed as I am not able to read block successfully. Can someone who has done this explain

One more question, is there a way to identify if block in blk file is orphaned and should be ignored?

Raw data of block - link. It has ~1400 transactions

Antoni
  • 277
  • 2
  • 10
Ankit
  • 29
  • 7
  • 1
    A transaction cannot have no inputs. Can you post the specific transaction data that is causing the error? Segwit activated at block 477,120, so it's almost certainly not that. – meshcollider Dec 30 '21 at 01:45
  • 1
    Some clarification about terms used for blocks by Pieter Wuille: https://bitcoin.stackexchange.com/a/5869/ –  Dec 30 '21 at 09:53
  • 1
    The concept of orphan blocks, as it existed in pre-2013 Bitcoin Core (0.9 and below), just doesn't exist anymore: blocks without known parent are just never downloaded. So if someone is talking about orphan blocks, they're almost certainly referring to what I called extinct blocks in my answer there, and yes, they are stored. I maintain my preference that these things shouldn't be called orphan blocks, because there is no orphaning going on, but discussing semantics of words isn't going to lead anywhere. – Pieter Wuille Dec 30 '21 at 16:52
  • @PieterWuille, does this mean that blk files can have blocks that have to be ignored and if so what is the logic? Also, if that is the case then what is the purpose of rev files – Ankit Dec 30 '21 at 17:09
  • 1
    Yes, the blk.dat files may contain blocks that are not part of the current chain. This is not self-apparent; which blocks do belong to the main chain is stored in the block index database, not the block files themselves. The rev.dat files are needed to "undo" the effect of a block, in case a reorganization happens. – Pieter Wuille Dec 30 '21 at 17:13
  • QQ - While parsing blk*.dat files, do I need to look into rev files as well? All open source parsers I found do not look at rev files – Ankit Dec 30 '21 at 17:17
  • I have added raw block data. It has lot of transactions and I get error while parsing second one as input count = 0 and Segwit flag is also false – Ankit Dec 30 '21 at 17:18
  • 1
    Whether you need to look into the rev*.dat files depends on what information you're interested in; the answer is quite possibly "no". It's hard to say what you're doing wrong when trying to parse files without seeing the code you're using... – Pieter Wuille Dec 30 '21 at 17:26
  • 1
    @PieterWuille This answer is from 2018 which states the difference between stale and orphan blocks: https://bitcoin.stackexchange.com/a/73193/ If what I call orphan block is not stored by node the term would certainly matter and there are lot of other places which mention stale vs orphan. Recently wrong term was used by Bitmex which had lead to lot of controversies as well. –  Dec 30 '21 at 18:53
  • My goal is to pull all data from BTC blockchain into RDBMS so I can run analytics over it. Best way to do that is to read raw files. Example of analytics - calculate address balance. – Ankit Dec 30 '21 at 19:51
  • @Prayank let's take the discussion about orphan block terminology elsewhere, but I believe stale blocks are the same as what Pieter calls "extinct". – meshcollider Dec 31 '21 at 06:17
  • @Ankit You're going to need to explain the exact algorithm (or better, share the code you're using) for us to be able to help you. This is too high level to see where the error could be. – Pieter Wuille Jan 02 '22 at 16:01
  • The block you have provided is for block 485735 in the main chain. It is completely reasonable that you would not run into any segwit parsing issues in the first 1000 blk.dat files as those would likely contain blocks prior to segwit's activation. I would guess that you have some mistake in the parsing of the actual witness data of segwit transactions that results in your code reading the wrong data for subsequent transactions. But without being able to see your code or the exact algorithm, I cannot say for sure. – Ava Chow Jan 02 '22 at 19:29

3 Answers3

2

It is probable that the issue you are facing is the result of incorrectly parsing or failing to parse the witness data. In segwit, witness data for each input is included after the outputs. The witness data for an input is an array of an array of bytes (each array is length prefixed with a compact size unsigned int). There is one witness data for each input and no length prefix for the witness data section entirely.

Given that you say that you run into a parsing issue for the second transaction of the given block, I am inclined to believe that the root cause is simply not parsing the witness data at all. The coinbase transactions of blocks containing segwit transactions must include a witness consisting of a 32 byte array of 0's for the coinbase input. So it is likely that you are mistakenly parsing these 0's as fields for the next transaction. And given the number of them, it would explain why you see 0 inputs for the second transaction, and also a byte that is not 0x01 for the segwit flag.

Ava Chow
  • 70,382
  • 5
  • 81
  • 161
2

After reading the comments on your question, i think you are not parsing the witness data and continuing with parsing a new transaction when you are actually reading witness data.

In a SegWit transaction, the varint that tells you how much inputs there are also tells you how much pieces of witness data there are. For each input one. Each piece of witness data has a varint of how much witness elements there are for that input. P2PKH or other non segwit inputs in a segwit transaction simply have a 00 there to indicate that there are no witness elements for that input. (They use sigScript for unlocking)

If this didn't help you, please post your code. Without it (i think) nobody can help you any further.

Antoni
  • 277
  • 2
  • 10
1

Sometimes ago I've allready wrote the script that supports parsing Witness data from the raw database. For example, parsing the file blk01896.dat (from my local database) with my blockchain parser. The parsing results of the first block 0000000000000000000D9C6917E3E865812D419648349D85C40C5A8302842D79 is:

The first transaction in that block BC1A08C90D142E4B475E844D16CDD796C2CFEAD15DC2D4C666131EB308CB739A has parsing results:

transactionVersionNumber = 00000002
Witness activated >>
Inputs count = 01
TX from hash = 0000000000000000000000000000000000000000000000000000000000000000
N output = FFFFFFFF
Input script = 032D470904E181F45D455530322F4254432E434F4D2FFABE6D6D9832F682523CD8FEA4AF9A980D6CAA10B5847FD6CC8D8BE3C9BB5EBB89547F9908000000CEEED33D2425EE84AC19040000000000
sequenceNumber = FFFFFFFF
Outputs count = 3
Value = 000000004B116E87
Output script = 001497CFC76442FE717F2A3F0CC9C175F7561B661997
Value = 0000000000000000
Output script = 6A24AA21A9EDB6B23A8ABBC43D9D9F3005646F9C8FF7307722698A360142E6C40D593FF1931C
Value = 0000000000000000
Output script = 6A2952534B424C4F434B3AF7AD9B2A11A8CE87CF52B80DAF26B1F1AD45FE3CF1BAF3DB9A457874001CADBE
Witness 0 0 32 0000000000000000000000000000000000000000000000000000000000000000
Lock time = 00000000
TX hash = BC1A08C90D142E4B475E844D16CDD796C2CFEAD15DC2D4C666131EB308CB739A

The second transaction EDDCB4AAD8DDC54E2A1B6FCD216CBF6C3BA2D4879AEE3570A0289C499BD8EF1B:

transactionVersionNumber = 00000001
Inputs count = 01
TX from hash = 428AB83113FD33C26020635FC0DCC1B275E6E18DC1E3B4AB8D81610155149968
N output = 000000C2
Input script = 473044022047118E23692DAD799096751526BE098B81328A1763E26FE0AB56CC88691699D002205BF4469E44C975B5DE3971243E5B4925CF5C9A658B049AEA9993202232B88914014104FCF07BB1222F7925F2B7CC15183A40443C578E62EA17100AA3B44BA66905C95D4980AEC4CD2F6EB426D1B1EC45D76724F26901099416B9265B76BA67C8B0B73D
sequenceNumber = FFFFFFFF
Outputs count = 3
Value = 000000000000088E
Output script = 76A914FA0692278AFE508514B5FFEE8FE5E97732CE066988AC
Value = 0000000000000000
Output script = 6A146F6D6E69000000000000000300000000000000E9
Value = 000000000000021C
Output script = A914DE9F32C7F3CBFDCCDF6BF98F596F9B3F830DA56987
Lock time = 00000000
TX hash = EDDCB4AAD8DDC54E2A1B6FCD216CBF6C3BA2D4879AEE3570A0289C499BD8EF1B

The fifth transaction B0E255196D16F244BDF552100C052F84DF304BF405CA2FCCB1FF55CB400A8CB2:

transactionVersionNumber = 00000002
Witness activated >>
Inputs count = 01
TX from hash = C453DD9343B17E9D728F46EA550831334AD342B5AEA2FDBD1E57CF73A97EC280
N output = 00000001
Input script = 160014369C1A58C231BA8311E26B8DDE1624138F68E980
sequenceNumber = FEFFFFFF
Outputs count = 2
Value = 000000000005E498
Output script = 76A914D5AA6FF082BED43D64D890B9AEFFE67C872D0AFE88AC
Value = 00000000046F8F55
Output script = A914FC249B6158D6CC6A6AD1FB3EE71747B86DF76D1187
Witness 0 0 71 012231AA44DEAA9B391EBC81A7419B93CBB0F50292AF9445D0A1AFB114FB5E1927200232BAF193AF7273F3EE1EFDCA873D3BB524E11B39259AAFCBA7B03E2A30DCC76820024430
Witness 0 1 33 FD1F21A7C45348728989BF8CE4C9CF1B17496DD41B834297D1AA53A3E22CF85303
Lock time = 0009472C
TX hash = B0E255196D16F244BDF552100C052F84DF304BF405CA2FCCB1FF55CB400A8CB2

First and fifth TXs has Witness, the second not. So that's the differences. Good luck!

P.S. You can check the parsing script section for transactions and see the differences (which bytes execute witness and where is the witness fields).

Denis Leonov
  • 945
  • 12
  • 28