2015. dec. 20.

What goes into an IPFS multihash (in case of a file)

I started fooling around with a simple python multihash library, and I wanted to verify that it produced the same multihash as the go-ipfs reference implementation.

I created a hashme.txt with the following content: Hash me!
Then I called ”ipfs”, so that I would get a multihash:

> ipfs add hashme.txt
added QmNdTrvTHNM4tXjeoX1HD554eMTweTN91hf2ih6KKwXppo hashme.txt

I assumed that this hash would match the return value from the python multihash library when calling it like this:

>>> someMultihashLib.gen_hash(SHA_256, ‘Hash me!’)

I did not. My assumption was quite naive.

I wanted to see what the block (referenced by the QmNd… hash) actually contained, so I used ipfs block get to get the raw content of the block into a file:

> ipfs block get QmNdTrvTHNM4tXjeoX1HD554eMTweTN91hf2ih6KKwXppo > block.raw

Then I checked the content of this file from python:

>>> open('block.raw').read()
'\n\x0e\x08\x02\x12\x08Hash me!\x18\x08'

Okay, the “Hash me!” text is there as expected, but there is also some other stuff.

At this point after some light code browsing I found merkledag.proto, which I downloaded. So.. IPFS uses protobuf for serialization, so I got myself a protoc executable, and put it on my path. This is actually my first time using protobuf.

I assumed that the encapsulation could be an PBNode type from the previously found merkledag.proto file (It being a PBLink (referencing something else) wouldn’t make much sense, and this .proto file only contains these 2 types.), so here my first decoding attempt of block.raw:

> protoc --decode=PBNode merkledag.proto < block.raw
Data: "\010\002\022\010Hash me!\030\010"

It’s… um… a partial success. It could make sense. Instances of the PBNode type can only contain some number of Links (of type PBLink) and Data (of type bytes), and this one only contains data, but it is still encapsulated in something.

I tried the --decode_raw switch, and it seems like that protoc has decided by some heuristic that the bytes of the Data are not actually of just simple bytes, but something that it “understands”. Outer “1” key identifies PBNode’s Data. The inner 1,2,3 key-value pairs are a mystery for now.

>protoc --decode_raw < block.raw
1 {
  1: 2
  2: "Hash me!"
  3: 8
}

So I assumed the content of the Data field is also serialized with protobuf, but some other .proto file. I searched the go-ipfs codebase for .proto files, and the one that looked most promising was unixfs.proto.

It makes perfect sense since I actually added a file. This .proto file contains a Data type, that has the following field:

  • Type(key #1, and the value, “2” stands for File)
  • Data(key#2, yup, it’s “Hash me!”)
  • filesize(key#3, and yes, “Hash me!” string is actually 8 bytes long!)
  • and some other(s) fields that are irrelevant for this post…

So, to summarize: When calling the ”ipfs add” command, that data is first wrapped in the Data type from unixfs.proto, then the result is wrapped in a PBNode from merkledag.proto.

In hindsight, I should have used ”ipfs block put” to verify the python multihash implementation. The output of ”ipfs block put hashme.txt” matches the return value from the python library. Still, this was an interesting exploration.

Update: this post contains some inaccuracies and/or omissions, such as: protoc does not use heuristics, and data may be chunked when it is longer