How to reverse engineer a proprietary data file format (e.g. Smartboard Notebook)?

Question

How should I begin trying to reverse engineer this file format? The only thing I can think of is saving a simple file, and then dig in with a hex editor. But since the file format may be some kind of archive, that seems like the wrong approach. I've always been a little interested in the idea of reverse-engineering a file format, but I never actually attempted it. How should I begin?

In particular, I am interested in Smart Notebook which loads and saves data into .notebook files. This is an undocumented proprietary file format. SMART is the leading manufacturer of white boards and their notebook software is therefore one of the most popular formats for educational (presentation) content. There is an open standard for whiteboard files and Open Sankore is an open source program that can open and save them. However, Smart Notebook is not fully compatible with the open whiteboard format so I really would like to understand the .notebook file format so that I can write software that makes use of it. The open stand (.iwb files) are zip archives that contain images and SVG data. It occurs to me that .notebook files may also be compressed or at least contain a number of sub-files within it (like images and swf files).

Is it reasonable to believe that a directory structure might be embedded in the .notebook files? — zetavolt, Mar 25 '13 at 22:03
Here is a site with smartboad files for reference. http://www.jmeacham.com/smart.board.htm — cb88, Mar 25 '13 at 22:24
I don't see any built in support for .notebook files. Just pdf, iwb , images and ubz I think it was. If you know of a plugin then perhaps you should list it otherwise it looks like sankore does not support .notebooka at all. — cb88, Mar 25 '13 at 22:42
@cb88 Sankore does not support .notebook files; as far as I know there is no software that can read .notebook besides Smart Notebook. I feel like .notebook files are the MS Word .doc files of interactive white boards because Smart is the leading software vendor in this space. That's why I want to reverse engineer the format. — Thorn, Mar 25 '13 at 22:58
@zv_ I think it is reasonable to expect some directory structure or at least a way for a notebook file to contain other files. When content is inserted into a notebook file (pictures, audio, Adobe .swf) these become embedded into the page and part of the file. — Thorn, Mar 25 '13 at 23:00
@Thorn I see, I missunderstood what you meant initially by refering to Sankore then saying "it can open and save them." I thought you were refering to the .notebook files. — cb88, Mar 26 '13 at 13:14
It could be really useful and 'top' question, if make it more abstract. About reversing file formats. And answer could contain 'common' techniques to do that, by writing python/etc scripts, using advanced hex editors, like e.g. 010 editor, and so on. About fuzzy search, and binary patterns match. As like as various statistical tools, like e.g. Cantor Dust (https://sites.google.com/site/xxcantorxdustxx/) (which is still prototype). — Anton Kochkov, Mar 26 '13 at 20:38
@Thorn: Did you get the information about the xbk file? Did you get the specification document for the same or you manually decoded it? — , Jul 06 '13 at 06:59
Neither. The format is not officially documented by Smart technologies, but the format is really just a zip file. Looking at some examples was enough to get the gist and since XML is readable, I'm able to save some simple files to better understand the format. The graphics are stored as SVG. — Thorn, Jul 08 '13 at 05:21

score 21 · Answer 1 · answered Mar 25 '13 at 22:56

I downloaded abc chant.notebook from the site cb88 linked to:

$ file "abc chant.notebook"
abc chant.notebook: Zip archive data, at least v2.0 to extract
$ unzip -t "abc chant.notebook" 
Archive:  abc chant.notebook
    testing: images/temp(1).png       OK
    ... about 200 similar lines ...
    testing: attachments/Zachary.JPG   OK
No errors detected in compressed data of abc chant.notebook.
$

It's a valid zip file containing mostly XML and image files. Are the .notebook files you were referring to different from this file? If so, could you upload a sample?

I know this doesn't really go into the process behind reversing a proprietary file format, for which I apologise. Hopefully someone else can provide a more interesting answer in this respect.

Wow - this much easier than I thought! I must have checked an earlier version of notebook files: xbk. They are not zip file but .notebook is! Somehow I thought I checked this already and didn't come back to it. — Thorn, Mar 25 '13 at 23:10

score 19 · Accepted Answer · answered Mar 25 '13 at 22:57

Well, obviously the particulars will very much depend on the particulars of the file format and what you expect to achieve in general. However, some steps will largely be the same. One thing you could do is:

try hard to find all kinds of clues about the format. This can be a small note in some bulletin board or the cached copy of some year old website that has since vanished. Often the gems won't pop up as top search results when you are looking for something specific enough. Weeding through pages of search results can make sense. Als make sure to use tools such as file which look for magic bytes and would be able to identify things not obvious to the naked eye.
find a proprietary program that uses the format and is able to read/write it (you seem to have that)
1. Use a trial & error technique such as making distinct changes to the document, saving them and observing and noting down the differences, AFAIK this is how the MS Office file formats were decoded initially for StarOffice (now OOo and LibreOffice)
2. reverse engineer the program itself to find the core routines reading and writing the data format
find an open source program in the same way -> read its source

If you understand the language in which the program from option 3 is written, no problem at all. If you don't have that or if you are faced with other challenges then you have to resort to the good old technique outlined in point 2, patching gaps with pieces you gather with method 1.

The point 2.1 should be obvious: you want to find out how recursive text is encoded? Type some text, format it, save, observe the change. Rinse, lather, repeat.

Point 2.2 will take a lot more effort and should likely be used sparsely to make sure you got details from 2.1 right.

How to reverse engineer a proprietary data file format (e.g. Smartboard Notebook)?

2 Answers2

Linked