Binary Data Definition Language

To parse a file, system needs to have knowledge about the internal structure of the file. XML uses XML Schema and DTD. Miraplacid Binary DOM library uses Binary Data Definition Language (BDDL) instead. This language was designed with the following requirements in mind:
  • It must be non-procedural, simple and human readable.
  • User must be able to describe virtually any file format.
  • Language must support any data structure you can possibly meet in a file.
Internal structure of a binary file, defined on this language, is the document schema (we call it "binary data definition" and keep in files with extension .bdd).
Top level BinaryDOM services, like BinSchemaSet, load document schemas from %CommonAppData%/Miraplacid/BinaryDOM/BinSchemas.

BDDL is c-like descriptive language. It is expressive enough to define various binary document structure elements.
C-like multiline comments inside /* and */ are supported.

File header


At the beginning of BDDL file, before all other constructions, you have to insert the following pseudo-structure:

MiraplacidBinaryDataDefinitionFileInfo {
version <version_number>;
credits <text>;
license <text>;
comments <text>;
}
All fields of this structure are optional.
Version value will be used by schema parser to verify its ability to parse this schema file correctly. Current schema version is 2.
All other values are arbitrary text. If text value contains spaces or punctuation symbols, it must be enclosed into double quotation marks.

Structures, documents and document contents


The main building block of the language is a structure (document). Nested structures define internal hierarchy of the document. Structures consist of items and rules:

<annotation>
document | struct <name> {
<item>
:
<item>
 [AcceptRule = <expression>)];
}
Annotation is a collection of object properties in "name=value" format. Annotation elements work like attributes in DotNet.

<annotation> : '[' name[=["]value["]] , name[=["]value["]] , : ']'
Values could be enclosed in double quotation marks. Values section (=["]value["]) is optional, in such a case name gets default value "true".
Annotation will be stored together with a schema and may be used as a source of additional information about schema elements.
Currently the following "standard" annotations (annotation keys) are defined:
  • helpstring. Defines help string for schema object. Can be defined for structures, documents and items. Binary Viewer uses it for tooltips.
  • mime_type. Defines mime type for a document. Helps to recognize document type. Value string contains mime type values separated by comma.
  • extension. Defines extensions for a document. Helps to recognize document type. Value string contains extension values (without leading dot, i.e. "jpg" rather than ".jpg") separated by comma.
  • alterview. Value contains a name of view object which will be used as an alternate view. Can be used with structures, documents and items.
  • pointer. Value contains a name directory used in directories and pointers mechanism. Can be used with single numeric data items only.
  • dataheap. This annotation marks a data item used as a data source in directories and pointers mechanism. Can be used with structures, documents and items.
Item is a field in structure described by type:

<annotation>
<type>['['dimension']'] name [optional implicit allowempty restrictions {expression}];
Each item may have its own annotation.
Type is a type of the item. It could be primitive data type, name of defined structures (or documents), or enumeration type, defined as a list of types. Any type from the list can occur in this place of a structure.
Primitive data types and their meaning from DotNet prospective:
  • uint8 or Byte - byte
  • int8 or SByte - sbyte
  • uint16 or UInt16 - ushort
  • int16 or Int16 - short
  • uint32 or UInt32 - uint
  • int32 Int32 - int
  • uint64 or UInt64 - ulong big endian
  • int64 or Int64 - long big endian
  • uint16be or UInt16BE - ushort big endian
  • int16be or Int16BE - short big endian
  • uint32be Uint32BE - uint big endian
  • int32be or Int32BE - int big endian
  • uint64be or UInt64BE - ulong big endian
  • int64be or Int64BE - long big endian
  • char or Char - 8-bit character (ANSI)
  • wchar or UChar - 16-bit character (Unicode)
  • float or Float - single
  • double or Double - double
Enumeration type declaration is similar to regular expression syntax:

(type1|type2|type3)
Array of enumeration means, that elements of listed types may appear here in any order. Data parser or validator will try to accept typed elements from left to right.
If item declaration contains dimension [expression] - it defines an array of elements of given type(s).

All other keywords in the item definition are extra flags and attributes:

optional means that current data item might be omitted in a document and that would not be an error.
implicit flag applies to user-defined types and arrays of them. All items of a type with "implicit" flag will be expanded and inserted into parent structure directly.
allowempty allows structures and arrays to have zero length. If type or array is marked as allowempty and it is empty in document, an error will not be thrown from the reader and validator. Unlike "optional", it adds empty structure or array to the document object model.
restrictions defines a special expression rule (in curly braces) which will be checked against the element. If the result of the rule execution is false - error will be reported. When used with arrays, rule will be applied to each array element, not array as a whole. If restriction rule failed on some array element, the array will be ended. Therefore, it is possible to organize variable-length array definition using unlimited size arrays with restrictions rule which will check each array element.
When used with enumerated items, including arrays of enumerations, this rule will be ignored.

AcceptRule is a pseudo-item containing expression which will be checked against this structure during parsing or validation. If the result of execution is false, structure will not be accepted.
AcceptRule should be used for:
  • Efficient implementation of enumerated type items. If some type in enumeration does not have an AcceptRule, it will be accepted and further variants will not be considered.
  • Building arrays of structures. Array continues as long as each parsed element is accepted. As soon as some array element is not accepted, array is finished and parser / validator will proceed with the next item in data definition.
  • Building arrays of enumerated types. Mix of the options listed above. This array will end when no variants in enumeration can be accepted in the current array element.
  • File validation and other validation purposes - for example, to ensure that the "magic" byte in the file header is really 0xAA.
AcceptRule is optional element.
 
Expressions in AcceptRule, expressions in restrictions, as well as expressions in array dimension are all based on the same syntax. We call it BinPath expression language.

Namespaces


Namespace describes an external component which should be used as an external function in expressions and as a filter plug-in for an alternate views.
The syntax of namespace object is the following:

namespace <name> {
    type DotNet|COM|DLL;
    location <location>;
}
Currently, BinaryDom library works only with .Net extrenal components, therefore type string may be omitted.
Location is a path to external component; this value could be enclosed into double quotation marks. If location does not contain a full path, plug-in will be loaded from %CommonAppData%/Miraplacid/BinaryDOM/Plugins.

Dictionaries


Some data formats contain encoded information, which cannot be directly used for document parsing and validation.
For example, mpeg file contains a bitrate flag, where value of 0010 means 64kbit bitrate for vesion 1, layer 1.
We need a table to decrypt those bits into meaningful bitrate value.
Dictionary is a key-value pairs array designed for such cases. The syntax is:

dictionary <name> {
key=value,
...
key=value
}
Key and value may be enclosed with double quotation marks, meaning, they can be strings. They can be also numeric values, including negative values (with leading "-") and float point values, depend on file format needs.

Dictionaries are used by lookup() function.

Alternate views


Most of binary document formats contain deep nesting structures, which, when they are big, require a lot of resources to parse, store and handle.
Some document types are compound, i.e. contain another documents.
Often, we need to view only some part of nested structures, or just need to verify documents type, without parsing each bit.
Nested documents and nested common data structures can be defined as "data with alternate view".
alternate views are not get parsed by default when data file is loaded with Binary DOM.
Alternate view syntax:

view <name> {
source <source_structure_name>;
type <item_type>[<dimension_expression>];
ContentEncoding <enconding_handler_specification>;
ConversionRule <rule>;
}
Source item contains name of the structure, which can be used as a source node for alternate view. Structures and documents could have several alternate views, all they will have source item pointing to this structure. Alterview annotation value in structure or item will contain a name of default alternate view, which can be used without name in alterview expression language function or GetAlternateView method of BinNode class. If some view is default, no need to specify source element.
Primitive datatypes and data arrays can only have alternate view mentioned in item annotation.

Alternate view could be of any type - structure, document, primitive type array, primitive type value. It can have a dimension expression just like any other item.

ContentEncoding feature could be used to decode document data before representing it as an alternative structure, and to encode data when updating document data via alternate view document object model. ContentEncoding points to a Namespace containing reference to a special filter plug-in assembly.
To develop ContentEncoding plug-in, you need to build an assembly with a class that implements IStreamConverter interface and make it available to the main application.
Syntax for link to a filter is

namespace::name

or 

namespace
where namespace is a Namespace name defined in a document schema. Name is a name of class in the assembly.
If function name is omitted, BinaryDOM library searches for a first suitable class (which implements IStreamConverter interface).

ConversionRule expression filters out items from parent data structure, consolidate them into a single collection and then convert this data into an alternate view. This operation is handy when compressed stream is splitted into chuncks and those chunks are spread into different data blocks. AcceptRule defines which chunks from the parent structure to merge together to get the compressed stream of the view. The order of data transformations driven by view features: (if the feature is present)
  1. Execute ConversionRule and get the result. The system assumes that result will be a collection of descendant elements of current node. When BinaryDOM creates an alternate view, the system connects it to the parent of current node. So, it is possible to use almost any axes in expressions.
    If ConversionRule is not defined for the view, the result is current node itself.
  2. Convert the result of previous step to a Stream, call filter plug-in and store result of its work in a Stream.
    If this step is not present, the resulting Stream is a stream generated from a previous result directly.
  3. Create a BinNode described as type and initialize it from the resulting Stream from previous step.
Source, ContentEncoding and ConversionRule might be omitted in view definition.

Should the altview annotation be applied to item definition, or should it be applied where item is used in some outer structure or array? Here is the rule of thumb: For structures and documents, alterview annotation could be applied to their declaration, if this object type shall always be converted to the same view.
For structures which have particular view in certain context (when the structure is an item of some another structure), alterview should be applied where structure is used.

For arrays of primitive types and for items of primitive types, alterview annotation should be applied where item is used.

Alternate view will help in the following situations:
  • When document has a very complex structure, and for your practical purposes it is not necessary to parse all the structure deep down.
  • When document contains a whole another document inside. It may be convenient to parse internal document as an array of bytes and convert this array to inner document only when necessary. For example, Windows spool file contains EMF files as images of its pages.
  • When document contain a structure (array) which can be represented as an array of structures, and number of these structures is undefined - up to the end of enclosing item.
    In this case, only alterview mechanism will allow to parse the file correctly and naturally limiting the inner array.
    Since array of structures disallowed to be a type in a view, we have to wrap structures array into another structure.
    Example of such a tricky file format is AVI.
  • When file has two structural levels: outer level is a set of transport-oriented data, wrapped around inner set of structures with document content. Reader of such formats must receive "transport" packets, check CRC, combine contents of the document, decode the document and do something with it.
    With a view, we can write a ConversionRule which will collect collection of data arrays refined from outer headers, footers and CRC codes. Our document content is scattered in these arrays. Then, we can use decoding filter plug-in (ContentEncoding) which will decode collected content. If document format is known, and alternate view with this document format is added to the binary data definition (.bdd) file, decoded document can be parsed just like if it would loaded from a hard drive.
    A good example of a nested format of this kind is OGG audio file.
Alternate views can be nested.

Directories and pointers support


Some data formats have not a nicely packed data structures. For example, some document headers contain a directory with pointers to some locations in the file. For example, TIFF file has a pointer to a directory structure.
These pointers usually reference some structure. Structure type is usually either immediately available, or can be deduced from data in directory element.
When file body consists of self-descriptive data structures and each of them has appropriate header, it allows us to define this body as an array of enumerated types, and disregard pointers. However, if type and size of data blocks contained outside these blocks, i.e., in directory elements, we need to use special BDDL features (directories and annotations) to parse such a file.

To declare a pointer to some data structure in a file, we need to use pointer annotation attached to a single numeric data item:

[pointer=MainPointer]
uint32 ptr;
Pointer annotation value must have corresponding name of directory data structure declared somewhere in BDDL file:

directory <name> {
PointerRule <rule>
type <item_type>[<dimension_expression>];
AcceptRule <rule>
}
PointerRule is an expression which allows calculating absolute position of referenced data, for example, when pointer contains only offset from the end of header, etc. In this case, we need to add header size to that value to get position. In other cases, we do not need this element, therefore it is optional.
AcceptRule is an expression which permits to interpret this value as a pointer. Sometimes, pointer might not be valid. For instance, if it equals to 0, it is invalid. Engine will check this rule before using value as a pointer. This parameter is optional.
Type - structure, primitive type array, primitive type value - type of referenced data. Dimension expression can be used just like for any other item.

Pointers referenced to typed data could be used for two following purposes:
  • If we have raw data area which should be converted to a set of various data structures. This area shall be declared as an array of bytes unless there is a strong reason to use some other data type.
    Parser will convert this array into a set of data structures according to types provided by directory specifications. Pointers are allowed to point inside of the array.
    Let's say, pointer points to byte 100, and expect to find structure of size 50 there. Parser will split source array into 3 elements: byte array of size 100, structure of size 50, and byte array with all the bytes after 150. If directory-based document does not contain "holes", all automatically created arrays will be converted into specific data structures.
    Structures created during conversion process may contain pointers and data heaps. As soon as parser knows where they are and wat is their type, it can process them.
  • If we need to convert some common type structure or primitive type nodes into a different data structure.
    This is type refinement driven by pointers which allows refining data types with help of directories. Pointers are not allowed to point inside these nodes and refined nodes must have the same size as original nodes.
    New data structures are allowed to have pointers and data heaps.
Nodes intended for conversion must have dataheap annotation with any non-empty value:

[dataheap]
uint8[*] body;

or

[dataheap]
struct Sector {
...
}
As a result of these conversions, document will no longer literally match the schema - instead of a single array it will contain a set of other data structures, or one structure instead of another. However, Validator will validate the document correctly as long as appropriate metainformation is correct (dataheap annotation).
BinNodes created by parser during conversion process will not have valid ShemaItem property in their SchemaInfo objects.

Forward references in expressions


Because data parser works in a stream fashion, expressions involved into parsing stage cannot contain forward references, i.e., cannot contain following, descendant, descendant_or_self, following_sibling, first_following_sibling, child axes. These expressions are: AcceptRule in structures, dimension and restriction expressions in items.
Other expressions used in schema works after all the document is loaded, so, all references are allowed. These expressions used in alternate views and directory elements. Also, any user-defined expressions used in queries are unrestricted.
Schema checker does not make absolutely exhaustive references testing and it always allows to use forward reference to some following sibling via parent axis: expression in item "a" could contain "parent.b" where "b" is placed lower in parent structure.
If you will use such a construction in expression of "restricted" type, this will lead to runtime error.
So, how to parse documents with forward references in structure? The common answer is: use the most comon data structures and refine them after parsing wil be completed, with alterviews, where refined structure contains forward references, which could be resolved via parent.item forward reference:

struct SomeOuterStructure {
   [alterview=SomeStructure]
   uint8[31] varArray;
   uint8 count;
}

view SomeStructure {
   type SomeStructure;
}

struct SomeStructure {
   uint8[parent.count] array1;
   uint8[31 - sizeof(array1)] filler;
}

Reusable Components


It is possible to reuse common data structures in several document schemas. You can create and reuse common types libraries and include one document into another. Here is the syntax of include statement:

include <filename>
filename could be enclosed with double quotation marks.
If filename does not contain a full path, it will be loaded from %CommonAppData%/Miraplacid/BinaryDOM/Includes.

Schema compilation


When BinaryDOM library loads a document schema, it checks the schema for:
  • Language structures syntax
  • Existence of referenced structures, documents, views, namespaces
  • Validity of BinPath expressions
  • Validity of reference paths elements used in expressions, and validity of function arguments
After compilation, system performs optimization step. Here are some important points of optimization:
  • If some structure contains only single nested structure, and the inner structure is implicit, items of the inner structure will be included into outer structure. AcceprRules of these structures will be united with && operation. This is a kind of inheritance.
    Please note, that you must use in AcceptRule inner structure item names (references) as if they are items of outer structure.
  • In addition, if inner structure contains only one item with restriction rule, and item containing this structure has its own restrictions, these rules will be united with && operation.
  • If one structure includes another structure, and the inner structure is implicit and has no AcceptRules, items of the inner structure will be included into the outer structure.

Best practice


Think of the data parsing process as a zipper. You have data, you have data definition file, and you have a runner that zips it together. The only difference is - on the data definition side you have some pieces that might or might not match the data, and the runner is smart enough to choose the right patches for the data. Runner might succeed or give up, depend on the data and patches on hand.
To develop fast, efficient and reliable document schemas, please take into account the following considerations:
  • Don't over-use implicit attribute. It might slow down schema parsing process (if this implicit status will not be optimized during compilation process), and you might lose important metainformation. For example, structure name and char[*] tells us more than just char[*] string.
  • Use mandatory attribute only when you have a very regular binary structure and you must have all items despite the fact that empty arrays and structures are allowed in that format.
  • Try to avoid items with enumerated types and arrays of enums. Use these constructions only if absolutely any value from the enumeration could appear in any order. Otherwise, try to make the data definition more specific.
  • If some structure item is marked as optional and it was rejected for some reason, parser/validator will try to fill and validate the next schema item with the same data.
  • Structures/documents will be rejected if:
    • It did not pass AcceptRule expression check
    • One or more non-optional items rejected, or missed when end-of-file occurred
    • Structure/document contains no items and it is not marked as mandatory
    • Restrictions rule failed in item representing this structure/document
    Usually, you don't have to verify each nested structure with AcceptRule. Some structures cannot be verified, and that's Ok as long as you can verify the outer structure. The outer structure will be rejected as soon as any its non-optional item is rejected.
  • Enumeration item will be rejected only after all the variants do not pass verification against the current data.
  • Array of primitive types will be rejected if it defined with non-unlimited length and actual length is not equal to the calculated length (end of file occurred or restrictions rule failed on some element), or if array is empty and it is not marked as mandatory.
  • Primitive types will be rejected if end of file occurred or restrictions rule failed.
  • If you use enums, take into account that parser will try types from left to right in the list of enumeration. You got to place types in enumeration in the correct order, typically, from more strict types (i.e., with stricter AcceptRule) to less strict. For example, PDF file might contain Dictionary structure beginning with "<<" or HexString structure beginning from "<". It is obvious that in enumeration Dictionary must placed before HexString. Type with no restrictions should be placed last.
  • Arrays of structures or enums of unlimited length [*] will end when the next array element will be rejected. Parser/validator will try to fill/analyse the next item after that array. If the array is empty and it is not mandatory, an error will be thrown. However, if the array contains at least one element, array will be successfully accepted and parser will try the next item.

    How to troubleshoot a data definition if it fails on or around arrays of structures or enums?
    Here are some tips:
    • Use as small document as possible, even create a very simple artificial one to isolate the problem.
    • If you have an unlimited array in your data definition file, try to temporarily replace it with fixed-size array, where array size matched the data in your data file, and see whether it helps.
    • Transform your array or enumeration array into a set of specific structures included into enclosing structure, to match your data file, and see whether it helps.
  • When you use unlimited arrays, always think how will they end. For example, unlimited array of bytes will be read up to the end of file and will be accepted successfully. For arrays of structures and enums, the limit is AcceptRule and restrictions rule aplied for each item.
    For arrays of primitive types, this is a restrictions rule applied for each item.
  • If your structure has AcceptRule, it will not be accepted until AcceptRule executed and succeeded. AcceptRule will be executed when the last item which referenced in the expression will be filled with data.
    For example, your structure is included into enumeration and has several items (for example, the first and the last one) in AcceptRule. So, parser will pump data into this structure up to the last item, and here AcceptRule rejects the structure and parser will try the next enum variant. Although, it was possible to reject the structure from the first item. In some cases like the one described above, it makes sense to let parser reject variants as early as possible - just apply the restriction rules to the first item and exclude this condition from structure's AcceptRule.
  • When working with textual data, if possible, use iswhitepsace(), isletter(), etc functions instead of "manual" comparison.
  • If function has default parameters, and they are what you need, do not pass arguments.
  • If document structure defined with pointers to some location, it does not necessarily mean that you have to use pointers in binary data definition file. Sometimes they are provided only for convenience, and the document itself is quite regular. Try to avoid using pointers, because they will cause an unnecessary second parsing pass over the document data.

See also: