Coding into the Void

Coding into the Void

A blog that I’ll probably forget about after making thirty-five posts.

Making a Very Bad Data Serialization Language

It was nearing the end of July and I had not been very productive when it came to making games. What should I do? Make a data serialization format, of course!

What is that, you ask? A way of representing values to disk. There are two general types of data format: ones designed more for human editing, like INI, and ones designed more for machine editing, like JSON. It’s a continuum, and JSON isn’t all that arduous to modify by hand.

Why did I make my own? Well, for fun, mostly. I wanted to have files that were very hand-editable, and I considered using INI files, as that’s common in Windows, but the format itself seemed fairly limited in what you could express. I wanted something that I could dump arbitrary dictionaries out to and read back in from, and I would have had to do some manual conversions to handle INI. Things like YAML have load-bearing whitespace, which I do not like.

In the end, no one else’s well-thought-out formats do things in the exact specific way that I want them to do, and I have time, so why not?

I call it Kevin’s Very Bad Data Serialization Language, or KVBDSL for short.1

Here is an example of what a file using KVBDSL would look like:

name: "Kevin Hutchins"
games made: i 38
living: b true
hobbies: [
    "programming"
    "games"
    "something that makes me seem cultured"
    i 313
]
games made: f 30.25
skill levels: {
    programming: 4
    game creation: 2
    complacency: 7
}
# This is a comment.
my first name but vertical: """
    K
    E
    V
    I
    N
"""
"I want to use a : in this key": b true

Generally speaking, entries in KVBDSL take the following format:

<key name>: <value type signifier> <value>

Occasionally taking the format:

<key name>: <value type signifier><value>

This is the top level diagram of a file in KVBDSL:

entry comment \n

Design Goals

I had three primary goals when designing KVBDSL:

  1. An extensible and backwards-compatible format.
  2. A generally permissive system, making it hard to have subtle errors that prevent keys from parsing.
  3. Any string value should be representable in both keys and values.

An explicit non-goal was transferrability. The expectation is that these files will be only used within a single program: more for configuration than communication.

I wanted to make a format that was extensible without changing how existing files were read or breaking existing systems.2 I also wanted a system that is fairly permissive, making it hard to have subtle errors that prevent keys from loading. An error in a long string would ideally not break parsing.

Part of that extensibility is the potential to add more data types. Data types, when implicitly typed, have the tendency to allow for overlaps.

There’s an issue that pops up in YAML from time to time called the Norway Problem. See, Norway has the country code “NO”. Strings can be specified without quotes. YAML has a series of supported boolean values, including NO to represent false. So if you’re writing a bunch of unquoted country codes, you might get a false value pop up your array.

In KVBDSL, there is no implicit typing. All types have a unique identifier that identifies the type, with a couple shorthands for representing strings.

Keys

UQKString QKString

Keys must be strings. They can either be either quoted strings or unquoted strings. For parsing, the first non-whitespace character on a line will determine if it’s parsed as an unquoted or a quoted string. If the first non-whitespace character is ", it will be parsed as a quoted string. If it is anything else, it will be parsed as an unquoted string.

Unlike string values, multi-line strings are not supported.

Unquoted Key Strings

Any character except :, ", or \n Any character except : or \n

Example:

   string key 52 : i 43

This key will be parsed as string key 52.

An unquoted key string is terminated when it reaches a newline, triggering a parse error, or a :, ending the key. Unquoted strings are trimmed of all leading and trailing whitespace.

Leading whitespace is trimmed to allow for the indenting of keys without any weird parsing side effects. Trailing whitespace is trimmed because I think it’s more likely that people will want to align the : character than to have keys that end with spaces.

No escape characters are supported in this mode,3 which means that unquoted strings cannot contain any of the following:

  • A leading ".
  • A : anywhere in the string.
  • A \n (a newline, not the combination of \ and n).
  • Leading or trailing whitespace.

Any of the above cases should trigger serializers to output keys as quoted strings.

The goal here was to have most strings be able to be written without quotes, with only outlier cases requiring them, which are both toil while writing a string and something that I can see someone forgetting to close. For hand-written files, I expect the vast majority of keys to be representable as unquoted strings.

Quoted Key Strings

" Any character except \ or \n \ \ " b f n r t v "

Example:

"  A key\n: that is terrible " : i 43

This key will be parsed as   A key\n: that is terrible . Note that in this case, the \n in the example text refers to a \ followed by a n, while in the parsed key it refers to a newline character.

A quoted string is terminated when it reaches an un-escaped " character. If it reaches an unescaped \n, it will fail parsing. The contents of the string are not trimmed.

Escape characters are supported in this mode. The escape characters supported are: \\, \", \b, \f, \n, \r, \t, and \v.

As such, the only restrictions on what a quoted string can contain is:

  • An unescaped \n newline character.

The expectation, although not requirement, is that quoted strings will be used as a fallback case in the serializer.

Types

b bool i integer f float s string [ \n array \n ] { \n dictionary \n } QString MLString

There are six overall types supported in KVBDSL: strings s, booleans b, integers i, floats f, arrays [, and dictionaries {. The code snippets after each type is the token that represents the type.

There is no limit on the type length, but lengths of one character were chosen for brevity.

The selection of types in the initial version was chosen to cover all the basic types I use when making a game in Unity. int32 and float are chosen over int64 and double for that reason. There’s no good reason that doubles and int64s aren’t supported other than lack of necessity for support in first version of a thing that I will probably never use.4

Tokens

There are two kinds of type tokens: whitespace-delineated tokens and substring-delineated tokens. The former, whitespace-delineated types, are preferred, as they are less error-prone easier to reason about when it comes to extensibility.

Whitespace-Delineated Type Tokens

Whitespace-delineated type tokens are, as the name suggests, identified because they have whitespace on both sides. As an example, to add an integer as a value, you’d write key: i 5, with the i signalling the type.

The whitespace after the token is important, as it is required by the parsing step. Otherwise, if I wanted to add an int64 type, it would either accidentally parse to an integer or require complicated ambiguity resolution logic. Adding a new whitespace-delineated type only requires checking if one with the same name already exists (or exists as a substring-delineated type, as below).

Substring-Delineated Type Tokens

Whitespace-delineated types are expressive, perfect, unambiguous things. They’re also not elegant. Specifying types, generally, isn’t elegant. That’s why for strings, and only strings, there are a couple of shortcuts.

Substring-delinated types are determined by a substring match against the start of the type value itself. When determining the type, substring-delineated types are checked before whitespace-delineated ones. They are checked in descending length order. If the token matches the start sequence of the value, it will interpret that value as that type.

As an example, assume you had the substring-delineated token foo. The following keys would match against that type.

key1: foobar
key2: foo bar

For substring-delineated types, the token string should be included while deserializing. Although this can be derived since all type tokens are static, it simplifies the use case to have that information available as part of the string.

As substring-delineated type tokens are aggressive with what they can accept, they prevent any whitespace-delineated tokens both from having the same name and from starting with the substring-delineated token’s value. For instance, in our foo example, foo, foobar, and foobaz would be off the table for whitespace-delineated types.

Currently, substring-delinated types are used in only two cases: """ to designate multi-line strings and " to designate quoted strings. This should show why substring matches are performed in descending length order: otherwise, the " token would claim all strings using the """ token.5

Note that this is only possible because a quoted string (one that starts with ") cannot contain any unescaped "s within it. If a valid quoted string could begin with """, I would have nixed one of the substring-delineated types to avoid ambiguity.

Supported Types

Booleans

true false

Booleans have the whitespace-delineated type token b. The value can be true or false, case insensitive.

I was moderately tempted to allow other values, like ‘yes’ and ‘no’ or ‘1’ and ‘0’, but I decided to go with the standard type instead. I don’t expect this to be used by anyone who isn’t at least moderately tech saavy.6

Floats

+ - 0-9 . 0-9 e + - 0-9

Floating point numbers (not double-precision) have the whitespace-delineated type token f. The decimal separator must be ., not ,. Leading zeroes are permitted. Leading + is permitted. Exponentiation (with sign) is permitted.

This is more permissive than JavaScript, which doesn’t allow multiple leading zeroes, although not more permissive than C#. I aimed to be reasonably permissive in this case, as it’s not immediately obvious why you shouldn’t be able to use multiple zeroes.

Integers

+ - 0-9

Integers (signed, 32-bit) have the whitespace-delineated type token i. They can have a leading - or +. Leading zeroes are permitted. This does not, currently, support any non-base 10 numbers, but nothing precludes adding them in the future, either in their own type (unlikely) or with a 0x prefix (more likely).

Strings

UQString QString MLString

Everyone loves strings. I like strings so much that I have three7 ways to represent them. Strings have the whitespace-delineated type token s, but you can also specify them with the substring-delineated type tokens " and """ as shorthand for quoted strings and multi-line strings, respectively.

This will feel similar to the keys section, as the behavior is similar.

Unquoted Strings

Any character except " or \n Any character except \n

Example:

key : s This is a test string. "s are okay.

This string will be parsed as This is a test string. "s are okay..

An unquoted key string is terminated when it reaches a newline. Unquoted strings are trimmed of all leading and trailing whitespace.

No escape characters are supported in this mode, which means that unquoted strings cannot contain any of the following:

  • A leading ".
  • A \n (a newline, not the combination of \ and n).
  • Leading or trailing whitespace.

Any of the above cases should trigger serializers to output keys as either quoted or multi-line strings.

Due to the ambiguity it would cause, unquoted strings have no substring-delineated shorthand.8

Quoted Strings

" Any character except \ or \n \ \ " b f n r t v " \n

Whitespace-Delineated Example:

key: s "This is a test string with a \" and a \n in it."

Substring-Delineated Example (the s type token omitted):

key: "This is a test string with a \" and a \n in it."

This string will be parsed as This is a test string with a " and a \n in it. Note that in this case, the \n in the parsed string refers to an escaped newline character.

A quoted string is terminated when it reaches an un-escaped " or \n character. The contents of the string are not trimmed.

Escape characters are supported in this mode. The escape characters supported are: \\, \", \b, \f, \n, \r, \t, and \v. If an unescaped \ is before a non-escape character, it will not be treated as an error but will instead be parsed as if it the \ was escaped. This does mean that adding more escape characters could change the interpretation of existing strings.

As such, the only restriction on what a quoted string can contain is:

  • An unescaped \n newline character.

The expectation, although not requirement, is that quoted strings will be used as a fallback case in the serializer.

Multi-Line Strings

""" \n Any sequence except \ or """ \ \ " b f n r t v p """

Whitespace-Delineated Example:

key: s """
    This is a test string with a \" and a \t in it.
    It has multiple lines.
    Yay!
    """

Substring-Delineated Example (the s type token omitted):

key: """
    This is a test string with a \" and a \t in it.
    It has multiple lines.
    Yay!
    """

The given string will be parsed as:

This is a test string with a \" and a \t in it.
It has multiple lines.
Yay!

A multi-line string is terminated when it reaches three consecutive unescaped quotation marks. The closing line can either be on the same line as the final text or on the following one.

The opening """ must be on the same line as the key and cannot have any non-whitespace text after it. This decision was made because text on the opening line leads to non-intuitive behavior for leading whitespace removal.9 Would it ignore the whitespace on the first line? Would it strip it but not count it? Would it count it even though the indentation was different? As a user, it would be hard to reason about and harder to remember.

Escape characters are supported in this mode. The escape characters supported are: \\, \", \b, \f, \n, \r, \t, \v, and \p. If an unescaped \ is before a non-escape character, it will not be treated as an error but will instead be parsed as if it the \ was escaped. This does mean that adding more escape characters could change the interpretation of existing strings.

Note the \p sequence which is not supported in other strings. See the section on trailing whitespace removal for more information.

Multi-line strings have no support for line continuation, as that would make leading whitespace removal behave unintuitively.

These strings behave largely the same as the Java text block behavior.10

I’m aware of three notable differences:

  1. The handling of trailing whitespace. Java requires users to enter in \u escape codes for trailing whitespace. KVBDSL introduce a preservation escape character.
  2. The handling of invalid escape sequences. Java fails the compilation step. KVBDSL accepts the value. Java’s MLS support being behind a compilation step makes being strict the right choice for their scenario, but not for mine.
  3. The handling of newlines around the trailing """. In Java, a """ on a new line adds a newline character. In KVBDSL, it only adds a newline character if there is more than one newline character before the closing sequence. This was to match the opening """ behavior, which does not add a newline.
Leading Whitespace Removal

Any consistent leading whitespace before all non-whitespace lines will be removed.

Example:

key: """
    foo
    bar
    """

Will be parsed as:

foo
bar

To avoid this behavior, place the terminal """ at the start of the line, like so:

key: """
    foo
    bar
"""

The above is parsed as:

    foo
    bar

When evaluating the consistent leading whitespace, only tabs and spaces will be considered. It can be a combination of the two, but no other forms of whitespaces are considered. Sorry, non-breaking space.

Newline Removal

The newline introduced after the opening """ is discarded. Similarly, if the closing """ is placed on its own line, that newline will also be discarded.

As such, the following are equivalent:

key1: """
foo
"""
key2: """
foo """

If trailing or leading whitespace is desired, add an additional newline:

key1: """

foo

"""
Trailing Whitespace Removal

Any whitespace at the end of a line will be removed, unless they’re in the form of escaped whitespace characters. If the trailing whitespace is desired, a \p can be placed, which will trigger the preservation of all preceeding whitespace. A \ would have felt more natural for me, but due to its use for bash line continuation, a very different behavior, I didn’t use it.11

An example:

key: """
foo     \p
bar  \p  
"""

Would be translated to the string foo     \nbar  , preserving the whitespace before, but not after, the \p.

Arrays

value comment \n

Arrays have the whitespace-delineated type token [. They are terminated by a terminating ], which must be on its own line without any other non-whitespace characters. Putting any non-whitespace characters on the same line as the opening [ will cause the entire value to be discarded. They may contain comments before, between, or after the values (although not on the lines with the opening [ or the closing ]).

Arrays contain a series of newline-separated values. Values contain the type token and the type representation. For example:

key: [
    s string one
    # Comment
    "string two"
    i 5
    f 2.5
    b false
]

Arrays can contain any values, including other arrays or dictionaries:

key: [
    [
        s foo
        s bar
    ]
    {
        key1: b true
        key2: b false
    }
]

Dictionaries

entry comment \n

Dictionaries have the whitespace-delineated type token {. They are terminated by a terminating }, which must be on its own line without any other non-whitespace characters. Putting any non-whitespace characters on the same line as the opening { will cause the entire value to be discarded. They may contain comments before, between, or after the values (although not on the lines with the opening { or the closing }).

Dictionaries contain a series of newline-separated entries. Entries contain the key, a type token, and a type representation. The file as a whole can be thought of as an implicit dictionary.

An example dictionary:

key: {
    s1: s foo
    # Comment
    s2: b false
}

Dictionaries can contain any values, including other dictionaries or arrays:

key: {
    arr1: [
        s foo
        s bar
    ]
    dict1: {
        key1: b true
        key2: b false
    }
}

Comments

# Any character except a newline

Comments begin with #, like so:

# Valid comment position.
key1: [
    "foo"
    # Valid comment position.
    "bar"
]
key2: [
    key2a: i 5
    # Valid comment position.
    key2b: f 3.5
]

They can be placed before, inbetween, and after keys at both the top level and inside of dictionaries. They can also be placed before, inbetween, and after values inside of arrays. They cannot be placed after array and dictionary opening or closing brackets, nor can they be placed on the same line as an opening or closing """.

Should You Use This?

No.

I didn’t think very much about the structure of this language. It almost certainly has one or two critical flaws that makes it a bad idea to use. The serializer and deserializer also almost certainly have bugs that will munge your data. It would need more tests and more eyes on it, none of which I am motivated to do.

Can You Use It?

Sure, why not, as long as it’s not for anything important.

Possible Future Improvements

Typed Arrays/Dictionaries

Arrays and dictionaries could potentially declare types up front, eliminating the need for a type in front of every data value. Sample presentation:

dictKey: { s
    key1: this is a string
    key2: this is another string
    key3: string3!
}

arrayKey: [ i
    1
    5
    3
]

The advantage of that approach is that the serialization library could guarantee correct typing for you, filter out all the values that don’t match the specified type, and reduce the amount of boilerplate when entering lots of values.

The downsides I can see:

  • The type is less visible and the format is a bit strange, so it would obscure types for people who aren’t familiar with the format.
  • If I were to introduce types that required only a type token and no value, like the null type below, it would not naturally fit in this format. To work around it, I would likely need to have type-only values accept a value to distinguish it from an empty line. On the other hand, why have an array full of nulls?

Null Type

What it says on the tin. A way to represent null.

It’s probably look like:

nullKey: null

Something to think about: How would a type token that isn’t followed by any data work with the typed arrays/dictionaries above? It could be that it would optionally accept any newline-terminated string.

Data Blob Type

Currently, there’s no way to just dump a blob of bytes in here without doing some sort of string encoding ahead of time. For an ini type that’s designed to be edited by users, that makes sense. If I wanted it to be more useful in terms of serializing data, it could be a good addition.

I’ve given a little bit of thought to it, and I think this is how I’d format it (type key still up in the air):

blobKey: b[] 5 sjdkf

The number immediately following the type key would signal the length of the data blob. After exactly one space, it would read either the number of bytes declared by the number, or, if the number is longer than the remaining length of the file, skip the key.

Advantages:

  • Easy to store arbitrary data.

Disadvantages:

  • Makes files less readable.
  • Corruption can quickly result in confusing behavior. (This is also true of arrays, dictionaries, and multi-line strings)

Conclusion

Implementing this was a fun experiment in parsing, although the code ended up being a mess. I don’t know if I’ll end up using this in any of my games, but it was a fun exercise. I’d like to configure some sort of language file to get syntax highlighting in VSCode (or maybe even in hugo?), but that’s a project for another day.

Appendix

EBNF (Kind of)

These are the rules that I used (along with this site) to generate the railroad diagrams above.

file ::= ((entry | comment) ( '\n' (entry | comment))*)?
comment ::= '#' 'Any character except a newline'*
entry ::= key ':' value

key ::= (UQKString | QKString)
UQKString ::= (('Any character except :, ", or \n')('Any character except : or \n')*)?
QKString ::= '"' (('Any character except \ or \n')|('\' ('\'|'"'|'b'|'f'|'n'|'r'|'t'|'v'))*)* '"'

value ::= ('b' bool | 'i' integer | 'f' float | '[' '\n' array '\n' ']' | '{' '\n' dictionary '\n' '}' | 's' string |  QString | MLString)
bool ::= 'true' | 'false'
integer ::= ('+'|'-')?('0-9')+
float ::= ('+'|'-')? ('0-9')* '.'? ('0-9')+ ('e' ('+'|'-')?('0-9')+)?
array ::= ((value | comment) ( '\n' (value | comment))*)?
dictionary ::= ((entry | comment) ( '\n' (entry | comment))*)?

string ::= UQString | QString | MLString
UQString ::= (('Any character except " or \n')('Any character except \n')*)?
QString ::= '"' (('Any character except \ or \n')|('\' ('\'|'"'|'b'|'f'|'n'|'r'|'t'|'v'))*)* ('"'|'\n')
MLString ::= '"""' '\n' (('Any sequence except \ or """')|('\' ('\'|'"'|'b'|'f'|'n'|'r'|'t'|'v'|'p'))*)* '"""'

  1. Well, shortish. ↩︎

  2. Although this implementation is not that, due to the lax escape parsing. ↩︎

  3. The goal is to have this as a short, readable key. Most people won’t use escape characters in serialized keys, so making people pay the cost of escaping a " or \ didn’t seem worth the cost. ↩︎

  4. That isn’t true, as of my proofreading pass on this before publishing it. I use it for persisting options in These Endless Plains. ↩︎

  5. I’ll go into more detail about this in the string type section. ↩︎

  6. I don’t think it’ll happen at all, in fact. ↩︎

  7. Or five, depending on how you’re counting. ↩︎

  8. If it were allowed, i 5 could map to either the integer 5 and the string i 5. ↩︎

  9. I read the Java text block justification after I was done formulating how I wanted to do it, and they came to the same reasoning for """ requiring a new line. See here for the explanation. Interestingly, the reason why is not explained in the RFC. ↩︎

  10. See here for the spec. ↩︎

  11. It also, perhaps, would have felt unintuitive due to the implication that it’s “escaping” the following newline. ↩︎