This week I learned… Protocol buffers

Soren Hansen
9 min readApr 8, 2021

--

This is the first in miniseries of articles that will culminate in an introduction to gRPC.

Protocol Buffers (or “protobuf”) was developed at Google to store and exchange data that more efficiently than, say, XML.

Protobuf is a way to serialize data. It lets you take your native, complex data types and turn them into a binary format dubbed “messages”.

Protobuf messages have interesting properties:

  • They are backward compatible.
  • They are forward compatible.
  • They are compact (though not compressed).

Unlike XML they are not self-describing. To convert between native data types and protobuf messages you need the protocol specification. Here is an example specification:

message Person {
required string name = 1;
required int32 id = 2;
optional string email = 3;

enum PhoneType {
MOBILE = 0;
HOME = 1;
WORK = 2;
}

message PhoneNumber {
required string number = 1;
optional PhoneType type = 2 [default = HOME];
}

repeated PhoneNumber phone = 4;
}

We start by defining our message type, Person. The first two fields of a Person are name and id.

name is a string and id is a 32 bit integer, and they are both required.

required means exactly what it says: Without those fields, the message will be rejected by the protobuf implementation as invalid. The = 1 and = 2 define the field's id's inside the protobuf messages. As I mentioned above, protobuf messages are not self describing. If you look at the message in a hex editor, you'll see the values, but not the string name or id. In XML you would say something like:

<Person>
<name>John Doe</name>
<id>1234</id>
</Person>

Next, we have email. It is an optional field which means exactly what it says.

Then we define an enum. It's analogous to the enum concept provided by many common programming languages. It provides two things:

  • A mapping from a readable string (MOBILE, HOME, WORK) to integer values (0, 1, and 2, respectively).
  • A way to define a narrow set of acceptable integer values for variable.

Then we have PhoneNumber, a nested message type. It has its own, independent set of fields. Their names and id's are allowed to overlap with those of its parent. Notice that the type field has the PhoneType type which is our enum type. It's set as optional, but has a default set as well. So when you load a PhoneNumber, if type isn't set, the protobuf implementation will set it to 1 (the value corresponding to HOME).

Finally, we have the phone field of the Person message type. Its type is PhoneNumber which we just defined and it's set as repeated. This means there can be any number (including 0) of them. It is an ordered list.

Types

These are the scalar data types available to you:

  .proto type   Java type    Python 3 type   Go type
------------- ------------ --------------- ----------
double double float *float64
float float float *float32
int32 int int *int32
int64 long int *int64
uint32 int int *uint32
uint64 long int *uint64
sint32 long int *int32
sint64 long int *int64
fixed32 int int *uint32
fixed64 long int *uint64
sfixed32 int int *int32
sfixed64 long int *int64
bool bool bool *bool string String str *string bytes ByteString bytes []byte

All of the int, uint, and sint types use variable length encoding.

Base 128 Varints

The Variable length encoding used in protobuf is based on “varints”. Varints let you encode integers into one or more bytes. The larger the value, the more bytes.

A varint consists of a sequence of bytes where each byte, except for the final one, has its most significant bit (msb) set. The unset msb indicates that we’re at the end of byte sequence.

Because we’re using the msb to indicate continuation, each byte carries 7 bits of data. The first byte in the sequence holds the LEAST significant set of bits, very much how little-endian systems represent multi-byte integers.

Numbers 0–127 are easy. They’re encoded as 0000 0000, 0000 0001, ..., 0111 1111.

Let’s try something more complicated. The decimal number 4960 in binary is:

1001101100000

That’ more than 7 bits, so we split it into chunks of 7 bits each and pad the top chunk with zeros:

1001101100000 ->
0100110 1100000

The LEAST significant group goes first:

0100110   1100000
X
1100000 0100110

And then we add the leading continuation bit:

11100000 00100110 = e0 26

And we’re done.

But what about negative numbers?

You may know that negative numbers are usually stored using a mechanism called two’s complement. E.g. a single-byte signed integer type can represent values from -128 to 127. The most signifiant msb indicates “signedness”. If it’s 0, the value is positive. If it's 1, the value is negative. The specific value is found by inverting the remaining bits and adding 1.

This is why integers sometimes “wrap around” when they grow out of bounds. 127 + 1: 0111 1111 + 1 = 1000 0000. Msb is set which tells us it’s a negative numer. The remaining bits (_000 0000) are inverted (_111 1111) and we add one (1000 0000 = 128) and we get the final result of -128.

For larger integer types, it works the same way. For a 16 bit integer, the msb can be used to indicate that it’s negative number, and the remaining 15 bit store the absolute value, etc.

Remember how these data types are variable length? Encoding 1 as an int64 will take up just a single single byte, because we rely on the continuation bit to tell us if there are more bits set. The 64 bit part relates mostly to how the translation between protobuf and your native data types happens. But negative numbers have to have the top bit set. For a 64 bit integer, that's that 64th bit, so even numerically small, negative values will take up 64 bits. 64 bits split into 7-bit chunks is 9 chunks. That's pretty inefficient, so steer clear of int32 and int64 if you're at all likely to have to encode (small) negative numbers.

The sint data types are different. They use a varint encoding called ZigZag encoding. It ensures that small absolute values yield small encoded values, too. It works like this:

  Encoded value   Original value
--------------- ----------------
0000 0000 0
0000 0001 -1
0000 0010 1
0000 0011 -2
0000 0100 2
0000 0101 -3
0000 0110 3
.... ....
0111 1110 63
0111 1111 -64

Setting the top most bit means we’re exceeding the 7 bit chunk size of the varints, so the next few lines are:

  Encoded value         Original value
--------------------- ----------------
1000 0000 0000 0001 64
1000 0001 0000 0001 -65
1000 0010 0000 0001 65
1000 0011 0000 0001 -66

Remember, least significant chunk goes first.

The ZigZag name is because the encoding zig-zags between positive and negative values.

Fixed length encodings

fixed32, fixed64 are unsigned, fixed-width integers.

sfixed32 and sfixed64 are signed.

fixed32 and sfixed32 are always encoded as 32 bits.

fixed64 and sfixed64 are always encoded as 64 bits.

You avoid the varint encoding/decoding which is especially handy if you know you'll often need very large values as they are encoded and decoded more efficiently this way.

Complex data types

We already saw that protobuf provides support for lists by way of the repeated keyword. What about maps?

Protobuf provides that, too! At least sort of.

You can define a map in your .proto file like so:

map<key_type, value_type> map_field = ID;

Remember, though, that protobuf is a serialization format. It’s not involved in adding or removing elements from the map, nor does it provide lookup mechanisms. Under the covers, this means that a map is nothing more than “syntactic sugar”. The above specification is encoded exactly like this:

message MapFieldEntry {
optional key_type key = 1;
optional value_type value = 2;
}
repeated MapFieldEntry map_field = ID;

This means protobuf implementations without map support or programming languages without a corresponding concept can still support your data.

Using “foreign” types

In our initial Person example, we saw how nested types worked. You can also use types defined elsewhere by importing other .proto files. At the top of your .proto file, you can add:

import "src/protos/other.proto";

The import process is not recursive. If A.proto imports B.proto which in turn imports C.proto, any type defined in C.proto will not be available in A.proto. This avoids namespace collisions, but occasionally you might want B.proto to "export" types from C.proto. This happens if e.g. parts of B.proto got split off into C.proto and you want users of B.proto to not have to worry about this fact. In this case, you can have B.proto say:

import public "src/protos/C.proto";

This means that any types imported from C.proto will be included when someone includes B.proto.

Well-known Types

In addition to the basic types provided by protobuf, there are a set of so-called “well-known” types. They’re not first class data types in protobuf, but they’re very commonly needed in information interchange, so they’re provided as a convenience. The most interesting one is a timestamp type. The programming language implementations handle the conversions between the protobuf type and the relevant types for the language.

Compatibility

Remember how I said these messages are backward and forward compatible? Let’s dig into that a little bit, because there are a few caveats.

Suppose you want add an extra element to our Person message type, a date of birth.

import "google/protobuf/timestamp.proto"message Person {
required string name = 1;
required int32 id = 2;
optional string email = 3;
enum PhoneType {
MOBILE = 0;
HOME = 1;
WORK = 2;
}
message PhoneNumber {
required string number = 1;
optional PhoneType type = 2 [default = HOME];
}
repeated PhoneNumber phone = 4; optional Timestamp date_of_birth = 5;
}

An application that has the old .proto file will parse the message and find a field with id = 5 and not know what it is, because it's not in its protocol definition and it will simply be discarded. Conversely, if that application serializes a new Person message, it will be without that field. Since it's optional, any new application will parse the message succesfully.

Perhaps you can guess one of the main caveats: What happens if date_of_birth was required? Well, then you lose your forward compatibility. The lesson is: Use required sparingly. Required is forever.

Extensibility

You can allow for your message types to be extended. You do this by reserving a number of field id’s for extensions. For example:

// base.proto:
message Foo {
[...]
extensions 100 to 199;
}

This says that id’s from 100 to 199 are reserved for extensions.

In another .proto file, you can import the original .proto and add fields:

// extension.proto:
import "base.proto";
extend Foo {
[...]
optional int32 bar = 126;
}

This adds an optional field named bar of type int32 with id 126.

Note that the way you use these fields in application code is different from how regular fields work. See your language reference for more information: Java, Python, Go.

Encoding

Finally, I want to cover some brief details about how messages are serialized.

Protobuf messages are a series of key/value pairs. When encoded, the keys are the id’s assigned in the .proto file. The names only exist in the .proto file and your application code. The values are the field's data values.

The keys and values are encoded and concatenated to form the message. To allow for backwards compatibility, parsers need to be able to skip over fields that it doesn’t know, so the parsing can’t rely the .proto file. To enable this, the key actually has two values in it: The id specified in the .proto file and the "wire" type.

The wire types are:

 Type ID   Meaning            Used for                                                  
--------- ------------------ --------------------------------------
0 Varint int32, int64, uint32, uint64,
sint32, sint64, bool, enum
1 64-bit fixed64, sfixed64, double
2 Length-delimited string, byte, embedded messages,
packed repeated fields
3 Start group groups (deprecated)
4 End group groups (deprecated)
5 32-bit fixed32, sfixed32, float

The key in the stream is a varint with the lowest 3 bits used to indicate the wire type, and the field id in the remaining bits. I.e. a string field with id = 5 and value “Hello, Mastercard!” would be encoded like so:

String: "Hello, Mastercard!"
Wire type: 2
Wire type as a 3-digit binary number: 010
ID: 5
ID in binary: 1001
Encoded ID becomes: 1001010This is only 7 bits, so as an encoded varint, it becomes:
0100 1010 = 0x4a

Being a length delimited field means it starts with a (varint encoded) length and then the specified number of bytes. Strings are encoded in UTF-8.

String: "Hello, Mastercard!"
Length: 18
Length in binary: 10010
Length as varint encoded binary: 0001 0010 = 0x12
String in hex:
48 65 6c 6c 6f 2c 20 4d 61 73 74 65 72 63 61 72 64 21
H e l l o , M a s t e r c a r d !
All put together:
4a 12 48 65 6c 6c 6f 2c 20 4d 61 73 74 65 72 63 61 72 64 21

You may wonder about an envelope or some sort of delimiter between messages. There isn’t one. It is your job to do something that makes sense for your particular application to determine where one message ends and the next one starts.

Wrapping up

That’s it! I hope you learned something today and you enjoy the format of the post.

--

--

Soren Hansen
Soren Hansen

No responses yet