This week I learned… Protocol buffers
This is the first in miniseries of articles that will culminate in an introduction to gRPC.
Protocol Buffers (or “protobuf”) was developed at Google to store and exchange data that more efficiently than, say, XML.
Protobuf is a way to serialize data. It lets you take your native, complex data types and turn them into a binary format dubbed “messages”.
Protobuf messages have interesting properties:
- They are backward compatible.
- They are forward compatible.
- They are compact (though not compressed).
Unlike XML they are not self-describing. To convert between native data types and protobuf messages you need the protocol specification. Here is an example specification:
message Person {
required string name = 1;
required int32 id = 2;
optional string email = 3;
enum PhoneType {
MOBILE = 0;
HOME = 1;
WORK = 2;
}
message PhoneNumber {
required string number = 1;
optional PhoneType type = 2 [default = HOME];
}
repeated PhoneNumber phone = 4;
}
We start by defining our message type, Person
. The first two fields of a Person
are name
and id
.
name
is a string and id
is a 32 bit integer, and they are both required
.
required
means exactly what it says: Without those fields, the message will be rejected by the protobuf implementation as invalid. The = 1
and = 2
define the field's id's inside the protobuf messages. As I mentioned above, protobuf messages are not self describing. If you look at the message in a hex editor, you'll see the values, but not the string name
or id
. In XML you would say something like:
<Person>
<name>John Doe</name>
<id>1234</id>
</Person>
Next, we have email
. It is an optional
field which means exactly what it says.
Then we define an enum
. It's analogous to the enum
concept provided by many common programming languages. It provides two things:
- A mapping from a readable string (
MOBILE
,HOME
,WORK
) to integer values (0
,1
, and2
, respectively). - A way to define a narrow set of acceptable integer values for variable.
Then we have PhoneNumber
, a nested message type. It has its own, independent set of fields. Their names and id's are allowed to overlap with those of its parent. Notice that the type
field has the PhoneType
type which is our enum
type. It's set as optional
, but has a default
set as well. So when you load a PhoneNumber
, if type
isn't set, the protobuf implementation will set it to 1
(the value corresponding to HOME
).
Finally, we have the phone
field of the Person
message type. Its type is PhoneNumber
which we just defined and it's set as repeated
. This means there can be any number (including 0) of them. It is an ordered list.
Types
These are the scalar data types available to you:
.proto type Java type Python 3 type Go type
------------- ------------ --------------- ----------
double double float *float64
float float float *float32 int32 int int *int32
int64 long int *int64 uint32 int int *uint32
uint64 long int *uint64 sint32 long int *int32
sint64 long int *int64 fixed32 int int *uint32
fixed64 long int *uint64 sfixed32 int int *int32
sfixed64 long int *int64 bool bool bool *bool string String str *string bytes ByteString bytes []byte
All of the int
, uint
, and sint
types use variable length encoding.
Base 128 Varints
The Variable length encoding used in protobuf is based on “varints”. Varints let you encode integers into one or more bytes. The larger the value, the more bytes.
A varint consists of a sequence of bytes where each byte, except for the final one, has its most significant bit (msb) set. The unset msb indicates that we’re at the end of byte sequence.
Because we’re using the msb to indicate continuation, each byte carries 7 bits of data. The first byte in the sequence holds the LEAST significant set of bits, very much how little-endian systems represent multi-byte integers.
Numbers 0–127 are easy. They’re encoded as 0000 0000
, 0000 0001
, ..., 0111 1111
.
Let’s try something more complicated. The decimal number 4960 in binary is:
1001101100000
That’ more than 7 bits, so we split it into chunks of 7 bits each and pad the top chunk with zeros:
1001101100000 ->
0100110 1100000
The LEAST significant group goes first:
0100110 1100000
X
1100000 0100110
And then we add the leading continuation bit:
11100000 00100110 = e0 26
And we’re done.
But what about negative numbers?
You may know that negative numbers are usually stored using a mechanism called two’s complement. E.g. a single-byte signed integer type can represent values from -128
to 127
. The most signifiant msb indicates “signedness”. If it’s 0
, the value is positive. If it's 1
, the value is negative. The specific value is found by inverting the remaining bits and adding 1
.
This is why integers sometimes “wrap around” when they grow out of bounds. 127 + 1
: 0111 1111 + 1 = 1000 0000
. Msb is set which tells us it’s a negative numer. The remaining bits (_000 0000
) are inverted (_111 1111
) and we add one (1000 0000 = 128
) and we get the final result of -128
.
For larger integer types, it works the same way. For a 16 bit integer, the msb can be used to indicate that it’s negative number, and the remaining 15 bit store the absolute value, etc.
Remember how these data types are variable length? Encoding 1
as an int64
will take up just a single single byte, because we rely on the continuation bit to tell us if there are more bits set. The 64 bit part relates mostly to how the translation between protobuf and your native data types happens. But negative numbers have to have the top bit set. For a 64 bit integer, that's that 64th bit, so even numerically small, negative values will take up 64 bits. 64 bits split into 7-bit chunks is 9 chunks. That's pretty inefficient, so steer clear of int32
and int64
if you're at all likely to have to encode (small) negative numbers.
The sint
data types are different. They use a varint encoding called ZigZag encoding. It ensures that small absolute values yield small encoded values, too. It works like this:
Encoded value Original value
--------------- ----------------
0000 0000 0
0000 0001 -1
0000 0010 1
0000 0011 -2
0000 0100 2
0000 0101 -3
0000 0110 3
.... ....
0111 1110 63
0111 1111 -64
Setting the top most bit means we’re exceeding the 7 bit chunk size of the varints, so the next few lines are:
Encoded value Original value
--------------------- ----------------
1000 0000 0000 0001 64
1000 0001 0000 0001 -65
1000 0010 0000 0001 65
1000 0011 0000 0001 -66
Remember, least significant chunk goes first.
The ZigZag name is because the encoding zig-zags between positive and negative values.
Fixed length encodings
fixed32
, fixed64
are unsigned, fixed-width integers.
sfixed32
and sfixed64
are signed.
fixed32
and sfixed32
are always encoded as 32 bits.
fixed64
and sfixed64
are always encoded as 64 bits.
You avoid the varint encoding/decoding which is especially handy if you know you'll often need very large values as they are encoded and decoded more efficiently this way.
Complex data types
We already saw that protobuf provides support for lists by way of the repeated
keyword. What about maps?
Protobuf provides that, too! At least sort of.
You can define a map in your .proto
file like so:
map<key_type, value_type> map_field = ID;
Remember, though, that protobuf is a serialization format. It’s not involved in adding or removing elements from the map, nor does it provide lookup mechanisms. Under the covers, this means that a map is nothing more than “syntactic sugar”. The above specification is encoded exactly like this:
message MapFieldEntry {
optional key_type key = 1;
optional value_type value = 2;
}repeated MapFieldEntry map_field = ID;
This means protobuf implementations without map support or programming languages without a corresponding concept can still support your data.
Using “foreign” types
In our initial Person
example, we saw how nested types worked. You can also use types defined elsewhere by importing other .proto
files. At the top of your .proto
file, you can add:
import "src/protos/other.proto";
The import process is not recursive. If A.proto
imports B.proto
which in turn imports C.proto
, any type defined in C.proto
will not be available in A.proto
. This avoids namespace collisions, but occasionally you might want B.proto
to "export" types from C.proto
. This happens if e.g. parts of B.proto
got split off into C.proto
and you want users of B.proto
to not have to worry about this fact. In this case, you can have B.proto
say:
import public "src/protos/C.proto";
This means that any types imported from C.proto
will be included when someone includes B.proto
.
Well-known Types
In addition to the basic types provided by protobuf, there are a set of so-called “well-known” types. They’re not first class data types in protobuf, but they’re very commonly needed in information interchange, so they’re provided as a convenience. The most interesting one is a timestamp type. The programming language implementations handle the conversions between the protobuf type and the relevant types for the language.
Compatibility
Remember how I said these messages are backward and forward compatible? Let’s dig into that a little bit, because there are a few caveats.
Suppose you want add an extra element to our Person
message type, a date of birth.
import "google/protobuf/timestamp.proto"message Person {
required string name = 1;
required int32 id = 2;
optional string email = 3; enum PhoneType {
MOBILE = 0;
HOME = 1;
WORK = 2;
} message PhoneNumber {
required string number = 1;
optional PhoneType type = 2 [default = HOME];
} repeated PhoneNumber phone = 4; optional Timestamp date_of_birth = 5;
}
An application that has the old .proto
file will parse the message and find a field with id = 5
and not know what it is, because it's not in its protocol definition and it will simply be discarded. Conversely, if that application serializes a new Person
message, it will be without that field. Since it's optional, any new application will parse the message succesfully.
Perhaps you can guess one of the main caveats: What happens if date_of_birth
was required
? Well, then you lose your forward compatibility. The lesson is: Use required
sparingly. Required is forever.
Extensibility
You can allow for your message types to be extended. You do this by reserving a number of field id’s for extensions. For example:
// base.proto:
message Foo {
[...]
extensions 100 to 199;
}
This says that id’s from 100 to 199 are reserved for extensions.
In another .proto
file, you can import
the original .proto
and add fields:
// extension.proto:
import "base.proto";
extend Foo {
[...]
optional int32 bar = 126;
}
This adds an optional
field named bar
of type int32
with id 126
.
Note that the way you use these fields in application code is different from how regular fields work. See your language reference for more information: Java, Python, Go.
Encoding
Finally, I want to cover some brief details about how messages are serialized.
Protobuf messages are a series of key/value pairs. When encoded, the keys are the id’s assigned in the .proto
file. The names only exist in the .proto
file and your application code. The values are the field's data values.
The keys and values are encoded and concatenated to form the message. To allow for backwards compatibility, parsers need to be able to skip over fields that it doesn’t know, so the parsing can’t rely the .proto
file. To enable this, the key actually has two values in it: The id specified in the .proto
file and the "wire" type.
The wire types are:
Type ID Meaning Used for
--------- ------------------ --------------------------------------
0 Varint int32, int64, uint32, uint64,
sint32, sint64, bool, enum
1 64-bit fixed64, sfixed64, double
2 Length-delimited string, byte, embedded messages,
packed repeated fields
3 Start group groups (deprecated)
4 End group groups (deprecated)
5 32-bit fixed32, sfixed32, float
The key in the stream is a varint with the lowest 3 bits used to indicate the wire type, and the field id in the remaining bits. I.e. a string field with id = 5 and value “Hello, Mastercard!” would be encoded like so:
String: "Hello, Mastercard!"
Wire type: 2
Wire type as a 3-digit binary number: 010
ID: 5
ID in binary: 1001Encoded ID becomes: 1001010This is only 7 bits, so as an encoded varint, it becomes:
0100 1010 = 0x4a
Being a length delimited field means it starts with a (varint encoded) length and then the specified number of bytes. Strings are encoded in UTF-8.
String: "Hello, Mastercard!"
Length: 18
Length in binary: 10010
Length as varint encoded binary: 0001 0010 = 0x12
String in hex:
48 65 6c 6c 6f 2c 20 4d 61 73 74 65 72 63 61 72 64 21
H e l l o , M a s t e r c a r d !
All put together:
4a 12 48 65 6c 6c 6f 2c 20 4d 61 73 74 65 72 63 61 72 64 21
You may wonder about an envelope or some sort of delimiter between messages. There isn’t one. It is your job to do something that makes sense for your particular application to determine where one message ends and the next one starts.
Wrapping up
That’s it! I hope you learned something today and you enjoy the format of the post.