Search icon CANCEL
Subscription
0
Cart icon
Your Cart (0 item)
Close icon
You have no products in your basket yet
Save more on your purchases! discount-offer-chevron-icon
Savings automatically calculated. No voucher code required.
Arrow left icon
Explore Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Free Learning
Arrow right icon
Arrow up icon
GO TO TOP
Protocol Buffers Handbook

You're reading from   Protocol Buffers Handbook Getting deeper into Protobuf internals and its usage

Arrow left icon
Product type Paperback
Published in Apr 2024
Publisher Packt
ISBN-13 9781805124672
Length 226 pages
Edition 1st Edition
Languages
Arrow right icon
Author (1):
Arrow left icon
Clément Jean Clément Jean
Author Profile Icon Clément Jean
Clément Jean
Arrow right icon
View More author details
Toc

Table of Contents (13) Chapters Close

Preface 1. Chapter 1: Serialization Primer FREE CHAPTER 2. Chapter 2: Protobuf is a Language 3. Chapter 3: Describing Data with Protobuf Text Format 4. Chapter 4: The Protobuf Compiler 5. Chapter 5: Serialization Internals 6. Chapter 6: Schema Evolution over Time 7. Chapter 7: Implementing the Address Book in Go 8. Chapter 8: Implementing the Address Book in Python 9. Chapter 9: Developing a Protoc Plugin in Golang 10. Chapter 10: Advanced Build 11. Index 12. Other Books You May Enjoy

What about Protobuf?

So far, we have talked only a little bit about Protobuf. In this section, we are going to start explaining the advantages and disadvantages of Protobuf, and we are going to base the analysis on the criteria we saw in the previous section.

Serialized data size

Due to a lot of optimization and its binary format, Protobuf is one of the best formats in the area. This is especially true for serializing numbers. Let us see why this is the case.

The main reason why Protobuf is good at serializing data in a small number of bytes is that it is a binary format. In Protobuf, additional structural elements such as curly braces, colons, and brackets, typically found in formats such as JSON, XML, and YAML, are absent. Instead, Protobuf data is represented simply as raw bytes.

On top of that, we have optimization such as bitpacking and the use of varints, which help a lot. It makes serializing the same data to a smaller number of bytes easier than it would be if we naively serialized the binary representation of the data as it is in memory.

Bitpacking is a method that compresses data into as few bits as possible. Notice that we are talking about bits and not bytes here. You can think of bitpacking as trying to put multiple pieces of data in the same number of bytes. Here is an example of bitpacking in Protobuf:

0a  -> 0101 00000
    -> 010 is 2 (length-delimited type like string)
    -> 1 is the field tag

Do not worry too much about what this data is representing just yet. We are going to see that in Chapter 5, Serialization, when we talk about the internals of serialization. The most important thing to understand is that we have one byte containing two pieces of metadata. At scale, this reduces the payload size quite a lot.

Most of the integers in Protobuf are varints. Varints is short for variable-size integers and they map integers to different numbers of bytes. How this works is that the smaller values will get mapped to a smaller number of bytes and the bigger ones will get translated to a larger number of bytes. Let us see an example (decimal -> byte(s) in hexadecimal):

1 -> 01
128 -> 80 01
16,384 -> 80 80 01
...

As you can see, compared to fixed-size integers (four or eight bytes), it saves space in a lot of cases. However, this also can result in non-efficient compression of the data. We can also store a 32-bit number (normally 4 bytes) in 5 bytes. Here is an example of such a case:

2,147,483,647 (max 32-bit number value) -> ff ff ff ff 07

We have five bytes instead of four if we serialize with a fixed-size integer.

While we did not get into too much detail about the internals of serialization that make the serialized data size smaller, we learned about the two main optimizations that are used in Protobuf. We saw that it uses bitpacking to compress data into smaller amounts of bits. This is what Protobuf does to limit the impact of metadata on serialized data size. Then, we saw that Protobuf also uses varints. They map smaller integers to smaller numbers of bytes. This is how most integers are serialized in Protobuf.

Availability of data

Deserializing data in Protobuf is fast because we are parsing binary instead of text. Now, this is hard to demonstrate because it is highly dependent on the programming language you are using and the data you have. However, a study conducted by Auth0 (https://auth0.com/blog/beating-json-performance-with-protobuf/) showed that Protobuf, in their use case, has better availability than JSON.

What is interesting in this study is that they found that Protobuf messages were available in their JavaScript code in 4% less time than JSON. Now, this might seem like a marginal improvement, but you have to consider the fact that JSON is JavaScript object literal format. This means that by deserializing Protobuf in JavaScript, we have to convert binary to JSON in order to access the data in our code. And even with that conversion overhead, Protobuf manages to be faster than parsing JSON directly.

This shows that, in JavaScript (JS), even with the overhead of having to transform binary to JSON, Protobuf performs better than JSON itself.

Be aware that, for every implementation, you will have different numbers. Some are more optimized than others. However, a lot of implementations are just wrappers around the C implementation, which means you will get quasi-consistent deserialization.

Finally, it is important to mention that serialization and deserialization speeds depend a lot on the underlying data. Some data formats perform better on certain kinds of data. So, if you are considering using any data schema, I urge you to do some benchmarking first. It will prevent you from being surprised.

Readability of data

As you might expect, readability of data is not where Protobuf shines. As the serialized data is in binary, it requires more effort or tools for humans to be able to understand it. Let’s take a look at the following hexadecimal:

0a 07 43 6c 65 6d 65 6e 74 10 64 1a 06 0a 04 4d 61 72 6b 1a 06 0a 04 4a 6f 68 6e

We have no way to guess that the preceding hexadecimal represents a Person object.

Now, while the default way of serializing data is not human-readable, Protobuf provides libraries to serialize data into JSON or the Protobuf text format. We can take the previous data that we had in hexadecimal and get text output similar to that in the following (Protobuf text format) code block:

name: "Clément"
age: 100
friends: {
  name: "Mark"
}
friends: {
  name: "John"
}

Furthermore, this text format can also be used to create binary. This means that we could have a configuration file written in text and still use the data with Protobuf.

So, as we saw, Protobuf serialized data is by default not human-readable. It is binary and it would take some extra effort or more tools to read. However, since Protobuf provides a way to serialize data to its own text format or to JSON, we can still have human-readable serialized data. This is nice for configuration files, test data, and so on.

Type safety

An advantage that schema-backed data formats have is that the schemas are generally defined with types in mind. Protobuf can define objects called messages that contain fields that are themselves typed. We saw a few examples previously in this chapter. This means that we will be able to check at compile time that the values provided for certain fields are correct.

On top of that, since Protobuf has an extensive set of types, there is the possibility to optimize for specific use cases. An example of this is the type for the age field that we saw in Person. In JSON Schema, for example, we would have a type called integer; however, this does not ensure that we do not provide negative numbers. In Protobuf, we could use uint32 (unsigned 32-bit integer) or a uint64 (usigned 64-bit integer) and thus we would, at compile time, ensure that we do not get negative numbers.

Finally, Protobuf also provides the possibility to nest types and reference these types by their fully qualified name. This basically means that the types have scopes. Let us take an example to make things clearer. Let us say that we have the following definitions (note that this is simplified):

message PhoneNumber {
  enum Type {
    Mobile,
    Work,
    Fax
  }
  Type type;
  string number;
}
message Email {
  enum Type {
    Personal,
    Work
  }
  Type type;
  string email;
}

As we can see, we defined two enums called Type. It is necessary because PhoneNumber supports Fax and Mobile but Email does not. Now, here when we are dealing with the Type enum in PhoneNumber, we are going to refer to it as PhoneNumber.Type, and when we deal with the one in Email, we will refer to it as Email.Type. These names are the fully qualified names of the types and they differentiate the two types, which have the same name, by providing a scope.

Now, let us think about what would happen if we had the following definitions:

enum Type {
  Mobile,
  Work,
  Fax,
  Personal
}
message PhoneNumber {
  Type type;
  string number;
}
message Email {
  Type type;
  string email;
}

We could still create valid Email instances and PhoneNumber instances because we have all the types in the Type enum. However, in this case, we could also create invalid Emails and PhoneNumbers. For example, it is possible to create an Email type with the Mobile type. This means that by providing scoping for types, we can have type safety without having to create two enums:

enum PhoneType { /*...*/ }
enum EmailType { /*...*/ }

This is verbose and unnecessary since Protobuf lets us create types that will be called Phone.Type and Email.Type.

To summarize, Protobuf lets us use types in a very explicit and safe way. We have an extensive set of types that we can use and that will let us ensure that our data is correct at compile time. Finally, we saw that by providing nested types referenced by their fully qualified names, we can differentiate types with the same names and ensure that only certain correct values get set to fields.

Readability of schema

Finally, Protobuf has readable schemas because it was designed as self-documenting and as a language. We can find a lot of concepts that we are already familiar with such as type, comments, and imports. All of this makes onboarding new developers and using Protobuf-backed APIs easy.

The first thing worth mentioning is that we can have comments in our schemas to describe what the fields and types are doing. One of the best examples of that is in the proto files provided in Protobuf itself. Let’s look at duration.proto (simplified for brevity) in the following code block:

// A Duration represents a signed, fixed-length span of time represented
// as a count of seconds and fractions of seconds at nanosecond
// resolution. It is independent of any calendar and concepts like "day" or "month".
message Duration {
  // Signed seconds of the span of time. Must be from -315,576,000,000
  // to +315,576,000,000 inclusive. Note: these bounds are computed from:
  // 60 sec/min * 60 min/hr * 24 hr/day * 365.25 days/year * 10000 
  years
  int64 seconds;
  // Signed fractions of a second at nanosecond resolution of the span
  // of time. Durations less than one second are represented with a 0
  // `seconds` field and a positive or negative `nanos` field. For 
  durations
  // of one second or more, a non-zero value for the `nanos` field 
  must be
  // of the same sign as the `seconds` field. Must be from 
  -999,999,999
  // to +999,999,999 inclusive.
  int32 nanos;
}

We can see that we have a comment explaining the purpose of Duration and two others explaining what the fields represent, what their possible values are, and so on. This is important because it helps users use the types properly and it can let us make assumptions.

Next, we can import some other schema files to reuse types. This helps a lot, especially when the project becomes bigger. Instead of having to duplicate code or write everything in one file, we can separate by concern and reuse definitions in multiple places. Furthermore, since imports use the path of the schema file to access the definition in it, we can look at the definition. Now, this requires knowing a little bit more about the compiler and its options, but other than that, we can access the code and start poking around.

Finally, as we mentioned earlier, Protobuf has explicit typing. By looking at a field, we know exactly what kind of value we need to set and what the boundaries are (if any). Furthermore, after learning a little bit about the internals of serialization in Protobuf, you will be able to optimize the types that you use in a granular way.

You have been reading a chapter from
Protocol Buffers Handbook
Published in: Apr 2024
Publisher: Packt
ISBN-13: 9781805124672
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $19.99/month. Cancel anytime
Banner background image