class: title, smokescreen, shelf, bottom, no-footer background-image: url(images/protobuf.png) # 181U Spring 2020 ## Message Serialization <style> h1 { border-bottom: 8px solid rgb(32,67,143); border-radius: 2px; width: 90%; } .smokescreen h1 { border-bottom: none; } .small {font-size: 80%} .smaller {font-size: 70%} .small-code.remark-slide-content.compact code {font-size:1.0rem} .very-small-code.remark-slide-content.compact code {font-size:0.9rem} .line-numbers{ /* Set "line-numbers-counter" to 0 */ counter-reset: line-numbers-counter; } .line-numbers .remark-code-line::before { /* Increment "line-numbers-counter" by 1 */ counter-increment: line-numbers-counter; content: counter(line-numbers-counter); text-align: right; width: 20px; border-right: 1px solid #aaa; display: inline-block; margin-right: 10px; padding: 0 5px; } </style> --- layout: true .footer[ - 181U - See acknowledgements ] --- class: compact # Agenda * The problem * XDR * JSON * Protocol Buffers --- class: compact,smaller # The Problem * MQTT provides a protocol for exchanging data `topic value` but no guidance about the syntax for "value" * Every receiver of an MQTT message must - decode the topic and route the message appropriately - decode the value, not so bad for simple things like 3.14159 or 42, but what if you want to send a structure ? * Every sender of an MQTT message must - encode the topic, encode the value * This problem arises in other context - Configuration of routers - remote procedure calls - Storing/retrieving binary data --- class: compact # Data Serialization ![](images/space.png# w-20pct) ![](images/serialize-deserialize.png# w-60pct) https://www.geeksforgeeks.org/serialization-in-java/ --- class: compact # Language Specific Solutions * Java -- java.io.Serializability interface - `writeObject(Object obj)` - serialization runtime associates a version number with each Serializable class called a SerialVersionUID - Reader and Writer have to use the same code/version for the object library * Python -- pickle - `pickle` module implements binary protocols for serializing and de-serializing Python structure - not secure -- pickle data can execute arbitrary code --- class: compact # Language Independent Solutions * XDR (External Data Representation Standard) [rfc1832](https://tools.ietf.org/html/rfc1832) - Internet protocol developed for transfering data * JSON (Javascript Object Notation) - Human readable format * Google Protocol Buffers (protobuf) -- binary encoding, machine independent --- class: compact # XDR: External Data Representation * Uses a language to describe data formats * Used by ONC RPC (remote procedure calls), and NFS (network file system) * assumes bytes are portable * standard allows encoding/decoding on different architectures * Format defined by an IDL file (a data description language) * IDL file compiled to c code with rpcgen --- class: compact # XDR Encoding * All items encoded as a multiple of four bytes. * XDR Data types - Integer (32-bits, big endian) - Unsigned integer - Enumerations - Long integer (64 bits) - Floating point - Strings - Structures - Unions --- class: compact # XDR Encoding (example) * Integer ```plaintext (MSB) (LSB) +-------+-------+-------+-------+ |byte 0 |byte 1 |byte 2 |byte 3 | +-------+-------+-------+-------+ <------------32 bits------------> ``` --- class: compact # XDR Encoding (example) * Structure ```c struct { component-declaration-A; component-declaration-B; ... } identifier; ``` The components of the structure are encoded in the order of their declaration in the structure. Each component's size is a multiple of four bytes, though the components may be different sizes. ```plaintext +-------------+-------------+... | component A | component B |... +-------------+-------------+... ``` --- class: compact # XDR Encoding (string) * Length/data format * Always rounded to multiple of four bytes ```plaintext 0 1 2 3 4 5 ... +-----+-----+-----+-----+-----+-----+...+-----+-----+...+-----+ | length n |byte0|byte1|...| n-1 | 0 |...| 0 | +-----+-----+-----+-----+-----+-----+...+-----+-----+...+-----+ |<-------4 bytes------->|<------n bytes------>|<---r bytes--->| |<----n+r (where (n+r) mod 4 = 0)---->| STRING ```` --- class: compact # XDR Observations * C-centric -- all of the types are C types * (Very) inefficient -- every type is allocated space for the "worst case" (arrays and strings are exceptions) * Very sensitive to change -- not possible to extend the message type (perhaps for other receivers) without recompiling the code everywhere * No real sanity checking for messages * Tools (rpcgen) are aimed at remote procedure calls * rpcgen creates - header file with data structure for message - encoder function - decoder function --- class: compact # JSON (Javascript object format) ![](images/json-object.png# w-40pct fr) * Human readable/writable data format * Subset of Javascript * Built on two structures * A collection of name/value pairs (e.g. a dictionary) * An ordered list of values (e.g. an array or list) --- class: compact # JSON array and value ![](images/json-value.png# w-40pct) ![](images/space.png# w-10pct) ![](images/json-array.png# w-40pct) --- class: very-small-code,compact,hljs-tomorrow-night-eighties,line-numbers # JSON Example ```javascript {"widget": { "debug": "on", "window": { "title": "Sample Konfabulator Widget", "name": "main_window", "width": 500, "height": 500 }, "image": { "src": "Images/Sun.png", "name": "sun1", "hOffset": 250, "vOffset": 250, "alignment": "center" }, "text": { "data": "Click Here", "size": 36, "style": "bold", "name": "text1", "hOffset": 250, "vOffset": 100, "alignment": "center", "onMouseUp": "sun1.opacity = (sun1.opacity / 100) * 90;" } }} ``` --- class: compact # JSON Encoding/Decoding * Relatively easy to "parse", but meaning is left as an exercise to the programmer - no error checking (is the JSON structure the correct one ?) - no specification of the "correct" JSON structure for validation * There are libraries to help, but there is still string to xxx interpretation - Example JSMN --- class: compact # JSMN * portable to embedded processors * C89 compatible output * No library dependencies * Small footprint * Just parses into tokens, other work left to user. * Use C libraries to parse numbers * Need to map objects to data structures --- class: compact,small-code,hljs-tomorrow-night-eighties,line-numbers # JSMN Parsing ```javascript '{ "name" : "Jack", "age" : 27 }' ``` JSMN creates tokens with boundaries in the string * Object [0..31] * String [3..7], String [12..16], String [20.23] * Number [27..29] `jsmntok_t` type is: ```C typedef struct { jsmntype_t type; /* Token type */ int start; /* Token start position */ int end; /* Token end position */ int size; /* Number of child (nested) tokens */ } jsmntok_t; ``` --- class: compact,small-code,hljs-tomorrow-night-eighties,line-numbers # Protocol Buffers * Google's mechanism for serializing structured data - language-neutral - platform-neutral - extensible ```c message Person { required string name = 1; required int32 id = 2; optional string email = 3; } ``` ```C++ Person john; fstream input(argv[1], ios::in | ios::binary); john.ParseFromIstream(&input); id = john.id(); name = john.name(); email = john.email(); ``` --- class: compact,small-code,hljs-tomorrow-night-eighties,line-numbers # Protocol Buffer Basics: (e.g. Python) * Message formats defined in a .proto file * Converted to target language with protocol buffer compiler * Example: Python protocol buffer API to write, read messages * Why use protcol buffers - very compact binary format - generate language specific API for the message format chosen - message formats can be extended without affecting existing applications --- class: compact,very-small-code,hljs-tomorrow-night-eighties,line-numbers,col-2 # Protocol Buffer Example ```protobuf syntax = "proto2"; package tutorial; message Person { required string name = 1; required int32 id = 2; optional string email = 3; enum PhoneType { MOBILE = 0; HOME = 1; WORK = 2; } ``` ```protobuf message PhoneNumber { required string number = 1; optional PhoneType type = 2 [default = HOME]; } repeated PhoneNumber phones = 4; } message AddressBook { repeated Person people = 1; } ``` --- class: compact,very-small-code,hljs-tomorrow-night-eighties,line-numbers # Compiling .proto file ``` protoc -I=$SRC_DIR --python_out=$DST_DIR $SRC_DIR/addressbook.proto ``` * This generates `addressbook_pb2.py` * Here is an example of creating a person ```python import addressbook_pb2 person = addressbook_pb2.Person() person.id = 1234 person.name = "John Doe" person.email = "jdoe@example.com" phone = person.phones.add() phone.number = "555-4321" phone.type = addressbook_pb2.Person.HOME ``` --- class: compact,very-small-code,hljs-tomorrow-night-eighties,line-numbers # Serialization Methods Generated * `SerializeToString()`: serializes the message and returns it as a string * `ParseFromString(data)`: parses a message for a string --- class: compact # C++ Compilation ``` protoc -I=$SRC_DIR --cpp_out=$DST_DIR $SRC_DIR/addressbook.proto ``` * `addressbook.pb.h` : c++ header * `addressbook.pb.cc` : c++ implementation of classes --- class: compact,very-small-code,hljs-tomorrow-night-eighties,line-numbers,col-2 # C++ Generated API ```c++ // name inline bool has_name() const; inline void clear_name(); inline const ::std::string& name() const; inline void set_name(const ::std::string& value); inline void set_name(const char* value); inline ::std::string* mutable_name(); // id inline bool has_id() const; inline void clear_id(); inline int32_t id() const; inline void set_id(int32_t value); ``` <br> ```c++ // email inline bool has_email() const; inline void clear_email(); inline const ::std::string& email() const; inline void set_email(const ::std::string& value); inline void set_email(const char* value); inline ::std::string* mutable_email(); // phones inline int phones_size() const; inline void clear_phones(); inline const ::google::protobuf::RepeatedPtrField< ::tutorial::Person_PhoneNumber >& phones() const; inline ::google::protobuf::RepeatedPtrField< ::tutorial::Person_PhoneNumber >* mutable_phones(); inline const ::tutorial::Person_PhoneNumber& phones(int index) const; inline ::tutorial::Person_PhoneNumber* mutable_phones(int index); inline ::tutorial::Person_PhoneNumber* add_phones(); ``` --- class: compact # C++ Parsing and Serialization * `bool SerializeToString(string* output) const;`: serializes the message and stores the bytes in the given string. Note that the bytes are binary, not text; we only use the string class as a convenient container. * `bool ParseFromString(const string& data);`: parses a message from the given string. * `bool SerializeToOstream(ostream* output) const;`: writes the message to the given C++ ostream. * `bool ParseFromIstream(istream* input);`: parses a message from the given C++ istream. --- class: compact,very-small-code,hljs-tomorrow-night-eighties,line-numbers # Encoding -- A simple message ```c message Test1 { optional int32 a = 1; } ``` Suppose you create a message and set `a` to 150. The serialized stream is three bytes (smaller than an int) ``` 08 96 01 ``` --- class: compact,very-small-code,hljs-tomorrow-night-eighties,line-numbers # Encoding (example) * Encoding of integers is variable length -- only the bytes needed are generated - Each byte except the last has msb set (thus carries 7 bits of data) - bits are least significant group first - To encode the number 1 `0000 0001` (a single byte) - To encode 300 : ```plaintext 1010 1100 0000 0010 --> drop msb 010 1100 000 0010 --> reverse order of "bytes" 000 0010 010 1100 --> simplify 1 0010 1100 --> convert to decimal 256 + 32 + 8 + 4 --> 300 ``` --- class: compact # Message Structure * Protocol buffer message is a series of key-value pairs. * Keys in the binary message are the "tags" * Encoded keys are tag + "wire type" Wire types | Type | Meaning | Used for | | -----|---------|----------| | 0 | Varint | int32, int64, uint32, uint64, sint32, sint64, bool, enum | | 1 | 64-bit | fixed64, sfixed64, double | | 2 | length-delimited | string, bytes, ... | | 5 | 32-bit | fixed32,sfixed32, float | --- class: compact,small-code,hljs-tomorrow-night-eighties,line-numbers # Message -- String ```c message Test2 { optional string b = 2; } ``` Suppose we have a message with b = "testing" ```plaintext 12 07 74 65 73 74 69 6e 67 ``` The last 7 bytes are the utf-8 encoding of "testing" --- class: compact # Protocol Buffers Summary (so far) * Simple language for describing message types * Very compact "wire" encoding * Compiler generates message specific APIs for a variety of languages - C++ - Java - Python - Go - C# ... --- class: compact # Protocol Buffer Message Definitions Can be Extended * you *must not* change the tag numbers of existing fields * you *must not* delete any required fields * you *may* delete optional or repeated fields * you *may* add new optional or repeated fields provided fresh tag numbers are used --- class: compact # Protocol Buffer support for embedded code * Nanopb -- an extension that uses protoc to generate compact C code * Typical project includes these files - Nanopb runtime - Protocol description - protodef.proto - protodef.pb.c (generated) - protodef.pb.h (geneated) * Small code size (~10KB compiled) * Small ram usage (around 300 bytes plus message structs) * I've used this for my research in sub-gram data loggers -- all communication with tags is using protobuf --- class: compact # Summary * Acknowledgements - Cover: Alternative and Flexible Control Approaches for Robotic Manipulators: on the Challenge of Developing a Flexible Control Architecture that Allows for Controlling Different Manipulators - Scientific Figure on ResearchGate. Available from: https://www.researchgate.net/figure/To-use-Protocol-Buffers-it-is-necessary-to-generate-code-for-each-message-that-needs_fig17_285578991 [accessed 6 Feb, 2020]