Struct data_encoding::Specification

source ·

pub struct Specification {
    pub symbols: String,
    pub bit_order: BitOrder,
    pub check_trailing_bits: bool,
    pub padding: Option<char>,
    pub ignore: String,
    pub wrap: Wrap,
    pub translate: Translate,
}

Expand description

Base-conversion specification

It is possible to define custom encodings given a specification. To do so, it is important to understand the theory first.

Theory

Each subsection has an equivalent subsection in the Practice section.

The main idea of a base-conversion encoding is to see [u8] as numbers written in little-endian base256 and convert them in another little-endian base. For performance reasons, this crate restricts this other base to be of size 2 (binary), 4 (base4), 8 (octal), 16 (hexadecimal), 32 (base32), or 64 (base64). The converted number is written as [u8] although it doesn’t use all the 256 possible values of u8. This crate encodes to ASCII, so only values smaller than 128 are allowed.

More precisely, we need the following elements:

The bit-width N: 1 for binary, 2 for base4, 3 for octal, 4 for hexadecimal, 5 for base32, and 6 for base64
The bit-order: most or least significant bit first
The symbols function S from [0, 2^N) (called values and written uN) to symbols (represented as u8 although only ASCII symbols are allowed, i.e. smaller than 128)
The values partial function V from ASCII to [0, 2^N), i.e. from u8 to uN
Whether trailing bits are checked: trailing bits are leading zeros in theory, but since numbers are little-endian they come last

For the encoding to be correct (i.e. encoding then decoding gives back the initial input), V(S(i)) must be defined and equal to i for all i in [0, 2^N). For the encoding to be canonical (i.e. different inputs decode to different outputs, or equivalently, decoding then encoding gives back the initial input), trailing bits must be checked and if V(i) is defined then S(V(i)) is equal to i for all i.

Encoding and decoding are given by the following pipeline:

[u8] <--1--> [[bit; 8]] <--2--> [[bit; N]] <--3--> [uN] <--4--> [u8]
1: Map bit-order between each u8 and [bit; 8]
2: Base conversion between base 2^8 and base 2^N (check trailing bits)
3: Map bit-order between each [bit; N] and uN
4: Map symbols/values between each uN and u8 (values must be defined)

Extensions

All these extensions make the encoding not canonical.

Padding

Padding is useful if the following conditions are met:

the bit-width is 3 (octal), 5 (base32), or 6 (base64)
the length of the data to encode is not known in advance
the data must be sent without buffering

Bases for which the bit-width N does not divide 8 may not concatenate encoded data. This comes from the fact that it is not possible to make the difference between trailing bits and encoding bits. Padding solves this issue by adding a new character to discriminate between trailing bits and encoding bits. The idea is to work by blocks of lcm(8, N) bits, where lcm(8, N) is the least common multiple of 8 and N. When such block is not complete, it is padded.

To preserve correctness, the padding character must not be a symbol.

Ignore characters when decoding

Ignoring characters when decoding is useful if after encoding some characters are added for convenience or any other reason (like wrapping). In that case we want to first ignore thoses characters before decoding.

To preserve correctness, ignored characters must not contain symbols or the padding character.

Wrap output when encoding

Wrapping output when encoding is useful if the output is meant to be printed in a document where width is limited (typically 80-columns documents). In that case, the wrapping width and the wrapping separator have to be defined.

To preserve correctness, the wrapping separator characters must be ignored (see previous subsection). As such, wrapping separator characters must also not contain symbols or the padding character.

Translate characters when decoding

Translating characters when decoding is useful when encoded data may be copied by a humain instead of a machine. Humans tend to confuse some characters for others. In that case we want to translate those characters before decoding.

To preserve correctness, the characters we translate from must not contain symbols or the padding character, and the characters we translate to must only contain symbols or the padding character.

Practice

Basics

use data_encoding::{Encoding, Specification};
fn make_encoding(symbols: &str) -> Encoding {
    let mut spec = Specification::new();
    spec.symbols.push_str(symbols);
    spec.encoding().unwrap()
}
let binary = make_encoding("01");
let octal = make_encoding("01234567");
let hexadecimal = make_encoding("0123456789abcdef");
assert_eq!(binary.encode(b"Bit"), "010000100110100101110100");
assert_eq!(octal.encode(b"Bit"), "20464564");
assert_eq!(hexadecimal.encode(b"Bit"), "426974");

The binary base has 2 symbols 0 and 1 with value 0 and 1 respectively. The octal base has 8 symbols 0 to 7 with value 0 to 7. The hexadecimal base has 16 symbols 0 to 9 and a to f with value 0 to 15. The following diagram gives the idea of how encoding works in the previous example (note that we can actually write such diagram only because the bit-order is most significant first):

[      octal] |  2  :  0  :  4  :  6  :  4  :  5  :  6  :  4  |
[     binary] |0 1 0 0 0 0 1 0|0 1 1 0 1 0 0 1|0 1 1 1 0 1 0 0|
[hexadecimal] |   4   :   2   |   6   :   9   |   7   :   4   |
               ^-- LSB                                       ^-- MSB

Note that in theory, these little-endian numbers are read from right to left (the most significant bit is at the right). Since leading zeros are meaningless (in our usual decimal notation 0123 is the same as 123), it explains why trailing bits must be zero. Trailing bits may occur when the bit-width of a base does not divide 8. Only binary, base4, and hexadecimal don’t have trailing bits issues. So let’s consider octal and base64, which have trailing bits in similar circumstances:

use data_encoding::{Specification, BASE64_NOPAD};
let octal = {
    let mut spec = Specification::new();
    spec.symbols.push_str("01234567");
    spec.encoding().unwrap()
};
assert_eq!(BASE64_NOPAD.encode(b"B"), "Qg");
assert_eq!(octal.encode(b"B"), "204");

We have the following diagram, where the base64 values are written between parentheses:

[base64] |   Q(16)   :   g(32)   : [has 4 zero trailing bits]
[ octal] |  2  :  0  :  4  :       [has 1 zero trailing bit ]
         |0 1 0 0 0 0 1 0|0 0 0 0
[ ascii] |       B       |
                          ^-^-^-^-- leading zeros / trailing bits

Extensions

Padding

For octal and base64, lcm(8, 3) == lcm(8, 6) == 24 bits or 3 bytes. For base32, lcm(8, 5) is 40 bits or 5 bytes. Let’s consider octal and base64:

use data_encoding::{Specification, BASE64};
let octal = {
    let mut spec = Specification::new();
    spec.symbols.push_str("01234567");
    spec.padding = Some('=');
    spec.encoding().unwrap()
};
// We start encoding but we only have "B" for now.
assert_eq!(BASE64.encode(b"B"), "Qg==");
assert_eq!(octal.encode(b"B"), "204=====");
// Now we have "it".
assert_eq!(BASE64.encode(b"it"), "aXQ=");
assert_eq!(octal.encode(b"it"), "322720==");
// By concatenating everything, we may decode the original data.
assert_eq!(BASE64.decode(b"Qg==aXQ=").unwrap(), b"Bit");
assert_eq!(octal.decode(b"204=====322720==").unwrap(), b"Bit");

We have the following diagrams:

[base64] |   Q(16)   :   g(32)   :     =     :     =     |
[ octal] |  2  :  0  :  4  :  =  :  =  :  =  :  =  :  =  |
         |0 1 0 0 0 0 1 0|. . . . . . . .|. . . . . . . .|
[ ascii] |       B       |        end of block aligned --^
         ^-- beginning of block aligned

[base64] |   a(26)   :   X(23)   :   Q(16)   :     =     |
[ octal] |  3  :  2  :  2  :  7  :  2  :  0  :  =  :  =  |
         |0 1 1 0 1 0 0 1|0 1 1 1 0 1 0 0|. . . . . . . .|
[ ascii] |       i       |       t       |

Ignore characters when decoding

The typical use-case is to ignore newlines (\r and \n). But to keep the example small, we will ignore spaces.

let mut spec = data_encoding::HEXLOWER.specification();
spec.ignore.push_str(" \t");
let base = spec.encoding().unwrap();
assert_eq!(base.decode(b"42 69 74"), base.decode(b"426974"));

Wrap output when encoding

The typical use-case is to wrap after 64 or 76 characters with a newline (\r\n or \n). But to keep the example small, we will wrap after 8 characters with a space.

let mut spec = data_encoding::BASE64.specification();
spec.wrap.width = 8;
spec.wrap.separator.push_str(" ");
let base64 = spec.encoding().unwrap();
assert_eq!(base64.encode(b"Hey you"), "SGV5IHlv dQ== ");

Note that the output always ends with the separator.

Translate characters when decoding

The typical use-case is to translate lowercase to uppercase or reciprocally, but it is also used for letters that look alike, like O0 or Il1. Let’s illustrate both examples.

let mut spec = data_encoding::HEXLOWER.specification();
spec.translate.from.push_str("ABCDEFOIl");
spec.translate.to.push_str("abcdef011");
let base = spec.encoding().unwrap();
assert_eq!(base.decode(b"BOIl"), base.decode(b"b011"));

Features

Requires the alloc feature.

Fields§

§symbols: String

Symbols

The number of symbols must be 2, 4, 8, 16, 32, or 64. Symbols must be ASCII characters (smaller than 128) and they must be unique.

§bit_order: BitOrder

Bit-order

The default is to use most significant bit first since it is the most common.

§check_trailing_bits: bool

Check trailing bits

The default is to check trailing bits. This field is ignored when unnecessary (i.e. for base2, base4, and base16).

§padding: Option<char>

Padding

The default is to not use padding. The padding character must be ASCII and must not be a symbol.

§ignore: String

Characters to ignore when decoding

The default is to not ignore characters when decoding. The characters to ignore must be ASCII and must not be symbols or the padding character.

§wrap: Wrap

How to wrap the output when encoding

The default is to not wrap the output when encoding. The wrapping characters must be ASCII and must not be symbols or the padding character.

§translate: Translate

How to translate characters when decoding

The default is to not translate characters when decoding. The characters to translate from must be ASCII and must not have already been assigned a semantics. The characters to translate to must be ASCII and must have been assigned a semantics (symbol, padding character, or ignored character).

Struct data_encoding::Specification

Fields§

Implementations§

impl Specification

pub fn new() -> Specification

pub fn encoding(&self) -> Result<Encoding, SpecificationError>

Trait Implementations§

impl Clone for Specification

fn clone(&self) -> Specification

fn clone_from(&mut self, source: &Self)

impl Debug for Specification

fn fmt(&self, f: &mut Formatter<'_>) -> Result

impl Default for Specification

fn default() -> Self

Auto Trait Implementations§

impl RefUnwindSafe for Specification

impl Send for Specification

impl Sync for Specification

impl Unpin for Specification

impl UnwindSafe for Specification

Blanket Implementations§

impl<T> Any for Twhere T: 'static + ?Sized,

fn type_id(&self) -> TypeId

impl<T> Borrow<T> for Twhere T: ?Sized,

fn borrow(&self) -> &T

impl<T> BorrowMut<T> for Twhere T: ?Sized,

fn borrow_mut(&mut self) -> &mut T

impl<T> From<T> for T

fn from(t: T) -> T

impl<T, U> Into<U> for Twhere U: From<T>,

fn into(self) -> U

impl<T> ToOwned for Twhere T: Clone,

type Owned = T

fn to_owned(&self) -> T

fn clone_into(&self, target: &mut T)

impl<T, U> TryFrom<U> for Twhere U: Into<T>,

type Error = Infallible

fn try_from(value: U) -> Result<T, <T as TryFrom<U>>::Error>

impl<T, U> TryInto<U> for Twhere U: TryFrom<T>,

type Error = <U as TryFrom<T>>::Error

fn try_into(self) -> Result<U, <U as TryFrom<T>>::Error>

impl<T> Any for T
where T: 'static + ?Sized,

impl<T> Borrow<T> for T
where T: ?Sized,

impl<T> BorrowMut<T> for T
where T: ?Sized,

impl<T, U> Into<U> for T
where U: From<T>,

impl<T> ToOwned for T
where T: Clone,

impl<T, U> TryFrom<U> for T
where U: Into<T>,

impl<T, U> TryInto<U> for T
where U: TryFrom<T>,