catwithtudou

一直是阵雨

🌦️一枚服务端菜狗/ 大部分时间都是 golang 🫖尝试记录和热爱生活/尝试交一些新朋友 📖目前最重要的事情是打破信息壁垒&重新拾起初心和兴趣&输出者
twitter
github

🔬Build A Parser By Rust (Part 2)

This document contains content copied from Feishu documents for search purposes, and there may be formatting incompatibilities. It is recommended to view the original Feishu document

Background#

Through the previous article (Build A Parser By Rust (Part 1)), we have a basic understanding of the concepts related to parsers and their usage in Rust's parser libraries. In this article, we will practice what we learned in the previous article in a specific case, helping everyone better understand the concepts and master their practical applications. We will also gain insights into some design aspects of parser implementations, providing ideas for implementing custom parsers in the future.

We will guide you through a simple implementation of a relatively common and straightforward JSON parser using both nom and pest.

Note:

  • The versions of nom or pest dependencies used below are the latest versions.
[dependencies]
pest = "2.6"
pest_derive = "2.6"
nom = "7.1.3"
  • During the practical process below, only some important or previously unmentioned API meanings will be explained. For more details, please refer to the previous article or the library documentation.
  • The parser constructed in this document only verifies the correctness of parsing and does not conduct related performance testing. Interested readers can try it themselves.

The specific source code can be found at:

build parser by rust

JSON Standard#

JSON (JavaScript Object Notation) is a lightweight data interchange format. It represents data in a text format that is easy for humans to read and write. The JSON format consists of key-value pairs and uses a syntax similar to JavaScript objects, hence the name JavaScript Object Notation. JSON is commonly used for transmitting data over networks, storing configuration information, and exchanging data between different systems. It is very common in web development, such as for data transmission in APIs and front-end and back-end data interaction. JSON is also widely used in mobile application development, big data processing, and other fields. Due to its simplicity and readability, JSON has become one of the standard formats for data exchange in many applications.

If we want to implement a JSON parser, we first need to understand the JSON standard protocol, which is mainly divided into the following six parts:

JSON AnnotationDescriptionSpecific Definition
WhitespaceWhitespace can be inserted between any pair of tokensimg
NumberNumbers are very similar to C or Java numbers, except octal and hexadecimal formats are not usedimg
StringA string is a sequence of zero or more Unicode characters enclosed in double quotes, using backslashes for escaping. A character is a single stringimg
ValueA value can be a string enclosed in double quotes, a number, true, false, null, an object, or an array. These structures can be nestedimg
ArrayAn array is an ordered collection of values. An array begins with a left bracket [ and ends with a right bracket ], with values separated by commas ,img
ObjectAn object is an unordered collection of name/value pairs. An object begins with a left brace { and ends with a right brace }. Each name is followed by a colon :, and name/value pairs are separated by commas ,img

As we can see in the JSON standard protocol, the definitions of data types and specific parsing situations are very clear and relatively simple.

Based on this standard, we will implement it simply using the previously learned nom and pest. The specific code paths are in nom/json and pest/json.

Implementation Based on Nom#

JSON Model#

Here we use an enum to represent JSON Value excluding whitespace:

#[derive(Debug, PartialEq)]
pub enum JsonValue {
    Str(String),
    Boolean(bool),
    Num(f64),
    Array(Vec<JsonValue>),
    Object(HashMap<String, JsonValue>),
    Null,
}

Specific Type Parsing#

  1. Whitespace

From the previous section, we can see that whitespace elements can be any of the following. When processing, it will consume input until it encounters other elements, ultimately obtaining whitespace:

  • space -> " "
  • linefeed -> "\n"
  • carriage return -> "\r"
  • horizontal tab -> "\t"

Here, nom has two implementation methods: one can directly use the built-in function multispace0, and the other is to use take_while to construct a parsing function:

As mentioned earlier, take_while as a predicate parser continuously consumes input until its input no longer satisfies the predicate.

// whitespace JSON space parsing (equivalent to nom built-in function multispace0)
fn whitespace(i: &str) -> IResult<&str, &str> {
    let chars = " \t\r\n";
    take_while(move |c| chars.contains(c))(i)
}
  1. Number

From the previous section, we can see that JSON supports positive and negative numbers, decimals, and scientific notation. Although we can parse using combinations of parsers like alt and be_f64, it is more common in this scenario to use the built-in function double provided by nom. Its usage can be referenced in the example:

use nom::number::complete::double;

let parser = |s| {
  double(s)
};

assert_eq!(parser("1.1"), Ok(("", 1.1)));
assert_eq!(parser("123E-02"), Ok(("", 1.23)));
assert_eq!(parser("123K-01"), Ok(("K-01", 123.0)));
assert_eq!(parser("abc"), Err(Err::Error(("abc", ErrorKind::Float))));
  1. String

Here we need to discuss both the string and the string within the quotes:

  • First, we can see that in the string, there are three situations to the right of the left quote. Except for the case where there are empty characters between the quotes, the other situations can be handled by combinators to remove the quotes and obtain the content within the quotes. There are many ways to use combinators; here are two common approaches:
    • alt + delimited: parsing according to the overall structure of the characters
    • preceded + cut + terminated: parsing according to the order of the characters
// string entire string parsing
fn string(i: &str) -> IResult<&str, &str> {
    context(
        "string",
        preceded(char('\"'), cut(terminated(parse_str, char('\"')))))(i)
        // parse_str will be described later
}
fn string(i: &str) -> IResult<&str, &str> {
    context(
        "string",
        alt((tag("\"\""), delimited(tag("\""), parse_str, tag("\"")))),
    )(i)
}

The cut combinator's role is to prevent backtracking; it will immediately stop parsing when a parsing failure occurs and will not attempt other possible parsing paths. This is very useful for avoiding unnecessary performance overhead and parsing errors. Here is an official example for better understanding:

use nom::combinator::cut;

fn parser(input: &str) -> IResult<&str, &str> {
  alt((
    preceded(one_of("+-"), cut(digit1)),
    rest
  ))(input)
}

assert_eq!(parser("+10 ab"), Ok((" ab", "10")));
assert_eq!(parser("ab"), Ok(("", "ab")));
assert_eq!(parser("+"), Err(Err::Failure(Error { input: "", code: ErrorKind::Digit })));
  • After obtaining the string within the quotes, we need to handle escape characters to get the actual content. Currently, nom provides a special function escaped for handling escape characters. This function takes parameters escaped(normal, control, escapable), which represent:
    • normal: used to match ordinary character parsers but cannot accept characters containing control characters
    • control: control characters (e.g., \ used in most languages)
    • escapable: matches escape characters that can be matched
// Official example
use nom::bytes::complete::escaped;
use nom::character::complete::one_of;

fn esc(s: &str) -> IResult<&str, &str> {
  // digit1: built-in parser function, indicating to match at least one digit
  // '\\': indicates the backslash character '\'
  // r#""n\"#: through `r#"{constructing raw string literal content}"#`, indicating that the escape characters that can be matched are "、n、\
  escaped(digit1, '\\', one_of(r#""n\"#))(s)
}

assert_eq!(esc("123;"), Ok((";", "123")));
assert_eq!(esc(r#"12\"34;"#), Ok((";", r#"12\"34"#)));
  • Finally, based on the escaped function and the JSON standard, we construct the parse_str function, where the three parameters filled in this scenario mean:
    • normal: matches "Any codepoint except " or \ or control characters"
    • '\\': the escape character in JSON is also the backslash character
    • escapable: matches the escape characters described in the standard, such as ",\,/,b, etc. It is important to note that hexadecimal numbers also need to be handled separately.
      • Here, it is particularly noted that the peek built-in function used for hexadecimal processing does not consume input after parsing, allowing subsequent parsing to proceed normally.
// parse_str single string parsing
fn parse_str(i: &str) -> IResult<&str, &str> {
    escaped(normal, '\\', escapable)(i)
}

// normal ordinary character parsing
fn normal(i: &str) -> IResult<&str, &str> {
    take_till1(|c: char| c == '\\' || c == '"' || c.is_ascii_control())(i)
}

// escapable escape character parsing
fn escapable(i: &str) -> IResult<&str, &str> {
    context(
        "escaped",
        alt((
            tag("\""),
            tag("\\"),
            tag("/"),
            tag("b"),
            tag("f"),
            tag("n"),
            tag("r"),
            tag("t"),
            hex
        )))(i)
}

// hex hexadecimal character parsing
fn hex(i: &str) -> IResult<&str, &str> {
    context(
        "hex",
        preceded(
            peek(tag("u")),
            take_while_m_n(5, 5, |c: char| c.is_ascii_hexdigit() || c == 'u'),
        ))(i)
}
  1. Value

Having implemented the parsers for whitespace, numbers, and strings, we can now complete the basic types boolean and null:

// boolean boolean data type parsing
fn boolean(i: &str) -> IResult<&str, bool> {
    alt((
        value(true, tag("true")),
        value(false, tag("false"))
    ))(i)
}

// null null parsing
fn null(i: &str) -> IResult<&str, JsonValue> {
    map(tag("null"), |_| JsonValue::Null)(i)
}

Currently, based on the implemented type parsers, we can construct the value parser (composite types will be implemented later):

In the implementation below, there is a syntax that may be difficult to understand. Here is a brief explanation:

  • The parameter types of the map function are the nom parser trait and the closure function FnMut(O1) -> O2.
  • Here we can utilize the constructor function of the enum type itself as the anonymous function, so it can be used directly.
// json_value JsonValue parsing
fn json_value(i: &str) -> IResult<&str, JsonValue> {
    context(
        "json value",
        delimited(
            whitespace,
            alt((
                map(string, |s| JsonValue::Str(String::from(s))),
                map(double, JsonValue::Num),
                map(boolean, JsonValue::Boolean),
                null,
                map(array, JsonValue::Array),
                map(object, JsonValue::Object)
            )),
            whitespace,
        ),
    )(i)
}
  1. Array

According to the standard description of arrays:

  • First, use delimited to remove the left and right brackets, making it easier to parse the content in between.
  • Use the built-in function separated_list0 to parse the content within the brackets to obtain an array of Vec<JsonValue>:
// array array parsing
fn array(i: &str) -> IResult<&str, Vec<JsonValue>> {
    context(
        "array",
        delimited(
            tag("["),
            separated_list0(tag(","), delimited(whitespace, json_value, whitespace)),
            tag("]"),
        ),
    )(i)
}
  1. Object

For complex parsers like objects, we can implement them by splitting sub-parsers:

  • First, parse the key-value pair format in the object using separated_pair + preceded:
// key_value kv format parsing
fn key_value(i: &str) -> IResult<&str, (&str, JsonValue)> {
    separated_pair(preceded(whitespace, string), cut(preceded(whitespace, char(':'))), json_value)(i)
}
  • Then, for the overall structure of the object, the parsing idea is:
    • Left brace -> content within the braces -> parse according to the (previously implemented) key-value pair format to construct an array -> convert the array to a HashMap type -> right brace
// object object format parsing
fn object(i: &str) -> IResult<&str, HashMap<String, JsonValue>> {
    context(
        "object",
        preceded(
            char('{'),
            cut(terminated(
                map(
                    separated_list0(preceded(whitespace, char(',')), key_value),
                    |tuple_vec| {
                        tuple_vec.into_iter().map(|(k, v)| (String::from(k), v)).collect()
                    },
                ),
                preceded(whitespace, char('}')),
            )),
        ),
    )(i)
}

Top-Level Parsing Function#

We have implemented all the annotated types in the JSON standard. Finally, we just need to construct the top-level function to use this parser.

Here, the outermost result of JSON can either be an object or an array, so our top-level function is:

fn root(i: &str) -> IResult<&str, JsonValue> {
    delimited(
        whitespace,
        alt((
            map(object, JsonValue::Object),
            map(array, JsonValue::Array),
        )),
        opt(whitespace),
    )(i)
}

Finally, you can run the following test function to see if the final return result is normal:

#[cfg(test)]
mod test_json {
    use crate::nom::json::json::root;

    #[test]
    fn test_parse_json() {
        let data = "  { \"a\"\t: 42,
  \"b\": [ \"x\", \"y\", 12 ] ,
  \"c\": { \"hello\" : \"world\"}
  } ";
        println!("will try to parse valid JSON data:\n\n**********\n{}\n**********\n", data);
        //
        // will try to parse valid JSON data:
        //
        //     **********
        // { "a" : 42,
        //     "b": [ "x", "y", 12 ] ,
        //     "c": { "hello" : "world"}
        // }
        // **********


        println!(
            "parsing a valid file:\n{:#?}\n",
            root(data)
        );
        // parsing a valid file:
        //     Ok(
        //         (
        // "",
        // Object(
        //     {
        //         "c": Object(
        //         {
        //             "hello": Str(
        //             "world",
        //             ),
        //         },
        //         ),
        //         "b": Array(
        //             [
        //                 Str(
        //         "x",
        //         ),
        //         Str(
        //             "y",
        //         ),
        //         Num(
        //             12.0,
        //         ),
        //         ],
        //         ),
        //         "a": Num(
        //         42.0,
        //         ),
        //     },
        // ),
        // ),
        // )
    }
}

Thus, the JSON parser implemented using nom is complete. No specific performance testing has been conducted here; interested readers can perform stress testing themselves.

Implementation Based on Pest#

JSON Model#

Similar to the previous implementation with nom, here we use an enum to construct JSON Value excluding whitespace.

#[derive(Debug, PartialEq)]
pub enum JsonValue<'a> {  
    Number(f64),
    String(&'a str),
    Boolean(bool),
    Array(Vec<JsonValue<'a>>),
    Object(Vec<(&'a str, JsonValue<'a>)>),
    Null,
}

In fact, it is not necessary to declare a lifetime; you can directly use String. The reason for declaring it is that &str is introduced, which saves the type conversion processing later.

Considering that the JSONValue obtained after parsing will be better displayed and processed, we add a serializer for JsonValue:

pub fn serialize_json_value(val: &JsonValue) -> String {
    use JsonValue::*; // for convenience in subsequent enums

    match val {
        Number(n) => format!("{}", n),
        String(s) => format!("\"{}\"", s),
        Boolean(b) => format!("{}", b),
        Array(a) => {
            let contents: Vec<_> = a.iter().map(serialize_json_value).collect();
            format!("[{}]", contents.join(","))
        }
        Object(o) => {
            let contents: Vec<_> = o
                .iter()
                .map(|(key, value)| format!("\"{}\":{}", key, serialize_json_value(value)))
                .collect();
            format!("{{{}}}", contents.join(","))
        }
        Null => "null".to_string(),
    }
}

It is important to note that when processing arrays and object composite types, recursion is needed to obtain the specific values under the composite types.

Pest Grammar Parsing#

Here we create json.pest to write the JSON standard we need to parse using Pest Grammar.

  1. Whitespace

According to the standard description, we implement it using the optional operator |:

WHITESPACE = _{ " " | "\t" | "\r" | "\n" }

There are a few special syntax treatments that need to be explained in advance:

  • If a rule is prefixed with _, it indicates that a silent rule has been created. Unlike ordinary rules, during parsing, it will not produce token pairs and will not report errors. Ultimately, it will only obtain a pair of token pairs at the outermost level.
  • In pest, if WHITESPACE is defined separately, it will be implicitly inserted between each sequence or repetition (except for atomic rules).
    • It is important to note the mention of "except for atomic rules" as there will be rules related to this information later.
    • Similar implicit conventions also exist for the COMMENT rule, which considers the handling of hidden whitespace in character content.

In summary, all sequences or repetitions in the subsequent file, except for atomic rules, will ignore whitespace during parsing.

  1. Number

Based on the standard description, we use the sequence operator ~ to add different parsing conditions for the expression, and we can utilize pest's built-in rules related to numbers:

// 2. number
number = @{
    "-"?
    ~ ("0" | ASCII_NONZERO_DIGIT ~ ASCII_DIGIT*)
    ~ ("." ~ ASCII_DIGIT*)?
    ~ (^"e" ~ ("+"|"-")? ~ ASCII_DIGIT+)?
}

Here, there are also special syntax treatments that need to be explained for better understanding:

  • If a rule is prefixed with @, it indicates that an atomic rule has been created, which has the following characteristics:
    • It will not be affected by the previously mentioned WHITESPACE handling, meaning internal whitespace will not be hidden, and there will be no character ignoring between sequences constructed with ~.
    • In atomic rules, other rules called will also be treated as atomic rules.
    • In atomic rules, all rules matched internally will be treated as silent, meaning only the outermost parsing result of the entire rule can be obtained.
  • Adding operators ?, *, and + to the rule suffix indicates at most one match, match all, and match at least one character, respectively.
  • Adding the operator ^ to the rule prefix indicates case insensitivity.
  1. String

Based on the standard description, we will combine three rules to clarify the parsing of strings:

// 3. string
string = ${ "\"" ~ inner ~ "\"" }
inner = @{ char* }
char = {
    !( "\"" | "\\") ~ ANY
    | "\\" ~ ("\"" | "\\" | "/" | "b" | "f" | "n" | "r" | "t")
    | "\\" ~ ("u" ~ ASCII_HEX_DIGIT{4})
}

Pest does not provide built-in functions for handling escape characters like nom, so we need to manually include escape symbols during parsing.

Here are some special syntax treatments used in the rules:

  • If a rule is prefixed with $, it indicates that a composite atomic rule has been created. Similar to atomic rules, but there are differences to note. It has the following characteristics:
    • It will not be affected by the previously mentioned WHITESPACE handling.
    • In composite atomic rules, there is no treatment of other rules as atomic rules and internal matching rules as silent; everything else is similar to ordinary rules.
  • !(...) ~ ANY indicates matching any character except those given in the parentheses.
  1. Value

Similar to the previous implementation, we first complete the basic types boolean and null:

// 4. boolean
boolean = {"true" | "false"}
// 5. null
null = {"null"}

Combining the parsing rules for each data type to construct values, considering that we do not care about the parsing process of different rules within the values, we mark _ as a silent rule to reduce nesting:

value = _{ number | string | boolean | array | object | null}

For arrays and objects, we will describe them below.

  1. Array

According to the standard description, we separate empty arrays and non-empty arrays, using ~ and * to indicate that multiple values may exist:

// 6. array
array = {
    "[" ~ "]"|
    "[" ~ value ~ ("," ~ value)* ~ "]"
}
  1. Object

Here we separately split the object values into pair rules, using the previous string and value rules, and handle them similarly to arrays, distinguishing between empty objects and non-empty objects:

// 7. object
object = {
    "{" ~ "}"|
    "{" ~ pair ~ ("," ~ pair)* ~ "}"
}
pair = { string ~ ":" ~ value }
  1. Final Rule

Finally, we need a final rule to represent the entire JSON, and the only valid content for JSON is an object or an array.

Also, considering that we only need the parsed values themselves and the EOI rule, we mark the rule as silent:

// 9. json
json = _{ SOI ~ (object | array) ~ EOI}

Thus, we have completed the pest rules we need to write. The next step is to generate the parsing structure based on these rules to create the AST.

AST Generation and Parsing#

  1. Define the Structure Bound to Pest Rules

Pest needs to mark the Rust structure through the grammar macro:

use pest::Parser;
use pest_derive::Parser;

#[derive(Parser)]
#[grammar = "pest/json/json.pest"] // according to your project file
pub struct JsonParser;
  1. Construct the AST Generation Function

By binding the pest rules to JsonParser, we use the pest::Parser's parse method to obtain the AST result generated by the previous JSON rules:

pub fn root(content: &str) -> Result<JsonValue, Error<Rule>> {
    let json = JsonParser::parse(Rule::json, content)?.next().unwrap();
    // ......
}

Since json is a silent rule, there is only the final generated token pair, i.e., Pair<Rule>, so we only need to call next() once.

Our goal is to parse the AST to obtain the final JsonValue, so we need another method to parse this Pair<Rule>.

  1. Parsing AST Function

We add the parse_json_value function:

  • Using the obtained AST result, we assign values to each type in JsonValue based on the pest rules we defined earlier.
  • The handling of composite types like arrays and objects requires recursive functions to search for nested values.
  • If the matched Rule is not one of the types in JsonValue, we exit with an error.
pub fn parse_json_value(pair: Pair<Rule>) -> JsonValue {
    match pair.as_rule() {
        Rule::number => JsonValue::Number(pair.as_str().parse().unwrap()),
        Rule::string => JsonValue::String(pair.into_inner().next().unwrap().as_str()),
        Rule::boolean => JsonValue::Boolean(pair.as_str().parse().unwrap()),
        Rule::null => JsonValue::Null,
        Rule::array => JsonValue::Array(pair.into_inner().map(parse_json_value).collect()),
        Rule::object => JsonValue::Object(
            pair.into_inner()
                .map(|pair| {
                    let mut inner_rules = pair.into_inner();
                    let key = inner_rules
                        .next() // get the pair rule
                        .unwrap()
                        .into_inner()
                        .next() // get the first token pair of the pair rule, i.e., key
                        .unwrap()
                        .as_str();
                    let value = parse_json_value(inner_rules.next().unwrap());
                    (key, value)
                })
                .collect()
        ),
        _ => unreachable!()
    }
}
  1. Final Top-Level Function and Function Testing

Based on the previous parsing function, we can complete the top-level function root to obtain the final parsing result:

pub fn root(content: &str) -> Result<JsonValue, Error<Rule>> {
    let json = JsonParser::parse(Rule::json, content)?.next().unwrap();
    Ok(parse_json_value(json))
}

Finally, you can run the following test function to verify the parsing results:

#[cfg(test)]
mod test {
    use crate::pest::json::json::{JsonValue, root, serialize_json_value};

    #[test]
    fn test_parse_json_by_pest() {
        let data = "  { \"a\"\t: 42,
  \"b\": [ \"x\", \"y\", 12 ] ,
  \"c\": { \"hello\" : \"world\"}
  } ";
        println!("will try to parse valid JSON data:\n\n**********\n{}\n**********\n", data);

        // will try to parse valid JSON data:
        //
        //     **********
        // { "a" : 42,
        //     "b": [ "x", "y", 12 ] ,
        //     "c": { "hello" : "world"}
        // }
        // **********


        let json_result: JsonValue = root(data).expect("unsuccessful JSON");

        println!("{}", serialize_json_value(&json_result))
        // {"a":42,"b":["x","y",12],"c":{"hello":"world"}}
    }
}

Thus, we have completed the JSON parser implemented based on pest.

Summary#

We have practiced constructing a JSON parser using nom and pest. Both nom and pest are classic parser libraries that complete parsing based on different excellent implementation ideas, capable of meeting most parser library-related requirements.

Through reading these two articles, I believe everyone has a basic grasp of how to quickly build their custom parsers using Rust parser libraries, freeing themselves from the pain of manually writing parsers. At the same time, in this process, we have also learned about concepts such as Parser, Parser Combinator, and JSON standards.

Finally, thank you all for reading, and I hope this can help those interested in understanding parsers or similar parser needs.

In the future, the repo will also prepare to update the Redis protocol parser, but due to length, it will not be included here.

References#

https://zhuanlan.zhihu.com/p/146455601

https://www.json.org/json-zh.html

https://pest.rs/book/examples/json.html

Loading...
Ownership of this post data is guaranteed by blockchain and smart contracts to the creator alone.