This document contains content copied from Feishu documents for search purposes, and there may be formatting incompatibilities. It is recommended to view the original Feishu document
Background#
Through the previous article (Build A Parser By Rust (Part 1)), we have a basic understanding of the concepts related to parsers and their usage in Rust's parser libraries. In this article, we will practice what we learned in the previous article in a specific case, helping everyone better understand the concepts and master their practical applications. We will also gain insights into some design aspects of parser implementations, providing ideas for implementing custom parsers in the future.
We will guide you through a simple implementation of a relatively common and straightforward JSON parser using both nom and pest.
Note:
- The versions of nom or pest dependencies used below are the latest versions.
[dependencies] pest = "2.6" pest_derive = "2.6" nom = "7.1.3"
- During the practical process below, only some important or previously unmentioned API meanings will be explained. For more details, please refer to the previous article or the library documentation.
- The parser constructed in this document only verifies the correctness of parsing and does not conduct related performance testing. Interested readers can try it themselves.
The specific source code can be found at:
build parser by rust
JSON Standard#
JSON (JavaScript Object Notation) is a lightweight data interchange format. It represents data in a text format that is easy for humans to read and write. The JSON format consists of key-value pairs and uses a syntax similar to JavaScript objects, hence the name JavaScript Object Notation. JSON is commonly used for transmitting data over networks, storing configuration information, and exchanging data between different systems. It is very common in web development, such as for data transmission in APIs and front-end and back-end data interaction. JSON is also widely used in mobile application development, big data processing, and other fields. Due to its simplicity and readability, JSON has become one of the standard formats for data exchange in many applications.
If we want to implement a JSON parser, we first need to understand the JSON standard protocol, which is mainly divided into the following six parts:
JSON Annotation | Description | Specific Definition |
---|---|---|
Whitespace | Whitespace can be inserted between any pair of tokens | |
Number | Numbers are very similar to C or Java numbers, except octal and hexadecimal formats are not used | |
String | A string is a sequence of zero or more Unicode characters enclosed in double quotes, using backslashes for escaping. A character is a single string | |
Value | A value can be a string enclosed in double quotes, a number, true, false, null, an object, or an array. These structures can be nested | |
Array | An array is an ordered collection of values. An array begins with a left bracket [ and ends with a right bracket ] , with values separated by commas , | |
Object | An object is an unordered collection of name/value pairs. An object begins with a left brace { and ends with a right brace } . Each name is followed by a colon : , and name/value pairs are separated by commas , |
As we can see in the JSON standard protocol, the definitions of data types and specific parsing situations are very clear and relatively simple.
Based on this standard, we will implement it simply using the previously learned nom and pest. The specific code paths are in nom/json and pest/json.
Implementation Based on Nom#
JSON Model#
Here we use an enum to represent JSON Value excluding whitespace:
#[derive(Debug, PartialEq)]
pub enum JsonValue {
Str(String),
Boolean(bool),
Num(f64),
Array(Vec<JsonValue>),
Object(HashMap<String, JsonValue>),
Null,
}
Specific Type Parsing#
- Whitespace
From the previous section, we can see that whitespace elements can be any of the following. When processing, it will consume input until it encounters other elements, ultimately obtaining whitespace:
- space ->
" "
- linefeed ->
"\n"
- carriage return ->
"\r"
- horizontal tab ->
"\t"
Here, nom has two implementation methods: one can directly use the built-in function multispace0
, and the other is to use take_while
to construct a parsing function:
As mentioned earlier,
take_while
as a predicate parser continuously consumes input until its input no longer satisfies the predicate.
// whitespace JSON space parsing (equivalent to nom built-in function multispace0)
fn whitespace(i: &str) -> IResult<&str, &str> {
let chars = " \t\r\n";
take_while(move |c| chars.contains(c))(i)
}
- Number
From the previous section, we can see that JSON supports positive and negative numbers, decimals, and scientific notation. Although we can parse using combinations of parsers like alt
and be_f64
, it is more common in this scenario to use the built-in function double
provided by nom. Its usage can be referenced in the example:
use nom::number::complete::double;
let parser = |s| {
double(s)
};
assert_eq!(parser("1.1"), Ok(("", 1.1)));
assert_eq!(parser("123E-02"), Ok(("", 1.23)));
assert_eq!(parser("123K-01"), Ok(("K-01", 123.0)));
assert_eq!(parser("abc"), Err(Err::Error(("abc", ErrorKind::Float))));
- String
Here we need to discuss both the string and the string within the quotes:
- First, we can see that in the string, there are three situations to the right of the left quote. Except for the case where there are empty characters between the quotes, the other situations can be handled by combinators to remove the quotes and obtain the content within the quotes. There are many ways to use combinators; here are two common approaches:
alt
+delimited
: parsing according to the overall structure of the characterspreceded
+cut
+terminated
: parsing according to the order of the characters
// string entire string parsing
fn string(i: &str) -> IResult<&str, &str> {
context(
"string",
preceded(char('\"'), cut(terminated(parse_str, char('\"')))))(i)
// parse_str will be described later
}
fn string(i: &str) -> IResult<&str, &str> {
context(
"string",
alt((tag("\"\""), delimited(tag("\""), parse_str, tag("\"")))),
)(i)
}
The
cut
combinator's role is to prevent backtracking; it will immediately stop parsing when a parsing failure occurs and will not attempt other possible parsing paths. This is very useful for avoiding unnecessary performance overhead and parsing errors. Here is an official example for better understanding:use nom::combinator::cut; fn parser(input: &str) -> IResult<&str, &str> { alt(( preceded(one_of("+-"), cut(digit1)), rest ))(input) } assert_eq!(parser("+10 ab"), Ok((" ab", "10"))); assert_eq!(parser("ab"), Ok(("", "ab"))); assert_eq!(parser("+"), Err(Err::Failure(Error { input: "", code: ErrorKind::Digit })));
- After obtaining the string within the quotes, we need to handle escape characters to get the actual content. Currently, nom provides a special function
escaped
for handling escape characters. This function takes parametersescaped(normal, control, escapable)
, which represent:normal
: used to match ordinary character parsers but cannot accept characters containing control characterscontrol
: control characters (e.g.,\
used in most languages)escapable
: matches escape characters that can be matched
// Official example
use nom::bytes::complete::escaped;
use nom::character::complete::one_of;
fn esc(s: &str) -> IResult<&str, &str> {
// digit1: built-in parser function, indicating to match at least one digit
// '\\': indicates the backslash character '\'
// r#""n\"#: through `r#"{constructing raw string literal content}"#`, indicating that the escape characters that can be matched are "、n、\
escaped(digit1, '\\', one_of(r#""n\"#))(s)
}
assert_eq!(esc("123;"), Ok((";", "123")));
assert_eq!(esc(r#"12\"34;"#), Ok((";", r#"12\"34"#)));
- Finally, based on the
escaped
function and the JSON standard, we construct theparse_str
function, where the three parameters filled in this scenario mean:normal
: matches "Any codepoint except " or \ or control characters"'\\'
: the escape character in JSON is also the backslash characterescapable
: matches the escape characters described in the standard, such as",\,/,b
, etc. It is important to note that hexadecimal numbers also need to be handled separately.- Here, it is particularly noted that the
peek
built-in function used for hexadecimal processing does not consume input after parsing, allowing subsequent parsing to proceed normally.
- Here, it is particularly noted that the
// parse_str single string parsing
fn parse_str(i: &str) -> IResult<&str, &str> {
escaped(normal, '\\', escapable)(i)
}
// normal ordinary character parsing
fn normal(i: &str) -> IResult<&str, &str> {
take_till1(|c: char| c == '\\' || c == '"' || c.is_ascii_control())(i)
}
// escapable escape character parsing
fn escapable(i: &str) -> IResult<&str, &str> {
context(
"escaped",
alt((
tag("\""),
tag("\\"),
tag("/"),
tag("b"),
tag("f"),
tag("n"),
tag("r"),
tag("t"),
hex
)))(i)
}
// hex hexadecimal character parsing
fn hex(i: &str) -> IResult<&str, &str> {
context(
"hex",
preceded(
peek(tag("u")),
take_while_m_n(5, 5, |c: char| c.is_ascii_hexdigit() || c == 'u'),
))(i)
}
- Value
Having implemented the parsers for whitespace, numbers, and strings, we can now complete the basic types boolean and null:
// boolean boolean data type parsing
fn boolean(i: &str) -> IResult<&str, bool> {
alt((
value(true, tag("true")),
value(false, tag("false"))
))(i)
}
// null null parsing
fn null(i: &str) -> IResult<&str, JsonValue> {
map(tag("null"), |_| JsonValue::Null)(i)
}
Currently, based on the implemented type parsers, we can construct the value parser (composite types will be implemented later):
In the implementation below, there is a syntax that may be difficult to understand. Here is a brief explanation:
- The parameter types of the
map
function are the nom parser trait and the closure functionFnMut(O1) -> O2
.- Here we can utilize the constructor function of the enum type itself as the anonymous function, so it can be used directly.
// json_value JsonValue parsing
fn json_value(i: &str) -> IResult<&str, JsonValue> {
context(
"json value",
delimited(
whitespace,
alt((
map(string, |s| JsonValue::Str(String::from(s))),
map(double, JsonValue::Num),
map(boolean, JsonValue::Boolean),
null,
map(array, JsonValue::Array),
map(object, JsonValue::Object)
)),
whitespace,
),
)(i)
}
- Array
According to the standard description of arrays:
- First, use
delimited
to remove the left and right brackets, making it easier to parse the content in between. - Use the built-in function
separated_list0
to parse the content within the brackets to obtain an array ofVec<JsonValue>
:
// array array parsing
fn array(i: &str) -> IResult<&str, Vec<JsonValue>> {
context(
"array",
delimited(
tag("["),
separated_list0(tag(","), delimited(whitespace, json_value, whitespace)),
tag("]"),
),
)(i)
}
- Object
For complex parsers like objects, we can implement them by splitting sub-parsers:
- First, parse the key-value pair format in the object using
separated_pair
+preceded
:
// key_value kv format parsing
fn key_value(i: &str) -> IResult<&str, (&str, JsonValue)> {
separated_pair(preceded(whitespace, string), cut(preceded(whitespace, char(':'))), json_value)(i)
}
- Then, for the overall structure of the object, the parsing idea is:
- Left brace -> content within the braces -> parse according to the (previously implemented) key-value pair format to construct an array -> convert the array to a HashMap type -> right brace
// object object format parsing
fn object(i: &str) -> IResult<&str, HashMap<String, JsonValue>> {
context(
"object",
preceded(
char('{'),
cut(terminated(
map(
separated_list0(preceded(whitespace, char(',')), key_value),
|tuple_vec| {
tuple_vec.into_iter().map(|(k, v)| (String::from(k), v)).collect()
},
),
preceded(whitespace, char('}')),
)),
),
)(i)
}
Top-Level Parsing Function#
We have implemented all the annotated types in the JSON standard. Finally, we just need to construct the top-level function to use this parser.
Here, the outermost result of JSON can either be an object or an array, so our top-level function is:
fn root(i: &str) -> IResult<&str, JsonValue> {
delimited(
whitespace,
alt((
map(object, JsonValue::Object),
map(array, JsonValue::Array),
)),
opt(whitespace),
)(i)
}
Finally, you can run the following test function to see if the final return result is normal:
#[cfg(test)]
mod test_json {
use crate::nom::json::json::root;
#[test]
fn test_parse_json() {
let data = " { \"a\"\t: 42,
\"b\": [ \"x\", \"y\", 12 ] ,
\"c\": { \"hello\" : \"world\"}
} ";
println!("will try to parse valid JSON data:\n\n**********\n{}\n**********\n", data);
//
// will try to parse valid JSON data:
//
// **********
// { "a" : 42,
// "b": [ "x", "y", 12 ] ,
// "c": { "hello" : "world"}
// }
// **********
println!(
"parsing a valid file:\n{:#?}\n",
root(data)
);
// parsing a valid file:
// Ok(
// (
// "",
// Object(
// {
// "c": Object(
// {
// "hello": Str(
// "world",
// ),
// },
// ),
// "b": Array(
// [
// Str(
// "x",
// ),
// Str(
// "y",
// ),
// Num(
// 12.0,
// ),
// ],
// ),
// "a": Num(
// 42.0,
// ),
// },
// ),
// ),
// )
}
}
Thus, the JSON parser implemented using nom is complete. No specific performance testing has been conducted here; interested readers can perform stress testing themselves.
Implementation Based on Pest#
JSON Model#
Similar to the previous implementation with nom, here we use an enum to construct JSON Value excluding whitespace.
#[derive(Debug, PartialEq)]
pub enum JsonValue<'a> {
Number(f64),
String(&'a str),
Boolean(bool),
Array(Vec<JsonValue<'a>>),
Object(Vec<(&'a str, JsonValue<'a>)>),
Null,
}
In fact, it is not necessary to declare a lifetime; you can directly use String. The reason for declaring it is that &str
is introduced, which saves the type conversion processing later.
Considering that the JSONValue obtained after parsing will be better displayed and processed, we add a serializer for JsonValue:
pub fn serialize_json_value(val: &JsonValue) -> String {
use JsonValue::*; // for convenience in subsequent enums
match val {
Number(n) => format!("{}", n),
String(s) => format!("\"{}\"", s),
Boolean(b) => format!("{}", b),
Array(a) => {
let contents: Vec<_> = a.iter().map(serialize_json_value).collect();
format!("[{}]", contents.join(","))
}
Object(o) => {
let contents: Vec<_> = o
.iter()
.map(|(key, value)| format!("\"{}\":{}", key, serialize_json_value(value)))
.collect();
format!("{{{}}}", contents.join(","))
}
Null => "null".to_string(),
}
}
It is important to note that when processing arrays and object composite types, recursion is needed to obtain the specific values under the composite types.
Pest Grammar Parsing#
Here we create json.pest
to write the JSON standard we need to parse using Pest Grammar.
- Whitespace
According to the standard description, we implement it using the optional operator |
:
WHITESPACE = _{ " " | "\t" | "\r" | "\n" }
There are a few special syntax treatments that need to be explained in advance:
- If a rule is prefixed with
_
, it indicates that a silent rule has been created. Unlike ordinary rules, during parsing, it will not produce token pairs and will not report errors. Ultimately, it will only obtain a pair of token pairs at the outermost level. - In pest, if
WHITESPACE
is defined separately, it will be implicitly inserted between each sequence or repetition (except for atomic rules).- It is important to note the mention of "except for atomic rules" as there will be rules related to this information later.
- Similar implicit conventions also exist for the
COMMENT
rule, which considers the handling of hidden whitespace in character content.
In summary, all sequences or repetitions in the subsequent file, except for atomic rules, will ignore whitespace during parsing.
- Number
Based on the standard description, we use the sequence operator ~
to add different parsing conditions for the expression, and we can utilize pest's built-in rules related to numbers:
// 2. number
number = @{
"-"?
~ ("0" | ASCII_NONZERO_DIGIT ~ ASCII_DIGIT*)
~ ("." ~ ASCII_DIGIT*)?
~ (^"e" ~ ("+"|"-")? ~ ASCII_DIGIT+)?
}
Here, there are also special syntax treatments that need to be explained for better understanding:
- If a rule is prefixed with
@
, it indicates that an atomic rule has been created, which has the following characteristics:- It will not be affected by the previously mentioned
WHITESPACE
handling, meaning internal whitespace will not be hidden, and there will be no character ignoring between sequences constructed with~
. - In atomic rules, other rules called will also be treated as atomic rules.
- In atomic rules, all rules matched internally will be treated as silent, meaning only the outermost parsing result of the entire rule can be obtained.
- It will not be affected by the previously mentioned
- Adding operators
?
,*
, and+
to the rule suffix indicates at most one match, match all, and match at least one character, respectively. - Adding the operator
^
to the rule prefix indicates case insensitivity.
- String
Based on the standard description, we will combine three rules to clarify the parsing of strings:
// 3. string
string = ${ "\"" ~ inner ~ "\"" }
inner = @{ char* }
char = {
!( "\"" | "\\") ~ ANY
| "\\" ~ ("\"" | "\\" | "/" | "b" | "f" | "n" | "r" | "t")
| "\\" ~ ("u" ~ ASCII_HEX_DIGIT{4})
}
Pest does not provide built-in functions for handling escape characters like nom, so we need to manually include escape symbols during parsing.
Here are some special syntax treatments used in the rules:
- If a rule is prefixed with
$
, it indicates that a composite atomic rule has been created. Similar to atomic rules, but there are differences to note. It has the following characteristics:- It will not be affected by the previously mentioned
WHITESPACE
handling. - In composite atomic rules, there is no treatment of other rules as atomic rules and internal matching rules as silent; everything else is similar to ordinary rules.
- It will not be affected by the previously mentioned
!(...) ~ ANY
indicates matching any character except those given in the parentheses.
- Value
Similar to the previous implementation, we first complete the basic types boolean and null:
// 4. boolean
boolean = {"true" | "false"}
// 5. null
null = {"null"}
Combining the parsing rules for each data type to construct values, considering that we do not care about the parsing process of different rules within the values, we mark _
as a silent rule to reduce nesting:
value = _{ number | string | boolean | array | object | null}
For arrays and objects, we will describe them below.
- Array
According to the standard description, we separate empty arrays and non-empty arrays, using ~
and *
to indicate that multiple values may exist:
// 6. array
array = {
"[" ~ "]"|
"[" ~ value ~ ("," ~ value)* ~ "]"
}
- Object
Here we separately split the object values into pair rules, using the previous string and value rules, and handle them similarly to arrays, distinguishing between empty objects and non-empty objects:
// 7. object
object = {
"{" ~ "}"|
"{" ~ pair ~ ("," ~ pair)* ~ "}"
}
pair = { string ~ ":" ~ value }
- Final Rule
Finally, we need a final rule to represent the entire JSON, and the only valid content for JSON is an object or an array.
Also, considering that we only need the parsed values themselves and the EOI rule, we mark the rule as silent:
// 9. json
json = _{ SOI ~ (object | array) ~ EOI}
Thus, we have completed the pest rules we need to write. The next step is to generate the parsing structure based on these rules to create the AST.
AST Generation and Parsing#
- Define the Structure Bound to Pest Rules
Pest needs to mark the Rust structure through the grammar macro:
use pest::Parser;
use pest_derive::Parser;
#[derive(Parser)]
#[grammar = "pest/json/json.pest"] // according to your project file
pub struct JsonParser;
- Construct the AST Generation Function
By binding the pest rules to JsonParser, we use the pest::Parser
's parse method to obtain the AST result generated by the previous JSON rules:
pub fn root(content: &str) -> Result<JsonValue, Error<Rule>> {
let json = JsonParser::parse(Rule::json, content)?.next().unwrap();
// ......
}
Since json is a silent rule, there is only the final generated token pair, i.e., Pair<Rule>
, so we only need to call next()
once.
Our goal is to parse the AST to obtain the final JsonValue, so we need another method to parse this Pair<Rule>
.
- Parsing AST Function
We add the parse_json_value
function:
- Using the obtained AST result, we assign values to each type in JsonValue based on the pest rules we defined earlier.
- The handling of composite types like arrays and objects requires recursive functions to search for nested values.
- If the matched Rule is not one of the types in JsonValue, we exit with an error.
pub fn parse_json_value(pair: Pair<Rule>) -> JsonValue {
match pair.as_rule() {
Rule::number => JsonValue::Number(pair.as_str().parse().unwrap()),
Rule::string => JsonValue::String(pair.into_inner().next().unwrap().as_str()),
Rule::boolean => JsonValue::Boolean(pair.as_str().parse().unwrap()),
Rule::null => JsonValue::Null,
Rule::array => JsonValue::Array(pair.into_inner().map(parse_json_value).collect()),
Rule::object => JsonValue::Object(
pair.into_inner()
.map(|pair| {
let mut inner_rules = pair.into_inner();
let key = inner_rules
.next() // get the pair rule
.unwrap()
.into_inner()
.next() // get the first token pair of the pair rule, i.e., key
.unwrap()
.as_str();
let value = parse_json_value(inner_rules.next().unwrap());
(key, value)
})
.collect()
),
_ => unreachable!()
}
}
- Final Top-Level Function and Function Testing
Based on the previous parsing function, we can complete the top-level function root
to obtain the final parsing result:
pub fn root(content: &str) -> Result<JsonValue, Error<Rule>> {
let json = JsonParser::parse(Rule::json, content)?.next().unwrap();
Ok(parse_json_value(json))
}
Finally, you can run the following test function to verify the parsing results:
#[cfg(test)]
mod test {
use crate::pest::json::json::{JsonValue, root, serialize_json_value};
#[test]
fn test_parse_json_by_pest() {
let data = " { \"a\"\t: 42,
\"b\": [ \"x\", \"y\", 12 ] ,
\"c\": { \"hello\" : \"world\"}
} ";
println!("will try to parse valid JSON data:\n\n**********\n{}\n**********\n", data);
// will try to parse valid JSON data:
//
// **********
// { "a" : 42,
// "b": [ "x", "y", 12 ] ,
// "c": { "hello" : "world"}
// }
// **********
let json_result: JsonValue = root(data).expect("unsuccessful JSON");
println!("{}", serialize_json_value(&json_result))
// {"a":42,"b":["x","y",12],"c":{"hello":"world"}}
}
}
Thus, we have completed the JSON parser implemented based on pest.
Summary#
We have practiced constructing a JSON parser using nom and pest. Both nom and pest are classic parser libraries that complete parsing based on different excellent implementation ideas, capable of meeting most parser library-related requirements.
Through reading these two articles, I believe everyone has a basic grasp of how to quickly build their custom parsers using Rust parser libraries, freeing themselves from the pain of manually writing parsers. At the same time, in this process, we have also learned about concepts such as Parser, Parser Combinator, and JSON standards.
Finally, thank you all for reading, and I hope this can help those interested in understanding parsers or similar parser needs.
In the future, the repo will also prepare to update the Redis protocol parser, but due to length, it will not be included here.
References#
https://zhuanlan.zhihu.com/p/146455601