Pipelines

Regular Expressions

The string-regexp-extract step is used to run regular expressions on pipeline inputs to transform variables such as queries into more structured information. The regular expressions use RE2 syntax.

They are used mostly for extracting structured information from queries (also referred to as query scoping). Below are some examples of common use cases. We also support using more advanced NLP models, but for many cases regex is sufficient.

Extracting a size

A common requirement is to extract patterns such as "size 14 shoes" from queries. In this case, no products mention "size 14", so it's very common to see search implementations fail on these types of queries. Pipelines can easily alleviate these complexities in a few lines of YAML.

Below is a sample query from which we will extract the size information:

{
    "q": "nike size 14 shoes"
}

The regular expression to do this looks for the size <sizeValue> pattern and puts matches into the size variable which is bound to the match param.

- id: string-regexp-extract
  params:
    match:
      bind: size
    matchName:
      constant: sizeValue
    outText:
      bind: q
    pattern:
      constant: size (?P<sizeValue>\d+(.\d+)?)
    text:
      bind: q

The variables after this step look like:

{
    "q": "nike shoes",
    "size": "14"
}

It is very useful as we can quickly filter on size whenever a size variable exists. This is done using the add-filter step:

- id: add-filter
  params:
    filter:
      constant: option_size ~ [size]
  condition: size

Note: The condition ensures the filter is only added if the size variable exists. The option_size ~ [size] is filtering for products where the option_size field array contains the value in the variable size, in this case "14". The query executed against the indexes would be nike shoes.

Note: text is defining the input variable and outText is defining the output variable. In this case, they are the same, which is why the regex pattern was removed from the query. So alternatively this could be written to a different variable to keep the query unchanged.

Extracting a year

Below is an example illustrating how a year can be extracted from a free text query.

{
    "q": "2018 tax legislation"
}

For the above query, we have a year field on every document, but the year is not mentioned in other fields that are typically indexed such as the title and description.

We would ideally like to filter on the year and use the rest of the query as a normal query. Below shows how to do that. The first part is to extract the year:

- id: string-regexp-extract
  params:
    match:
      bind: yearFilter
    matchTemplate:
      const: year = ${yearValue}
    outText:
      bind: q
    pattern:
      const: (?P<yearValue>\b(19|20)[0-9]{2}\b)
    text:
      bind: q

This converts the input into the following:

{
    "q": "tax legislation",
    "yearFilter": "year = 2018"
}

This version uses the matchTemplate option to insert the extracted year ${yearValue} directly into a filter expression. It could also have just extracted the year, but this case illustrates conversion into an expression (it could be any other string pattern).

The filter can then be added simply using:

- id: add-filter
  params:
    filter:
      bind: yearFilter
  condition: yearFilter

Note here the condition is checking to make sure the yearFilter variable exists before adding the filter.