MATCH_RECOGNIZE

The MATCH_RECOGNIZE expression performs pattern recognition in a sequence of rows and returns the found results. This functionality is important for various business areas, such as fraud detection, pricing analysis in finance, and sensor data processing. This area is known as Complex Event Processing (CEP), and pattern recognition is a valuable tool for this. An example of how MATCH_RECOGNIZE works is provided in the link.

Data processing algorithm

The MATCH_RECOGNIZE expression performs the following actions:

  1. The input table is divided into non-overlapping groups. Each group consists of a set of rows from the input table with identical values in the columns listed after PARTITION BY.
  2. Each group is ordered according to the ORDER BY clause.
  3. Recognition of pattern from PATTERN is performed independently in each ordered group.
  4. Pattern search in the sequence of rows is a step-by-step process: rows are checked one by one if they fit the pattern. Among all matches starting in the earliest row, the one consisting of the largest number of rows is selected. If no matches were found starting in the earliest row, the search continues starting from the next row.
  5. After a match is found, the columns defined by expressions in the MEASURES block are calculated.
  6. Depending on the ROWS PER MATCH mode, one or all rows for the found match are output.
  7. The AFTER MATCH SKIP mode determines from which row the pattern recognition will resume.

Syntax

MATCH_RECOGNIZE (
    [ PARTITION BY <partition_1> [ ... , <partition_N> ] ]
    [ ORDER BY <sort_key_1> [ ... , <sort_key_N> ] ]
    [ MEASURES <expression_1> AS <column_name_1> [ ... , <expression_N> AS <column_name_N> ] ]
    [ ROWS PER MATCH ]
    [ AFTER MATCH SKIP ]
    PATTERN (<search_pattern>)
    DEFINE <variable_1> AS <predicate_1> [ ... , <variable_N> AS <predicate_N> ]
)

Here is a brief description of the SQL syntax elements of the MATCH_RECOGNIZE expression:

  • DEFINE: Block for declaring variables that describe the search pattern and the conditions that rows must meet for each variable.
  • PATTERN: Regular expressions describing the search pattern.
  • MEASURES: Defines the list of columns for the returned data. Each column is specified by an SQL expression for its computation.
  • ROWS PER MATCH: Determines the structure of the returned data and the number of rows for each match found.
  • AFTER MATCH SKIP: Defines the method of moving to the point of the next match search.
  • ORDER BY: Determines sorting of input data. Pattern search is performed within the data sorted according to the list of columns or expressions listed in <sort_key_1> [ ... , <sort_key_N> ].
  • PARTITION BY: Divides the input table according to the specified rules in accordance with <partition_1> [ ... , <partition_N> ]. Pattern search is performed independently in each part.

DEFINE

DEFINE <variable_1> AS <predicate_1> [ ... , <variable_N> AS <predicate_N> ]

DEFINE declares variables that are used to describe the desired pattern defined in PATTERN. Variables are named SQL statements evaluated over the input data. The syntax of the SQL statements in DEFINE is the same as the SQL statements of the WHERE predicate. For example, the button = 1 expression searches for rows with the value 1 in the button column. Any SQL expressions that can be used to perform a search, including aggregation functions (LAST, FIRST). For example, button > 2 AND zone_id < 12 or LAST(button) > 10.

In the example below, the SQL statement A.button = 1 is declared as variable A.

DEFINE
    A AS A.button = 1

Note

DEFINE does not currently support aggregation functions (e.g., AVG, MIN, or MAX) and PREV and NEXT functions.

When processing each row of data, all SQL statements describing variables in DEFINE are calculated. When the SQL-expression describing the corresponding variable from DEFINE gets the TRUE value, such a row is labeled with the DEFINE variable name and added to the list of rows subject to pattern matching.

Example

When defining variables in SQL expressions, you can reference other variables:

DEFINE
    A AS A.button = 1 AND LAST(A.zone_id) = 12,
    B AS B.button = 2 AND FIRST(A.zone_id) = 12

An input data row will be computed as variable A if it contains a button column with value 1 and the last row of the set of previously matched A has a zone_id column with value 12. The row will be computed as variable B if the data row contains a button column with value 2 and the first row of the set of previously matched variables A has a zone_id column with value 12.

PATTERN

PATTERN (<search_pattern>)

The PATTERN keyword describes the search pattern in the format derived from variables in the DEFINE section. The PATTERN syntax is similar to the one of regular expressions.

Warning

If a variable used in the PATTERN section has not been previously described in the DEFINE section, it is assumed that it is always TRUE.

You can use quantifiers in PATTERN. In regular expressions, they determine the number of repetitions of an element or subsequence in the matched pattern. Here is the list of supported quantifiers:

Quantifier Description
A+ One or more occurrences of A
A* Zero or more occurrences of A
A? Zero or one occurrence of A
B{n} Exactly n occurrences of B
C{n, m} From n to m occurrences of C
D{n,} At least n occurrences of D
(A|B) Occurrence of A or B in the data
(A|B){,m} From zero to m occurrences of A or B

Supported pattern search sequences:

Supported sequences Syntax Description
Sequence A B+ C+ D+ The system searches for the exact specified sequence, the occurrence of other variables within the sequence is not allowed. The pattern search is performed in the order of the pattern variables.
One of A | B | C Variables are listed in any order with a pipe | between them. The search is performed for any variable from the specified list.
Grouping (A | B)+ | C Variables inside round brackets are considered a single group. In this case, quantifiers apply to the entire group.
Exclusion from result {- A B+ C -} Rows found by the pattern in parentheses will be excluded from the result in ALL ROWS PER MATCH mode

Example

PATTERN (B1 E* B2+ B3)
DEFINE
    B1 as B1.button = 1,
    B2 as B2.button = 2,
    B3 as B3.button = 3

The DEFINE section describes the B1, B2, and B3 variables, while it does not describe E. Such notation allows interpreting E as any event, so the following pattern will be searched: one button 1 click, one or more button 2 clicks, and one button 3 click. Meanwhile, between a click of button 1 and button 2, any number of any other events may occur.

MEASURES

MEASURES <expression_1> AS <column_name_1> [ ... , <expression_N> AS <column_name_N> ]

MEASURES describes the set of returned columns when a pattern is found. A set of returned columns should be represented by an SQL expression with the aggregate functions over the variables declared in the DEFINE statement.

Example

The input data for the example are:

ts button device_id zone_id
100 1 3 0
200 1 3 1
300 2 2 0
400 3 1 1
MEASURES
    AGGREGATE_LIST(B1.zone_id * 10 + B1.device_id) AS ids,
    COUNT(DISTINCT B1.zone_id) AS count_zones,
    LAST(B3.ts) - FIRST(B1.ts) AS time_diff,
    42 AS meaning_of_life
PATTERN (B1+ B2 B3)
DEFINE
    B1 AS B1.button = 1,
    B2 AS B2.button = 2,
    B3 AS B3.button = 3

Result:

ids count_zones time_diff meaning_of_life
[3,13] 2 300 42

The ids column contains the list of zone_id * 10 + device_id values counted among the rows matched with the B1 variable. The count_zones column contains the number of unique values of the zone_id column among the rows matched with the B1 variable. Column time_diff contains the difference between the value of column ts in the last row of the set of rows matched with variable B3 and the value of column ts in the first row of the set of rows matched with variable B1. The meaning_of_life column contains the constant 42. Thus, an expression in MEASURES may contain aggregate functions over multiple variables, but there must be only one variable within a single aggregate function.

ROWS PER MATCH

ROWS PER MATCH determines the number of result rows for each match found, as well as the number of columns returned. The default mode is ONE ROW PER MATCH.

ONE ROW PER MATCH sets the ROWS PER MATCH mode to output one row for the match found. The structure of the returned data corresponds to the columns listed in PARTITION BY and MEASURES.

ALL ROWS PER MATCH sets the ROWS PER MATCH mode to output all rows of the match found except explicitly excluded by parentheses. In addition to the columns of the source table, the structure of the returned data includes the columns listed in the MEASURES.

Examples

The input data for all examples are:

ts button
100 1
200 2
300 3
Example 1
MEASURES
    FIRST(B1.ts) AS first_ts,
    FIRST(B2.ts) AS mid_ts,
    LAST(B3.ts) AS last_ts
ONE ROW PER MATCH
PATTERN (B1 {- B2 -} B3)
DEFINE
    B1 AS B1.button = 1,
    B2 AS B2.button = 2,
    B3 AS B3.button = 3

Result:

first_ts mid_ts last_ts
100 200 300
Example 2
MEASURES
    FIRST(B1.ts) AS first_ts,
    FIRST(B2.ts) AS mid_ts,
    LAST(B3.ts) AS last_ts
ALL ROWS PER MATCH
PATTERN (B1 {- B2 -} B3)
DEFINE
    B1 AS B1.button = 1,
    B2 AS B2.button = 2,
    B3 AS B3.button = 3

Result:

first_ts mid_ts last_ts button ts
100 200 300 1 100
100 200 300 3 300

AFTER MATCH SKIP

AFTER MATCH SKIP determines the method of transitioning from the found match to searching for the next one. In the AFTER MATCH SKIP TO NEXT ROW mode, the search for the next match starts after the first row of the previous one, while in the AFTER MATCH SKIP PAST LAST ROW mode it starts after the last row of the previous match. The default mode is PAST LAST ROW.

Examples

The input data for all examples are:

ts button
100 1
200 1
300 2
400 3
Example 1
MEASURES
    FIRST(B1.ts) AS first_ts,
    LAST(B3.ts) AS last_ts
AFTER MATCH SKIP TO NEXT ROW
PATTERN (B1+ B2 B3)
DEFINE
    B1 AS B1.button = 1,
    B2 AS B2.button = 2,
    B3 AS B3.button = 3

Result:

first_ts last_ts
100 400
200 400
Example 2
MEASURES
    FIRST(B1.ts) AS first_ts,
    LAST(B3.ts) AS last_ts
AFTER MATCH SKIP PAST LAST ROW
PATTERN (B1+ B2 B3)
DEFINE
    B1 AS B1.button = 1,
    B2 AS B2.button = 2,
    B3 AS B3.button = 3

Result:

first_ts last_ts
100 400

ORDER BY

ORDER BY <sort_key_1> [ ... , <sort_key_N> ]

<sort_key> ::= { <column_names> | <expression> }

ORDER BY determines sorting of the input data. That is, before all pattern search operations are performed, the data will be pre-sorted according to the specified keys or expressions. The syntax is similar to the ORDER BY SQL expression.

Example

ORDER BY CAST(ts AS Timestamp)

PARTITION BY

PARTITION BY <partition_1> [ ... , <partition_N> ]

<partition> ::= { <column_names> | <expression> }

PARTITION BY partitions the source data into multiple non-overlapping groups, each used for an independent pattern search. If the expression is not specified, all data is processed as a single group. Records with the same values of the columns listed after PARTITION BY fall into the same group.

Example

PARTITION BY device_id, zone_id

Limitations

Our support for the MATCH_RECOGNIZE expression will eventually comply with SQL-2016; currently, however, the following limitations apply:

  • MEASURES. Functions PREV/NEXT are not supported.
  • AFTER MATCH SKIP. Only the AFTER MATCH SKIP TO NEXT ROW and AFTER MATCH SKIP PAST LAST ROW modes are supported.
  • PATTERN. Union pattern variables are not implemented.
  • DEFINE. Aggregation functions are not supported.

Example of usage

Here is a hands-on example of pattern recognizing in a data table produced by an IoT device, where pressing its buttons triggers certain events. Let's assume you need to find and process the following sequence of button clicks: button 1, button 2, and button 3.

The structure of the data to transmit is as follows:

ts button device_id zone_id
600 3 17 3
500 3 4 2
400 2 17 3
300 2 4 2
200 1 17 3
100 1 4 2

The body of the SQL query looks like this:

PRAGMA FeatureR010="prototype"; -- pragma for enabling MATCH_RECOGNIZE

SELECT * FROM input MATCH_RECOGNIZE ( -- Performing pattern matching from input
    PARTITION BY device_id, zone_id -- Partitioning the input data into groups by columns device_id and zone_id
    ORDER BY ts -- Viewing events based on the ts column data sorted ascending
    MEASURES
        LAST(B1.ts) AS b1, -- Going to get the latest timestamp of clicking button 1 in the query results
        LAST(B3.ts) AS b3  -- Going to get the latest timestamp of clicking button 3 in the query results
    ONE ROW PER MATCH            -- Going to get one result row per match hit
    AFTER MATCH SKIP TO NEXT ROW -- Going to move to the next row once the match is found
    PATTERN (B1 B2+ B3) -- Searching for a pattern that includes one button 1 click, one or more button 2 clicks, and one button 3 click
    DEFINE
        B1 AS B1.button = 1, -- Defining the B1 variable as event of clicking button 1 (the button field equals 1)
        B2 AS B2.button = 2, -- Defining the B2 variable as event of clicking button 2 (the button field equals 2)
        B3 AS B3.button = 3  -- Defining the B3 variable as event of clicking button 3 (the button field equals 3)
);

Result:

b1 b3 device_id zone_id
100 500 4 2
200 600 17 3