Patterns

From DKPLP Doc

(Redirected from Pattern)
Jump to: navigation, search
This article is about the patterns functionality. For the configuration tab, see Patterns (configuration tab).

Patterns make up a part of pattern sets and are based on regular expressions in Java. The patterns differ a bit depending on the input type.

Contents

Patterns for Text

Regular expressions that operate on individual text lines are used.

Patterns in XML

A syntax is used on top of regular expressions in order to specify the XML elements that contain the interesting data. The body specifies the base that has to be satisfied before the head start attempting to match things. The first part of each head-element represents the regular expression to be used on the element name, the second represents the one that should be used on the contents.

BNF

The following BNF describes the syntax.

xmlPattern ::= <body> ":" <head>
body ::= <element> <moreElements>
element ::= <regexp>
moreElements ::= ">" <element> <moreElements>
               | ''empty''
regexp ::= "!" <regexpContents>
         | <regexpContents>
regexpContents ::= <character> <moreCharacters>
character ::= ''[^:<>#"~]''
            | "#".
moreCharacters ::= <character> <moreCharacters>
                 | ''empty''
head ::= <headCondition> <moreHeadConditions>
headCondition ::= """ <headCondition> <moreHeadConditions> """
                | <atomicHeadCondition>
atomicHeadCondition ::= <regexp> "<" <regexp>
moreHeadConditions ::= ">" <headCondition> <moreHeadConditions>
                     | "~" <headCondition> <moreHeadConditions>
                     | "?" <moreHeadConditions>
                     | ''empty'' 

Special characters

  • > - The delimiter between elements. Can also be thought of as an operator for creating conjunctions of conditions in the head.
  • : - The delimiter between the body and head.
  • < - The delimiter between a head's element name and contents regexp.
  • # - Escape character.
  • ! - If found directly before a regular expression is expected then the following regular expression will be negated.
  • " - Can be used to encapsulate head conditions so that operators may operate on the whole condition.
  • ? - If placed after an head condition then it makes that condition optional. Meaning that the condition does not have to be matched in order for the head to be captured.
  • ~ - An operator for creating disjunctions of conditions in the head, has lower precedence than the > operator.

Examples

Lets say we have the following XML contents detailing a person and where that persons keeps her or his cookies. These examples assume that you know regular expressions. The results shown are everything captued.

<person>
    <hand1>cookie 1</hand1>
    <hand2></hand2>
    <home>
        <jar>cookie 2</jar>
        <jar>cookie 3</jar>
    </home>
</person>

Here are some examples of XML patterns and what they match.

Example 1

Pattern

person:hand1<(.+)

Result

  • cookie 1

Reason

The pattern first attempts to match the body ("person"), meaning a tag named person in the root. It does so and then starts to match the head ("hand1<(.+)") which translates into find a tag named hand1 and capture all contents that is at least one character long. It finds exactly one tag named hand1 and it contains cookie1 which is at least one character so it's captured.

Example 2

Pattern

home:jar<(.+)

Result (nothing)

Reason

The pattern first attempts to match the body ("home"), meaning a tag named home in the root. However there are no tags named home in the root, there is only a tag named person. Hence the pattern doesn't find anything.

Example 3

Pattern

person>home:.*<(.+)

Result

  • cookie 2
  • cookie 3

Reason

The pattern first attempts to match the body ("person>home"), meaning a tag named person in the root followed by a tag names home. It does so and then starts to match the head (".*<(.+)") which translates into find any tag and capture all contents that is at least one character long. It finds tags named jar which each contain contents which are at least one character long (cookie 2, cookie 3).

Example 4

Pattern

person:!hand1|hand2<(.*)

Result (nothing)

Reson

The pattern first attempts to match the body ("person"), meaning a tag named person in the root. It does so and then starts to match the head ("!hand1|hand2<(.*)") which translates into find all tags that are not named hand1 or hand2 and capture any content in those tags. Note that the '!' has a lower precedence than '|' so everything that is named hand1 or hand2 is not matched. There are no such tags, so nothing is captured.

Example 5

Pattern

person|person>home:hand\d<(.*)>jar<(.*)

Result

  • cookie 2
  • cookie 3

Reson

The pattern first attempts to match the body ("person|person>home"), meaning a tag named person or person in the root followed by a tag named home. The body does not translate into a tag named person in the root or a tag named person in the root followed by a tag named home because the '>' operator has higher precedance than '|'. The head then matches tags named jar or hand followed by a digit and captures all contents. The only tags the head can match are the jar tags, so it captures all the contents from them.

Example 6

Pattern

person:"hand1<(.+)"?>"hand2<(.+)"?

Result

  • cookie 1
  • null

Reason

The patter first attempts to match the body ("person"), meaning a tag named person, which it succeeds with. The pattern then looks for two head elements, both which are optional because of the ? construct. It tries no match the first part of the head, "hand1<(.+)" and does find an element that matches (i.e. an element named hand1 with any non-zero length contents). It captures "cookie 1" from that condition and then moves on to the second. It tries to find an element named hand2 with a non-zero length contents, but fails because the hand2 element has no contents (i.e. zero-length contents). It however sees that the condition is optional and does therefor note present a fatal error, it signifies the missed capture with null and moves on. The null is later interpreted by the formats as a missed capture.

Example 7

Pattern

person:"hand1<(.+)"~"hand2<(.+)"?

Result

  • cookie 1
  • null

Reason

The patter first attempts to match the body ("person"), meaning a tag named person, which it succeeds with. The pattern then looks for two head elements hand1 and hand2. It does only require one of them to be found though, because they have the OR operator ~ between them instead of the AND operator >. It finds an element named hand1 with a non-zero length contents, but the element named hand2 has a zero-length contents, so it can not be matched. Only the contents from hand1 is therefor captured, while the other condition could not be captured it was not necessary to do so and therefor a match was found. The missed capture is signified with null, which is later interpreted by the formats as a missed capture. If it was possible the match both then both elements' contents would have been captured.

Personal tools