Segmentation rules

Two types of rules

There are two ways to identify text segments with rules:

By specifying a scope and a condition: an instance of the segments is identified whenver the condition is met and the extension of the instance will coincide with the scope of the rule.
By specifying two conditions, one the is matched by the beginning of each segment instance and the other that is matched by the end of it.

Scope based rules

The syntax of a segmentation rule of the first type is:

SCOPE scopeOption
{
    SEGMENT(segmentName)
    {
        condition
    }
}

For example, consider the following sample text:

About a month ago I was diagnosed with "pre-diabetes" after a blood test.
I had complained to my doctor of constant tiredness and lack of energy throughout the day.
I will be the first to admit my diet is lousy - pizzas, burgers, chocolate, takeaways, fizzy drinks are all vices of mine.
To help me monitor this I purchased this accu-check gadget and although the concept of taking my own blood samples was a bit daunting, it really is very easy indeed.
Unfortunately, in the last days my blood glucose monitor seems to give incorrect readings, I tried several times turning it off and on again, but it still doesn't work.

and suppose that you want a segment for each paragraph dealing with malfunctions.
The following rule:

SCOPE PARAGRAPH
{
    SEGMENT(MALFUNCTION)
    {
        KEYWORD("n't","not")
        >
        LEMMA ("work")
    }

}

will be triggered by the lemma work preceded by a negation and will produce an instance of segment MALFUNCTION coinciding with the paragrapg in which the condition was met, because the extension of the segment instance is determined by the scope of the rule, as highlighted below:


About a month ago I was diagnosed with "pre-diabetes" after a blood test.
I had complained to my doctor of constant tiredness and lack of energy throughout the day.
I will be the first to admit my diet is lousy - pizzas, burgers, chocolate, takeaways, fizzy drinks are all vices of mine.
To help me monitor this I purchased this accu-check gadget and although the concept of taking my own blood samples was a bit daunting, it really is very easy indeed.
Unfortunately, in the last days my blood glucose monitor seems to give incorrect readings, I tried several times turning it off and on again, but it still doesn't work.

Begin-end rules

The second type of rules has this syntax:

SCOPE scopeOption
{
    SEGMENT(segmentName|BEGIN)
    {
        condition
    }

    SEGMENT(segmentName|END)
    {
        condition
    }
}

The rule is triggered by any portion of text in which the BEGIN condition and the END conditions are met. The instances of the segment that are generated beggin with the tokens matched by the BEGIN condition and end with the tokens matched by the END condition.

For example, consider this excerpt of an insurance contract:

Contract of Reinsurance
SUM REINSURED
USD 200,000,000 per occurrence (combined single limit or Damage and Business Interruption)
LIMITS
Contingent business interruption
USD 125,000
DEDUCTIBLES
Earthquake, Earth Movement or Volcanic Eruption 5% of loss amount, minimum USD 125,000 and maximum USD 425,000 combined Property Damage and Business Interruption

The re-insured sum can be found between SUM REINSURED and LIMITS. The following rule creates a corresponding segment instance:

SCOPE SENTENCE
{
    SEGMENT(SUM_REINSURED|BEGIN)
    {
        KEYWORD("SUM REINSURED")
    }

    SEGMENT(SUM_REINSURED|END)
    {
        KEYWORD("LIMITS")
    }
}

The segment is highlighted below:


Contract of Reinsurance
SUM REINSURED
USD 200,000,000 per occurrence (combined single limit or Damage and Business Interruption)
LIMITS
Contingent business interruption
USD 125,000
DEDUCTIBLES
Earthquake, Earth Movement or Volcanic Eruption 5% of loss amount, minimum USD 125,000 and maximum USD 425,000 combined Property Damage and Business Interruption

To use segmentation rules most effectively, it is important that they are set up to identify concepts that often recur in the set of documents to be processed for a given project. With the exception of sporadic special cases, where the beginnings and the endings of segments can be identified with almost ad hoc rules, a good set of segmentation rules must be in some way predictive, so that they can also encompass variants of known forms and layouts.

Note

In case more instances of the same segment overlap each other, a bigger single instance will be created.

BEFORE and AFTER

Advanced segmentation syntax allows the developer to single out phraseology that precedes or follows the segment to be detected by using the keywords BEFORE or AFTER as follows:

SCOPE scopeOption
{
    SEGMENT(segmentName|BEGIN_option) 
    {
        condition
    }

    SEGMENT(segmentName|END_option) 
    {
        condition
    }
}

where BEGIN_option and END_option correspond to one of the following conditions:

BEGIN_BEFORE: the segment begins with the sentence before the sentence matched by the linguistic condition.
BEGIN_AFTER: the segment begins with the sentence after the sentence matched by the linguistic condition.
END_BEFORE: the segment ends with the sentence before the sentence matched by the linguistic condition.
END_AFTER: the segment ends with the sentence after the sentence matched by the linguistic condition.

Segmentation rules score

When working with segments, it is possible to define several rules for each boundary, as the number of opening and closing conditions may vary according to the type of document. In some cases, some concepts identified by means of segmentation rules can represent stronger points of reference to define a segment boundaries than others. It is possible to highlight this difference in the rules and mark some concepts as more relevant while others as less relevant. This can be achieved by adding a score option to the rules using the following syntax:

SCOPE scopeOption
{   
    SEGMENT(segmentName|boundaryTypeOption:scoreOption)
    {
        condition
    }
}

The name of the segment must be followed by the boundary type defined by the rule as well as one of the score options. Score options can be of two types:

Default score option
Custom score option

Default score option

Segmentation and categorization rules share the same default score options listed in the table below:

Option	Description
NORMAL	The default/implicit score option
LOW	Lower than the default
HIGH	Higher than the default

The options LOW and HIGH allow the user to assign a a slightly different score to a boundary compared to the default option and they can also be used to assign a higher or lower relevance of a boundary compared to another. The correct use of these options must consider:

The use of the default score in most cases.
The use of HIGH to give emphasis to a particular rule, for example one containing a concept or combination of concepts which is not ambiguous and will certainly result in a valid boundary (e.g. the main or most frequent beginning or end of a segment).
The use of LOW to give less importance to a rule, for example one containing a slightly ambiguous concept which you are neither willing to exclude a priori nor willing to rely on in every case (for example special-case or unusual segments beginning or end).

Custom score option

Similar to categorization rules, it is possible to create custom score options. They can be defined in the config.cr file and they can be shared among both categorization and segmentation rules.
The syntax is:

SCORES
{
  @scoreOptionName:points,
  ...
}

For example:

SCORES
{
  @LOWER:1,
  @HIGHER:20
}

Once defined, the names of the new options can be used in the segmentation rules to allow for a greater variability of rules score.

Note

Don't use language keywords as score option names.

Scope options in segmentation rules

As for categorization and extraction rules, every segment rule needs a SCOPE option to be chosen in order to define two elements:

The portion of text in which a single rule or a group of rules will act upon.
The portion of text on which the segment will be extended.

Any of the standard or custom scope options available can be used. However, there are some restrictions specific to segmentation rules that must be detailed.

The SCOPE options: SENTENCE / PARAGRAPH / CLAUSE / PHRASE can always be used.
The SCOPE options: SECTION / SEGMENT / CLAUSE (clause_type) / PHRASE (phrase_type) can be used except in those cases where the BEGIN or END statements are used to separately define the boundaries of a segment.

Phrase and clause

PHRASE and CLAUSE scope options can be used in the cases specified above. Additionally, they must only be intended as portions of text where a segmentation rule has to be verified. In fact, since segments' extensions can't disregard sentence boundaries (for example segments can not be shorter than a sentence), CLAUSE and PHRASE scope options do not determine the portion of text on which the segment will be extended.

Sentence and paragraph

The SCOPE options SENTENCE and PARAGRAPH can be used in any of the ways described in the cases specified above. However, when the following syntax is used:

SCOPE PARAGRAPH|SENTENCE*n.
{
    segmentationRule(s)
}

A distinction must be made between the programmed scope and the real scope of a rule, where "programmed scope" is the most extended portion of text on which a rule acts upon, and "real scope" is the portion of text that is really included in the segment.

For example, if we define a rule scope in the following way

SCOPE SENTENCE*3
{
    SEGMENT(segment_name)
    {
        //condition//
    }
}

we are declaring that the rule condition has to be verified within three consecutive sentences of the input document. Actually, three sentences are the maximum possible scope for the rule to be verified. The rule could also be verified in a single sentence or in two sentences, depending where the elements specified in the condition are found. Therefore, notwithstanding the maximum scope declared in a rule, the real scope is determined by the portion of text really containing the concepts that the rule looks for.

Section and segment

The use of SECTION and SEGMENT scope options has a peculiar meaning when defining segmentation rules. In fact, when using these options for categorization or extraction rules, the user’s aim is to look for concepts in a specific portion of text. When defining segmentation rules, on the other hand, the output of a rule acting within a section or another previously defined segment is a new segment created within the section or segment specified in the rule SCOPE. The possible aims to be achieved by means of this technique are two:

Create nested segments.
Upgrade a whole section or a whole segment to a new segment.

Nested segments

Using the scope option SEGMENT it is possible to define dynamic segments within other previously created segments. The syntax is the following:

SCOPE scopeOption
{
    SEGMENT(segmentName1)
    {
        condition
    }
}

SCOPE SENTENCE IN SEGMENT(segmentName1)
{
    SEGMENT (segmentName2)
    {
        condition
    }
}

The first rule (or set of rules) defines a segment using any scope options other than SEGMENT. The second rule uses the first segment as scope in order to define, within the first segment itself, another segment, nested in the first one.

Circular References

When defining nested segments it is fundamental to pay attention not to define circular references. Should it occur, the software will be unable to assign the correct order to the segmentations rules, thus making it impossible to execute them.

Consider the following examples:

SCOPE SENTENCE
{
    SEGMENT(segment_name1)
    {
    //condition//
    }
}

SCOPE SENTENCE IN SEGMENT(segment1)
{
    SEGMENT(segment2)
    {
        //condition//
    }
}

SCOPE SENTENCE IN SEGMENT(segment2)
{
    SEGMENT(segment3)
    {
        //condition//
    }
}

SCOPE SENTENCE IN SEGMENT(segment3)
{
    SEGMENT(segment_name1)
    {
        //condition//
    }
}

The rules above define:

Segment1 first.
Then segment2 is defined within segment1.
Then segment3 is defined within segment2.
At the end, segment1 is defined within segment3.

The last rule invalidates the whole set because it introduces a circular reference in the code. This would generate an error and no rule would be compiled and applied.

Sections and segments promotion

By using segmentation rules it is possible to promote a whole section or segment to a new segment which coincides with the original section or segment. In other words, it is possible to generate a segment identical in position and extension to another segment or section in order to create a sort of “duplicate” of an existing segment or section. This technique is useful when different operations must be performed within a single section or segment (linguistic rules, filters, post-processing…) and the developer needs to differentiate a document portion where these actions need to be performed. This can be achieved only when the new segment includes the entire original section or segment, not just a part of it.

For example, the following sample rule:

SCOPE SECTION(HEADLINE)
{
    SEGMENT(BOLD)
    {
        //condition//
    }
}

is correct and accepted because the entire HEADLINE section is going to be part of the new segment BOLD.