Similarities

Overview

The similarities function is common to categorization, extraction and thesaurus projects.
It is designed to create and publish in NL Flow a similarity model to be used in the workflows that serve to manage the similarity use case.

The similarity model is a JSON containing instructions for two workflow components:

In blocks of the Similarity Document Preparator component the model says which metadata of the linguistic model output, taken from which sections of the text, must be indexed.
In blocks of the Similarity Calculator component the model says which fields of the Elasticsearch index, corresponding to the above metadata, must have the same value in different documents to consider the documents similar, with the indication of a possible boost to give to the similarity score for certain fields so that they weigh more in the similarity computation.

The project of the predictive language model is the right place to define the similarity model because it is the project, and more specifically an experiment on the linguistic model, that reveals the metadata—including the model's predictions, that is categories or extractions—the predictive model will produce for a given document, so that you can choose which metadata is more suitable to determine the similarity between documents.

For this reason, the similarity model is created based on the outcome of an experiment, which reports all the metadata, including the specific predictions of the model, produced starting out of the documents of the test library.

The similarity function involves the creation of an Elasticsearch index in which the chosen data is actually indexed, for all the documents of the test library. This allows you to search among the documents and then immediately find, for each document, similar documents with respect to the similarity model.
If the similarity model isn't satisfactory, you can create more with different settings.

Creating a similarity index

Creating a similarity index means defining a similarity model and asking the system to build an index based on the chosen metadata, starting from the outcome of an experiment.

To start the wizard for creating a similarity index:

Select Similarities on the project dashboard toolbar.

If there are no indexes defined yet:
- Select Create similarity index at the center of the page.
Otherwise:
- Select Create similarity index on the left panel.

In the first dialog that appears, enter the name of the index and select Next. The Similarity index generation dialog opens.
This dialog has several tabs corresponding to the steps of the wizard. Once you have finished with a tab, you move to the next one by selecting Next. From a tab after the first you can go back to the previous tab by selecting Back. To cancel the wizard select Cancel or click anywhere outside the dialog.

In the Experiment tab, choose the experiment you want to use from the list. To filter the list of experiments, type something in the search box at the top right and press Enter: only experiments whose names contain what you typed will be shown. To cancel the filter, empty the search box and press Enter or select the "X" icon on the right of the search box.
In the Document scope tab, choose the source of the metadata. If you choose Document any source will be fine, while if you choose Section you must then indicate which sections you want to consider and only the metadata produced from the text of the chosen sections will be considered.
In the Correlation tab, choose the metadata that will be indexed and on which similarity between documents will be computed.
- In a categorization project, the Categorization choice corresponds to the categories predicted by the linguistic model.
- In an extraction project, Classes corresponds to the information extracted from the linguistic model, divided by group.
- In a thesaurus project, Thesaurus corresponds to the extractions of the taxonomy concepts cited in the text.
For each metadata chosen, you can set a boost. When two documents have the same value for a metadata, this entails a contribution to the similarity score between the two documents. If you set the boost, this contribution is multiplied by the boost, so if the boost is 2 or more, the equality of that metadata counts more than the equality of other metadata without a boost, that is with a default boost of 1.
In the Stop terms tab, choose any stop terms. When indexing a document, metadata values that coincide with a stop term are skipped, therefore they will not contribute to similarity computation.

If stop terms have already been defined at project level, the tab controls appear enabled, otherwise, if you want to enable them, select the toggle switch at the top right.

Any stop terms defined at the project level have a pre-selected check mark to indicate that they are included by default. To exclude them, uncheck them individually or all of them together by removing the check mark above the list.

To add a stop term:
1. Select Add stop term .
2. Type the term.
3. Select Confirm to confirm or Undo to cancel.
To delete a stop term, select Remove to the right of the term.

Info

Stop terms defined at the project level cannot be deleted in this step of the wizard.

To load stop terms from a file, select Import stop terms and select a text file containing the stop terms. This must be a UTF-8 encoded plain text file with one stop term per line.

To filter the list of stop terms, type something in the search box at the top right and press Enter: only stop terms that contain what you typed will be shown. To clear the filter, empty the search box and press Enter or select the "X" icon on the right of the search box.
In the Summary tab you can find the wizard report. Select Start to start the operation.

A progress bar appears while the index is being generated.

When the index is generated, it is added to the list in the left panel and the details of the similarity model are shown in the central panel, while the experiment that the system used to populate the index with the metadata of the documents is shown in the right panel.

Rename an index

To rename a similarity index:

Select Similarities on the project dashboard toolbar.
Select the index from the list in the left panel.
Select Edit index name on the right panel.
Edit the name and select Save.

Delete an index

To delete a similarity index:

Select Similarities on the project dashboard toolbar.
Select the index from the list in the left panel.
Select Delete index on the right panel.

Publish a similarity model

To publish a similarity model to NL Flow so you can use it in Similarity Document Preparator and Similarity Calculator components' blocks:

Select Similarities on the project dashboard toolbar.
Select the index that the similarity model corresponds to from the list in the left panel.
Select Publish similarity model on the right panel.

Managing project level stop terms

To manage stop terms at the project level:

Select Similarities on the project dashboard toolbar.
Select Manage stop terms on the left panel. In the dialog that appears:
- To add a stop term:
  1. Select Add stop term .
  2. Type the term.
  3. Select Confirm to confirm or Undo to cancel.
- To delete a stop term, select Remove to the right of the term.
- To load stop terms from a file, select Import stop terms and choose a text file containing the stop terms. This must be a UTF-8 encoded plain text file with one stop term per line.
- To filter the list of stop terms, type something in the search box at the top right and press Enter: only stop terms that contain what you typed will be shown. To clear the filter, empty the search box and press Enter or select the "X" icon to the right of the search box.
- To save your changes and close the dialog, select Save.
- To discard your changes and close the dialog, select Cancel or click anywhere outside the dialog.

Get info about a similarity model

To get info about a similarity model:

Select Similarities on the project dashboard toolbar.
Select the index corresponding to the model in the left panel.

You can find the information about the similarity model in the central panel and information about the experiment the index was based upon in the right panel.

Similarity checks

To test for finding similar documents within a similarity index:

Select Similarities on the project dashboard toolbar.
Select the index from the list in the left panel.
Select the document icon on the left panel.

The dashboard looks like a console for consulting and searching documents for an experiment and is used in exactly the same way.

Once you have identified a document of interest, to find similar documents select the name of the document in the list. The view of similar documents opens.
The left panel shows the list of documents corresponding to what was found in the previous view, with the selected document in first position. The right panel lists the similar document, sorted in descending order on the similarity score, with the first document selected.
The central panel shows the detail of the similarities between the two selected documents. It is divided horizontally into two parts: on the left, the data of the document selected in the left list, on the right, the data of the similar document selected in the right list. In turn, these parts are divided symmetrically in two: those closest to the center of the panel show the list of metadata, divided by type, metadata that match highlighted in bold. The parts furthest from the vertical centerline show the text of the documents divided into sections. These parts may be hidden if the window of the browser is lot large enough.
The similarity score is displayed at the top right of the central panel and in the right list under each document.