Introducing the description of mining algorithms developed by ICM #15

Merged
schatz merged 3 commits from enrichment_mining_icm into main 2022-11-17 13:44:26 +01:00
5 changed files with 191 additions and 1 deletions
Showing only changes of commit 0e96fae405 - Show all commits

View File

@ -0,0 +1,54 @@
# Affiliation matching
***Short description:***
The goal of the affiliation matching module is to match affiliations extracted from the pdf and xml documents with organizations from the OpenAIRE organization database.
***Algorithmic details:***
*The buckets concept*
In order to get the best possible results, the algorithm should compare every affiliation with every organization. However, this approach would be very inefficient and slow, because it would involve the processing of the cartesian product (all possible pairs) of millions of affiliations and thousands of organizations. To avoid this, IIS has introduced the concept of buckets. A bucket is a smaller group of affiliations and organizations that have been selected to be matched with one another. The matching algorithm compares only these affiliations and organizations that belong to the same bucket.
*Affiliation matching process*
Every affiliation in a given *bucket* is compared with every organization in the same bucket multiple times, each time by using a different algorithm (*voter*). Each *voter* is assigned a number (match strength) that describes the estimated correctness of the result of its comparison. All the affiliation-organization pairs that have been matched by at least one *voter*, will be assigned the match strength > 0 (the actual number depends on the voters, its calculation method will be shown later).
It is very important for the algorithm to group the affiliations and organizations properly i.e. the ones that have a chance to match should be in the same *bucket*. To guarantee this, the affiliation matching module allows to create different methods of dividing the affiliations and organizations into *buckets*, and to use all of these methods in a single matching process. The specific method of grouping the affiliations and organizations into *bucket* and then joining them into pairs is carried out by the service called *Joiner*.
Every *joiner* can be linked with many different *voters* that will tell if the affiliation-organization pairs joined match or not. By providing new *joiners* and *voters* one can extend the matching algorithm with countless new methods for matching affiliations with organizations, thus adjusting the algorithm to his or her needs.
All the affiliations and organizations are sequentially computed by all the *matchers*. In every *matcher* they are grouped by some *joiner* in pairs, and then these pairs are processed by all the *voters* in the *matcher*. Every affiliation-organization pair that has been matched at least once is assigned the match strength that depends on the match strengths of the *voters* that pointed the given pair is a match.
**NOTE:** There can be many organizations matched with a given affiliation, each of them matched with a different match strength. The user of the module can set a match strength threshold which will limit the results to only those matches that have the match strength greater than the specified threshold.
*Calculation of the match strength of the affiliation-organization pair matched by multiple matchers*
It often happens that the given affiliation-organization pair is returned as a match by more than one matcher, each time with a different match strength. In such a case **the match with the highest match strength will be selected**.
*Calculation of the match strength of the affiliation-organization pair within a single matcher*
Every voter has a match strength that is in the range (0, 1]. **The voter match strength says what the quotient of correct matches to all matches guessed by this voter is, and is based on real data and hundreds of matches prepared by hand.**
The match strength of the given affiliation-organization pair is based on the match strengths of all the voters in the matcher that have pointed that the pair is a match. It will always be less than or equal to 1 and greater than the match strength of each single voter that matched the given pair.
The total match strength is calculated in such a way that each consecutive voter reduces (by its match strength) the gap of uncertainty about the correctness of the given match.
***Parameters:*** -
* input
* input_document_metadata: [ExtractedDocumentMetadata](https://github.com/openaire/iis/blob/master/iis-schemas/src/main/avro/eu/dnetlib/iis/metadataextraction/ExtractedDocumentMetadata.avdl) avro datastore location. Document metadata is the source of affiliations.
* input_organizations: [Organization](https://github.com/openaire/iis/blob/master/iis-schemas/src/main/avro/eu/dnetlib/iis/importer/Organization.avdl) avro datastore location.
* input_document_to_project: [DocumentToProject](https://github.com/openaire/iis/blob/master/iis-schemas/src/main/avro/eu/dnetlib/iis/importer/DocumentToProject.avdl) avro datastore location with **imported** document-to-project relations. These relations (alongside with inferred document-project and project-organization relations) are used to generate document-organization pairs which are used as a hint for matching affiliations.
* input_inferred_document_to_project: [DocumentToProject](https://github.com/openaire/iis/blob/master/iis-schemas/src/main/avro/eu/dnetlib/iis/referenceextraction/project/DocumentToProject.avdl) avro datastore location with **inferred** document-to-project relations.
* input_project_to_organization: [ProjectToOrganization](https://github.com/openaire/iis/blob/master/iis-schemas/src/main/avro/eu/dnetlib/iis/importer/ProjectToOrganization.avdl) avro datastore location. These relations (alongside with infered document-project and document-project relations) are used to generate document-organization pairs which are used as a hint for matching affiliations
* output
* [MatchedOrganization](https://github.com/openaire/iis/blob/master/iis-wf/iis-wf-affmatching/src/main/resources/eu/dnetlib/iis/wf/affmatching/model/MatchedOrganization.avdl) avro datastore location with matched publications with organizations.
***Limitations:*** -
***Environment:*** Java, Spark
***References:*** -
***Authority:*** ICM • ***License:*** AGPL-3.0 • ***Code:*** [CoAnSys/affiliation-organization-matching](https://github.com/CeON/CoAnSys/tree/master/affiliation-organization-matching)

View File

@ -0,0 +1,41 @@
# Citation matching
***Short description:***
During a citation matching task, bibliographic entries are linked to the documents that they reference. The citation matching module - one of the modules of the Information Inference Service (IIS) - receives as an input a list of documents accompanied by their metadata and bibliography. Among them, it discovers links described above and returns them as a list. In this document we shall evaluate if the module has been properly integrated with the whole
system and assess the accuracy of the algorithm used. It is worth mentioning that the implemented algorithm has been described in detail in arXiv:1303.6906 [cs.IR]1. However, in the referenced paper the algorithm was tested on small datasets, but here we will focus on larger datasets, which are expected to be analysed by the system in the production environment.
***Algorithmic details:***
*General description*
The algorithm used in citation matching task consists of two phases. In the first one, for each citation string a set of potentially matching documents is retrieved using a heuristic. In the second one, the metadata of these documents is analysed in order to assess which of them is the most similar to given citation. We assume that citations are parsed, i.e. fragments containing meaningful pieces of metadata information are marked in a special way. Note that in the IIS system, the citation parsing step is executed by another module. The following metadata fields are used by the described solution:
* an author,
* a title,
* a journal name,
* pages,
* a year of publication.
*Heuristic matching*
The heuristic is based on indexing of document metadata by their author names. For each citation we extract author names and try to find documents in the index which have the same author entries. As spelling errors and inaccuracies commonly occur in citations, we have implemented approximate index which enables retrieval of entities with edit distance less than or equal 1.
*Strict matching*
In this step, all the potentially matching pairs obtained in the heuristic step are evaluated and only the most probable ones are returned as the final result. As citations tend to contain spelling errors and differ in style, there is a need to introduce fuzzy similarity measures fitted to the specifics of various metadata fields. Most of them compute a fraction of tokens or trigrams that occur in both fields being compared. When comparing journal
names, we have taken longest common subsequence (LCS) of two strings into consideration. This can be seen as an instance of the assignment problem with some refinements added. The overall similarity of two citation strings is obtained by applying a linear Support Vector Machine (SVM) using field similarities as features.
***Parameters:*** -
* input:
* input_metadata: [ExtractedDocumentMetadataMergedWithOriginal](https://github.com/openaire/iis/blob/master/iis-schemas/src/main/avro/eu/dnetlib/iis/transformers/metadatamerger/ExtractedDocumentMetadataMergedWithOriginal.avdl) avro datastore location with the metadata of both publications and bibliorgaphic references to be matched
* input_matched_citations: [Citation](https://github.com/openaire/iis/blob/master/iis-schemas/src/main/avro/eu/dnetlib/iis/common/citations/Citation.avdl) avro datastore location with citations which were already matched and should be excluded from fuzzy matching
* output: [Citation](https://github.com/openaire/iis/blob/master/iis-schemas/src/main/avro/eu/dnetlib/iis/common/citations/Citation.avdl) avro datastore location with matched publications
***Limitations:*** -
***Environment:*** Java, Spark
***References:*** -
***Authority:*** ICM • ***License:*** AGPL-3.0 • ***Code:*** [CoAnSys/citation-matching](https://github.com/CeON/CoAnSys/tree/master/citation-matching)

View File

@ -0,0 +1,49 @@
# Documents similarity
***Short description:***
Document similarity module is responsible for finding similar documents among the ones available in the OpenAIRE Information Space. It produces "similarity" links between the documents stored in the OpenAIRE Information Space. Each link has a similarity score from [0,1] range assigned; it is expected that the higher the score, the more similar are the documents with respect to their content.
***Algorithmic details:***
The similarity between two documents is expressed as the similarity between weights of their common terms (i.e., words being reduced to their root form) within a context of all terms from the first and the second document. In this approach, the computation can be divided into three consecutive steps:
1. selection of proper terms,
2. calculation of weights of terms for each document,
3. calculation of a given similarity function on weights of terms corresponding to each pair of documents.
 
The document similarity module uses the term frequency inverse-document frequency (TFIDF) measure and the cosine similarity to produce weights for terms and calculate their similarity respectively.
*Steps of execution*
Computation of similarity between documents is executed in the following steps.
1. First, we create a text representation of each document. The text is a concatenation of 3 attributes of document object coming from Information Space: title, abstract, and keywords.
2. Text representation of each document is split into words. Next, stop words or words which occur in more than the N percent of documents (say 99%) or these occurring in less than M documents (say 5) are discarded as we assume that they carry no important information.
3. Next, the words are stemmed (reduced to their root form) and thus converted to terms. The importance of each term in each document is calculated using TFIDF measure (resulting in a vector of weights of terms for each document). Only the top P (say 20) important terms per documents remain for the further computations.
4. In order to calculate the cosine similarity value for the documents, we execute the following steps.
a. Triples [document id, term, term weight] are grouped by a common term and for each pair of triples from the group, term importance is recalculated as the multiplication of terms weights, producing quads [document id 1, document id 2, term, multiplied term weight].
b. Quads are grouped by [document id 1, document id 2] and the values of the multiplied term weight are summed up, resulting in the creation of triples [document id 1, document id 2, total common weight].
c. Finally, triples are normalized using product of the norm of the term weights' vectors. The normalized value is the final similarity measure with value between 0 and 1.
5. For a given document, only the top R (say 20) links to similar documents are returned. The links that are thrown away are assumed to be uninteresting for the end-user and thus storing them would only needlessly take disk space.
***Parameters:*** -
* input:
* input_document: [DocumentMetadata](https://github.com/openaire/iis/blob/master/iis-schemas/src/main/avro/eu/dnetlib/iis/documentssimilarity/DocumentMetadata.avdl) avro datastore location
* parallel: sets parameter parallel for Pig actions (default=80)
* mapredChildJavaOpts: mapreduce's map and reduce child java opts set to all PIG actions (default=Xmx12g)
* tfidfTopnTermPerDocument: number of the most important terms taken into account (default=20)
* similarityTopnDocumentPerDocument: maximum number of similar documents for each publication (default=20)
* removal_rate: removal rate (default=0.99)
* removal_least_used: removal of the least used terms (default=20)
* threshold_num_of_vector_elems_length: vector elements length threshold, when set to less than 2 all documents will be included in similarity matching (default=2)
* output: [DocumentSimilarity](https://github.com/openaire/iis/blob/master/iis-schemas/src/main/avro/eu/dnetlib/iis/documentssimilarity/DocumentSimilarity.avdl) avro datastore location
***Limitations:*** -
***Environment:*** Pig, Java
***References:***
* P. J. Dendek, A. Czeczko, M. Fedoryszak, A. Kawa, and L. Bolikowski, "Content Analysis of Scientific Articles in Apache Hadoop Ecosystem", Stud. Comp.Intelligence, vol. 541, 2014.
***Authority:*** ICM • ***License:*** AGPL-3.0 • ***Code:*** [CoAnSys/document-similarity](https://github.com/CeON/CoAnSys/tree/master/document-similarity)

View File

@ -0,0 +1,37 @@
# Metadata extraction
***Short description:***
Metadata Extraction algorithm is responsible for plaintext and metadata extraction out of the PDF documents. It based on [CERMINE](http://cermine.ceon.pl/about.html) project.
CERMINE is a comprehensive open source system for extracting metadata and content from scientific articles in born-digital form. The system is able to process documents in PDF format and extracts:
* document's metadata, including title, authors, affiliations, abstract, keywords, journal name, volume and issue,
* parsed bibliographic references
* the structure of document's sections, section titles and paragraphs
CERMINE is based on a modular workflow, whose architecture ensures that individual workflow steps can be maintained separately. As a result it is easy to perform evaluation, training, improve or replace one step implementation without changing other parts of the workflow. Most steps implementations utilize supervised and unsupervised machine-leaning techniques, which increases the maintainability of the system, as well as its ability to adapt to new document layouts.
***Algorithmic details:***
CERMINE workflow is composed of four main parts:
* Basic structure extraction takes a PDF file on the input and produces a geometric hierarchical structure representing the document. The structure is composed of pages, zones, lines, words and characters. The reading order of all elements is determined. Every zone is labelled with one of four general categories: METADATA, REFERENCES, BODY and OTHER.
* Metadata extraction part analyses parts of the geometric hierarchical structure labelled as METADATA and extracts a rich set of document's metadata from it.
* References extraction part analyses parts of the geometric hierarchical structure labelled as REFERENCES and the result is a list of document's parsed bibliographic references.
* Text extraction part analyses parts of the geometric hierarchical structure labelled as BODY and extracts document's body structure composed of sections, subsections and paragraphs.
CERMINE uses supervised and unsupervised machine-leaning techniques, such as Support Vector Machines, K-means clustering and Conditional Random Fields. Content classifiers are trained on [GROTOAP2 dataset](http://cermine.ceon.pl/grotoap2/). More information about CERMINE can be found in the [presentation](http://cermine.ceon.pl/static/docs/slides.pdf).
***Parameters:*** -
* input: [DocumentText](https://github.com/openaire/iis/blob/master/iis-schemas/src/main/avro/eu/dnetlib/iis/metadataextraction/DocumentText.avdl) avro datastore location
* output: [ExtractedDocumentMetadata](https://github.com/openaire/iis/blob/master/iis-schemas/src/main/avro/eu/dnetlib/iis/metadataextraction/ExtractedDocumentMetadata.avdl) avro datastore location
***Limitations:***
Born-digital form of PDF documents is supported only. Large PDF documents may require more than 4g of assgined memory (set by default).
***Environment:*** Java, Hadoop
***References:***
* Dominika Tkaczyk, Pawel Szostek, Mateusz Fedoryszak, Piotr Jan Dendek and Lukasz Bolikowski. CERMINE: automatic extraction of structured metadata from scientific literature. In International Journal on Document Analysis and Recognition, 2015, vol. 18, no. 4, pp. 317-335, doi: 10.1007/s10032-015-0249-8.
***Authority:*** ICM • ***License:*** AGPL-3.0 • ***Code:*** [CERMINE](https://github.com/CeON/CERMINE)

View File

@ -3,4 +3,13 @@ sidebar_position: 1
--- ---
# Mining algorithms # Mining algorithms
<span className="todo">TODO</span>
The Text and Data Mining (TDM) algorithms used for enriching the OpenAIRE Graph are grouped in the following main categories:
[Metadata extraction](metadata_extraction.md)
[Affiliation matching](affiliation_matching.md)
[Citation matching](citation_matching.md)
[Documents similiarity](documents_similarity.md)