Introducing the description of mining algorithms developed by ICM #15
|
@ -34,7 +34,7 @@ The match strength of the given affiliation-organization pair is based on the ma
|
|||
|
||||
The total match strength is calculated in such a way that each consecutive voter reduces (by its match strength) the gap of uncertainty about the correctness of the given match.
|
||||
|
||||
***Parameters:*** -
|
||||
***Parameters:***
|
||||
|
||||
* input
|
||||
* input_document_metadata: [ExtractedDocumentMetadata](https://github.com/openaire/iis/blob/master/iis-schemas/src/main/avro/eu/dnetlib/iis/metadataextraction/ExtractedDocumentMetadata.avdl) avro datastore location. Document metadata is the source of affiliations.
|
||||
|
@ -45,10 +45,12 @@ The total match strength is calculated in such a way that each consecutive voter
|
|||
* output
|
||||
* [MatchedOrganization](https://github.com/openaire/iis/blob/master/iis-wf/iis-wf-affmatching/src/main/resources/eu/dnetlib/iis/wf/affmatching/model/MatchedOrganization.avdl) avro datastore location with matched publications with organizations.
|
||||
|
||||
***Limitations:*** -
|
||||
***Limitations:***
|
||||
|
||||
***Environment:*** Java, Spark
|
||||
***Environment:***
|
||||
|
||||
***References:*** -
|
||||
Java, Spark
|
||||
|
||||
***References:***
|
||||
|
||||
***Authority:*** ICM • ***License:*** AGPL-3.0 • ***Code:*** [CoAnSys/affiliation-organization-matching](https://github.com/CeON/CoAnSys/tree/master/affiliation-organization-matching)
|
||||
|
|
|
@ -26,16 +26,18 @@ The heuristic is based on indexing of document metadata by their author names. F
|
|||
In this step, all the potentially matching pairs obtained in the heuristic step are evaluated and only the most probable ones are returned as the final result. As citations tend to contain spelling errors and differ in style, there is a need to introduce fuzzy similarity measures fitted to the specifics of various metadata fields. Most of them compute a fraction of tokens or trigrams that occur in both fields being compared. When comparing journal
|
||||
names, we have taken longest common subsequence (LCS) of two strings into consideration. This can be seen as an instance of the assignment problem with some refinements added. The overall similarity of two citation strings is obtained by applying a linear Support Vector Machine (SVM) using field similarities as features.
|
||||
|
||||
***Parameters:*** -
|
||||
***Parameters:***
|
||||
* input:
|
||||
* input_metadata: [ExtractedDocumentMetadataMergedWithOriginal](https://github.com/openaire/iis/blob/master/iis-schemas/src/main/avro/eu/dnetlib/iis/transformers/metadatamerger/ExtractedDocumentMetadataMergedWithOriginal.avdl) avro datastore location with the metadata of both publications and bibliorgaphic references to be matched
|
||||
* input_matched_citations: [Citation](https://github.com/openaire/iis/blob/master/iis-schemas/src/main/avro/eu/dnetlib/iis/common/citations/Citation.avdl) avro datastore location with citations which were already matched and should be excluded from fuzzy matching
|
||||
* output: [Citation](https://github.com/openaire/iis/blob/master/iis-schemas/src/main/avro/eu/dnetlib/iis/common/citations/Citation.avdl) avro datastore location with matched publications
|
||||
|
||||
***Limitations:*** -
|
||||
***Limitations:***
|
||||
|
||||
***Environment:*** Java, Spark
|
||||
***Environment:***
|
||||
|
||||
***References:*** -
|
||||
Java, Spark
|
||||
|
||||
***References:***
|
||||
|
||||
***Authority:*** ICM • ***License:*** AGPL-3.0 • ***Code:*** [CoAnSys/citation-matching](https://github.com/CeON/CoAnSys/tree/master/citation-matching)
|
||||
|
|
|
@ -27,7 +27,7 @@ Computation of similarity between documents is executed in the following steps.
|
|||
c. Finally, triples are normalized using product of the norm of the term weights' vectors. The normalized value is the final similarity measure with value between 0 and 1.
|
||||
5. For a given document, only the top R (say 20) links to similar documents are returned. The links that are thrown away are assumed to be uninteresting for the end-user and thus storing them would only needlessly take disk space.
|
||||
|
||||
***Parameters:*** -
|
||||
***Parameters:***
|
||||
* input:
|
||||
* input_document: [DocumentMetadata](https://github.com/openaire/iis/blob/master/iis-schemas/src/main/avro/eu/dnetlib/iis/documentssimilarity/DocumentMetadata.avdl) avro datastore location
|
||||
* parallel: sets parameter parallel for Pig actions (default=80)
|
||||
|
@ -38,9 +38,12 @@ Computation of similarity between documents is executed in the following steps.
|
|||
* removal_least_used: removal of the least used terms (default=20)
|
||||
* threshold_num_of_vector_elems_length: vector elements length threshold, when set to less than 2 all documents will be included in similarity matching (default=2)
|
||||
* output: [DocumentSimilarity](https://github.com/openaire/iis/blob/master/iis-schemas/src/main/avro/eu/dnetlib/iis/documentssimilarity/DocumentSimilarity.avdl) avro datastore location
|
||||
***Limitations:*** -
|
||||
|
||||
***Environment:*** Pig, Java
|
||||
***Limitations:***
|
||||
|
||||
***Environment:***
|
||||
|
||||
Pig, Java
|
||||
|
||||
***References:***
|
||||
|
||||
|
|
|
@ -1,6 +1,7 @@
|
|||
# Metadata extraction
|
||||
|
||||
***Short description:***
|
||||
|
||||
Metadata Extraction algorithm is responsible for plaintext and metadata extraction out of the PDF documents. It based on [CERMINE](http://cermine.ceon.pl/about.html) project.
|
||||
|
||||
CERMINE is a comprehensive open source system for extracting metadata and content from scientific articles in born-digital form. The system is able to process documents in PDF format and extracts:
|
||||
|
@ -22,14 +23,17 @@ CERMINE workflow is composed of four main parts:
|
|||
|
||||
CERMINE uses supervised and unsupervised machine-leaning techniques, such as Support Vector Machines, K-means clustering and Conditional Random Fields. Content classifiers are trained on [GROTOAP2 dataset](http://cermine.ceon.pl/grotoap2/). More information about CERMINE can be found in the [presentation](http://cermine.ceon.pl/static/docs/slides.pdf).
|
||||
|
||||
***Parameters:*** -
|
||||
***Parameters:***
|
||||
* input: [DocumentText](https://github.com/openaire/iis/blob/master/iis-schemas/src/main/avro/eu/dnetlib/iis/metadataextraction/DocumentText.avdl) avro datastore location
|
||||
* output: [ExtractedDocumentMetadata](https://github.com/openaire/iis/blob/master/iis-schemas/src/main/avro/eu/dnetlib/iis/metadataextraction/ExtractedDocumentMetadata.avdl) avro datastore location
|
||||
|
||||
***Limitations:***
|
||||
|
||||
Born-digital form of PDF documents is supported only. Large PDF documents may require more than 4g of assgined memory (set by default).
|
||||
|
||||
***Environment:*** Java, Hadoop
|
||||
***Environment:***
|
||||
|
||||
Java, Hadoop
|
||||
|
||||
***References:***
|
||||
* Dominika Tkaczyk, Pawel Szostek, Mateusz Fedoryszak, Piotr Jan Dendek and Lukasz Bolikowski. CERMINE: automatic extraction of structured metadata from scientific literature. In International Journal on Document Analysis and Recognition, 2015, vol. 18, no. 4, pp. 317-335, doi: 10.1007/s10032-015-0249-8.
|
||||
|
|
Loading…
Reference in New Issue