merging with main
@ -0,0 +1,31 @@
|
||||
---
|
||||
sidebar_position: 3
|
||||
---
|
||||
|
||||
# Extraction of acknowledged concepts
|
||||
|
||||
***Short description:***
|
||||
Scans the plaintexts of publications for acknowledged concepts, including grant identifiers (projects) of funders, accession numbers of bioetities, EPO patent mentions, as well as custom concepts that can link research objects to specific research communities and initiatives in OpenAIRE.
|
||||
|
||||
***Algorithmic details:***
|
||||
The algorithm processes the publication's fulltext and extracts references to acknowledged concepts. It applies pattern matching and string join between the fulltext and a target database which contains the title, the acronym and the identifier of the searched concept.
|
||||
|
||||
***Parameters:***
|
||||
Concept titles, acronyms, and identifiers, publication's identifiers and fulltexts
|
||||
|
||||
***Limitations:*** -
|
||||
|
||||
***Environment:***
|
||||
Python, [madIS](https://github.com/madgik/madis), [APSW](https://github.com/rogerbinns/apsw)
|
||||
|
||||
***References:***
|
||||
* Foufoulas, Y., Zacharia, E., Dimitropoulos, H., Manola, N., Ioannidis, Y. (2022). DETEXA: Declarative Extensible Text Exploration and Analysis. In: , et al. Linking Theory and Practice of Digital Libraries. TPDL 2022. Lecture Notes in Computer Science, vol 13541. Springer, Cham. [doi:10.1007/978-3-031-16802-4_9](https://doi.org/10.1007/978-3-031-16802-4_9)
|
||||
|
||||
***Authority:*** ATHENA RC • ***License:*** CC-BY/CC-0 • ***Code:*** [iis/referenceextraction](https://github.com/openaire/iis/tree/master/iis-wf/iis-wf-referenceextraction/src/main/resources/eu/dnetlib/iis/wf/referenceextraction)
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
@ -0,0 +1,58 @@
|
||||
---
|
||||
sidebar_position: 1
|
||||
---
|
||||
|
||||
# Affiliation matching
|
||||
|
||||
***Short description:***
|
||||
The goal of the affiliation matching module is to match affiliations extracted from the pdf and xml documents with organizations from the OpenAIRE organization database.
|
||||
|
||||
***Algorithmic details:***
|
||||
|
||||
*The buckets concept*
|
||||
|
||||
In order to get the best possible results, the algorithm should compare every affiliation with every organization. However, this approach would be very inefficient and slow, because it would involve the processing of the cartesian product (all possible pairs) of millions of affiliations and thousands of organizations. To avoid this, IIS has introduced the concept of buckets. A bucket is a smaller group of affiliations and organizations that have been selected to be matched with one another. The matching algorithm compares only these affiliations and organizations that belong to the same bucket.
|
||||
|
||||
*Affiliation matching process*
|
||||
|
||||
Every affiliation in a given *bucket* is compared with every organization in the same bucket multiple times, each time by using a different algorithm (*voter*). Each *voter* is assigned a number (match strength) that describes the estimated correctness of the result of its comparison. All the affiliation-organization pairs that have been matched by at least one *voter*, will be assigned the match strength > 0 (the actual number depends on the voters, its calculation method will be shown later).
|
||||
|
||||
It is very important for the algorithm to group the affiliations and organizations properly i.e. the ones that have a chance to match should be in the same *bucket*. To guarantee this, the affiliation matching module allows to create different methods of dividing the affiliations and organizations into *buckets*, and to use all of these methods in a single matching process. The specific method of grouping the affiliations and organizations into *bucket* and then joining them into pairs is carried out by the service called *Joiner*.
|
||||
|
||||
Every *joiner* can be linked with many different *voters* that will tell if the affiliation-organization pairs joined match or not. By providing new *joiners* and *voters* one can extend the matching algorithm with countless new methods for matching affiliations with organizations, thus adjusting the algorithm to his or her needs.
|
||||
|
||||
All the affiliations and organizations are sequentially computed by all the *matchers*. In every *matcher* they are grouped by some *joiner* in pairs, and then these pairs are processed by all the *voters* in the *matcher*. Every affiliation-organization pair that has been matched at least once is assigned the match strength that depends on the match strengths of the *voters* that pointed the given pair is a match.
|
||||
|
||||
**NOTE:** There can be many organizations matched with a given affiliation, each of them matched with a different match strength. The user of the module can set a match strength threshold which will limit the results to only those matches that have the match strength greater than the specified threshold.
|
||||
|
||||
*Calculation of the match strength of the affiliation-organization pair matched by multiple matchers*
|
||||
|
||||
It often happens that the given affiliation-organization pair is returned as a match by more than one matcher, each time with a different match strength. In such a case **the match with the highest match strength will be selected**.
|
||||
|
||||
*Calculation of the match strength of the affiliation-organization pair within a single matcher*
|
||||
|
||||
Every voter has a match strength that is in the range (0, 1]. **The voter match strength says what the quotient of correct matches to all matches guessed by this voter is, and is based on real data and hundreds of matches prepared by hand.**
|
||||
|
||||
The match strength of the given affiliation-organization pair is based on the match strengths of all the voters in the matcher that have pointed that the pair is a match. It will always be less than or equal to 1 and greater than the match strength of each single voter that matched the given pair.
|
||||
|
||||
The total match strength is calculated in such a way that each consecutive voter reduces (by its match strength) the gap of uncertainty about the correctness of the given match.
|
||||
|
||||
***Parameters:***
|
||||
|
||||
* input
|
||||
* input_document_metadata: [ExtractedDocumentMetadata](https://github.com/openaire/iis/blob/master/iis-schemas/src/main/avro/eu/dnetlib/iis/metadataextraction/ExtractedDocumentMetadata.avdl) avro datastore location. Document metadata is the source of affiliations.
|
||||
* input_organizations: [Organization](https://github.com/openaire/iis/blob/master/iis-schemas/src/main/avro/eu/dnetlib/iis/importer/Organization.avdl) avro datastore location.
|
||||
* input_document_to_project: [DocumentToProject](https://github.com/openaire/iis/blob/master/iis-schemas/src/main/avro/eu/dnetlib/iis/importer/DocumentToProject.avdl) avro datastore location with **imported** document-to-project relations. These relations (alongside with inferred document-project and project-organization relations) are used to generate document-organization pairs which are used as a hint for matching affiliations.
|
||||
* input_inferred_document_to_project: [DocumentToProject](https://github.com/openaire/iis/blob/master/iis-schemas/src/main/avro/eu/dnetlib/iis/referenceextraction/project/DocumentToProject.avdl) avro datastore location with **inferred** document-to-project relations.
|
||||
* input_project_to_organization: [ProjectToOrganization](https://github.com/openaire/iis/blob/master/iis-schemas/src/main/avro/eu/dnetlib/iis/importer/ProjectToOrganization.avdl) avro datastore location. These relations (alongside with infered document-project and document-project relations) are used to generate document-organization pairs which are used as a hint for matching affiliations
|
||||
* output
|
||||
* [MatchedOrganization](https://github.com/openaire/iis/blob/master/iis-wf/iis-wf-affmatching/src/main/resources/eu/dnetlib/iis/wf/affmatching/model/MatchedOrganization.avdl) avro datastore location with matched publications with organizations.
|
||||
|
||||
***Limitations:*** -
|
||||
|
||||
***Environment:***
|
||||
Java, Spark
|
||||
|
||||
***References:*** -
|
||||
|
||||
***Authority:*** ICM • ***License:*** AGPL-3.0 • ***Code:*** [CoAnSys/affiliation-organization-matching](https://github.com/CeON/CoAnSys/tree/master/affiliation-organization-matching)
|
@ -0,0 +1,42 @@
|
||||
# Citation matching
|
||||
|
||||
***Short description:***
|
||||
During a citation matching task, bibliographic entries are linked to the documents that they reference. The citation matching module - one of the modules of the Information Inference Service (IIS) - receives as an input a list of documents accompanied by their metadata and bibliography. Among them, it discovers links described above and returns them as a list. In this document we shall evaluate if the module has been properly integrated with the whole
|
||||
system and assess the accuracy of the algorithm used. It is worth mentioning that the implemented algorithm has been described in detail in arXiv:1303.6906 [cs.IR]1. However, in the referenced paper the algorithm was tested on small datasets, but here we will focus on larger datasets, which are expected to be analysed by the system in the production environment.
|
||||
|
||||
***Algorithmic details:***
|
||||
|
||||
*General description*
|
||||
|
||||
The algorithm used in citation matching task consists of two phases. In the first one, for each citation string a set of potentially matching documents is retrieved using a heuristic. In the second one, the metadata of these documents is analysed in order to assess which of them is the most similar to given citation. We assume that citations are parsed, i.e. fragments containing meaningful pieces of metadata information are marked in a special way. Note that in the IIS system, the citation parsing step is executed by another module. The following metadata fields are used by the described solution:
|
||||
|
||||
* an author,
|
||||
* a title,
|
||||
* a journal name,
|
||||
* pages,
|
||||
* a year of publication.
|
||||
|
||||
*Heuristic matching*
|
||||
|
||||
The heuristic is based on indexing of document metadata by their author names. For each citation we extract author names and try to find documents in the index which have the same author entries. As spelling errors and inaccuracies commonly occur in citations, we have implemented approximate index which enables retrieval of entities with edit distance less than or equal 1.
|
||||
|
||||
*Strict matching*
|
||||
|
||||
In this step, all the potentially matching pairs obtained in the heuristic step are evaluated and only the most probable ones are returned as the final result. As citations tend to contain spelling errors and differ in style, there is a need to introduce fuzzy similarity measures fitted to the specifics of various metadata fields. Most of them compute a fraction of tokens or trigrams that occur in both fields being compared. When comparing journal
|
||||
names, we have taken longest common subsequence (LCS) of two strings into consideration. This can be seen as an instance of the assignment problem with some refinements added. The overall similarity of two citation strings is obtained by applying a linear Support Vector Machine (SVM) using field similarities as features.
|
||||
|
||||
***Parameters:***
|
||||
|
||||
* input:
|
||||
* input_metadata: [ExtractedDocumentMetadataMergedWithOriginal](https://github.com/openaire/iis/blob/master/iis-schemas/src/main/avro/eu/dnetlib/iis/transformers/metadatamerger/ExtractedDocumentMetadataMergedWithOriginal.avdl) avro datastore location with the metadata of both publications and bibliorgaphic references to be matched
|
||||
* input_matched_citations: [Citation](https://github.com/openaire/iis/blob/master/iis-schemas/src/main/avro/eu/dnetlib/iis/common/citations/Citation.avdl) avro datastore location with citations which were already matched and should be excluded from fuzzy matching
|
||||
* output: [Citation](https://github.com/openaire/iis/blob/master/iis-schemas/src/main/avro/eu/dnetlib/iis/common/citations/Citation.avdl) avro datastore location with matched publications
|
||||
|
||||
***Limitations:*** -
|
||||
|
||||
***Environment:***
|
||||
Java, Spark
|
||||
|
||||
***References:*** -
|
||||
|
||||
***Authority:*** ICM • ***License:*** AGPL-3.0 • ***Code:*** [CoAnSys/citation-matching](https://github.com/CeON/CoAnSys/tree/master/citation-matching)
|
@ -0,0 +1,24 @@
|
||||
---
|
||||
sidebar_position: 4
|
||||
---
|
||||
|
||||
# Extraction of cited concepts
|
||||
|
||||
***Short description:***
|
||||
Scans the plaintexts of publications for cited concepts, currently for references to datasets and software URIs.
|
||||
|
||||
***Algorithmic details:***
|
||||
The algorithm extracts citations to specific datasets and software. It extracts the citation section of a publication's fulltext and applies string matching against a target database which includes an inverted index with dataset/software titles, urls and other metadata.
|
||||
|
||||
***Parameters:***
|
||||
Title, URL, creator names, publisher names and publication year for each concept to create the target database. Identifier and publication's fulltext to extract the cited concepts
|
||||
|
||||
***Limitations:*** -
|
||||
|
||||
***Environment:***
|
||||
Python, [madIS](https://github.com/madgik/madis), [APSW](https://github.com/rogerbinns/apsw)
|
||||
|
||||
***References:***
|
||||
* Foufoulas Y., Stamatogiannakis L., Dimitropoulos H., Ioannidis Y. (2017) “High-Pass Text Filtering for Citation Matching”. In: Kamps J., Tsakonas G., Manolopoulos Y., Iliadis L., Karydis I. (eds) Research and Advanced Technology for Digital Libraries. TPDL 2017. Lecture Notes in Computer Science, vol 10450. Springer, Cham. [doi:10.1007/978-3-319-67008-9_28](https://doi.org/10.1007/978-3-319-67008-9_28)
|
||||
|
||||
***Authority:*** ATHENA RC • ***License:*** CC-BY/CC-0 • ***Code:*** [iis/referenceextraction](https://github.com/openaire/iis/tree/master/iis-wf/iis-wf-referenceextraction/src/main/resources/eu/dnetlib/iis/wf/referenceextraction)
|
@ -0,0 +1,22 @@
|
||||
---
|
||||
sidebar_position: 5
|
||||
---
|
||||
|
||||
# Classifiers
|
||||
|
||||
***Short description:*** A document classification algorithm that employs analysis of free text stemming from the abstracts of the publications. The purpose of applying a document classification module is to assign a scientific text to one or more predefined content classes.
|
||||
|
||||
***Algorithmic details:***
|
||||
The algorithm classifies publication's fulltexts using a Bayesian classifier and weighted terms according to an offline training phase. The training has been done using the following taxonomies: arXiv, MeSH (Medical Subject Headings), ACM, and DDC (Dewey Decimal Classification, or Dewey Decimal System).
|
||||
|
||||
***Parameters:*** Publication's identifier and fulltext
|
||||
|
||||
***Limitations:*** -
|
||||
|
||||
***Environment:***
|
||||
Python, [madIS](https://github.com/madgik/madis), [APSW](https://github.com/rogerbinns/apsw)
|
||||
|
||||
***References:***
|
||||
* Giannakopoulos, T., Stamatogiannakis, E., Foufoulas, I., Dimitropoulos, H., Manola, N., Ioannidis, Y. (2014). Content Visualization of Scientific Corpora Using an Extensible Relational Database Implementation. In: Bolikowski, Ł., Casarosa, V., Goodale, P., Houssos, N., Manghi, P., Schirrwagen, J. (eds) Theory and Practice of Digital Libraries -- TPDL 2013 Selected Workshops. TPDL 2013. Communications in Computer and Information Science, vol 416. Springer, Cham. [doi:10.1007/978-3-319-08425-1_10](https://doi.org/10.1007/978-3-319-08425-1_10)
|
||||
|
||||
***Authority:*** ATHENA RC • ***License:*** CC-BY/CC-0 • ***Code:*** [iis/referenceextraction](https://github.com/openaire/iis/tree/master/iis-wf/iis-wf-referenceextraction/src/main/resources/eu/dnetlib/iis/wf/referenceextraction)
|
@ -0,0 +1,37 @@
|
||||
# Metadata extraction
|
||||
|
||||
***Short description:***
|
||||
Metadata Extraction algorithm is responsible for plaintext and metadata extraction out of the PDF documents. It based on [CERMINE](http://cermine.ceon.pl/about.html) project.
|
||||
|
||||
CERMINE is a comprehensive open source system for extracting metadata and content from scientific articles in born-digital form. The system is able to process documents in PDF format and extracts:
|
||||
|
||||
* document's metadata, including title, authors, affiliations, abstract, keywords, journal name, volume and issue,
|
||||
* parsed bibliographic references
|
||||
* the structure of document's sections, section titles and paragraphs
|
||||
|
||||
CERMINE is based on a modular workflow, whose architecture ensures that individual workflow steps can be maintained separately. As a result it is easy to perform evaluation, training, improve or replace one step implementation without changing other parts of the workflow. Most steps implementations utilize supervised and unsupervised machine-leaning techniques, which increases the maintainability of the system, as well as its ability to adapt to new document layouts.
|
||||
|
||||
***Algorithmic details:***
|
||||
CERMINE workflow is composed of four main parts:
|
||||
|
||||
* Basic structure extraction takes a PDF file on the input and produces a geometric hierarchical structure representing the document. The structure is composed of pages, zones, lines, words and characters. The reading order of all elements is determined. Every zone is labelled with one of four general categories: METADATA, REFERENCES, BODY and OTHER.
|
||||
* Metadata extraction part analyses parts of the geometric hierarchical structure labelled as METADATA and extracts a rich set of document's metadata from it.
|
||||
* References extraction part analyses parts of the geometric hierarchical structure labelled as REFERENCES and the result is a list of document's parsed bibliographic references.
|
||||
* Text extraction part analyses parts of the geometric hierarchical structure labelled as BODY and extracts document's body structure composed of sections, subsections and paragraphs.
|
||||
|
||||
CERMINE uses supervised and unsupervised machine-leaning techniques, such as Support Vector Machines, K-means clustering and Conditional Random Fields. Content classifiers are trained on [GROTOAP2 dataset](http://cermine.ceon.pl/grotoap2/). More information about CERMINE can be found in the [presentation](http://cermine.ceon.pl/static/docs/slides.pdf).
|
||||
|
||||
***Parameters:***
|
||||
* input: [DocumentText](https://github.com/openaire/iis/blob/master/iis-schemas/src/main/avro/eu/dnetlib/iis/metadataextraction/DocumentText.avdl) avro datastore location
|
||||
* output: [ExtractedDocumentMetadata](https://github.com/openaire/iis/blob/master/iis-schemas/src/main/avro/eu/dnetlib/iis/metadataextraction/ExtractedDocumentMetadata.avdl) avro datastore location
|
||||
|
||||
***Limitations:***
|
||||
Born-digital form of PDF documents is supported only. Large PDF documents may require more than 4g of assgined memory (set by default).
|
||||
|
||||
***Environment:***
|
||||
Java, Hadoop
|
||||
|
||||
***References:***
|
||||
* Dominika Tkaczyk, Pawel Szostek, Mateusz Fedoryszak, Piotr Jan Dendek and Lukasz Bolikowski. CERMINE: automatic extraction of structured metadata from scientific literature. In International Journal on Document Analysis and Recognition, 2015, vol. 18, no. 4, pp. 317-335, doi: 10.1007/s10032-015-0249-8.
|
||||
|
||||
***Authority:*** ICM • ***License:*** AGPL-3.0 • ***Code:*** [CERMINE](https://github.com/CeON/CERMINE)
|
@ -1,6 +0,0 @@
|
||||
---
|
||||
sidebar_position: 1
|
||||
---
|
||||
|
||||
# Mining algorithms
|
||||
<span className="todo">TODO</span>
|
Before Width: | Height: | Size: 170 KiB After Width: | Height: | Size: 174 KiB |
Before Width: | Height: | Size: 129 KiB After Width: | Height: | Size: 130 KiB |
Before Width: | Height: | Size: 181 KiB After Width: | Height: | Size: 184 KiB |
Before Width: | Height: | Size: 78 KiB After Width: | Height: | Size: 79 KiB |