Closes #52: Improvement required for the "Extraction of cited concepts" Graph documentation page

Renaming title to "Extraction of referenced concepts".

Introducing the first batch of improvements related to "algorithmic details":
* describing “target database” for (a) the datasets, and (b) software
* the result of the process on the OpenAIRE Research Graph content
This commit is contained in:
Marek Horst 2023-06-29 14:47:57 +02:00
parent 9fbc4cc6e0
commit e98b98ab71
1 changed files with 13 additions and 1 deletions

View File

@ -2,13 +2,25 @@
sidebar_position: 4 sidebar_position: 4
--- ---
# Extraction of cited concepts # Extraction of referenced concepts
***Short description:*** Scans the plaintexts of publications for cited concepts, currently for references to datasets and software URIs. ***Short description:*** Scans the plaintexts of publications for cited concepts, currently for references to datasets and software URIs.
***Algorithmic details:*** ***Algorithmic details:***
The algorithm extracts citations to specific datasets and software. It extracts the citation section of a publication's fulltext and applies string matching against a target database which includes an inverted index with dataset/software titles, urls and other metadata. The algorithm extracts citations to specific datasets and software. It extracts the citation section of a publication's fulltext and applies string matching against a target database which includes an inverted index with dataset/software titles, urls and other metadata.
The following sqlite databases are involved in the mining process:
* [datasets] two databases including opentrials and datasets kept separately, both imported from the OpenAIRE Graph
* [software] database including software entities imported from the OpenAIRE Graph and the Software Heritage URLs imported from the SH API
****The result of the process on the OpenAIRE Research Graph content:****
The following content is generated as an outcome of the software mining:
* a new software entity in the OpenAIRE Graph including `title`, `description`, `codeRepositoryUrl`, two instance objects pointing to the original repository and to the SH resource
* bi-directional relation between the publication and the software
Dataset mining links to an already existing dataset entity therefore the following outcome is generated:
* bi-directional relation between the publication and the dataset
***Parameters:*** ***Parameters:***
Title, URL, creator names, publisher names and publication year for each concept to create the target database. Identifier and publication's fulltext to extract the cited concepts Title, URL, creator names, publisher names and publication year for each concept to create the target database. Identifier and publication's fulltext to extract the cited concepts