Compare commits

...

1 Commits

Author SHA1 Message Date
Marek Horst e98b98ab71 Closes #52: Improvement required for the "Extraction of cited concepts" Graph documentation page
Renaming title to "Extraction of referenced concepts".

Introducing the first batch of improvements related to "algorithmic details":
* describing “target database” for (a) the datasets, and (b) software
* the result of the process on the OpenAIRE Research Graph content
2023-06-29 14:47:57 +02:00
1 changed files with 13 additions and 1 deletions

View File

@ -2,13 +2,25 @@
sidebar_position: 4 sidebar_position: 4
--- ---
# Extraction of cited concepts # Extraction of referenced concepts
***Short description:*** Scans the plaintexts of publications for cited concepts, currently for references to datasets and software URIs. ***Short description:*** Scans the plaintexts of publications for cited concepts, currently for references to datasets and software URIs.
***Algorithmic details:*** ***Algorithmic details:***
The algorithm extracts citations to specific datasets and software. It extracts the citation section of a publication's fulltext and applies string matching against a target database which includes an inverted index with dataset/software titles, urls and other metadata. The algorithm extracts citations to specific datasets and software. It extracts the citation section of a publication's fulltext and applies string matching against a target database which includes an inverted index with dataset/software titles, urls and other metadata.
The following sqlite databases are involved in the mining process:
* [datasets] two databases including opentrials and datasets kept separately, both imported from the OpenAIRE Graph
* [software] database including software entities imported from the OpenAIRE Graph and the Software Heritage URLs imported from the SH API
****The result of the process on the OpenAIRE Research Graph content:****
The following content is generated as an outcome of the software mining:
* a new software entity in the OpenAIRE Graph including `title`, `description`, `codeRepositoryUrl`, two instance objects pointing to the original repository and to the SH resource
* bi-directional relation between the publication and the software
Dataset mining links to an already existing dataset entity therefore the following outcome is generated:
* bi-directional relation between the publication and the dataset
***Parameters:*** ***Parameters:***
Title, URL, creator names, publisher names and publication year for each concept to create the target database. Identifier and publication's fulltext to extract the cited concepts Title, URL, creator names, publisher names and publication year for each concept to create the target database. Identifier and publication's fulltext to extract the cited concepts