uodated cheangelog

This commit is contained in:
Claudio Atzori 2024-07-26 12:58:44 +02:00
parent da4568b7c9
commit a1819728e8
2 changed files with 20 additions and 57 deletions

View File

@ -19,6 +19,20 @@ This section documents all notable changes for each graph version.
--- ---
### v9.0.0
_Start Date: 2024-07-03 • Release Date: 2024-07-15 • Dataset release: **yes**_
#### Added
- General increase of the scientific products with ORCID identified authors +0.43% (+145K)
#### Changed
- Improved matching of organizations in the deduplication algorithm, leading to less false positives
- Updated Crossref publications to include contents until May 2024
- Updated ORCID contents until June 2024
- Updated Datacite contents until June 2024
### v8.0.0 ### v8.0.0
_Start Date: 2024-05-15 • Release Date: 2024-06-20 • Dataset release: **no**_ _Start Date: 2024-05-15 • Release Date: 2024-06-20 • Dataset release: **no**_
@ -31,8 +45,12 @@ _Start Date: 2024-05-15 • Release Date: 2024-06-20 • Dataset release:
#### Changed #### Changed
- Revised deduplication configuration to better exploit resource types - Revised deduplication configuration to better exploit resource types
- The DOIBoost dataset was superseded by the direct aggregation of its datasources: Crossref, Unpaywall, Microsoft Academic Graph, ORCID - The DOIBoost dataset was superseded by the direct aggregation of its datasources: Crossref, Unpaywall, Microsoft
- Relaxed Crossref publication inclusion criteria, now accepting records without author information, leading to a +15% increase (from 127Mi to 146Mi records). Included contents until April 2023 Academic Graph, ORCID. See the [aggregation of the non compatible sources](category/non-compatible-sources) section
to know more
details
- Relaxed Crossref publication inclusion criteria, now accepting records without author information, leading to a
+15% increase (from 127Mi to 146Mi records). Included contents until April 2024
- Updated ORCID contents until April 2024 - Updated ORCID contents until April 2024
- Updated Datacite contents until April 2024 - Updated Datacite contents until April 2024

View File

@ -1,55 +0,0 @@
## Enrichment from ORCID
OpenAIRE enhances publication metadata by incorporating author information from ORCID. This involves adding persistent identifiers to authors and leveraging ORCID data to improve author disambiguation.
### Enrichment Process
The following steps outline how ORCID information is integrated into the OpenAIRE Graph:
#### Extracting Author and Work Information
1. **Data Collection:** OpenAIRE extracts the following from ORCID profiles:
* Author information: ORCID, family name, given name, other names, credit name
* Work information: Persistent identifiers (DOI, PMC, PMID, arXiv, handle)
2. **ORCID-Work Pair Creation:** For each work identified by a persistent identifier (PID), an ORCID-Work pair is created. For example:
* `<orcid1, doi1>`
* `<orcid1, pmc1>`
#### Grouping by Work Persistent Identifier
ORCID-Work pairs are grouped by the work's persistent identifier to identify multiple authors contributing to the same work. This results in structures like:
* `<doi1, [orcid1, orcid2]>`
**Note:**
* `orcidx`: ORCID identifier with associated author name information.
* `doix`: Persistent identifier schema and value (e.g., `<"doi", "10....">`).
#### Matching with Graph and Enriching Author Metadata
1. **Graph Search:** For each ORCID-Work pair, OpenAIRE searches the Graph for a corresponding result based on the persistent identifier.
2. **Author Matching:** Potential authors within the graph result are compared to ORCID profile authors using an *author name disambiguation* algorithm.
3. **Metadata Enrichment:** Successful matches enrich the graph's author information with the ORCID identifier.
#### Author Name Disambiguation Algorithm
The algorithm compares authors from the graph and ORCID profiles for the same persistent identifier. It employs the following matching strategies in decreasing order of confidence:
1. **Exact Full Name Match:** Matches full names (given name + family name) directly.
2. **Exact Reversed Full Name Match:** Matches full names with reversed order (family name + given name).
3. **Ordered Token Match:** Compares author names tokenized into individual words, considering word order and allowing for variations (e.g., abbreviations).
4. **Exact Credit Name Match:** Matches the graph author's full name with the ORCID author's credit name.
5. **Exact Other Names Match:** Matches the graph author's full name with ORCID author's other names.
Upon finding a match, the graph author's information is enriched with ORCID data, and the matched ORCID author is removed from the comparison list. This process continues until no more matches are found.
**Example:**
Consider the following author lists:
* **Graph List:** Robert Stein, Sjoert van Velzen, Marek Kowalski, ...
* **ORCID List:** Marek Kowalski, Itai Sfaradi, James Carl Miller-Jones, ...
The algorithm applies matching strategies sequentially, starting with exact full name matches and progressing to ordered token matching. For instance, "Marek Kowalski" would be matched using the exact full name strategy.
**By combining these approaches, OpenAIRE improves the accuracy of author identification and linking.**