From a1819728e87577035aebdfa61f6ab7ccca050159 Mon Sep 17 00:00:00 2001 From: Claudio Atzori Date: Fri, 26 Jul 2024 12:58:44 +0200 Subject: [PATCH] uodated cheangelog --- docs/changelog.md | 22 +++++++- .../enrichment-by-pid/orcid-alternative.md | 55 ------------------- 2 files changed, 20 insertions(+), 57 deletions(-) delete mode 100644 docs/graph-production-workflow/enrichment-by-pid/orcid-alternative.md diff --git a/docs/changelog.md b/docs/changelog.md index 14e01a5..1b421f7 100644 --- a/docs/changelog.md +++ b/docs/changelog.md @@ -19,6 +19,20 @@ This section documents all notable changes for each graph version. --- +### v9.0.0 +_Start Date: 2024-07-03 • Release Date: 2024-07-15 • Dataset release: **yes**_ + +#### Added + +- General increase of the scientific products with ORCID identified authors +0.43% (+145K) + +#### Changed + +- Improved matching of organizations in the deduplication algorithm, leading to less false positives +- Updated Crossref publications to include contents until May 2024 +- Updated ORCID contents until June 2024 +- Updated Datacite contents until June 2024 + ### v8.0.0 _Start Date: 2024-05-15 • Release Date: 2024-06-20 • Dataset release: **no**_ @@ -31,8 +45,12 @@ _Start Date: 2024-05-15 • Release Date: 2024-06-20 • Dataset release: #### Changed - Revised deduplication configuration to better exploit resource types -- The DOIBoost dataset was superseded by the direct aggregation of its datasources: Crossref, Unpaywall, Microsoft Academic Graph, ORCID -- Relaxed Crossref publication inclusion criteria, now accepting records without author information, leading to a +15% increase (from 127Mi to 146Mi records). Included contents until April 2023 +- The DOIBoost dataset was superseded by the direct aggregation of its datasources: Crossref, Unpaywall, Microsoft + Academic Graph, ORCID. See the [aggregation of the non compatible sources](category/non-compatible-sources) section + to know more + details +- Relaxed Crossref publication inclusion criteria, now accepting records without author information, leading to a + +15% increase (from 127Mi to 146Mi records). Included contents until April 2024 - Updated ORCID contents until April 2024 - Updated Datacite contents until April 2024 diff --git a/docs/graph-production-workflow/enrichment-by-pid/orcid-alternative.md b/docs/graph-production-workflow/enrichment-by-pid/orcid-alternative.md deleted file mode 100644 index 2e32dc6..0000000 --- a/docs/graph-production-workflow/enrichment-by-pid/orcid-alternative.md +++ /dev/null @@ -1,55 +0,0 @@ -## Enrichment from ORCID - -OpenAIRE enhances publication metadata by incorporating author information from ORCID. This involves adding persistent identifiers to authors and leveraging ORCID data to improve author disambiguation. - -### Enrichment Process - -The following steps outline how ORCID information is integrated into the OpenAIRE Graph: - -#### Extracting Author and Work Information - -1. **Data Collection:** OpenAIRE extracts the following from ORCID profiles: - * Author information: ORCID, family name, given name, other names, credit name - * Work information: Persistent identifiers (DOI, PMC, PMID, arXiv, handle) - -2. **ORCID-Work Pair Creation:** For each work identified by a persistent identifier (PID), an ORCID-Work pair is created. For example: - * `` - * `` - -#### Grouping by Work Persistent Identifier - -ORCID-Work pairs are grouped by the work's persistent identifier to identify multiple authors contributing to the same work. This results in structures like: -* `` - -**Note:** -* `orcidx`: ORCID identifier with associated author name information. -* `doix`: Persistent identifier schema and value (e.g., `<"doi", "10....">`). - -#### Matching with Graph and Enriching Author Metadata - -1. **Graph Search:** For each ORCID-Work pair, OpenAIRE searches the Graph for a corresponding result based on the persistent identifier. -2. **Author Matching:** Potential authors within the graph result are compared to ORCID profile authors using an *author name disambiguation* algorithm. -3. **Metadata Enrichment:** Successful matches enrich the graph's author information with the ORCID identifier. - -#### Author Name Disambiguation Algorithm - -The algorithm compares authors from the graph and ORCID profiles for the same persistent identifier. It employs the following matching strategies in decreasing order of confidence: - -1. **Exact Full Name Match:** Matches full names (given name + family name) directly. -2. **Exact Reversed Full Name Match:** Matches full names with reversed order (family name + given name). -3. **Ordered Token Match:** Compares author names tokenized into individual words, considering word order and allowing for variations (e.g., abbreviations). -4. **Exact Credit Name Match:** Matches the graph author's full name with the ORCID author's credit name. -5. **Exact Other Names Match:** Matches the graph author's full name with ORCID author's other names. - -Upon finding a match, the graph author's information is enriched with ORCID data, and the matched ORCID author is removed from the comparison list. This process continues until no more matches are found. - -**Example:** - -Consider the following author lists: - -* **Graph List:** Robert Stein, Sjoert van Velzen, Marek Kowalski, ... -* **ORCID List:** Marek Kowalski, Itai Sfaradi, James Carl Miller-Jones, ... - -The algorithm applies matching strategies sequentially, starting with exact full name matches and progressing to ordered token matching. For instance, "Marek Kowalski" would be matched using the exact full name strategy. - -**By combining these approaches, OpenAIRE improves the accuracy of author identification and linking.** \ No newline at end of file