From f187b1aafb465f156735cc0235ca55065a3be6a7 Mon Sep 17 00:00:00 2001 From: Claudio Atzori Date: Fri, 26 Jul 2024 12:01:20 +0200 Subject: [PATCH] WIP: added ORCID enrichment alternative --- .../non-compatible-sources/orcid.md | 2 +- .../enrichment-by-pid/orcid-alternative.md | 55 +++++++++++++++++++ 2 files changed, 56 insertions(+), 1 deletion(-) create mode 100644 docs/graph-production-workflow/enrichment-by-pid/orcid-alternative.md diff --git a/docs/graph-production-workflow/aggregation/non-compatible-sources/orcid.md b/docs/graph-production-workflow/aggregation/non-compatible-sources/orcid.md index 321228f..e959c4b 100644 --- a/docs/graph-production-workflow/aggregation/non-compatible-sources/orcid.md +++ b/docs/graph-production-workflow/aggregation/non-compatible-sources/orcid.md @@ -58,6 +58,6 @@ For a more extensive description of the different fields and the schema of the r ## Process The information obtained by ORCID is used to enrich the Graph, in particular to add the author identifiers to the results not providing one. -This process is described in the [enrichment by PID](/graph-production-workflow/enrichment-by-pid/orcid-enrichment) section. +This process is described in the [enrichment by PID](../../enrichment-by-pid/orcid-enrichment) section. diff --git a/docs/graph-production-workflow/enrichment-by-pid/orcid-alternative.md b/docs/graph-production-workflow/enrichment-by-pid/orcid-alternative.md new file mode 100644 index 0000000..2e32dc6 --- /dev/null +++ b/docs/graph-production-workflow/enrichment-by-pid/orcid-alternative.md @@ -0,0 +1,55 @@ +## Enrichment from ORCID + +OpenAIRE enhances publication metadata by incorporating author information from ORCID. This involves adding persistent identifiers to authors and leveraging ORCID data to improve author disambiguation. + +### Enrichment Process + +The following steps outline how ORCID information is integrated into the OpenAIRE Graph: + +#### Extracting Author and Work Information + +1. **Data Collection:** OpenAIRE extracts the following from ORCID profiles: + * Author information: ORCID, family name, given name, other names, credit name + * Work information: Persistent identifiers (DOI, PMC, PMID, arXiv, handle) + +2. **ORCID-Work Pair Creation:** For each work identified by a persistent identifier (PID), an ORCID-Work pair is created. For example: + * `` + * `` + +#### Grouping by Work Persistent Identifier + +ORCID-Work pairs are grouped by the work's persistent identifier to identify multiple authors contributing to the same work. This results in structures like: +* `` + +**Note:** +* `orcidx`: ORCID identifier with associated author name information. +* `doix`: Persistent identifier schema and value (e.g., `<"doi", "10....">`). + +#### Matching with Graph and Enriching Author Metadata + +1. **Graph Search:** For each ORCID-Work pair, OpenAIRE searches the Graph for a corresponding result based on the persistent identifier. +2. **Author Matching:** Potential authors within the graph result are compared to ORCID profile authors using an *author name disambiguation* algorithm. +3. **Metadata Enrichment:** Successful matches enrich the graph's author information with the ORCID identifier. + +#### Author Name Disambiguation Algorithm + +The algorithm compares authors from the graph and ORCID profiles for the same persistent identifier. It employs the following matching strategies in decreasing order of confidence: + +1. **Exact Full Name Match:** Matches full names (given name + family name) directly. +2. **Exact Reversed Full Name Match:** Matches full names with reversed order (family name + given name). +3. **Ordered Token Match:** Compares author names tokenized into individual words, considering word order and allowing for variations (e.g., abbreviations). +4. **Exact Credit Name Match:** Matches the graph author's full name with the ORCID author's credit name. +5. **Exact Other Names Match:** Matches the graph author's full name with ORCID author's other names. + +Upon finding a match, the graph author's information is enriched with ORCID data, and the matched ORCID author is removed from the comparison list. This process continues until no more matches are found. + +**Example:** + +Consider the following author lists: + +* **Graph List:** Robert Stein, Sjoert van Velzen, Marek Kowalski, ... +* **ORCID List:** Marek Kowalski, Itai Sfaradi, James Carl Miller-Jones, ... + +The algorithm applies matching strategies sequentially, starting with exact full name matches and progressing to ordered token matching. For instance, "Marek Kowalski" would be matched using the exact full name strategy. + +**By combining these approaches, OpenAIRE improves the accuracy of author identification and linking.** \ No newline at end of file