From 9c3ae8f47eaecb4e94814f7c3e86258a28e917a6 Mon Sep 17 00:00:00 2001
From: Claudio Atzori <claudio.atzori@isti.cnr.it>
Date: Fri, 26 Jul 2024 10:03:21 +0200
Subject: [PATCH] WIP: added enrichment by PID section, ORCID enrichment

---
 .../non-compatible-sources/orcid.md           |  4 +-
 .../enrichment-by-pid/enrichment-by-pid.md    |  8 ++
 .../enrichment-by-pid/orcid-enrichment.md     | 96 +++++++++++++++++++
 sidebars.js                                   |  8 ++
 4 files changed, 114 insertions(+), 2 deletions(-)
 create mode 100644 docs/graph-production-workflow/enrichment-by-pid/enrichment-by-pid.md
 create mode 100644 docs/graph-production-workflow/enrichment-by-pid/orcid-enrichment.md
diff --git a/docs/graph-production-workflow/aggregation/non-compatible-sources/orcid.md b/docs/graph-production-workflow/aggregation/non-compatible-sources/orcid.md
index 0a10acb..321228f 100644
--- a/docs/graph-production-workflow/aggregation/non-compatible-sources/orcid.md
+++ b/docs/graph-production-workflow/aggregation/non-compatible-sources/orcid.md
@@ -57,7 +57,7 @@ For a more extensive description of the different fields and the schema of the r
 
 ## Process
 
-In the following we describe the process applied to the ORCID contents.
+The information obtained by ORCID is used to enrich the Graph, in particular to add the author identifiers to the results not providing one.
+This process is described in the [enrichment by PID](/graph-production-workflow/enrichment-by-pid/orcid-enrichment) section.
 
-### ... 
 
diff --git a/docs/graph-production-workflow/enrichment-by-pid/enrichment-by-pid.md b/docs/graph-production-workflow/enrichment-by-pid/enrichment-by-pid.md
new file mode 100644
index 0000000..7e7f326
--- /dev/null
+++ b/docs/graph-production-workflow/enrichment-by-pid/enrichment-by-pid.md
@@ -0,0 +1,8 @@
+import DocCardList from '@theme/DocCardList';
+
+
+# Enrichment by PID
+
+
+
+<DocCardList></DocCardList>
\ No newline at end of file
diff --git a/docs/graph-production-workflow/enrichment-by-pid/orcid-enrichment.md b/docs/graph-production-workflow/enrichment-by-pid/orcid-enrichment.md
new file mode 100644
index 0000000..b438915
--- /dev/null
+++ b/docs/graph-production-workflow/enrichment-by-pid/orcid-enrichment.md
@@ -0,0 +1,96 @@
+# Enrichment from ORCID
+
+OpenAIRE collects the ORCID dataset and exploits it to enrich the metadata of the results by adding the persistent
+identifier to the authors.
+
+
+## How does the enrichment works?
+
+The following steps describe the pipeline to enrich the author information in the graph by including the orcid identifiers from ORCID.
+
+### Extracting Author and Work Information and creating ORCID-Work pairs
+OpenAIRE extracts the following information from each ORCID profile:
+- Author information: ORCID, family name, given name, other names, and credit name.
+- Work information: Persistent identifiers (DOI, PMC, PMID, arXiv) associated with the profile.
+
+For each work identified by a persistent identifier (PID), a pair is created linking the ORCID to the work PID. For
+example, if an ORCID profile (orcid1) has a DOI (doi1) and a PMC (pmc1) associated with it, the following pairs are generated:
+- P1: <orcid1, doi1>
+- P2: <orcid1, pmc1>
+
+### Grouping by work persistent identifier
+Once all ORCID-Work pairs are created, they are grouped by the work's persistent identifier. This allows identification
+of multiple authors contributing to the same work. For instance, if two ORCIDs (orcid1 and orcid2) are associated with
+the same DOI (doi1), the structure <doi1, [orcid1, orcid2]> is created
+
+Note: The term "orcidx" refers to a structure containing the ORCID identifier along with the author's name information
+(family name, given name, other names, and credit name) as extracted from the ORCID profile. The term "doix" refer to a structure
+containing the schema and value of the persistent identifier. In case of the example "doix" : <"doi","10....">
+
+### Matching with the Graph result and enriching the author metadata
+For each persistent identifier pair, OpenAIRE searches for a corresponding result in the Graph based on the pair's
+schema and value. Once a match is found, OpenAIRE attempts to identify the corresponding authors within the result by
+comparing them to the authors listed in the ORCID profile. This process employs an Algorithm called *author name disambiguation*
+to establish the correct matches. Successful matches allow OpenAIRE to enrich the result's author information with the
+ORCID identifier from the profile.
+
+### Author name disambiguation algorithm
+The process involves comparing authors from two sets: those extracted from the graph (graph authors) and those derived from ORCID profiles (ORCID authors) that share the same persistent identifier pair.
+For each graph author, the algorithm iterates through the following matching strategies, ordered by decreasing confidence:
+- Exact fullname match: If the full name of a graph author exactly matches the full name (constructed by concatenating the author given name and family name) of one author in the ORCID list, a match is found.
+- Exact reversed fullname match: Similar to the previous strategy, but the ORCID full name is constructed by concatenating family name and given name.
+- Ordered token match: Author names are tokenized into individual words. These tokens are then ordered and compared for matches or abbreviations. This strategy is applied to names with at least two words and such that the name word difference is two or less. This strategy allow for variability in the name. (some examples will be provided in the following)
+- Exact match of ORCID credit name: If the graph author's full name matches an ORCID author's credit name, a match is considered.
+- Exact match of ORCID other names: The graph author's full name is compared to each other name listed in the ORCID profile.
+
+Upon identifying a match, the graph author's information is enriched with the corresponding ORCID data, and the matched ORCID author is removed from the comparison list. This process continues until no further matches can be found.
+
+By applying this multi-faceted approach, OpenAIRE aims to maximize the accuracy of author identification and linking.
+
+#### Author name disambiguation example
+Consider the following author lists
+- Graph List: Robert Stein, Sjoert van Velzen, Marek Kowalski, Anna Franckowiak, James C. A. Miller-Jones, Sara Frederick, Itai Sfaradi, Assaf Horesh, Albert Kong, Ryan Foley
+- Orcid List: Marek Kowalski, Itai Sfaradi, James Carl Miller-Jones, Assaf Horesh, Kong Albert, Ryan Foley
+
+The graph list contains the full names of the authors as found in the metadata. Any potential ambiguities in splitting names into components (like first name and last name) are addressed by the first three steps.
+The ORCID list names are expressed as the concatenation of the given name and the family name as provided in the ORCID profile
+(i.e. "Kong Alber => Kong is given name and Albert is family name in the ORCID profile) For simplicity, other names and credit names are excluded from this list, since the corresponding strategies can be assimilated to an exact match comparison.
+
+Algorithm Application
+
+First of all the *Exact fullname match* strategy is applied.
+Each graph author's full name is compared to every full name in the ORCID list until a match is found. A full name in the
+ORCID list is constructed by concatenating the given name and family name in the order provided.
+If an exact match is found, the ORCID identifier is used to enrich the corresponding graph's author record, and the ORCID author
+is removed from the list for subsequent comparisons.
+By applying this strategy we can find a match for Marek Kowalski, Itai Sfaradi, Assaf Horesh, Ryan Foley
+
+Then the *Exact reverse fullname match* strategy is applied on the graph and orcid list that have not been match in the previous step:
+- Graph List: Robert Stein, Sjoert van Velzen, Anna Franckowiak, James C. A. Miller-Jones, Sara Frederick, Albert Kong
+- Orcid List: James Carl Miller-Jones, Kong Albert
+
+The process is similar to step one, but the ORCID fullname is constructed by reversing the order of given name and family name.
+This step accomodates variation in name formatting. As before if an exact match is found, the ORCID identifier is used to update the metadata of the
+graph author, and the ORCID author is removed from the list for subsequent comparisons. With this strategy we can find a match for
+Albert Kong.
+
+The third step is the application of the *Oredered token match* stratedy to the remaining authors to be matched:
+- Graph List: Robert Stein, Sjoert van Velzen, Anna Franckowiak, James C. A. Miller-Jones, Sara Frederick
+- Orcid List: James Carl Miller-Jones
+
+Let us consider directly the names that can be matched by this strategy:
+graph name = James C. A. Miller-Jones
+orcid name = Carl James Miller-Jones
+
+So the two names are broken down into individual words or token that sorted alphabetically to standardize the comparison process.
+graph = A C James Miller-Jones
+orcid = Carl James Miller-Jones
+
+The strategy scroll the list of words until it finds a couple of words with the same starting char. Since the lists are ordered
+only the alphabetically lower word is discarded if the chars are different.
+So the first words to compare are A and Carl. They have different starting char so the index of the list with the word lower in the alphabetical order is increased.
+The comparison now moves to C and Carl. They have the same starting char. One of the two words is composed by only the character, so it is considered to be a short match (i.e.
+one word matches with just one character) and we proceed in the check with the next two words. Now we compare James with James, exact long match, and then Miller-Jones with Miller-Jones
+another exact long match. At the end of the list of words we got one short match and two long matches, since the number of long matches is bigger than zero, and the sum of the long and short matches
+equals the length of the shorted list of words, we can state we found a match.
+And here we stop the execution since the complete orcid list was matched.
diff --git a/sidebars.js b/sidebars.js
index 6934839..1b93bfd 100644
--- a/sidebars.js
+++ b/sidebars.js
@@ -153,6 +153,14 @@ const sidebars = {
           type: 'doc', 
           id: 'graph-production-workflow/merge-by-id'
         },
+        {
+          type: 'category',
+          label: "Enrichment by PID",
+          link: {type: 'doc', id: 'graph-production-workflow/enrichment-by-pid/enrichment-by-pid'},
+          items: [
+              { type: 'doc', id: 'graph-production-workflow/enrichment-by-pid/orcid-enrichment' }
+            ]
+        },
         {
           type: 'category', 
           label: "Enrichment by mining",