[Orcid Enrichment] update to the last part of the text for the author name disambiguation algo

2024-07-26 11:04:31 +02:00 · 2024-07-26 11:04:31 +02:00 · f0d9b74ba5
parent 9c3ae8f47e
commit f0d9b74ba5
1 changed files with 50 additions and 12 deletions
--- a/docs/graph-production-workflow/enrichment-by-pid/orcid-enrichment.md
+++ b/docs/graph-production-workflow/enrichment-by-pid/orcid-enrichment.md
@ -11,7 +11,7 @@ The following steps describe the pipeline to enrich the author information in th
 ### Extracting Author and Work Information and creating ORCID-Work pairs
 OpenAIRE extracts the following information from each ORCID profile:
 - Author information: ORCID, family name, given name, other names, and credit name.
- Work information: Persistent identifiers (DOI, PMC, PMID, arXiv) associated with the profile.
+- Work information: Persistent identifiers (DOI, PMC, PMID, arXiv, handle) associated with the profile.

 For each work identified by a persistent identifier (PID), a pair is created linking the ORCID to the work PID. For
 example, if an ORCID profile (orcid1) has a DOI (doi1) and a PMC (pmc1) associated with it, the following pairs are generated:
@ -35,7 +35,8 @@ to establish the correct matches. Successful matches allow OpenAIRE to enrich th
 ORCID identifier from the profile.

 ### Author name disambiguation algorithm
-The process involves comparing authors from two sets: those extracted from the graph (graph authors) and those derived from ORCID profiles (ORCID authors) that share the same persistent identifier pair.
+The process involves comparing authors from two sets: those extracted from the graph (graph authors) and those derived 
+from ORCID profiles (ORCID authors) that share the same persistent identifier pair.
 For each graph author, the algorithm iterates through the following matching strategies, ordered by decreasing confidence:
 - Exact fullname match: If the full name of a graph author exactly matches the full name (constructed by concatenating the author given name and family name) of one author in the ORCID list, a match is found.
 - Exact reversed fullname match: Similar to the previous strategy, but the ORCID full name is constructed by concatenating family name and given name.
@ -43,7 +44,8 @@ For each graph author, the algorithm iterates through the following matching str
 - Exact match of ORCID credit name: If the graph author's full name matches an ORCID author's credit name, a match is considered.
 - Exact match of ORCID other names: The graph author's full name is compared to each other name listed in the ORCID profile.

-Upon identifying a match, the graph author's information is enriched with the corresponding ORCID data, and the matched ORCID author is removed from the comparison list. This process continues until no further matches can be found.
+Upon identifying a match, the graph author's information is enriched with the corresponding ORCID data, and the matched 
+ORCID author is removed from the comparison list. This process continues until no further matches can be found.

 By applying this multi-faceted approach, OpenAIRE aims to maximize the accuracy of author identification and linking.

@ -74,7 +76,27 @@ This step accomodates variation in name formatting. As before if an exact match
 graph author, and the ORCID author is removed from the list for subsequent comparisons. With this strategy we can find a match for
 Albert Kong.

-The third step is the application of the *Oredered token match* stratedy to the remaining authors to be matched:
+The third step is the application of the *Oredered token match* strategy to the remaining authors to be matched. Before going to see 
+a running example, let us describe how the strategy works.
+
+The tokens from the two lists are pairwise compared. The outcome of each comparison falls into one of three categories:
+- No Match: This occurs when the initial characters of the compared tokens differ, or when the entire words don't match despite sharing the same starting character. A mismatch indicates that the authors are different, and the comparison process terminates.
+- Short Match: A short match happens when both tokens begin with the same character, but one token consists solely of that character.
+- Long Match: Exact correspondence between the two compared words
+
+When a no match is encountered due to different initial characters the starting character, the algorithm proceeds
+to compare the next token in the list with the lexicographically lower preceding token. This allows to be tolerant with missing
+words in one of the two names.
+
+A successful match (short or long) moves the comparison of the subsequent tokens in both lists.
+This iterative process continues until either a no match is determined or both token lists have been exhausted.
+
+If both lists have been exhausted, a match is found if:
+- At list one long match exists
+- The sum of short and long matches equals the length of the shorter token list, indicating that all the words
+  in the shorter list have a match in the longer one.
+
+Going back to the example, the authors that remain to find a match for are:
 - Graph List: Robert Stein, Sjoert van Velzen, Anna Franckowiak, James C. A. Miller-Jones, Sara Frederick
 - Orcid List: James Carl Miller-Jones

@ -86,11 +108,27 @@ So the two names are broken down into individual words or token that sorted alph
 graph = A C James Miller-Jones
 orcid = Carl James Miller-Jones

-The strategy scroll the list of words until it finds a couple of words with the same starting char. Since the lists are ordered
-only the alphabetically lower word is discarded if the chars are different.
-So the first words to compare are A and Carl. They have different starting char so the index of the list with the word lower in the alphabetical order is increased.
-The comparison now moves to C and Carl. They have the same starting char. One of the two words is composed by only the character, so it is considered to be a short match (i.e.
-one word matches with just one character) and we proceed in the check with the next two words. Now we compare James with James, exact long match, and then Miller-Jones with Miller-Jones
-another exact long match. At the end of the list of words we got one short match and two long matches, since the number of long matches is bigger than zero, and the sum of the long and short matches
-equals the length of the shorted list of words, we can state we found a match.
-And here we stop the execution since the complete orcid list was matched.
+The comparison process works as follows:
+
+- *A* and *Carl* are compared. No match since the initial characters are different. The graph list will be moved one step ahead for the next comparison
+- *C* and *Carl* are compared. A short match is detected, since both start with the same character and the graph word is only that character. Both the lists will be moved one step ahead for the next comparison
+- *James* and *James* are compared. A long match is detected. Both the lists will be moved one step ahead for the next comparison
+- *Miller-Jones* and *Miller-Jones* are compared. A long match is found. The lists are exhausted and the computation ends.
+
+Since at list one long match exists and the sum of long and short matches equals the length of the shorter list, the match is confirmed and 
+the graph author can be enriched with the ORCID information.
+
+The ORCID list remains empty after the application of the third strategy and the author name disambiguation process ends. 
+
+Note: the application of the remaining two strategies can be remanded to the application of the *Exact name match* strategy.
+Note: Even if the third strategy can subsume the first two, the reason they are applied before the third is for efficiency. 
+In this way, in fact, 
+we can claim a match as soon as the first pair of matching names is found. Applying only the third strategy, all the comparisons should be done and 
+a way to determine the best match should be found before claiming a match.
+Example:
+
+graph = Mario Enrico Rossi, Mario Rossi
+ORCID = Mario Rossi
+
+As you can see applying only the third strategy, we would associate Mario Rossi's ORCID to Mario Fabrizio Rossi if this one would have been first in teh author list.
+