Change the test for author merger

[DOIBoost author merger] merge with beta and added new test to verify the match for 0000-0002-4333-2748
[DOIBoostAuthorMerger] merge with beta
2023-05-24 15:21:36 +02:00 · 2023-05-23 15:44:03 +02:00 · 2022-01-10 15:37:10 +01:00 · 2021-11-25 10:57:06 +01:00 · 2021-11-22 16:55:02 +01:00 · 2021-11-22 16:54:27 +01:00
16 changed files with 1014 additions and 37 deletions
--- a/dhp-workflows/dhp-doiboost/src/main/java/eu/dnetlib/doiboost/AuthorAssoc.java
+++ b/dhp-workflows/dhp-doiboost/src/main/java/eu/dnetlib/doiboost/AuthorAssoc.java
@ -0,0 +1,53 @@
+
+package eu.dnetlib.doiboost;
+
+import java.io.Serializable;
+import java.util.ArrayList;
+import java.util.List;
+
+import eu.dnetlib.dhp.schema.oaf.Author;
+
+/**
+ * This class stores the association information between the enriching author and the possibly enriched ones.
+ * It also contains the value of the similarity score between the enriching author and the possibly enriched ones.
+ * Possibly enriched authors with the same similarity score with the enriching are put in the to_be_enriched list.
+ */
+public class AuthorAssoc implements Serializable {
+	private Double score;
+	private List<Author> to_be_enriched;
+	private Author with_enricheing_content;
+
+	public Double getScore() {
+		return score;
+	}
+
+	public void setScore(Double score) {
+		this.score = score;
+	}
+
+	public List<Author> getTo_be_enriched() {
+		return to_be_enriched;
+	}
+
+	public void setTo_be_enriched(List<Author> to_be_enriched) {
+		this.to_be_enriched = to_be_enriched;
+	}
+
+	public Author getWith_enricheing_content() {
+		return with_enricheing_content;
+	}
+
+	public void setWith_enricheing_content(Author with_enricheing_content) {
+		this.with_enricheing_content = with_enricheing_content;
+	}
+
+	public static AuthorAssoc newInstance(Author a) {
+		AuthorAssoc ret = new AuthorAssoc();
+		ret.score = 0.0;
+		ret.to_be_enriched = new ArrayList<>();
+		ret.with_enricheing_content = a;
+
+		return ret;
+
+	}
+}
--- a/dhp-workflows/dhp-doiboost/src/main/java/eu/dnetlib/doiboost/DoiBoostAuthorMerger.java
+++ b/dhp-workflows/dhp-doiboost/src/main/java/eu/dnetlib/doiboost/DoiBoostAuthorMerger.java
@ -0,0 +1,301 @@
+
+package eu.dnetlib.doiboost;
+
+import java.text.Normalizer;
+import java.util.*;
+import java.util.stream.Collectors;
+
+import com.wcohen.ss.Jaccard;
+import com.wcohen.ss.JaroWinkler;
+
+import eu.dnetlib.dhp.schema.oaf.Author;
+import eu.dnetlib.dhp.schema.oaf.Result;
+import eu.dnetlib.dhp.schema.oaf.StructuredProperty;
+import eu.dnetlib.dhp.utils.DHPUtils;
+import scala.Tuple2;
+
+/**
+ * This is a version of the AuthorMerger specific for DoiBoost.
+ * Here we suppose a match must exist for the authors. We compare via JaroWrinkler similarity measure each author in the list
+ * that should be enriched with each author in the enriching list. For each enriching author we select the best match that is
+ * the author with the highest similarity score.
+ * The association is done from the enriching author to the enriched because in this way only one match per enriching author can be found
+ * One enriching author can have the same maximum similarity score with more than one enrich author
+ *
+ *
+ *
+ *
+ * The idea is to enrich the most similar authors having at least one word of the name in common
+ *
+ * It is defined a data structure to store the association information between the enriching and the enriched authors.
+ * This structure contains the list of authors that can be possibly enriched, the enriching author and the similarity score among the enriching authors and the enriched ones
+ * Questa struttura ha la lista di autori che possono essere arricchiti, l’autore che arricchisce e lo score di similarita fra l’autore che arricchisce e gli autori arricchiti.
+ * E’ il valore di una mappa che per chiave la il fullname dell’autore che arricchisce
+ * 6:23
+ * per ogni autore che puo’ essere arricchito verifico se la entri nella mappa di quello che arricchisce e’ associata ad un autore con score di similarita’ piu’ basso. Se cosi’ e’ modifico l’associazione nella mappa per l’autore che arricchisce, sostituendo l’autore arricchito a cui era associato prima con quello nuovo che ha score piu’ alto. Se lo score e’ lo stesso, aggiungo il nuovo autore da arricchire alla lista degli autori associata all’autore che arricchisce
+ * 6:25
+ * Alla fine caso facile: ogni entry e’ associata ad un unico autore da arricchire => verifico che almeno una delle parole che sono nei due nomi sia in comune fra i due insiemi Se e’ cosi’, aggiungo i pid mancanti all’autore da arricchire dell’autore che arricchisce
+ * 6:26
+ * caso brutto: ci sono piu’ autori da arricchire con la stessa similarita: arricchisco quello che ha il maggior numero di parole del fullname uguali a quelle dell’autore che arricchisce. In caso di parita’ non si arricchisce
+ * 6:28
+ * ricordiamoci che si parte dal presupposto che un match debba esistere visto che abbiamo lo stesso doi
+ * 6:29
+ * di conseguenza l’autore che ha lo score di similarita’ piu’ alto fra quelli presenti ed anche una parola in comune del nome dovrebbe essere sufficiente per poterlo arricchire.
+ * 6:30
+ * I casi di omonimia che potrebbero portare problemi con i rank degli autori non si mappano
+ */
+
+public class DoiBoostAuthorMerger {
+
+	public static List<Author> merge(List<List<Author>> authors, Boolean crossref) {
+
+		Iterator<List<Author>> it = authors.iterator();
+		List<Author> author = it.next();
+
+		while (it.hasNext()) {
+			List<Author> autList = it.next();
+			Tuple2<List<Author>, Boolean> tmp = mergeAuthor(author, autList, crossref);
+			author = tmp._1();
+			crossref = tmp._2();
+		}
+
+		return author;
+
+	}
+
+	// If we have a list of authors coming from crossref we take that and we enrich it
+	// If we do not have a list of authors coming from crossref we enrich the longest at each step
+	public static Tuple2<List<Author>, Boolean> mergeAuthor(final List<Author> baseAuthor,
+		final List<Author> otherAuthor,
+		final Boolean crossref) {
+
+		if (baseAuthor == null || baseAuthor.size() == 0)
+			return new Tuple2<>(otherAuthor, false);
+		if (otherAuthor == null || otherAuthor.size() == 0)
+			return new Tuple2<>(baseAuthor, crossref);
+
+		if (crossref) {
+			enrichPidFromList(baseAuthor, otherAuthor);
+			return new Tuple2<>(baseAuthor, true);
+		} else if (baseAuthor.size() > otherAuthor.size()) {
+			enrichPidFromList(baseAuthor, otherAuthor);
+			return new Tuple2<>(baseAuthor, false);
+		} else {
+			enrichPidFromList(otherAuthor, baseAuthor);
+			return new Tuple2<>(otherAuthor, false);
+		}
+
+	}
+
+	// valutare se questa cosa va invertita: dovrei prendere per ogni enriching author quello che piu' gli somiglia
+	// nella base list non il contrario
+	private static void enrichPidFromList(List<Author> base, List<Author> enrich) {
+
+		// search authors having identifiers in the enrich list
+		final List<Author> authorsWithPids = enrich
+			.stream()
+			.filter(a -> a.getPid() != null && a.getPid().size() > 0)
+			.collect(Collectors.toList());
+
+		Map<String, AuthorAssoc> assocMap = authorsWithPids
+			.stream()
+			.map(
+				a -> new Tuple2<>(DHPUtils.md5(a.getFullname()), AuthorAssoc.newInstance(a)))
+			.collect(Collectors.toMap(Tuple2::_1, Tuple2::_2, (x1, x2) -> x1));
+
+		Map<String, Tuple2<String, Tuple2<List<String>, Double>>> baseAssoc = new HashMap<>();
+
+		// for each author in the base list, we search the best enriching match
+		// we create the association (author, list of (enriching author, similatiry score))
+		base
+			.stream()
+			.map(
+				a -> new Tuple2<>(a,
+					authorsWithPids
+						.stream()
+						.map(e -> new Tuple2<>(e, sim(a, e)))
+						.filter(t2 -> t2._2() > 0.0)
+						.collect(Collectors.toList())))
+			.forEach(t2 -> {
+				String base_name = t2._1().getFullname();
+				String base_name_md5 = DHPUtils.md5(t2._1().getFullname());
+				Double max_score = 0.0;
+				List<String> enrich_name = new ArrayList();
+				for (Tuple2<Author, Double> t : t2._2()) {
+					// we get the fullname of the enriching
+					String mapEntry = DHPUtils.md5(t._1().getFullname());
+
+					if (t._2() > max_score) {
+						max_score = t._2();
+						enrich_name = new ArrayList();
+						enrich_name.add(mapEntry);
+					} else if (t._2() > 0 && t._2().equals(max_score)) {
+						enrich_name.add(mapEntry);
+					}
+
+					AuthorAssoc aa = assocMap.get(mapEntry);
+					if (aa.getScore() < t._2()) {
+						aa.setScore(t._2());
+						aa.setTo_be_enriched(new ArrayList<>());
+						aa.getTo_be_enriched().add(t2._1());
+					} else {
+						aa.getTo_be_enriched().add(t2._1());
+					}
+				}
+				if (max_score > 0) {
+					baseAssoc.put(base_name_md5, new Tuple2(base_name, new Tuple2<>(enrich_name, max_score)));
+				}
+
+			});
+		List<Tuple2<Double, Tuple2<String, List<String>>>> list = baseAssoc.keySet().stream().map(k -> {
+			Tuple2<String, Tuple2<List<String>, Double>> map_entry = baseAssoc.get(k);
+			return new Tuple2<>(map_entry._2()._2(), new Tuple2<>(map_entry._1(), map_entry._2()._1()));
+		})
+			.collect(Collectors.toList());
+		list.sort(Comparator.comparing(e -> e._1()));
+		// ordino per max score la baseAssoc
+		for (int i = list.size() - 1; i >= 0; i--) {
+			Tuple2<Double, Tuple2<String, List<String>>> tmp = list.get(i);
+			List<String> entries = tmp._2()._2();
+			// se len = 1 => ho un solo e che con questo a ha max score
+			if (entries.size() == 1) {
+				if (assocMap.containsKey(entries.get(0))) {
+					enrichAuthor(assocMap.get(entries.get(0)));
+					assocMap.remove(entries.get(0));
+				}
+			} else {
+				String author_fullname = tmp._2()._1();
+				long commonWords = 0;
+				String enriching = null;
+				for (String entry : entries) {
+					if (assocMap.containsKey(entry)) {
+						long words = getCommonWords(
+							normalize(entry),
+							normalize(author_fullname));
+						if (words > commonWords) {
+							commonWords = words;
+							enriching = entry;
+						}
+						if (words == commonWords) {
+							enriching = null;
+						}
+					}
+
+				}
+				if (enriching != null) {
+					enrichAuthor(assocMap.get(entries.get(0)));
+					assocMap.remove(entries.get(0));
+				}
+				// TODO pensare ad un modo per arricchire con il miglior e questo autore
+				// Siamo nel caso in cui un autore ha piu' di un e con lo stesso similarity score
+			}
+		}
+		// assocMap.keySet().forEach(k -> enrichAuthor(assocMap.get(k)));
+
+	}
+
+	private static long getCommonWords(List<String> fullEnrich, List<String> fullEnriching) {
+		return fullEnrich.stream().filter(w -> fullEnriching.contains(w)).count();
+	}
+
+	private static void enrichAuthor(Author enrich, Author enriching) {
+		// verify if some of the words in the fullname are contained in the other
+		// get normalized fullname
+
+		long commonWords = getCommonWords(
+			normalize(enrich.getFullname()),
+			normalize(enriching.getFullname()));
+		if (commonWords > 0) {
+			if (enrich.getPid() == null) {
+				enrich.setPid(new ArrayList<>());
+			}
+			Set<String> aPids = enrich.getPid().stream().map(p -> pidToComparableString(p)).collect(Collectors.toSet());
+			ArrayList<StructuredProperty> newPids = new ArrayList<>();
+			newPids.addAll(enrich.getPid());
+			enriching.getPid().forEach(p -> {
+				String pidToComparableString = pidToComparableString(p);
+				if (!aPids.contains(pidToComparableString)) {
+					newPids.add(p);
+					aPids.add(pidToComparableString);
+				}
+			});
+			enrich.setPid(newPids);
+			if (enrich.getAffiliation() == null) {
+				if (enriching.getAffiliation() != null) {
+					enrich.setAffiliation(enriching.getAffiliation());
+				}
+			}
+		}
+
+	}
+
+	// Verify the number of words in common. The one that has more, wins. If the number of words in common are the same
+	// we
+	// enrich no author
+	private static void enrichAuthor(AuthorAssoc authorAssoc) {
+		if (authorAssoc.getTo_be_enriched().size() == 1) {
+			enrichAuthor(authorAssoc.getTo_be_enriched().get(0), authorAssoc.getWith_enricheing_content());
+		} else {
+			long common = 0;
+			List<Author> selected = new ArrayList<>();
+			for (Author a : authorAssoc.getTo_be_enriched()) {
+				long current_common = getCommonWords(
+					normalize(a.getFullname()),
+					normalize(authorAssoc.getWith_enricheing_content().getFullname()));
+				if (current_common > common) {
+					common = current_common;
+					selected = new ArrayList<>();
+					selected.add(a);
+				} else if (current_common == common) {
+					selected.add(a);
+				}
+			}
+			if (selected.size() == 1) {
+				enrichAuthor(selected.get(0), authorAssoc.getWith_enricheing_content());
+			}
+		}
+
+	}
+
+	public static String pidToComparableString(StructuredProperty pid) {
+		return (pid.getQualifier() != null
+			? pid.getQualifier().getClassid() != null ? pid.getQualifier().getClassid().toLowerCase() : ""
+			: "")
+			+ (pid.getValue() != null ? pid.getValue().toLowerCase() : "");
+	}
+
+	private static Double sim(Author a, Author b) {
+		return new Jaccard()
+			.score(normalizeString(a.getFullname()), normalizeString(b.getFullname()));
+
+	}
+
+	private static String normalizeString(String fullname) {
+		return String.join(" ", normalize(fullname));
+	}
+
+	private static List<String> normalize(final String s) {
+		String[] normalized = nfd(s)
+			.replaceAll("[^\\p{ASCII}]", "")
+			.toLowerCase()
+			// do not compact the regexes in a single expression, would cause StackOverflowError
+			// in case
+			// of large input strings
+			.replaceAll("(\\W)+", " ")
+			.replaceAll("(\\p{InCombiningDiacriticalMarks})+", " ")
+			.replaceAll("(\\p{Punct})+", " ")
+			.replaceAll("(\\d)+", " ")
+			.replaceAll("(\\n)+", " ")
+			.trim()
+			.split(" ");
+
+		Arrays.sort(normalized);
+
+		return Arrays.asList(normalized);
+
+	}
+
+	private static String nfd(final String s) {
+		return Normalizer.normalize(s, Normalizer.Form.NFD);
+	}
+}
--- a/dhp-workflows/dhp-doiboost/src/main/resources/eu/dnetlib/dhp/doiboost/process/oozie_app/workflow.xml
+++ b/dhp-workflows/dhp-doiboost/src/main/resources/eu/dnetlib/dhp/doiboost/process/oozie_app/workflow.xml
@ -86,7 +86,7 @@
            <case to="ProcessUW">${wf:conf('resumeFrom') eq 'PreprocessUW'}</case>
            <case to="ProcessORCID">${wf:conf('resumeFrom') eq 'ProcessORCID'}</case>
            <case to="CreateDOIBoost">${wf:conf('resumeFrom') eq 'CreateDOIBoost'}</case>
-            <case to="GenerateActionSet">${wf:conf('resumeFrom') eq 'GenerateActionSet'}</case>
+<!--            <case to="GenerateActionSet">${wf:conf('resumeFrom') eq 'GenerateActionSet'}</case>-->
            <default to="ConvertCrossrefToOAF"/>
        </switch>
    </decision>
@ -226,40 +226,40 @@
            <arg>--workingPath</arg><arg>${workingPath}</arg>
            <arg>--master</arg><arg>yarn-cluster</arg>
        </spark>
-        <ok to="GenerateActionSet"/>
-        <error to="Kill"/>
-    </action>
-
-
-    <action name="GenerateActionSet">
-        <spark xmlns="uri:oozie:spark-action:0.2">
-            <master>yarn-cluster</master>
-            <mode>cluster</mode>
-            <name>Generate DOIBoost ActionSet</name>
-            <class>eu.dnetlib.doiboost.SparkGenerateDOIBoostActionSet</class>
-            <jar>dhp-doiboost-${projectVersion}.jar</jar>
-            <spark-opts>
-                --executor-memory=${sparkExecutorMemory}
-                --executor-cores=${sparkExecutorCores}
-                --driver-memory=${sparkDriverMemory}
-                --conf spark.sql.shuffle.partitions=3840
-                --conf spark.extraListeners=${spark2ExtraListeners}
-                --conf spark.sql.queryExecutionListeners=${spark2SqlQueryExecutionListeners}
-                --conf spark.yarn.historyServer.address=${spark2YarnHistoryServerAddress}
-                --conf spark.eventLog.dir=${nameNode}${spark2EventLogDir}
-            </spark-opts>
-            <arg>--dbPublicationPath</arg><arg>${workingPath}/doiBoostPublicationFiltered</arg>
-            <arg>--dbDatasetPath</arg><arg>${workingPath}/crossrefDataset</arg>
-            <arg>--crossRefRelation</arg><arg>${workingPath}/crossrefRelation</arg>
-            <arg>--dbaffiliationRelationPath</arg><arg>${workingPath}/doiBoostPublicationAffiliation</arg>
-            <arg>--dbOrganizationPath</arg><arg>${workingPath}/doiBoostOrganization</arg>
-            <arg>--targetPath</arg><arg>${workingPath}/actionDataSet</arg>
-            <arg>--sFilePath</arg><arg>${outputPath}</arg>
-            <arg>--master</arg><arg>yarn-cluster</arg>
-        </spark>
        <ok to="End"/>
        <error to="Kill"/>
    </action>

+
+<!--    <action name="GenerateActionSet">-->
+<!--        <spark xmlns="uri:oozie:spark-action:0.2">-->
+<!--            <master>yarn-cluster</master>-->
+<!--            <mode>cluster</mode>-->
+<!--            <name>Generate DOIBoost ActionSet</name>-->
+<!--            <class>eu.dnetlib.doiboost.SparkGenerateDOIBoostActionSet</class>-->
+<!--            <jar>dhp-doiboost-${projectVersion}.jar</jar>-->
+<!--            <spark-opts>-->
+<!--                &#45;&#45;executor-memory=${sparkExecutorMemory}-->
+<!--                &#45;&#45;executor-cores=${sparkExecutorCores}-->
+<!--                &#45;&#45;driver-memory=${sparkDriverMemory}-->
+<!--                &#45;&#45;conf spark.sql.shuffle.partitions=3840-->
+<!--                &#45;&#45;conf spark.extraListeners=${spark2ExtraListeners}-->
+<!--                &#45;&#45;conf spark.sql.queryExecutionListeners=${spark2SqlQueryExecutionListeners}-->
+<!--                &#45;&#45;conf spark.yarn.historyServer.address=${spark2YarnHistoryServerAddress}-->
+<!--                &#45;&#45;conf spark.eventLog.dir=${nameNode}${spark2EventLogDir}-->
+<!--            </spark-opts>-->
+<!--            <arg>&#45;&#45;dbPublicationPath</arg><arg>${workingPath}/doiBoostPublicationFiltered</arg>-->
+<!--            <arg>&#45;&#45;dbDatasetPath</arg><arg>${workingPath}/crossrefDataset</arg>-->
+<!--            <arg>&#45;&#45;crossRefRelation</arg><arg>${workingPath}/crossrefRelation</arg>-->
+<!--            <arg>&#45;&#45;dbaffiliationRelationPath</arg><arg>${workingPath}/doiBoostPublicationAffiliation</arg>-->
+<!--            <arg>&#45;&#45;dbOrganizationPath</arg><arg>${workingPath}/doiBoostOrganization</arg>-->
+<!--            <arg>&#45;&#45;targetPath</arg><arg>${workingPath}/actionDataSet</arg>-->
+<!--            <arg>&#45;&#45;sFilePath</arg><arg>${outputPath}</arg>-->
+<!--            <arg>&#45;&#45;master</arg><arg>yarn-cluster</arg>-->
+<!--        </spark>-->
+<!--        <ok to="End"/>-->
+<!--        <error to="Kill"/>-->
+<!--    </action>-->
+
    <end name="End"/>
 </workflow-app>
--- a/dhp-workflows/dhp-doiboost/src/main/scala/eu/dnetlib/doiboost/SparkGenerateDoiBoost.scala
+++ b/dhp-workflows/dhp-doiboost/src/main/scala/eu/dnetlib/doiboost/SparkGenerateDoiBoost.scala
@ -3,7 +3,7 @@ package eu.dnetlib.doiboost
 import eu.dnetlib.dhp.application.ArgumentApplicationParser
 import eu.dnetlib.dhp.oa.merge.AuthorMerger
 import eu.dnetlib.dhp.schema.common.ModelConstants
-import eu.dnetlib.dhp.schema.oaf.{Organization, Publication, Relation, Dataset => OafDataset}
+import eu.dnetlib.dhp.schema.oaf.{Author, Organization, Publication, Relation, Dataset => OafDataset}
 import eu.dnetlib.doiboost.mag.ConversionUtil
 import org.apache.commons.io.IOUtils
 import org.apache.spark.SparkConf
@ -45,6 +45,7 @@ object SparkGenerateDoiBoost {
      )
    )
    parser.parseArgument(args)
+    var crossref: Boolean = true
    val spark: SparkSession =
      SparkSession
        .builder()
@ -59,6 +60,53 @@ object SparkGenerateDoiBoost {
    val workingDirPath = parser.get("workingPath")
    val openaireOrganizationPath = parser.get("openaireOrganizationPath")

+    val crossrefAggregator = new Aggregator[(String, Publication), Publication, Publication] with Serializable {
+      override def zero: Publication = new Publication
+
+      override def reduce(b: Publication, a: (String, Publication)): Publication = {
+
+        if (b == null) {
+          if (a != null && a._2 != null) {
+            a._2.setId(a._1)
+            return a._2
+          }
+        } else {
+          if (a != null && a._2 != null) {
+            b.mergeFrom(a._2)
+            b.setId(a._1)
+            val authors = DoiBoostAuthorMerger.mergeAuthor(b.getAuthor, a._2.getAuthor, true)
+            b.setAuthor(authors._1)
+            return b
+          }
+        }
+        new Publication
+      }
+
+      override def merge(b1: Publication, b2: Publication): Publication = {
+        if (b1 == null) {
+          if (b2 != null)
+            return b2
+        } else {
+          if (b2 != null) {
+            b1.mergeFrom(b2)
+
+            val authors = DoiBoostAuthorMerger.mergeAuthor(b1.getAuthor, b2.getAuthor, true)
+            b1.setAuthor(authors._1)
+            if (b2.getId != null && b2.getId.nonEmpty)
+              b1.setId(b2.getId)
+            return b1
+          }
+        }
+        new Publication
+      }
+
+      override def finish(reduction: Publication): Publication = reduction
+
+      override def bufferEncoder: Encoder[Publication] = Encoders.kryo[Publication]
+
+      override def outputEncoder: Encoder[Publication] = Encoders.kryo[Publication]
+    }
+
    implicit val mapEncoderPub: Encoder[Publication] = Encoders.kryo[Publication]
    implicit val mapEncoderOrg: Encoder[Organization] = Encoders.kryo[Organization]
    implicit val mapEncoderDataset: Encoder[OafDataset] = Encoders.kryo[OafDataset]
@ -79,7 +127,11 @@ object SparkGenerateDoiBoost {
        val otherPub = item._2._2
        if (otherPub != null) {
          crossrefPub.mergeFrom(otherPub)
-          crossrefPub.setAuthor(AuthorMerger.mergeAuthor(crossrefPub.getAuthor, otherPub.getAuthor))
+          val mergeRes: (java.util.List[Author], java.lang.Boolean) =
+            DoiBoostAuthorMerger.mergeAuthor(crossrefPub.getAuthor, otherPub.getAuthor, crossref)
+          crossrefPub.setAuthor(mergeRes._1)
+          crossref = mergeRes._2
+
        }
      }
      crossrefPub
@ -142,8 +194,8 @@ object SparkGenerateDoiBoost {
            val b2 = right._2
            b1.mergeFrom(b2)
            b1.mergeOAFDataInfo(b2)
-            val authors = AuthorMerger.mergeAuthor(b1.getAuthor, b2.getAuthor)
-            b1.setAuthor(authors)
+            val authors = DoiBoostAuthorMerger.mergeAuthor(b1.getAuthor, b2.getAuthor, crossref)
+            b1.setAuthor(authors._1)
            if (b2.getId != null && b2.getId.nonEmpty)
              b1.setId(b2.getId)
            //Return publication Merged
@ -284,4 +336,4 @@ object SparkGenerateDoiBoost {
      .save(s"$workingDirPath/doiBoostOrganization")
  }

-}
+}
--- a/dhp-workflows/dhp-doiboost/src/test/java/eu/dnetlib/dhp/doiboost/DoiBoostAuthorMergerTest.java
+++ b/dhp-workflows/dhp-doiboost/src/test/java/eu/dnetlib/dhp/doiboost/DoiBoostAuthorMergerTest.java
@ -0,0 +1,545 @@
+
+package eu.dnetlib.dhp.doiboost;
+
+import java.io.BufferedReader;
+import java.io.FileReader;
+import java.io.IOException;
+import java.nio.file.Paths;
+import java.util.ArrayList;
+import java.util.List;
+import java.util.stream.Collectors;
+
+import org.apache.neethi.Assertion;
+import org.junit.jupiter.api.Assertions;
+import org.junit.jupiter.api.BeforeEach;
+import org.junit.jupiter.api.Test;
+
+import com.fasterxml.jackson.core.JsonProcessingException;
+import com.fasterxml.jackson.databind.ObjectMapper;
+
+import eu.dnetlib.dhp.schema.common.ModelConstants;
+import eu.dnetlib.dhp.schema.oaf.Author;
+import eu.dnetlib.dhp.schema.oaf.Publication;
+import eu.dnetlib.dhp.schema.oaf.StructuredProperty;
+import eu.dnetlib.doiboost.DoiBoostAuthorMerger;
+import eu.dnetlib.pace.util.MapDocumentUtil;
+import scala.Tuple2;
+
+public class DoiBoostAuthorMergerTest {
+
+	private String publicationsBasePath;
+
+	private List<List<Author>> authors;
+
+	@BeforeEach
+	public void setUp() throws Exception {
+
+		publicationsBasePath = Paths
+			.get(DoiBoostAuthorMergerTest.class.getResource("/eu/dnetlib/dhp/doiboost").toURI())
+			.toFile()
+			.getAbsolutePath();
+
+	}
+
+	@Test
+	public void mergeTestOrcid() {
+
+		authors = readSample(publicationsBasePath + "/matching_authors_first.json", Publication.class)
+			.stream()
+			.map(p -> p._2().getAuthor())
+			.collect(Collectors.toList());
+
+		for (List<Author> authors1 : authors) {
+			System.out.println("List " + (authors.indexOf(authors1) + 1));
+			for (Author author : authors1) {
+				System.out.println(authorToString(author));
+			}
+		}
+
+		List<Author> merge = DoiBoostAuthorMerger.merge(authors, true);
+
+		System.out.println("Merge ");
+		for (Author author : merge) {
+			System.out.println(authorToString(author));
+		}
+
+		Assertions.assertEquals(10, merge.size());
+
+		Assertions.assertEquals(3, merge.stream().filter(a -> a.getPid() != null).count());
+
+		merge
+			.stream()
+			.filter(a -> a.getPid() != null)
+			.forEach(
+				a -> Assertions
+					.assertTrue(
+						a.getPid().stream().anyMatch(p -> p.getQualifier().getClassid().equals(ModelConstants.ORCID))));
+		merge.stream().filter(a -> a.getPid() != null).forEach(a -> {
+			try {
+				System.out.println(new ObjectMapper().writeValueAsString(a));
+			} catch (JsonProcessingException e) {
+				e.printStackTrace();
+			}
+		});
+
+	}
+
+	public <T> List<Tuple2<String, T>> readSample(String path, Class<T> clazz) {
+		List<Tuple2<String, T>> res = new ArrayList<>();
+		BufferedReader reader;
+		try {
+			reader = new BufferedReader(new FileReader(path));
+			String line = reader.readLine();
+			while (line != null) {
+				res
+					.add(
+						new Tuple2<>(
+							MapDocumentUtil.getJPathString("$.id", line),
+							new ObjectMapper().readValue(line, clazz)));
+				// read next line
+				line = reader.readLine();
+			}
+			reader.close();
+		} catch (IOException e) {
+			e.printStackTrace();
+		}
+
+		return res;
+	}
+
+	public String authorToString(Author a) {
+
+		String print = "Fullname = ";
+		print += a.getFullname() + " pid = [";
+		if (a.getPid() != null)
+			for (StructuredProperty sp : a.getPid()) {
+				print += sp.toComparableString() + " ";
+			}
+		print += "]";
+		return print;
+	}
+
+	@Test
+	public void mergeTestMAG() {
+
+		authors = readSample(publicationsBasePath + "/matching_authors_second", Publication.class)
+			.stream()
+			.map(p -> p._2().getAuthor())
+			.collect(Collectors.toList());
+
+		for (List<Author> authors1 : authors) {
+			System.out.println("List " + (authors.indexOf(authors1) + 1));
+			for (Author author : authors1) {
+				System.out.println(authorToString(author));
+			}
+		}
+
+		List<Author> merge = DoiBoostAuthorMerger.merge(authors, true);
+
+		System.out.println("Merge ");
+		for (Author author : merge) {
+			System.out.println(authorToString(author));
+		}
+
+		Assertions.assertEquals(10, merge.size());
+
+		Assertions.assertEquals(10, merge.stream().filter(a -> a.getPid() != null).count());
+
+		merge
+			.stream()
+			.filter(a -> a.getPid() != null)
+			.forEach(
+				a -> Assertions
+					.assertTrue(
+						a.getPid().stream().anyMatch(p -> p.getQualifier().getClassid().equals("URL"))));
+		merge.stream().filter(a -> a.getPid() != null).forEach(a -> {
+			try {
+				System.out.println(new ObjectMapper().writeValueAsString(a));
+			} catch (JsonProcessingException e) {
+				e.printStackTrace();
+			}
+		});
+
+	}
+
+	@Test
+	public void mergeTestCrossrefEmpty1() throws JsonProcessingException {
+
+		authors = readSample(publicationsBasePath + "/empty_crossref_authors_first.json", Publication.class)
+			.stream()
+			.map(p -> p._2().getAuthor())
+			.collect(Collectors.toList());
+
+		List<Author> merge = DoiBoostAuthorMerger.merge(authors, true);
+
+		System.out.println("Merge ");
+		for (Author author : merge) {
+			System.out.println(authorToString(author));
+		}
+
+		Assertions.assertEquals(3, merge.size());
+
+		Assertions.assertEquals(3, merge.stream().filter(a -> a.getPid() != null).count());
+
+		merge
+			.stream()
+			.filter(a -> a.getPid() != null)
+			.forEach(
+				a -> Assertions
+					.assertTrue(
+						a.getPid().stream().anyMatch(p -> p.getQualifier().getClassid().equals(ModelConstants.ORCID))));
+		merge.stream().filter(a -> a.getPid() != null).forEach(a -> {
+			try {
+				System.out.println(new ObjectMapper().writeValueAsString(a));
+			} catch (JsonProcessingException e) {
+				e.printStackTrace();
+			}
+		});
+
+		System.out.println(new ObjectMapper().writeValueAsString(merge));
+
+	}
+
+	@Test
+	public void mergeTestCrossrefEmpty2() throws JsonProcessingException {
+
+		authors = readSample(publicationsBasePath + "/empty_crossref_authors_second.json", Publication.class)
+			.stream()
+			.map(p -> p._2().getAuthor())
+			.collect(Collectors.toList());
+
+		List<Author> merge = DoiBoostAuthorMerger.merge(authors, false);
+
+		System.out.println("Merge ");
+		for (Author author : merge) {
+			System.out.println(authorToString(author));
+		}
+
+		Assertions.assertEquals(10, merge.size());
+
+		Assertions.assertEquals(10, merge.stream().filter(a -> a.getPid() != null).count());
+
+		merge
+			.stream()
+			.filter(a -> a.getPid() != null)
+			.forEach(
+				a -> Assertions
+					.assertTrue(
+						a.getPid().stream().anyMatch(p -> p.getQualifier().getClassid().equals("URL"))));
+		merge.stream().filter(a -> a.getPid() != null).forEach(a -> {
+			try {
+				System.out.println(new ObjectMapper().writeValueAsString(a));
+			} catch (JsonProcessingException e) {
+				e.printStackTrace();
+			}
+		});
+
+		Assertions
+			.assertTrue(
+				3 == merge
+					.stream()
+					.filter(a -> a.getPid() != null)
+					.filter(
+						a -> a
+							.getPid()
+							.stream()
+							.anyMatch(p -> p.getQualifier().getClassid().equals(ModelConstants.ORCID)))
+					.count());
+
+	}
+
+	@Test
+	public void mergeTestCrossrefEmpty3() throws JsonProcessingException {
+
+		authors = readSample(publicationsBasePath + "/empty_crossref_authors_third.json", Publication.class)
+			.stream()
+			.map(p -> p._2().getAuthor())
+			.collect(Collectors.toList());
+
+		List<Author> merge = DoiBoostAuthorMerger.merge(authors, true);
+
+		System.out.println("Merge ");
+		for (Author author : merge) {
+			System.out.println(authorToString(author));
+		}
+
+		Assertions.assertEquals(10, merge.size());
+
+		Assertions.assertEquals(10, merge.stream().filter(a -> a.getPid() != null).count());
+
+		merge
+			.stream()
+			.filter(a -> a.getPid() != null)
+			.forEach(
+				a -> Assertions
+					.assertTrue(
+						a.getPid().stream().anyMatch(p -> p.getQualifier().getClassid().equals("URL"))));
+
+		Assertions
+			.assertTrue(
+				3 == merge
+					.stream()
+					.filter(a -> a.getPid() != null)
+					.filter(
+						a -> a
+							.getPid()
+							.stream()
+							.anyMatch(p -> p.getQualifier().getClassid().equals(ModelConstants.ORCID)))
+					.count());
+
+	}
+
+	@Test
+	public void mergeTestCrossrefEmpty4() throws JsonProcessingException {
+
+		authors = readSample(publicationsBasePath + "/empty_crossref_authors_fourth.json", Publication.class)
+			.stream()
+			.map(p -> p._2().getAuthor())
+			.collect(Collectors.toList());
+
+		List<Author> merge = DoiBoostAuthorMerger.merge(authors, true);
+
+		System.out.println("Merge ");
+		for (Author author : merge) {
+			System.out.println(authorToString(author));
+		}
+
+		Assertions.assertEquals(3, merge.size());
+
+		Assertions.assertEquals(3, merge.stream().filter(a -> a.getPid() != null).count());
+
+		Assertions
+			.assertTrue(
+				3 == merge
+					.stream()
+					.filter(a -> a.getPid() != null)
+					.filter(
+						a -> a
+							.getPid()
+							.stream()
+							.anyMatch(p -> p.getQualifier().getClassid().equals(ModelConstants.ORCID)))
+					.count());
+
+	}
+
+	@Test
+	public void shouldMergeTest1() throws JsonProcessingException {
+
+		authors = readSample(publicationsBasePath + "/should_appear_author1.json", Publication.class)
+			.stream()
+			.map(p -> p._2().getAuthor())
+			.collect(Collectors.toList());
+
+		List<Author> merge = DoiBoostAuthorMerger.merge(authors, true);
+
+		Assertions
+			.assertTrue(
+				6 == merge
+					.stream()
+					.filter(a -> a.getPid() != null)
+					.filter(
+						a -> a
+							.getPid()
+							.stream()
+							.anyMatch(p -> p.getQualifier().getClassid().equals(ModelConstants.ORCID)))
+					.count());
+
+		Assertions
+			.assertTrue(
+				34 == merge
+					.stream()
+					.filter(a -> a.getPid() != null)
+					.filter(
+						a -> a
+							.getPid()
+							.stream()
+							.anyMatch(p -> p.getQualifier().getClassid().equals(ModelConstants.ORCID_PENDING)))
+					.count());
+
+		merge
+			.stream()
+			.filter(a -> a.getRank() == 26)
+			.forEach(
+				a -> Assertions
+					.assertTrue(
+						a
+							.getPid()
+							.stream()
+							.anyMatch(
+								pid -> pid.getValue().equals("0000-0002-2445-5275")
+									&& pid.getQualifier().getClassid().equals(ModelConstants.ORCID))));
+
+	}
+
+	@Test
+	public void shouldMergeTest2() throws JsonProcessingException {
+
+		authors = readSample(publicationsBasePath + "/should_appear_author2.json", Publication.class)
+			.stream()
+			.map(p -> p._2().getAuthor())
+			.collect(Collectors.toList());
+
+		for (List<Author> authors1 : authors) {
+			System.out.println("List " + (authors.indexOf(authors1) + 1));
+			for (Author author : authors1) {
+				System.out.println(authorToString(author));
+			}
+		}
+
+		List<Author> merge = DoiBoostAuthorMerger.merge(authors, true);
+
+		merge.stream().filter(a -> a.getPid() != null).forEach(a -> {
+			try {
+				System.out.println(new ObjectMapper().writeValueAsString(a));
+			} catch (JsonProcessingException e) {
+				e.printStackTrace();
+			}
+		});
+
+		Assertions
+			.assertEquals(
+				6, merge
+					.stream()
+					.filter(a -> a.getPid() != null)
+					.filter(
+						a -> a
+							.getPid()
+							.stream()
+							.anyMatch(p -> p.getQualifier().getClassid().equals(ModelConstants.ORCID)))
+					.count());
+
+		Assertions
+			.assertTrue(
+				34 == merge
+					.stream()
+					.filter(a -> a.getPid() != null)
+					.filter(
+						a -> a
+							.getPid()
+							.stream()
+							.anyMatch(p -> p.getQualifier().getClassid().equals(ModelConstants.ORCID_PENDING)))
+					.count());
+
+		merge
+			.stream()
+			.filter(a -> a.getFullname().equals("da luz geraldo eduardo"))
+			.forEach(
+				a -> Assertions
+					.assertTrue(
+						a
+							.getPid()
+							.stream()
+							.anyMatch(
+								pid -> pid.getValue().equals("http://orcid.org/0000-0003-2434-0387")
+									&& pid.getQualifier().getClassid().equals(ModelConstants.ORCID_PENDING))));
+
+	}
+
+	@Test
+	public void shouldNotMergeTest1() throws JsonProcessingException {
+
+		authors = readSample(publicationsBasePath + "/should_appear_author3.json", Publication.class)
+			.stream()
+			.map(p -> p._2().getAuthor())
+			.collect(Collectors.toList());
+
+		for (List<Author> authors1 : authors) {
+			System.out.println("List " + (authors.indexOf(authors1) + 1));
+			for (Author author : authors1) {
+				System.out.println(authorToString(author));
+			}
+		}
+
+		List<Author> merge = DoiBoostAuthorMerger.merge(authors, true);
+
+		System.out.println("Merge ");
+		for (Author author : merge) {
+			System.out.println(authorToString(author));
+		}
+
+		Assertions
+			.assertEquals(
+				5, merge
+					.stream()
+					.filter(a -> a.getPid() != null)
+					.filter(
+						a -> a
+							.getPid()
+							.stream()
+							.anyMatch(p -> p.getQualifier().getClassid().equals(ModelConstants.ORCID)))
+					.count());
+
+		Assertions
+			.assertEquals(
+				34, merge
+					.stream()
+					.filter(a -> a.getPid() != null)
+					.filter(
+						a -> a
+							.getPid()
+							.stream()
+							.anyMatch(p -> p.getQualifier().getClassid().equals(ModelConstants.ORCID_PENDING)))
+					.count());
+
+		merge
+			.stream()
+			.filter(a -> a.getFullname().equals("da luz geraldo eduardo"))
+			.forEach(
+				a -> Assertions
+					.assertTrue(
+						a
+							.getPid()
+							.stream()
+							.anyMatch(
+								pid -> pid.getValue().equals("http://orcid.org/0000-0003-2434-0387")
+									&& pid.getQualifier().getClassid().equals(ModelConstants.ORCID_PENDING))));
+
+	}
+
+	@Test
+	public void mergeTestMatchOneill() {
+
+		authors = readSample(publicationsBasePath + "/matching_authors_Oneill.json", Publication.class)
+			.stream()
+			.map(p -> p._2().getAuthor())
+			.collect(Collectors.toList());
+
+		for (List<Author> authors1 : authors) {
+			System.out.println("List " + (authors.indexOf(authors1) + 1));
+			for (Author author : authors1) {
+				System.out.println(authorToString(author));
+			}
+		}
+
+		List<Author> merge = DoiBoostAuthorMerger.merge(authors, true);
+
+		System.out.println("Merge ");
+		for (Author author : merge) {
+			System.out.println(authorToString(author));
+		}
+
+		Assertions.assertEquals(10, merge.size());
+
+		Assertions.assertEquals(3, merge.stream().filter(a -> a.getPid() != null).count());
+
+		merge
+			.stream()
+			.filter(a -> a.getPid() != null)
+			.forEach(
+				a -> Assertions
+					.assertTrue(a.getPid().stream().allMatch(p -> p.getQualifier().getClassid().equals("orcid"))));
+
+		Assertions
+			.assertEquals(
+				"0000-0002-4333-2748",
+				merge
+					.stream()
+					.filter(a -> a.getSurname().equalsIgnoreCase("o'neill"))
+					.collect(Collectors.toList())
+					.get(0)
+					.getPid()
+					.get(0)
+					.getValue());
+
+	}
+}
--- a/dhp-workflows/dhp-doiboost/src/test/resources/eu/dnetlib/dhp/doiboost/empty_crossref_authors_first.json
+++ b/dhp-workflows/dhp-doiboost/src/test/resources/eu/dnetlib/dhp/doiboost/empty_crossref_authors_first.json
--- a/dhp-workflows/dhp-doiboost/src/test/resources/eu/dnetlib/dhp/doiboost/empty_crossref_authors_fourth.json
+++ b/dhp-workflows/dhp-doiboost/src/test/resources/eu/dnetlib/dhp/doiboost/empty_crossref_authors_fourth.json
--- a/dhp-workflows/dhp-doiboost/src/test/resources/eu/dnetlib/dhp/doiboost/empty_crossref_authors_second.json
+++ b/dhp-workflows/dhp-doiboost/src/test/resources/eu/dnetlib/dhp/doiboost/empty_crossref_authors_second.json
--- a/dhp-workflows/dhp-doiboost/src/test/resources/eu/dnetlib/dhp/doiboost/empty_crossref_authors_third.json
+++ b/dhp-workflows/dhp-doiboost/src/test/resources/eu/dnetlib/dhp/doiboost/empty_crossref_authors_third.json
--- a/dhp-workflows/dhp-doiboost/src/test/resources/eu/dnetlib/dhp/doiboost/matching_authors_Oneill.json
+++ b/dhp-workflows/dhp-doiboost/src/test/resources/eu/dnetlib/dhp/doiboost/matching_authors_Oneill.json
--- a/dhp-workflows/dhp-doiboost/src/test/resources/eu/dnetlib/dhp/doiboost/matching_authors_first.json
+++ b/dhp-workflows/dhp-doiboost/src/test/resources/eu/dnetlib/dhp/doiboost/matching_authors_first.json
--- a/dhp-workflows/dhp-doiboost/src/test/resources/eu/dnetlib/dhp/doiboost/matching_authors_second
+++ b/dhp-workflows/dhp-doiboost/src/test/resources/eu/dnetlib/dhp/doiboost/matching_authors_second
--- a/dhp-workflows/dhp-doiboost/src/test/resources/eu/dnetlib/dhp/doiboost/should_appear_author1.json
+++ b/dhp-workflows/dhp-doiboost/src/test/resources/eu/dnetlib/dhp/doiboost/should_appear_author1.json
--- a/dhp-workflows/dhp-doiboost/src/test/resources/eu/dnetlib/dhp/doiboost/should_appear_author2.json
+++ b/dhp-workflows/dhp-doiboost/src/test/resources/eu/dnetlib/dhp/doiboost/should_appear_author2.json
--- a/dhp-workflows/dhp-doiboost/src/test/resources/eu/dnetlib/dhp/doiboost/should_appear_author3.json
+++ b/dhp-workflows/dhp-doiboost/src/test/resources/eu/dnetlib/dhp/doiboost/should_appear_author3.json
--- a/dhp-workflows/dhp-graph-mapper/src/test/java/eu/dnetlib/dhp/oa/graph/clean/GraphCleaningFunctionsTest.java
+++ b/dhp-workflows/dhp-graph-mapper/src/test/java/eu/dnetlib/dhp/oa/graph/clean/GraphCleaningFunctionsTest.java
@ -343,7 +343,11 @@ public class GraphCleaningFunctionsTest {
 			.toString(getClass().getResourceAsStream("/eu/dnetlib/dhp/oa/graph/clean/doiboostpub2.json"));
 		Publication p_in = MAPPER.readValue(json, Publication.class);
 		Publication p_out = OafCleaner.apply(GraphCleaningFunctions.fixVocabularyNames(p_in), mapping);
+<<<<<<< HEAD
+		Publication cleaned = GraphCleaningFunctions.cleanup(p_out);
+=======
 		Publication cleaned = GraphCleaningFunctions.cleanup(p_out, vocabularies);
+>>>>>>> beta

 		Assertions.assertEquals(true, GraphCleaningFunctions.filter(cleaned));
Author	SHA1	Message	Date
Miriam Baglioni	42cd19f5ea	Change the test for author merger	2023-05-24 15:21:36 +02:00
Miriam Baglioni	2aaa63dfa2	[DOIBoost author merger] merge with beta and added new test to verify the match for 0000-0002-4333-2748	2023-05-23 15:44:03 +02:00
Miriam Baglioni	2bb0a737aa	[DOIBoostAuthorMerger] merge with beta	2022-01-10 15:37:10 +01:00
Miriam Baglioni	38065d6ed6	-	2021-11-25 10:57:06 +01:00
Miriam Baglioni	9a4c2aff07	[DoiBoost Author Merger] -	2021-11-22 16:55:02 +01:00
Miriam Baglioni	910abcba04	[DoiBoost Author merger] -	2021-11-22 16:54:27 +01:00
Miriam Baglioni	41ea1b2177	mergin with branch beta	2021-11-22 14:00:06 +01:00
Miriam Baglioni	1dd15ee2f2	[Author Merger DoiBoost] add last part to consider also author->enrich map	2021-11-19 18:18:46 +01:00
Miriam Baglioni	f100dc5880	mergin with branch beta	2021-11-19 16:43:45 +01:00
Miriam Baglioni	89e4bfbb98	-	2021-11-15 16:45:36 +01:00
Miriam Baglioni	c7924008f7	mergin with branch beta	2021-11-15 10:54:20 +01:00
Miriam Baglioni	48ec750305	-	2021-11-12 15:44:54 +01:00
Miriam Baglioni	92b4026380	mergin with branch beta	2021-11-12 13:52:09 +01:00
Miriam Baglioni	08dd119bc6	DOIBoost AuthorMerger - adding test resources	2021-08-11 16:55:21 +02:00
Miriam Baglioni	a674587cfd	mergin with branch beta	2021-08-11 16:38:33 +02:00
Miriam Baglioni	6a4af4abfc	DoiBoostAuthorMerger - first implementation and test class for the specific Author Merger for doiboost. First change: crossref as base to be enriched. If crossref is empty, the longest from all the merging providers is taken. Modified the normalization function to remove accents from words. Added class to store the most similar author list to be enriched w.r.t. one enriching author (related to DoiBoostAuthorMerger)	2021-08-11 15:51:48 +02:00