Merge branch 'stable_ids' of code-repo.d4science.org:D-Net/dnet-hadoop into stable_ids

minor change
Adding spark.close() to avoid Only one SparkContext may be running in this JVM error while running test on Jenkins and fixed issue
2021-09-17 11:27:25 +02:00 · 2021-09-17 11:27:06 +02:00 · 2021-07-13 10:33:00 +02:00 · 2021-07-12 10:07:01 +02:00 · 2021-07-12 10:06:29 +02:00 · 2021-07-08 18:57:25 +02:00
16 changed files with 721 additions and 8 deletions
--- a/dhp-workflows/dhp-dedup-openaire/src/test/java/eu/dnetlib/dhp/oa/dedup/SparkDedupTest.java
+++ b/dhp-workflows/dhp-dedup-openaire/src/test/java/eu/dnetlib/dhp/oa/dedup/SparkDedupTest.java
@ -72,8 +72,8 @@ public class SparkDedupTest implements Serializable {
 			.toAbsolutePath()
 			.toString();

-		FileUtils.deleteDirectory(new File(testOutputBasePath));
 		FileUtils.deleteDirectory(new File(testDedupGraphBasePath));
+		FileUtils.deleteDirectory(new File(testOutputBasePath));

 		final SparkConf conf = new SparkConf();
 		conf.set("spark.sql.shuffle.partitions", "200");
--- a/dhp-workflows/dhp-doiboost/src/main/java/eu/dnetlib/doiboost/AuthorAssoc.java
+++ b/dhp-workflows/dhp-doiboost/src/main/java/eu/dnetlib/doiboost/AuthorAssoc.java
@ -0,0 +1,47 @@
+package eu.dnetlib.doiboost;
+
+import eu.dnetlib.dhp.schema.oaf.Author;
+
+import java.io.Serializable;
+import java.util.ArrayList;
+import java.util.List;
+
+public class AuthorAssoc implements Serializable {
+    private Double score ;
+    private List<Author> to_be_enriched;
+    private Author with_enricheing_content;
+
+    public Double getScore() {
+        return score;
+    }
+
+    public void setScore(Double score) {
+        this.score = score;
+    }
+
+    public List<Author> getTo_be_enriched() {
+        return to_be_enriched;
+    }
+
+    public void setTo_be_enriched(List<Author> to_be_enriched) {
+        this.to_be_enriched = to_be_enriched;
+    }
+
+    public Author getWith_enricheing_content() {
+        return with_enricheing_content;
+    }
+
+    public void setWith_enricheing_content(Author with_enricheing_content) {
+        this.with_enricheing_content = with_enricheing_content;
+    }
+
+    public static AuthorAssoc newInstance(Author a){
+        AuthorAssoc ret = new AuthorAssoc();
+        ret.score = 0.0;
+        ret.to_be_enriched = new ArrayList<>();
+        ret.with_enricheing_content = a;
+
+        return ret;
+
+    }
+}
--- a/dhp-workflows/dhp-doiboost/src/main/java/eu/dnetlib/doiboost/DoiBoostAuthorMerger.java
+++ b/dhp-workflows/dhp-doiboost/src/main/java/eu/dnetlib/doiboost/DoiBoostAuthorMerger.java
@ -0,0 +1,236 @@
+
+package eu.dnetlib.doiboost;
+
+import java.text.Normalizer;
+import java.util.*;
+import java.util.stream.Collectors;
+
+import eu.dnetlib.dhp.schema.oaf.Result;
+import eu.dnetlib.dhp.utils.DHPUtils;
+
+import com.wcohen.ss.JaroWinkler;
+
+import eu.dnetlib.dhp.schema.oaf.Author;
+import eu.dnetlib.dhp.schema.oaf.StructuredProperty;
+
+import scala.Tuple2;
+
+/**
+ * This is a version of the AuthorMerger specific for DoiBoost.
+ * Here we suppose a match must exist for the authors. We compare via JaroWrinkler similarity measure each author in the list
+ * that should be enriched with each author in the enriching list. For each enriching author we select the best match that is
+ * the author with the highest similarity score.
+ * The association is done from the enriching author to the enriched because in this way only one match per enriching author can be found
+ * One enriching author can have the same maximum similarity score with more than one
+ *
+ *
+ *
+ *
+ * The idea is to enrich the most similar authors having at least one
+ * word of the name in
+ * common
+ * Quello che faccio e’ abbastanza semplice: ho una struttura dati che mantine l’informazione di associazione fra il record che puo’ possibilmente arricchire e quello che deve essere arricchito.
+ * 6:22
+ * Questa struttura ha la lista di autori che possono essere arricchiti, l’autore che arricchisce e lo score di similarita fra l’autore che arricchisce e gli autori arricchiti. E’ il valore di una mappa che per chiave la il fullname dell’autore che arricchisce
+ * 6:23
+ * per ogni autore che puo’ essere arricchito verifico se la entri nella mappa di quello che arricchisce e’ associata ad un autore con score di similarita’ piu’ basso. Se cosi’ e’ modifico l’associazione nella mappa per l’autore che arricchisce, sostituendo l’autore arricchito a cui era associato prima con quello nuovo che ha score piu’ alto. Se lo score e’ lo stesso, aggiungo il nuovo autore da arricchire alla lista degli autori associata all’autore che arricchisce
+ * 6:25
+ * Alla fine caso facile: ogni entry e’ associata ad un unico autore da arricchire => verifico che almeno una delle parole che sono nei due nomi sia in comune fra i due insiemi Se e’ cosi’, aggiungo i pid mancanti all’autore da arricchire dell’autore che arricchisce
+ * 6:26
+ * caso brutto: ci sono piu’ autori da arricchire con la stessa similarita: arricchisco quello che ha il maggior numero di parole del fullname uguali a quelle dell’autore che arricchisce. In caso di parita’ non si arricchisce
+ * 6:28
+ * ricordiamoci che si parte dal presupposto che un match debba esistere visto che abbiamo lo stesso doi
+ * 6:29
+ * di conseguenza l’autore che ha lo score di similarita’ piu’ alto fra quelli presenti ed anche una parola in comune del nome dovrebbe essere sufficiente per poterlo arricchire.
+ * 6:30
+ * I casi di omonimia che potrebbero portare problemi con i rank degli autori non si mappano
+ */
+
+public class DoiBoostAuthorMerger {
+
+
+	public static List<Author> merge(List<List<Author>> authors,  Boolean crossref) {
+
+		Iterator<List<Author>> it = authors.iterator();
+		List<Author> author = it.next();
+
+		while (it.hasNext()){
+			List<Author> autList = it.next();
+			Tuple2<List<Author>, Boolean> tmp = mergeAuthor(author, autList, crossref);
+			author = tmp._1();
+			crossref = tmp._2();
+		}
+
+		return author;
+
+	}
+
+	public static Tuple2<List<Author>, Boolean> mergeAuthor(final List<Author> baseAuthor, final List<Author> otherAuthor,
+										    final Boolean crossref) {
+
+		if(baseAuthor == null || baseAuthor.size() == 0)
+			return new Tuple2<>(otherAuthor, false);
+		if(otherAuthor == null || otherAuthor.size() == 0)
+			return new Tuple2<>(baseAuthor, crossref);
+
+		if(crossref) {
+			enrichPidFromList(baseAuthor, otherAuthor);
+			return new Tuple2<>(baseAuthor, true);
+		}
+		else
+			if (baseAuthor.size() > otherAuthor.size()){
+				enrichPidFromList(baseAuthor, otherAuthor);
+				return new Tuple2<>(baseAuthor, false);
+			}else{
+				enrichPidFromList(otherAuthor, baseAuthor);
+				return new Tuple2<>(otherAuthor, false);
+			}
+
+	}
+
+
+	private static void enrichPidFromList(List<Author> base, List<Author> enrich) {
+
+		//search authors having identifiers in the enrich list
+        final List<Author> authorsWithPids = enrich
+                .stream()
+                .filter(a -> a.getPid() != null && a.getPid().size() > 0)
+                .collect(Collectors.toList());
+
+		Map<String, AuthorAssoc> assocMap = authorsWithPids
+				.stream()
+				.map(
+						a -> new Tuple2<>(DHPUtils.md5(a.getFullname()), AuthorAssoc.newInstance(a)))
+				.collect(Collectors.toMap(Tuple2::_1, Tuple2::_2, (x1, x2) -> x1));
+
+
+		//for each author in the base list, we search the best enriched match
+		base.stream()
+				.map(a -> new Tuple2<>(a, authorsWithPids.stream()
+						.map(e -> new Tuple2<>(e, sim(a, e))).collect(Collectors.toList())))
+                .forEach(t2 -> {
+
+                    for (Tuple2<Author, Double> t : t2._2()) {
+                    	String mapEntry = DHPUtils.md5(t._1().getFullname());
+                    	AuthorAssoc aa = assocMap.get(mapEntry);
+                    	if(aa.getScore() < t._2() && aa.getScore() < 0.9){
+							aa.setScore(t._2());
+							aa.setTo_be_enriched(new ArrayList<>());
+							aa.getTo_be_enriched().add(t2._1());
+						}else if(t._2() > 0.9){
+                    		aa.getTo_be_enriched().add(t2._1());
+						}
+                    }
+
+                });
+                
+		assocMap.keySet().forEach(k -> enrichAuthor(assocMap.get(k)));
+
+
+	}
+
+	private static long getCommonWords(List<String> fullEnrich, List<String> fullEnriching){
+		return fullEnrich.stream().filter( w -> fullEnriching.contains(w)).count();
+	}
+
+
+	private static void enrichAuthor(Author enrich, Author enriching){
+		//verify if some of the words in the fullname are contained in the other
+		//get normalized fullname
+
+		long commonWords = getCommonWords(normalize(enrich.getFullname()),
+				normalize(enriching.getFullname()));
+		if(commonWords > 0 ){
+			if(enrich.getPid() == null){
+				enrich.setPid(new ArrayList<>());
+			}
+				Set<String> aPids = enrich.getPid().stream().map(p -> pidToComparableString(p)).collect(Collectors.toSet());
+			enriching.getPid().forEach(p -> {
+					if (!aPids.contains(pidToComparableString(p))){
+						enrich.getPid().add(p);
+					}
+				});
+			if (enrich.getAffiliation() == null){
+				if (enriching.getAffiliation() != null){
+					enrich.setAffiliation(enriching.getAffiliation());
+				}
+			}
+		}
+
+
+	}
+
+	//Verify the number of words in common. The one that has more, wins. If the number of words in common are the same we
+	//enrich no author
+	private static void enrichAuthor(AuthorAssoc authorAssoc) {
+		if (authorAssoc.getTo_be_enriched().size() == 1){
+			enrichAuthor(authorAssoc.getTo_be_enriched().get(0), authorAssoc.getWith_enricheing_content());
+		}else{
+			long common = 0;
+			List<Author> selected = new ArrayList<>() ;
+			for(Author a : authorAssoc.getTo_be_enriched()){
+				long current_common = getCommonWords(normalize(a.getFullname()),
+						normalize(authorAssoc.getWith_enricheing_content().getFullname()));
+				if (current_common > common){
+					common = current_common;
+					selected = new ArrayList<>();
+					selected.add(a);
+				}else if(current_common == common){
+					selected.add(a);
+				}
+			}
+			if (selected.size() == 1){
+				enrichAuthor(selected.get(0), authorAssoc.getWith_enricheing_content());
+			}
+		}
+
+	}
+
+
+	public static String pidToComparableString(StructuredProperty pid) {
+		return (pid.getQualifier() != null
+			? pid.getQualifier().getClassid() != null ? pid.getQualifier().getClassid().toLowerCase() : ""
+			: "")
+			+ (pid.getValue() != null ? pid.getValue().toLowerCase() : "");
+	}
+
+
+
+
+	private static Double sim(Author a, Author b) {
+			return new JaroWinkler()
+				.score(normalizeString(a.getFullname()), normalizeString(b.getFullname()));
+
+	}
+
+	private static String normalizeString(String fullname) {
+		return String.join(" ", normalize(fullname));
+	}
+
+
+	private static List<String> normalize(final String s) {
+		String[] normalized = nfd(s)
+			.replaceAll("[^\\p{ASCII}]", "")
+			.toLowerCase()
+			// do not compact the regexes in a single expression, would cause StackOverflowError
+			// in case
+			// of large input strings
+			.replaceAll("(\\W)+", " ")
+			.replaceAll("(\\p{InCombiningDiacriticalMarks})+", " ")
+			.replaceAll("(\\p{Punct})+", " ")
+			.replaceAll("(\\d)+", " ")
+			.replaceAll("(\\n)+", " ")
+			.trim()
+			.split(" ");
+
+		Arrays.sort(normalized);
+
+		return Arrays.asList(normalized);
+
+
+	}
+
+	private static String nfd(final String s) {
+		return Normalizer.normalize(s, Normalizer.Form.NFD);
+	}
+}
--- a/dhp-workflows/dhp-doiboost/src/main/java/eu/dnetlib/doiboost/SparkGenerateDoiBoost.scala
+++ b/dhp-workflows/dhp-doiboost/src/main/java/eu/dnetlib/doiboost/SparkGenerateDoiBoost.scala
@ -3,7 +3,7 @@ package eu.dnetlib.doiboost
 import eu.dnetlib.dhp.application.ArgumentApplicationParser
 import eu.dnetlib.dhp.oa.merge.AuthorMerger
 import eu.dnetlib.dhp.schema.common.ModelConstants
-import eu.dnetlib.dhp.schema.oaf.{Organization, Publication, Relation, Dataset => OafDataset}
+import eu.dnetlib.dhp.schema.oaf.{Author, Organization, Publication, Relation, Dataset => OafDataset}
 import eu.dnetlib.doiboost.mag.ConversionUtil
 import org.apache.commons.io.IOUtils
 import org.apache.spark.SparkConf
@ -25,6 +25,7 @@ object SparkGenerateDoiBoost {
    val conf: SparkConf = new SparkConf()
    val parser = new ArgumentApplicationParser(IOUtils.toString(getClass.getResourceAsStream("/eu/dnetlib/dhp/doiboost/generate_doiboost_params.json")))
    parser.parseArgument(args)
+    var crossref : Boolean = true
    val spark: SparkSession =
      SparkSession
        .builder()
@ -104,7 +105,10 @@ object SparkGenerateDoiBoost {
        val otherPub = item._2._2
        if (otherPub != null) {
          crossrefPub.mergeFrom(otherPub)
-          crossrefPub.setAuthor(AuthorMerger.mergeAuthor(crossrefPub.getAuthor, otherPub.getAuthor))
+          val mergeRes : (java.util.List[Author], java.lang.Boolean) = DoiBoostAuthorMerger.mergeAuthor(crossrefPub.getAuthor, otherPub.getAuthor, crossref)
+          crossrefPub.setAuthor(mergeRes._1)
+          crossref = mergeRes._2
+
        }
      }
      crossrefPub
--- a/dhp-workflows/dhp-doiboost/src/main/resources/eu/dnetlib/dhp/doiboost/oozie_app/workflow.xml
+++ b/dhp-workflows/dhp-doiboost/src/main/resources/eu/dnetlib/dhp/doiboost/oozie_app/workflow.xml
@ -111,7 +111,7 @@
            <case to="CreateDOIBoost">${wf:conf('resumeFrom') eq 'CreateDOIBoost'}</case>
            <case to="GenerateActionSet">${wf:conf('resumeFrom') eq 'GenerateActionSet'}</case>
            <case to="GenerateCrossrefDataset">${wf:conf('resumeFrom') eq 'GenerateCrossrefDataset'}</case>
-            <default to="ImportCrossRef"/>
+            <default to="ProcessORCID"/>
        </switch>
    </decision>

@ -319,7 +319,7 @@
                --executor-memory=${sparkExecutorIntersectionMemory}
                --executor-cores=${sparkExecutorCores}
                --driver-memory=${sparkDriverMemory}
-                --conf spark.sql.shuffle.partitions=3840
+                --conf spark.sql.shuffle.partitions=7680
                --conf spark.extraListeners=${spark2ExtraListeners}
                --conf spark.sql.queryExecutionListeners=${spark2SqlQueryExecutionListeners}
                --conf spark.yarn.historyServer.address=${spark2YarnHistoryServerAddress}
--- a/dhp-workflows/dhp-doiboost/src/test/java/eu/dnetlib/dhp/doiboost/DoiBoostAuthorMergerTest.java
+++ b/dhp-workflows/dhp-doiboost/src/test/java/eu/dnetlib/dhp/doiboost/DoiBoostAuthorMergerTest.java
@ -0,0 +1,406 @@
+
+package eu.dnetlib.dhp.doiboost;
+
+import java.io.BufferedReader;
+import java.io.FileReader;
+import java.io.IOException;
+import java.nio.file.Paths;
+import java.util.ArrayList;
+import java.util.List;
+import java.util.stream.Collectors;
+
+import org.apache.neethi.Assertion;
+import org.junit.jupiter.api.Assertions;
+import org.junit.jupiter.api.BeforeEach;
+import org.junit.jupiter.api.Test;
+
+import com.fasterxml.jackson.core.JsonProcessingException;
+import com.fasterxml.jackson.databind.ObjectMapper;
+
+import eu.dnetlib.dhp.schema.common.ModelConstants;
+import eu.dnetlib.dhp.schema.oaf.Author;
+import eu.dnetlib.dhp.schema.oaf.Publication;
+import eu.dnetlib.dhp.schema.oaf.StructuredProperty;
+import eu.dnetlib.doiboost.DoiBoostAuthorMerger;
+import eu.dnetlib.pace.util.MapDocumentUtil;
+import scala.Tuple2;
+
+public class DoiBoostAuthorMergerTest {
+
+	private String publicationsBasePath;
+
+	private List<List<Author>> authors;
+
+
+	@BeforeEach
+	public void setUp() throws Exception {
+
+		publicationsBasePath = Paths
+			.get(DoiBoostAuthorMergerTest.class.getResource("/eu/dnetlib/dhp/doiboost").toURI())
+			.toFile()
+			.getAbsolutePath();
+
+	}
+
+	@Test
+	public void mergeTestOrcid() {
+
+		authors = readSample(publicationsBasePath + "/matching_authors_first.json", Publication.class)
+				.stream()
+				.map(p -> p._2().getAuthor())
+				.collect(Collectors.toList());
+
+		for (List<Author> authors1 : authors) {
+			System.out.println("List " + (authors.indexOf(authors1) + 1));
+			for (Author author : authors1) {
+				System.out.println(authorToString(author));
+			}
+		}
+
+		List<Author> merge = DoiBoostAuthorMerger.merge(authors,  true);
+
+		System.out.println("Merge ");
+		for (Author author : merge) {
+			System.out.println(authorToString(author));
+		}
+
+		Assertions.assertEquals(10, merge.size());
+
+		Assertions.assertEquals(3, merge.stream().filter(a -> a.getPid() != null).count());
+
+		merge
+			.stream()
+			.filter(a -> a.getPid() != null)
+			.forEach(
+				a -> Assertions
+					.assertTrue(
+						a.getPid().stream().anyMatch(p -> p.getQualifier().getClassid().equals(ModelConstants.ORCID))));
+		merge.stream().filter(a -> a.getPid() != null).forEach(a -> {
+			try {
+				System.out.println(new ObjectMapper().writeValueAsString(a));
+			} catch (JsonProcessingException e) {
+				e.printStackTrace();
+			}
+		});
+
+	}
+
+	public <T> List<Tuple2<String, T>> readSample(String path, Class<T> clazz) {
+		List<Tuple2<String, T>> res = new ArrayList<>();
+		BufferedReader reader;
+		try {
+			reader = new BufferedReader(new FileReader(path));
+			String line = reader.readLine();
+			while (line != null) {
+				res
+					.add(
+						new Tuple2<>(
+							MapDocumentUtil.getJPathString("$.id", line),
+							new ObjectMapper().readValue(line, clazz)));
+				// read next line
+				line = reader.readLine();
+			}
+			reader.close();
+		} catch (IOException e) {
+			e.printStackTrace();
+		}
+
+		return res;
+	}
+
+	public String authorToString(Author a) {
+
+		String print = "Fullname = ";
+		print += a.getFullname() + " pid = [";
+		if (a.getPid() != null)
+			for (StructuredProperty sp : a.getPid()) {
+				print += sp.toComparableString() + " ";
+			}
+		print += "]";
+		return print;
+	}
+
+	@Test
+	public void mergeTestMAG() {
+
+		authors = readSample(publicationsBasePath + "/matching_authors_second", Publication.class)
+				.stream()
+				.map(p -> p._2().getAuthor())
+				.collect(Collectors.toList());
+
+		for (List<Author> authors1 : authors) {
+			System.out.println("List " + (authors.indexOf(authors1) + 1));
+			for (Author author : authors1) {
+				System.out.println(authorToString(author));
+			}
+		}
+
+		List<Author> merge = DoiBoostAuthorMerger.merge(authors, true);
+
+		System.out.println("Merge ");
+		for (Author author : merge) {
+			System.out.println(authorToString(author));
+		}
+
+		Assertions.assertEquals(10, merge.size());
+
+		Assertions.assertEquals(10, merge.stream().filter(a -> a.getPid() != null).count());
+
+		merge
+				.stream()
+				.filter(a -> a.getPid() != null)
+				.forEach(
+						a -> Assertions
+								.assertTrue(
+										a.getPid().stream().anyMatch(p -> p.getQualifier().getClassid().equals("URL"))));
+		merge.stream().filter(a -> a.getPid() != null).forEach(a -> {
+			try {
+				System.out.println(new ObjectMapper().writeValueAsString(a));
+			} catch (JsonProcessingException e) {
+				e.printStackTrace();
+			}
+		});
+
+	}
+
+
+	@Test
+	public void mergeTestCrossrefEmpty1() throws JsonProcessingException {
+
+		authors = readSample(publicationsBasePath + "/empty_crossref_authors_first.json", Publication.class)
+				.stream()
+				.map(p -> p._2().getAuthor())
+				.collect(Collectors.toList());
+
+
+		List<Author> merge = DoiBoostAuthorMerger.merge(authors,  true);
+
+		System.out.println("Merge ");
+		for (Author author : merge) {
+			System.out.println(authorToString(author));
+		}
+
+		Assertions.assertEquals(3, merge.size());
+
+		Assertions.assertEquals(3, merge.stream().filter(a -> a.getPid() != null).count());
+
+		merge
+				.stream()
+				.filter(a -> a.getPid() != null)
+				.forEach(
+						a -> Assertions
+								.assertTrue(
+										a.getPid().stream().anyMatch(p -> p.getQualifier().getClassid().equals(ModelConstants.ORCID))));
+		merge.stream().filter(a -> a.getPid() != null).forEach(a -> {
+			try {
+				System.out.println(new ObjectMapper().writeValueAsString(a));
+			} catch (JsonProcessingException e) {
+				e.printStackTrace();
+			}
+		});
+
+		System.out.println(new ObjectMapper().writeValueAsString(merge));
+
+	}
+
+
+	@Test
+	public void mergeTestCrossrefEmpty2() throws JsonProcessingException {
+
+		authors = readSample(publicationsBasePath + "/empty_crossref_authors_second.json", Publication.class)
+				.stream()
+				.map(p -> p._2().getAuthor())
+				.collect(Collectors.toList());
+
+
+
+		List<Author> merge = DoiBoostAuthorMerger.merge(authors, false);
+
+		System.out.println("Merge ");
+		for (Author author : merge) {
+			System.out.println(authorToString(author));
+		}
+
+		Assertions.assertEquals(10, merge.size());
+
+		Assertions.assertEquals(10, merge.stream().filter(a -> a.getPid() != null).count());
+
+		merge
+				.stream()
+				.filter(a -> a.getPid() != null)
+				.forEach(
+						a -> Assertions
+								.assertTrue(
+										a.getPid().stream().anyMatch(p -> p.getQualifier().getClassid().equals("URL"))));
+		merge.stream().filter(a -> a.getPid() != null).forEach(a -> {
+			try {
+				System.out.println(new ObjectMapper().writeValueAsString(a));
+			} catch (JsonProcessingException e) {
+				e.printStackTrace();
+			}
+		});
+
+		Assertions.assertTrue(3 == merge.stream().filter(a -> a.getPid() !=null)
+		.filter(a -> a.getPid().stream().anyMatch(p -> p.getQualifier().getClassid().equals(ModelConstants.ORCID))).count());
+
+	}
+
+	@Test
+	public void mergeTestCrossrefEmpty3() throws JsonProcessingException {
+
+		authors = readSample(publicationsBasePath + "/empty_crossref_author_third.json", Publication.class)
+				.stream()
+				.map(p -> p._2().getAuthor())
+				.collect(Collectors.toList());
+
+
+		List<Author> merge = DoiBoostAuthorMerger.merge(authors,  true);
+
+		System.out.println("Merge ");
+		for (Author author : merge) {
+			System.out.println(authorToString(author));
+		}
+
+		Assertions.assertEquals(10, merge.size());
+
+		Assertions.assertEquals(10, merge.stream().filter(a -> a.getPid() != null).count());
+
+		merge
+				.stream()
+				.filter(a -> a.getPid() != null)
+				.forEach(
+						a -> Assertions
+								.assertTrue(
+										a.getPid().stream().anyMatch(p -> p.getQualifier().getClassid().equals("URL"))));
+
+		Assertions.assertTrue(3 == merge.stream().filter(a -> a.getPid() !=null)
+				.filter(a -> a.getPid().stream().anyMatch(p -> p.getQualifier().getClassid().equals(ModelConstants.ORCID))).count());
+
+
+	}
+
+
+	@Test
+	public void mergeTestCrossrefEmpty4() throws JsonProcessingException {
+
+		authors = readSample(publicationsBasePath + "/empty_crossref_author_fourth.json", Publication.class)
+				.stream()
+				.map(p -> p._2().getAuthor())
+				.collect(Collectors.toList());
+
+
+		List<Author> merge = DoiBoostAuthorMerger.merge(authors,  true);
+
+		System.out.println("Merge ");
+		for (Author author : merge) {
+			System.out.println(authorToString(author));
+		}
+
+		Assertions.assertEquals(3, merge.size());
+
+		Assertions.assertEquals(3, merge.stream().filter(a -> a.getPid() != null).count());
+
+
+		Assertions.assertTrue(3 == merge.stream().filter(a -> a.getPid() !=null)
+				.filter(a -> a.getPid().stream().anyMatch(p -> p.getQualifier().getClassid().equals(ModelConstants.ORCID))).count());
+
+
+	}
+
+	@Test
+	public void shouldMergeTest1() throws JsonProcessingException {
+
+		authors = readSample(publicationsBasePath + "/should_appear_author1.json", Publication.class)
+				.stream()
+				.map(p -> p._2().getAuthor())
+				.collect(Collectors.toList());
+
+
+		List<Author> merge = DoiBoostAuthorMerger.merge(authors,  true);
+
+		Assertions.assertTrue(6 == merge.stream().filter(a -> a.getPid() !=null)
+				.filter(a -> a.getPid().stream().anyMatch(p -> p.getQualifier().getClassid().equals(ModelConstants.ORCID))).count());
+
+		Assertions.assertTrue(34 == merge.stream().filter(a -> a.getPid() !=null)
+				.filter(a -> a.getPid().stream().anyMatch(p -> p.getQualifier().getClassid().equals(ModelConstants.ORCID_PENDING))).count());
+
+		merge.stream().filter(a -> a.getRank() == 26)
+				.forEach(a ->
+						Assertions.assertTrue(a.getPid()
+								.stream()
+								.anyMatch(pid -> pid.getValue().equals("0000-0002-2445-5275")
+										&& pid.getQualifier().getClassid().equals(ModelConstants.ORCID)
+								)
+						)
+				);
+
+
+	}
+
+	@Test
+	public void shouldMergeTest2() throws JsonProcessingException {
+
+		authors = readSample(publicationsBasePath + "/should_appear_author2.json", Publication.class)
+				.stream()
+				.map(p -> p._2().getAuthor())
+				.collect(Collectors.toList());
+
+
+		List<Author> merge = DoiBoostAuthorMerger.merge(authors,  true);
+
+
+
+		Assertions.assertTrue(5 == merge.stream().filter(a -> a.getPid() !=null)
+				.filter(a -> a.getPid().stream().anyMatch(p -> p.getQualifier().getClassid().equals(ModelConstants.ORCID))).count());
+
+		Assertions.assertTrue(34 == merge.stream().filter(a -> a.getPid() !=null)
+				.filter(a -> a.getPid().stream().anyMatch(p -> p.getQualifier().getClassid().equals(ModelConstants.ORCID_PENDING))).count());
+
+		merge.stream().filter(a -> a.getFullname().equals("da luz geraldo eduardo"))
+				.forEach(a ->
+						Assertions.assertTrue(a.getPid()
+								.stream()
+								.anyMatch(pid -> pid.getValue().equals("http://orcid.org/0000-0003-2434-0387")
+										&& pid.getQualifier().getClassid().equals(ModelConstants.ORCID_PENDING)
+								)
+						)
+				);
+
+
+	}
+
+	@Test
+	public void shouldNotMergeTest1() throws JsonProcessingException {
+
+		authors = readSample(publicationsBasePath + "/should_appear_author3.json", Publication.class)
+				.stream()
+				.map(p -> p._2().getAuthor())
+				.collect(Collectors.toList());
+
+
+		List<Author> merge = DoiBoostAuthorMerger.merge(authors,  true);
+
+		System.out.println("Merge ");
+		for (Author author : merge) {
+			System.out.println(authorToString(author));
+		}
+
+//		Assertions.assertTrue(5 == merge.stream().filter(a -> a.getPid() !=null)
+//				.filter(a -> a.getPid().stream().anyMatch(p -> p.getQualifier().getClassid().equals(ModelConstants.ORCID))).count());
+//
+//		Assertions.assertTrue(34 == merge.stream().filter(a -> a.getPid() !=null)
+//				.filter(a -> a.getPid().stream().anyMatch(p -> p.getQualifier().getClassid().equals(ModelConstants.ORCID_PENDING))).count());
+//
+//		merge.stream().filter(a -> a.getFullname().equals("da luz geraldo eduardo"))
+//				.forEach(a ->
+//						Assertions.assertTrue(a.getPid()
+//								.stream()
+//								.anyMatch(pid -> pid.getValue().equals("http://orcid.org/0000-0003-2434-0387")
+//										&& pid.getQualifier().getClassid().equals(ModelConstants.ORCID_PENDING)
+//								)
+//						)
+//				);
+
+
+	}
+}
--- a/dhp-workflows/dhp-doiboost/src/test/java/eu/dnetlib/doiboost/orcid/MappingORCIDToOAFTest.scala
+++ b/dhp-workflows/dhp-doiboost/src/test/java/eu/dnetlib/doiboost/orcid/MappingORCIDToOAFTest.scala
@ -61,7 +61,7 @@ class MappingORCIDToOAFTest {
    assertTrue(oA == p.count())
    println(mapper.writerWithDefaultPrettyPrinter().writeValueAsString(p.first()))

-
+   spark.close()
  }


@ -78,8 +78,8 @@ class MappingORCIDToOAFTest {
    val oaf = ORCIDToOAF.convertTOOAF(orcid)
    assert(oaf.getPid.size() == 1)
    oaf.getPid.toList.foreach(pid => assert(pid.getQualifier.getClassid.equals("doi")))
-    oaf.getPid.toList.foreach(pid => assert(pid.getValue.equals("10.1042/BCJ20160876".toLowerCase())))
-    //println(mapper.writeValueAsString(ORCIDToOAF.convertTOOAF(orcid)))
+    oaf.getPid.toList.foreach(pid => assert(pid.getValue.equals("10.1042/BCJ20160876")))
+    //println(mapper.writeValueAsString(oaf))


  }
--- a/dhp-workflows/dhp-doiboost/src/test/resources/eu/dnetlib/dhp/doiboost/empty_crossref_author_fourth.json
+++ b/dhp-workflows/dhp-doiboost/src/test/resources/eu/dnetlib/dhp/doiboost/empty_crossref_author_fourth.json
--- a/dhp-workflows/dhp-doiboost/src/test/resources/eu/dnetlib/dhp/doiboost/empty_crossref_author_third.json
+++ b/dhp-workflows/dhp-doiboost/src/test/resources/eu/dnetlib/dhp/doiboost/empty_crossref_author_third.json
--- a/dhp-workflows/dhp-doiboost/src/test/resources/eu/dnetlib/dhp/doiboost/empty_crossref_authors_first.json
+++ b/dhp-workflows/dhp-doiboost/src/test/resources/eu/dnetlib/dhp/doiboost/empty_crossref_authors_first.json
--- a/dhp-workflows/dhp-doiboost/src/test/resources/eu/dnetlib/dhp/doiboost/empty_crossref_authors_second.json
+++ b/dhp-workflows/dhp-doiboost/src/test/resources/eu/dnetlib/dhp/doiboost/empty_crossref_authors_second.json
--- a/dhp-workflows/dhp-doiboost/src/test/resources/eu/dnetlib/dhp/doiboost/matching_authors_first.json
+++ b/dhp-workflows/dhp-doiboost/src/test/resources/eu/dnetlib/dhp/doiboost/matching_authors_first.json
--- a/dhp-workflows/dhp-doiboost/src/test/resources/eu/dnetlib/dhp/doiboost/matching_authors_second
+++ b/dhp-workflows/dhp-doiboost/src/test/resources/eu/dnetlib/dhp/doiboost/matching_authors_second
--- a/dhp-workflows/dhp-doiboost/src/test/resources/eu/dnetlib/dhp/doiboost/should_appear_author1.json
+++ b/dhp-workflows/dhp-doiboost/src/test/resources/eu/dnetlib/dhp/doiboost/should_appear_author1.json
--- a/dhp-workflows/dhp-doiboost/src/test/resources/eu/dnetlib/dhp/doiboost/should_appear_author2.json
+++ b/dhp-workflows/dhp-doiboost/src/test/resources/eu/dnetlib/dhp/doiboost/should_appear_author2.json
--- a/dhp-workflows/dhp-doiboost/src/test/resources/eu/dnetlib/dhp/doiboost/should_appear_author3.json
+++ b/dhp-workflows/dhp-doiboost/src/test/resources/eu/dnetlib/dhp/doiboost/should_appear_author3.json
Author	SHA1	Message	Date
miconis	680bfa490f	Merge branch 'stable_ids' of code-repo.d4science.org:D-Net/dnet-hadoop into stable_ids	2021-09-17 11:27:25 +02:00
miconis	43ac539414	minor change	2021-09-17 11:27:06 +02:00
Miriam Baglioni	c26980f1c4	Adding spark.close() to avoid Only one SparkContext may be running in this JVM error while running test on Jenkins and fixed issue	2021-07-13 10:33:00 +02:00
Miriam Baglioni	4f309e625c	Merge branch 'stable_ids' of https://code-repo.d4science.org/D-Net/dnet-hadoop into stable_ids	2021-07-12 10:07:01 +02:00
Miriam Baglioni	1ea66e8917	some more tests for authormerger	2021-07-12 10:06:29 +02:00
Miriam Baglioni	6e987fc084	Merge branch 'stable_ids' of https://code-repo.d4science.org/D-Net/dnet-hadoop into stable_ids	2021-07-08 18:57:25 +02:00
Miriam Baglioni	b0d86d32b0	added list of author to be merged	2021-07-08 18:56:29 +02:00
Miriam Baglioni	abe546e5ba	added resource files for test author merger for empy crossref and other merging providers (related to DoiBoostAuthorMerger)	2021-07-08 18:55:55 +02:00
Miriam Baglioni	bf24f588e2	Added test for empty author list for crossref and other merging providers (related to DoiBoostAuthorMerger)	2021-07-08 18:55:13 +02:00
Miriam Baglioni	96255fa647	-	2021-07-08 18:54:27 +02:00
Miriam Baglioni	0e47e94099	Added variable to verify if crossref is base for the merging of authors (related to DoiBoostAuthorMerger)	2021-07-08 18:54:07 +02:00
Miriam Baglioni	434aa6380b	Adding description of the merging process for DoiBoost (related to DoiBoostAuthorMerger) - to be refined	2021-07-08 18:53:15 +02:00
Miriam Baglioni	e0e80cde22	Added class to store the most similar author list to be enriched w.r.t. one enriching author (related to DoiBoostAuthorMerger)	2021-07-08 18:52:25 +02:00
Miriam Baglioni	97e0c27db9	Added check for empty author list. If crossref is empty, the longest from all the merging providers is taken. If crossref is not empty, crossref is chosen as base for the enrichment	2021-07-08 15:27:05 +02:00
Miriam Baglioni	3ed90420e4	Merge branch 'stable_ids' of https://code-repo.d4science.org/D-Net/dnet-hadoop into stable_ids	2021-07-05 16:48:19 +02:00
Miriam Baglioni	7498e63174	added resource files for testing of DoiBoostAuthorMerger	2021-07-05 16:26:46 +02:00
Miriam Baglioni	22ce947335	added resource files for testing of DoiBoostAuthorMerger	2021-07-05 16:26:17 +02:00
Miriam Baglioni	f64f5d9e23	first implementation and test class for the specific Author Merger for doiboost. First change: crossref as base to be enriched. Modified the normalization function to remove accents from words	2021-07-05 16:24:47 +02:00
Miriam Baglioni	238d692a0a	apply specific AuthorMerger for doiboost	2021-07-05 16:23:33 +02:00