implemented methods to extract fulltext link from an API call

code adapted as per Michele's recommendations
code refactored
2024-09-11 14:57:38 +02:00 · 2024-09-04 15:29:13 +02:00 · 2024-08-01 09:52:19 +02:00 · 2024-08-01 01:23:04 +02:00 · 2024-07-31 18:05:11 +02:00 · 2024-07-15 16:31:04 +02:00
694 changed files with 150326 additions and 11677 deletions
--- a/CODE_OF_CONDUCT.md
+++ b/CODE_OF_CONDUCT.md
@ -0,0 +1,43 @@
+# Contributor Code of Conduct
+
+Openness, transparency and our community-driven participatory approach guide us in our day-to-day interactions and decision-making. Our open source projects are no exception. Trust, respect, collaboration and transparency are core values we believe should live and breathe within our projects. Our community welcomes participants from around the world with different experiences, unique perspectives, and great ideas to share.
+
+## Our Pledge
+
+In the interest of fostering an open and welcoming environment, we as contributors and maintainers pledge to making participation in our project and our community a harassment-free experience for everyone, regardless of age, body size, disability, ethnicity, sex characteristics, gender identity and expression, level of experience, education, socio-economic status, nationality, personal appearance, race, religion, or sexual identity and orientation.
+
+## Our Standards
+
+Examples of behavior that contributes to creating a positive environment include:
+
+- Using welcoming and inclusive language
+- Being respectful of differing viewpoints and experiences
+- Gracefully accepting constructive criticism
+- Attempting collaboration before conflict
+- Focusing on what is best for the community
+- Showing empathy towards other community members
+
+Examples of unacceptable behavior by participants include:
+
+- Violence, threats of violence, or inciting others to commit self-harm
+- The use of sexualized language or imagery and unwelcome sexual attention or advances
+- Trolling, intentionally spreading misinformation, insulting/derogatory comments, and personal or political attacks
+- Public or private harassment
+- Publishing others' private information, such as a physical or electronic address, without explicit permission
+- Abuse of the reporting process to intentionally harass or exclude others
+- Advocating for, or encouraging, any of the above behavior
+- Other conduct which could reasonably be considered inappropriate in a professional setting
+
+## Our Responsibilities
+
+Project maintainers are responsible for clarifying the standards of acceptable behavior and are expected to take appropriate and fair corrective action in response to any instances of unacceptable behavior.
+
+Project maintainers have the right and responsibility to remove, edit, or reject comments, commits, code, wiki edits, issues, and other contributions that are not aligned to this Code of Conduct, or to ban temporarily or permanently any contributor for other behaviors that they deem inappropriate, threatening, offensive, or harmful.
+
+## Scope
+
+This Code of Conduct applies both within project spaces and in public spaces when an individual is representing the project or its community. Examples of representing a project or community include using an official project e-mail address, posting via an official social media account, or acting as an appointed representative at an online or offline event. Representation of a project may be further defined and clarified by project maintainers.
+
+## Attribution
+
+This Code of Conduct is adapted from the [Contributor Covenant](https://www.contributor-covenant.org/), [version 1.4](https://www.contributor-covenant.org/version/1/4/code-of-conduct.html).
--- a/CONTRIBUTING.md
+++ b/CONTRIBUTING.md
@ -0,0 +1,10 @@
+# Contributing to D-Net Hadoop
+
+:+1::tada: First off, thanks for taking the time to contribute! :tada::+1:
+
+This project and everyone participating in it is governed by our [Code of Conduct](CODE_OF_CONDUCT.md). By participating, you are expected to uphold this code. Please report unacceptable behavior to [dnet-team@isti.cnr.it](mailto:dnet-team@isti.cnr.it).
+
+The following is a set of guidelines for contributing to this project and its packages. These are mostly guidelines, not rules, which applies to this project as a while, including all its sub-modules. 
+Use your best judgment, and feel free to propose changes to this document in a pull request.
+
+All contributions are welcome, all contributions will be considered to be contributed under the [project license](LICENSE.md).
--- a/LICENSE.md
+++ b/LICENSE.md
--- a/README.md
+++ b/README.md
@ -2,6 +2,11 @@

 Dnet-hadoop is the project that defined all the [OOZIE workflows](https://oozie.apache.org/) for the OpenAIRE Graph construction, processing, provisioning.

+This project adheres to the Contributor Covenant [code of conduct](CODE_OF_CONDUCT.md).
+By participating, you are expected to uphold this code. Please report unacceptable behavior to [dnet-team@isti.cnr.it](mailto:dnet-team@isti.cnr.it).
+
+This project is licensed under the [AGPL v3 or later version](#LICENSE.md).
+
 How to build, package and run oozie workflows
 ====================

--- a/dhp-common/pom.xml
+++ b/dhp-common/pom.xml
@ -63,11 +63,13 @@

 	<dependencies>
 		<dependency>
-			<groupId>eu.dnetlib.dhp</groupId>
-			<artifactId>dhp-pace-core</artifactId>
-			<version>${project.version}</version>
+			<groupId>edu.cmu</groupId>
+			<artifactId>secondstring</artifactId>
+		</dependency>
+		<dependency>
+			<groupId>com.ibm.icu</groupId>
+			<artifactId>icu4j</artifactId>
 		</dependency>
-
 		<dependency>
 			<groupId>org.apache.hadoop</groupId>
 			<artifactId>hadoop-common</artifactId>
--- a/dhp-common/src/main/java/eu/dnetlib/dhp/common/DbClient.java
+++ b/dhp-common/src/main/java/eu/dnetlib/dhp/common/DbClient.java
@ -7,12 +7,12 @@ import java.sql.*;
 import java.util.function.Consumer;

 import org.apache.commons.lang3.StringUtils;
-import org.apache.commons.logging.Log;
-import org.apache.commons.logging.LogFactory;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;

 public class DbClient implements Closeable {

-	private static final Log log = LogFactory.getLog(DbClient.class);
+	private static final Logger log = LoggerFactory.getLogger(DbClient.class);

 	private final Connection connection;

@ -37,6 +37,8 @@ public class DbClient implements Closeable {
 		try (final Statement stmt = connection.createStatement()) {
 			stmt.setFetchSize(100);

+			log.info("running SQL:\n\n{}\n\n", sql);
+
 			try (final ResultSet rs = stmt.executeQuery(sql)) {
 				while (rs.next()) {
 					consumer.accept(rs);
--- a/dhp-common/src/main/java/eu/dnetlib/dhp/common/api/context/CategorySummary.java
+++ b/dhp-common/src/main/java/eu/dnetlib/dhp/common/api/context/CategorySummary.java
@ -0,0 +1,39 @@
+
+package eu.dnetlib.dhp.common.api.context;
+
+public class CategorySummary {
+
+	private String id;
+
+	private String label;
+
+	private boolean hasConcept;
+
+	public String getId() {
+		return id;
+	}
+
+	public String getLabel() {
+		return label;
+	}
+
+	public boolean isHasConcept() {
+		return hasConcept;
+	}
+
+	public CategorySummary setId(final String id) {
+		this.id = id;
+		return this;
+	}
+
+	public CategorySummary setLabel(final String label) {
+		this.label = label;
+		return this;
+	}
+
+	public CategorySummary setHasConcept(final boolean hasConcept) {
+		this.hasConcept = hasConcept;
+		return this;
+	}
+
+}
--- a/dhp-common/src/main/java/eu/dnetlib/dhp/common/api/context/CategorySummaryList.java
+++ b/dhp-common/src/main/java/eu/dnetlib/dhp/common/api/context/CategorySummaryList.java
@ -0,0 +1,7 @@
+
+package eu.dnetlib.dhp.common.api.context;
+
+import java.util.ArrayList;
+
+public class CategorySummaryList extends ArrayList<CategorySummary> {
+}
--- a/dhp-common/src/main/java/eu/dnetlib/dhp/common/api/context/ConceptSummary.java
+++ b/dhp-common/src/main/java/eu/dnetlib/dhp/common/api/context/ConceptSummary.java
@ -0,0 +1,52 @@
+
+package eu.dnetlib.dhp.common.api.context;
+
+import java.util.List;
+
+public class ConceptSummary {
+
+	private String id;
+
+	private String label;
+
+	public boolean hasSubConcept;
+
+	private List<ConceptSummary> concepts;
+
+	public String getId() {
+		return id;
+	}
+
+	public String getLabel() {
+		return label;
+	}
+
+	public List<ConceptSummary> getConcepts() {
+		return concepts;
+	}
+
+	public ConceptSummary setId(final String id) {
+		this.id = id;
+		return this;
+	}
+
+	public ConceptSummary setLabel(final String label) {
+		this.label = label;
+		return this;
+	}
+
+	public boolean isHasSubConcept() {
+		return hasSubConcept;
+	}
+
+	public ConceptSummary setHasSubConcept(final boolean hasSubConcept) {
+		this.hasSubConcept = hasSubConcept;
+		return this;
+	}
+
+	public ConceptSummary setConcept(final List<ConceptSummary> concepts) {
+		this.concepts = concepts;
+		return this;
+	}
+
+}
--- a/dhp-common/src/main/java/eu/dnetlib/dhp/common/api/context/ConceptSummaryList.java
+++ b/dhp-common/src/main/java/eu/dnetlib/dhp/common/api/context/ConceptSummaryList.java
@ -0,0 +1,7 @@
+
+package eu.dnetlib.dhp.common.api.context;
+
+import java.util.ArrayList;
+
+public class ConceptSummaryList extends ArrayList<ConceptSummary> {
+}
--- a/dhp-common/src/main/java/eu/dnetlib/dhp/common/api/context/ContextSummary.java
+++ b/dhp-common/src/main/java/eu/dnetlib/dhp/common/api/context/ContextSummary.java
@ -0,0 +1,50 @@
+
+package eu.dnetlib.dhp.common.api.context;
+
+public class ContextSummary {
+
+	private String id;
+
+	private String label;
+
+	private String type;
+
+	private String status;
+
+	public String getId() {
+		return id;
+	}
+
+	public String getLabel() {
+		return label;
+	}
+
+	public String getType() {
+		return type;
+	}
+
+	public String getStatus() {
+		return status;
+	}
+
+	public ContextSummary setId(final String id) {
+		this.id = id;
+		return this;
+	}
+
+	public ContextSummary setLabel(final String label) {
+		this.label = label;
+		return this;
+	}
+
+	public ContextSummary setType(final String type) {
+		this.type = type;
+		return this;
+	}
+
+	public ContextSummary setStatus(final String status) {
+		this.status = status;
+		return this;
+	}
+
+}
--- a/dhp-common/src/main/java/eu/dnetlib/dhp/common/api/context/ContextSummaryList.java
+++ b/dhp-common/src/main/java/eu/dnetlib/dhp/common/api/context/ContextSummaryList.java
@ -0,0 +1,7 @@
+
+package eu.dnetlib.dhp.common.api.context;
+
+import java.util.ArrayList;
+
+public class ContextSummaryList extends ArrayList<ContextSummary> {
+}
--- a/dhp-common/src/main/java/eu/dnetlib/dhp/common/collection/HttpConnector2.java
+++ b/dhp-common/src/main/java/eu/dnetlib/dhp/common/collection/HttpConnector2.java
@ -8,10 +8,13 @@ import java.io.InputStream;
 import java.net.*;
 import java.util.List;
 import java.util.Map;
+import java.util.concurrent.TimeUnit;

 import org.apache.commons.io.IOUtils;
 import org.apache.commons.lang3.math.NumberUtils;
+import org.apache.commons.lang3.time.DateUtils;
 import org.apache.http.HttpHeaders;
+import org.joda.time.Instant;
 import org.slf4j.Logger;
 import org.slf4j.LoggerFactory;

@ -94,14 +97,16 @@ public class HttpConnector2 {
 			throw new CollectorException(msg);
 		}

-		log.info("Request attempt {} [{}]", retryNumber, requestUrl);
-
 		InputStream input = null;

+		long start = System.currentTimeMillis();
 		try {
 			if (getClientParams().getRequestDelay() > 0) {
 				backoffAndSleep(getClientParams().getRequestDelay());
 			}
+
+			log.info("Request attempt {} [{}]", retryNumber, requestUrl);
+
 			final HttpURLConnection urlConn = (HttpURLConnection) new URL(requestUrl).openConnection();
 			urlConn.setInstanceFollowRedirects(false);
 			urlConn.setReadTimeout(getClientParams().getReadTimeOut() * 1000);
@ -115,9 +120,8 @@ public class HttpConnector2 {
 					urlConn.addRequestProperty(headerEntry.getKey(), headerEntry.getValue());
 				}
 			}
-			if (log.isDebugEnabled()) {
-				logHeaderFields(urlConn);
-			}
+
+			logHeaderFields(urlConn);

 			int retryAfter = obtainRetryAfter(urlConn.getHeaderFields());
 			String rateLimit = urlConn.getHeaderField(Constants.HTTPHEADER_IETF_DRAFT_RATELIMIT_LIMIT);
@ -132,9 +136,7 @@ public class HttpConnector2 {
 			}

 			if (is2xx(urlConn.getResponseCode())) {
-				input = urlConn.getInputStream();
-				responseType = urlConn.getContentType();
-				return input;
+				return getInputStream(urlConn, start);
 			}
 			if (is3xx(urlConn.getResponseCode())) {
 				// REDIRECTS
@ -144,6 +146,7 @@ public class HttpConnector2 {
 					.put(
 						REPORT_PREFIX + urlConn.getResponseCode(),
 						String.format("Moved to: %s", newUrl));
+				logRequestTime(start);
 				urlConn.disconnect();
 				if (retryAfter > 0) {
 					backoffAndSleep(retryAfter);
@ -159,26 +162,50 @@ public class HttpConnector2 {
 						if (retryAfter > 0) {
 							log
 								.warn(
-									"{} - waiting and repeating request after suggested retry-after {} sec.",
-									requestUrl, retryAfter);
+									"waiting and repeating request after suggested retry-after {} sec for URL {}",
+									retryAfter, requestUrl);
 							backoffAndSleep(retryAfter * 1000);
 						} else {
 							log
 								.warn(
-									"{} - waiting and repeating request after default delay of {} sec.",
-									requestUrl, getClientParams().getRetryDelay());
-							backoffAndSleep(retryNumber * getClientParams().getRetryDelay() * 1000);
+									"waiting and repeating request after default delay of {} sec for URL {}",
+									getClientParams().getRetryDelay(), requestUrl);
+							backoffAndSleep(retryNumber * getClientParams().getRetryDelay());
 						}
 						report.put(REPORT_PREFIX + urlConn.getResponseCode(), requestUrl);
+
+						logRequestTime(start);
+
 						urlConn.disconnect();
+
 						return attemptDownload(requestUrl, retryNumber + 1, report);
+					case 422: // UNPROCESSABLE ENTITY
+						report.put(REPORT_PREFIX + urlConn.getResponseCode(), requestUrl);
+						log.warn("waiting and repeating request after 10 sec for URL {}", requestUrl);
+						backoffAndSleep(10000);
+						urlConn.disconnect();
+						logRequestTime(start);
+						try {
+							return getInputStream(urlConn, start);
+						} catch (IOException e) {
+							log
+								.error(
+									"server returned 422 and got IOException accessing the response body from URL {}",
+									requestUrl);
+							log.error("IOException:", e);
+							return attemptDownload(requestUrl, retryNumber + 1, report);
+						}
 					default:
+						log.error("gor error {} from URL: {}", urlConn.getResponseCode(), urlConn.getURL());
+						log.error("response message: {}", urlConn.getResponseMessage());
 						report
 							.put(
 								REPORT_PREFIX + urlConn.getResponseCode(),
 								String
 									.format(
 										"%s Error: %s", requestUrl, urlConn.getResponseMessage()));
+						logRequestTime(start);
+						urlConn.disconnect();
 						throw new CollectorException(urlConn.getResponseCode() + " error " + report);
 				}
 			}
@ -199,13 +226,27 @@ public class HttpConnector2 {
 		}
 	}

+	private InputStream getInputStream(HttpURLConnection urlConn, long start) throws IOException {
+		InputStream input = urlConn.getInputStream();
+		responseType = urlConn.getContentType();
+		logRequestTime(start);
+		return input;
+	}
+
+	private static void logRequestTime(long start) {
+		log
+			.info(
+				"request time elapsed: {}sec",
+				TimeUnit.MILLISECONDS.toSeconds(System.currentTimeMillis() - start));
+	}
+
 	private void logHeaderFields(final HttpURLConnection urlConn) throws IOException {
-		log.debug("StatusCode: {}", urlConn.getResponseMessage());
+		log.info("Response: {} - {}", urlConn.getResponseCode(), urlConn.getResponseMessage());

 		for (Map.Entry<String, List<String>> e : urlConn.getHeaderFields().entrySet()) {
 			if (e.getKey() != null) {
 				for (String v : e.getValue()) {
-					log.debug("  key: {} - value: {}", e.getKey(), v);
+					log.info("  key: {} - value: {}", e.getKey(), v);
 				}
 			}
 		}
@ -225,7 +266,7 @@ public class HttpConnector2 {
 		for (String key : headerMap.keySet()) {
 			if ((key != null) && key.equalsIgnoreCase(HttpHeaders.RETRY_AFTER) && (!headerMap.get(key).isEmpty())
 				&& NumberUtils.isCreatable(headerMap.get(key).get(0))) {
-				return Integer.parseInt(headerMap.get(key).get(0)) + 10;
+				return Integer.parseInt(headerMap.get(key).get(0));
 			}
 		}
 		return -1;
--- a/dhp-common/src/main/java/eu/dnetlib/dhp/common/vocabulary/Vocabulary.java
+++ b/dhp-common/src/main/java/eu/dnetlib/dhp/common/vocabulary/Vocabulary.java
@ -63,7 +63,10 @@ public class Vocabulary implements Serializable {
 	}

 	public VocabularyTerm getTermBySynonym(final String syn) {
-		return getTerm(synonyms.get(syn.toLowerCase()));
+		return Optional
+			.ofNullable(syn)
+			.map(s -> getTerm(synonyms.get(s.toLowerCase())))
+			.orElse(null);
 	}

 	public Qualifier getTermAsQualifier(final String termId) {
--- a/dhp-common/src/main/java/eu/dnetlib/dhp/common/vocabulary/VocabularyGroup.java
+++ b/dhp-common/src/main/java/eu/dnetlib/dhp/common/vocabulary/VocabularyGroup.java
@ -135,6 +135,24 @@ public class VocabularyGroup implements Serializable {
 		return vocs.get(vocId.toLowerCase()).getSynonymAsQualifier(syn);
 	}

+	public Qualifier lookupTermBySynonym(final String vocId, final String syn) {
+		return find(vocId)
+			.map(
+				vocabulary -> Optional
+					.ofNullable(vocabulary.getTerm(syn))
+					.map(
+						term -> OafMapperUtils
+							.qualifier(term.getId(), term.getName(), vocabulary.getId(), vocabulary.getName()))
+					.orElse(
+						Optional
+							.ofNullable(vocabulary.getTermBySynonym(syn))
+							.map(
+								term -> OafMapperUtils
+									.qualifier(term.getId(), term.getName(), vocabulary.getId(), vocabulary.getName()))
+							.orElse(null)))
+			.orElse(null);
+	}
+
 	/**
 	 * getSynonymAsQualifierCaseSensitive
 	 *
--- a/dhp-common/src/main/java/eu/dnetlib/dhp/oa/merge/AuthorMerger.java
+++ b/dhp-common/src/main/java/eu/dnetlib/dhp/oa/merge/AuthorMerger.java
@ -10,6 +10,7 @@ import org.apache.commons.lang3.StringUtils;
 import com.wcohen.ss.JaroWinkler;

 import eu.dnetlib.dhp.schema.oaf.Author;
+import eu.dnetlib.dhp.schema.oaf.Qualifier;
 import eu.dnetlib.dhp.schema.oaf.StructuredProperty;
 import eu.dnetlib.pace.model.Person;
 import scala.Tuple2;
@ -119,11 +120,47 @@ public class AuthorMerger {
 				});
 	}

+	public static String normalizeFullName(final String fullname) {
+		return nfd(fullname)
+			.toLowerCase()
+			// do not compact the regexes in a single expression, would cause StackOverflowError
+			// in case
+			// of large input strings
+			.replaceAll("(\\W)+", " ")
+			.replaceAll("(\\p{InCombiningDiacriticalMarks})+", " ")
+			.replaceAll("(\\p{Punct})+", " ")
+			.replaceAll("(\\d)+", " ")
+			.replaceAll("(\\n)+", " ")
+
+			.trim();
+	}
+
+	private static String authorFieldToBeCompared(Author author) {
+		if (StringUtils.isNotBlank(author.getSurname())) {
+			return author.getSurname();
+
+		}
+		if (StringUtils.isNotBlank(author.getFullname())) {
+			return author.getFullname();
+		}
+		return null;
+	}
+
 	public static String pidToComparableString(StructuredProperty pid) {
-		final String classid = pid.getQualifier().getClassid() != null ? pid.getQualifier().getClassid().toLowerCase()
-			: "";
-		return (pid.getQualifier() != null ? classid : "")
-			+ (pid.getValue() != null ? pid.getValue().toLowerCase() : "");
+		final String classId = Optional
+			.ofNullable(pid)
+			.map(
+				p -> Optional
+					.ofNullable(p.getQualifier())
+					.map(Qualifier::getClassid)
+					.map(String::toLowerCase)
+					.orElse(""))
+			.orElse("");
+		return Optional
+			.ofNullable(pid)
+			.map(StructuredProperty::getValue)
+			.map(v -> String.join("|", v, classId))
+			.orElse("");
 	}

 	public static int countAuthorsPids(List<Author> authors) {
@ -171,7 +208,7 @@ public class AuthorMerger {
 		}
 	}

-	private static String normalize(final String s) {
+	public static String normalize(final String s) {
 		String[] normalized = nfd(s)
 			.toLowerCase()
 			// do not compact the regexes in a single expression, would cause StackOverflowError
--- a/dhp-common/src/main/java/eu/dnetlib/dhp/oa/merge/GroupEntitiesSparkJob.java
+++ b/dhp-common/src/main/java/eu/dnetlib/dhp/oa/merge/GroupEntitiesSparkJob.java
@ -14,17 +14,22 @@ import java.util.stream.Collectors;
 import org.apache.commons.io.IOUtils;
 import org.apache.spark.SparkConf;
 import org.apache.spark.api.java.function.MapFunction;
-import org.apache.spark.api.java.function.ReduceFunction;
+import org.apache.spark.api.java.function.MapGroupsFunction;
 import org.apache.spark.sql.*;
 import org.slf4j.Logger;
 import org.slf4j.LoggerFactory;

 import eu.dnetlib.dhp.application.ArgumentApplicationParser;
 import eu.dnetlib.dhp.common.HdfsSupport;
+import eu.dnetlib.dhp.common.vocabulary.VocabularyGroup;
 import eu.dnetlib.dhp.schema.common.EntityType;
 import eu.dnetlib.dhp.schema.common.ModelSupport;
 import eu.dnetlib.dhp.schema.oaf.OafEntity;
-import eu.dnetlib.dhp.schema.oaf.utils.OafMapperUtils;
+import eu.dnetlib.dhp.schema.oaf.utils.GraphCleaningFunctions;
+import eu.dnetlib.dhp.schema.oaf.utils.MergeUtils;
+import eu.dnetlib.dhp.utils.ISLookupClientFactory;
+import eu.dnetlib.enabling.is.lookup.rmi.ISLookUpException;
+import eu.dnetlib.enabling.is.lookup.rmi.ISLookUpService;
 import scala.Tuple2;

 /**
@ -35,6 +40,12 @@ public class GroupEntitiesSparkJob {

 	private static final Encoder<OafEntity> OAFENTITY_KRYO_ENC = Encoders.kryo(OafEntity.class);

+	private ArgumentApplicationParser parser;
+
+	public GroupEntitiesSparkJob(ArgumentApplicationParser parser) {
+		this.parser = parser;
+	}
+
 	public static void main(String[] args) throws Exception {

 		String jsonConfiguration = IOUtils
@ -51,6 +62,17 @@ public class GroupEntitiesSparkJob {
 			.orElse(Boolean.TRUE);
 		log.info("isSparkSessionManaged: {}", isSparkSessionManaged);

+		final String isLookupUrl = parser.get("isLookupUrl");
+		log.info("isLookupUrl: {}", isLookupUrl);
+
+		final ISLookUpService isLookupService = ISLookupClientFactory.getLookUpService(isLookupUrl);
+
+		new GroupEntitiesSparkJob(parser).run(isSparkSessionManaged, isLookupService);
+	}
+
+	public void run(Boolean isSparkSessionManaged, ISLookUpService isLookUpService)
+		throws ISLookUpException {
+
 		String graphInputPath = parser.get("graphInputPath");
 		log.info("graphInputPath: {}", graphInputPath);

@ -60,19 +82,21 @@ public class GroupEntitiesSparkJob {
 		String outputPath = parser.get("outputPath");
 		log.info("outputPath: {}", outputPath);

-		boolean filterInvisible = Boolean.valueOf(parser.get("filterInvisible"));
+		boolean filterInvisible = Boolean.parseBoolean(parser.get("filterInvisible"));
 		log.info("filterInvisible: {}", filterInvisible);

 		SparkConf conf = new SparkConf();
 		conf.set("spark.serializer", "org.apache.spark.serializer.KryoSerializer");
 		conf.registerKryoClasses(ModelSupport.getOafModelClasses());

+		final VocabularyGroup vocs = VocabularyGroup.loadVocsFromIS(isLookUpService);
+
 		runWithSparkSession(
 			conf,
 			isSparkSessionManaged,
 			spark -> {
 				HdfsSupport.remove(checkpointPath, spark.sparkContext().hadoopConfiguration());
-				groupEntities(spark, graphInputPath, checkpointPath, outputPath, filterInvisible);
+				groupEntities(spark, graphInputPath, checkpointPath, outputPath, filterInvisible, vocs);
 			});
 	}

@ -81,7 +105,7 @@ public class GroupEntitiesSparkJob {
 		String inputPath,
 		String checkpointPath,
 		String outputPath,
-		boolean filterInvisible) {
+		boolean filterInvisible, VocabularyGroup vocs) {

 		Dataset<OafEntity> allEntities = spark.emptyDataset(OAFENTITY_KRYO_ENC);

@ -106,11 +130,15 @@ public class GroupEntitiesSparkJob {
 		}

 		Dataset<?> groupedEntities = allEntities
-			.groupByKey((MapFunction<OafEntity, String>) OafEntity::getId, Encoders.STRING())
-			.reduceGroups((ReduceFunction<OafEntity>) (b, a) -> OafMapperUtils.mergeEntities(b, a))
 			.map(
-				(MapFunction<Tuple2<String, OafEntity>, Tuple2<String, OafEntity>>) t -> new Tuple2(
-					t._2().getClass().getName(), t._2()),
+				(MapFunction<OafEntity, OafEntity>) entity -> GraphCleaningFunctions
+					.applyCoarVocabularies(entity, vocs),
+				OAFENTITY_KRYO_ENC)
+			.groupByKey((MapFunction<OafEntity, String>) OafEntity::getId, Encoders.STRING())
+			.mapGroups((MapGroupsFunction<String, OafEntity, OafEntity>) MergeUtils::mergeById, OAFENTITY_KRYO_ENC)
+			.map(
+				(MapFunction<OafEntity, Tuple2<String, OafEntity>>) t -> new Tuple2<>(
+					t.getClass().getName(), t),
 				Encoders.tuple(Encoders.STRING(), OAFENTITY_KRYO_ENC));

 		// pivot on "_1" (classname of the entity)
--- a/dhp-common/src/main/java/eu/dnetlib/dhp/oozie/RunSQLSparkJob.java
+++ b/dhp-common/src/main/java/eu/dnetlib/dhp/oozie/RunSQLSparkJob.java
@ -0,0 +1,77 @@
+
+package eu.dnetlib.dhp.oozie;
+
+import static eu.dnetlib.dhp.common.SparkSessionSupport.runWithSparkHiveSession;
+
+import java.net.URL;
+import java.nio.charset.StandardCharsets;
+import java.util.HashMap;
+import java.util.Map;
+import java.util.Optional;
+
+import org.apache.commons.lang3.time.DurationFormatUtils;
+import org.apache.commons.text.StringSubstitutor;
+import org.apache.spark.SparkConf;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+import com.google.common.io.Resources;
+
+import eu.dnetlib.dhp.application.ArgumentApplicationParser;
+
+public class RunSQLSparkJob {
+	private static final Logger log = LoggerFactory.getLogger(RunSQLSparkJob.class);
+
+	private final ArgumentApplicationParser parser;
+
+	public RunSQLSparkJob(ArgumentApplicationParser parser) {
+		this.parser = parser;
+	}
+
+	public static void main(String[] args) throws Exception {
+
+		Map<String, String> params = new HashMap<>();
+		for (int i = 0; i < args.length - 1; i++) {
+			if (args[i].startsWith("--")) {
+				params.put(args[i].substring(2), args[++i]);
+			}
+		}
+
+		/*
+		 * String jsonConfiguration = IOUtils .toString( Objects .requireNonNull( RunSQLSparkJob.class
+		 * .getResourceAsStream( "/eu/dnetlib/dhp/oozie/run_sql_parameters.json"))); final ArgumentApplicationParser
+		 * parser = new ArgumentApplicationParser(jsonConfiguration); parser.parseArgument(args);
+		 */
+
+		Boolean isSparkSessionManaged = Optional
+			.ofNullable(params.get("isSparkSessionManaged"))
+			.map(Boolean::valueOf)
+			.orElse(Boolean.TRUE);
+		log.info("isSparkSessionManaged: {}", isSparkSessionManaged);
+
+		URL url = com.google.common.io.Resources.getResource(params.get("sql"));
+		String raw_sql = Resources.toString(url, StandardCharsets.UTF_8);
+
+		String sql = StringSubstitutor.replace(raw_sql, params);
+		log.info("sql: {}", sql);
+
+		SparkConf conf = new SparkConf();
+		conf.set("hive.metastore.uris", params.get("hiveMetastoreUris"));
+
+		runWithSparkHiveSession(
+			conf,
+			isSparkSessionManaged,
+			spark -> {
+				for (String statement : sql.split(";\\s*/\\*\\s*EOS\\s*\\*/\\s*")) {
+					log.info("executing: {}", statement);
+					long startTime = System.currentTimeMillis();
+					spark.sql(statement).show();
+					log
+						.info(
+							"executed in {}",
+							DurationFormatUtils.formatDuration(System.currentTimeMillis() - startTime, "HH:mm:ss.S"));
+				}
+			});
+	}
+
+}
--- a/dhp-common/src/main/java/eu/dnetlib/dhp/schema/oaf/utils/CleaningFunctions.java
+++ b/dhp-common/src/main/java/eu/dnetlib/dhp/schema/oaf/utils/CleaningFunctions.java
@ -0,0 +1,76 @@
+
+package eu.dnetlib.dhp.schema.oaf.utils;
+
+import java.util.HashSet;
+import java.util.Objects;
+import java.util.Optional;
+import java.util.Set;
+
+import org.apache.commons.lang3.StringUtils;
+
+import eu.dnetlib.dhp.schema.oaf.StructuredProperty;
+
+public class CleaningFunctions {
+
+	public static final String DOI_PREFIX_REGEX = "(^10\\.|\\/10\\.)";
+	public static final String DOI_PREFIX = "10.";
+
+	public static final Set<String> PID_BLACKLIST = new HashSet<>();
+
+	static {
+		PID_BLACKLIST.add("none");
+		PID_BLACKLIST.add("na");
+	}
+
+	public CleaningFunctions() {
+	}
+
+	/**
+	 * Utility method that filter PID values on a per-type basis.
+	 * @param s the PID whose value will be checked.
+	 * @return false if the pid matches the filter criteria, true otherwise.
+	 */
+	public static boolean pidFilter(StructuredProperty s) {
+		final String pidValue = s.getValue();
+		if (Objects.isNull(s.getQualifier()) ||
+			StringUtils.isBlank(pidValue) ||
+			StringUtils.isBlank(pidValue.replaceAll("(?:\\n|\\r|\\t|\\s)", ""))) {
+			return false;
+		}
+		if (CleaningFunctions.PID_BLACKLIST.contains(pidValue)) {
+			return false;
+		}
+		return !PidBlacklistProvider.getBlacklist(s.getQualifier().getClassid()).contains(pidValue);
+	}
+
+	/**
+	 * Utility method that normalises PID values on a per-type basis.
+	 * @param pid the PID whose value will be normalised.
+	 * @return the PID containing the normalised value.
+	 */
+	public static StructuredProperty normalizePidValue(StructuredProperty pid) {
+		pid
+			.setValue(
+				normalizePidValue(
+					pid.getQualifier().getClassid(),
+					pid.getValue()));
+
+		return pid;
+	}
+
+	public static String normalizePidValue(String pidType, String pidValue) {
+		String value = Optional
+			.ofNullable(pidValue)
+			.map(String::trim)
+			.orElseThrow(() -> new IllegalArgumentException("PID value cannot be empty"));
+
+		switch (pidType) {
+
+			// TODO add cleaning for more PID types as needed
+			case "doi":
+				return value.toLowerCase().replaceFirst(DOI_PREFIX_REGEX, DOI_PREFIX);
+		}
+		return value;
+	}
+
+}
--- a/dhp-common/src/main/java/eu/dnetlib/dhp/schema/oaf/utils/DoiCleaningRule.java
+++ b/dhp-common/src/main/java/eu/dnetlib/dhp/schema/oaf/utils/DoiCleaningRule.java
@ -1,6 +1,8 @@

 package eu.dnetlib.dhp.schema.oaf.utils;

+import org.apache.commons.lang3.StringUtils;
+
 public class DoiCleaningRule {

 	public static String clean(final String doi) {
@ -11,4 +13,26 @@ public class DoiCleaningRule {
 			.replaceFirst(CleaningFunctions.DOI_PREFIX_REGEX, CleaningFunctions.DOI_PREFIX);
 	}

+	public static String normalizeDoi(final String input) {
+		if (input == null)
+			return null;
+		final String replaced = input
+			.replaceAll("\\n|\\r|\\t|\\s", "")
+			.toLowerCase()
+			.replaceFirst(CleaningFunctions.DOI_PREFIX_REGEX, CleaningFunctions.DOI_PREFIX);
+		if (StringUtils.isEmpty(replaced))
+			return null;
+
+		if (!replaced.contains("10."))
+			return null;
+
+		final String ret = replaced.substring(replaced.indexOf("10."));
+
+		if (!ret.startsWith(CleaningFunctions.DOI_PREFIX))
+			return null;
+
+		return ret;
+
+	}
+
 }
--- a/dhp-common/src/main/java/eu/dnetlib/dhp/schema/oaf/utils/GraphCleaningFunctions.java
+++ b/dhp-common/src/main/java/eu/dnetlib/dhp/schema/oaf/utils/GraphCleaningFunctions.java
@ -1,8 +1,12 @@

 package eu.dnetlib.dhp.schema.oaf.utils;

+import static eu.dnetlib.dhp.schema.common.ModelConstants.*;
+import static eu.dnetlib.dhp.schema.common.ModelConstants.OPENAIRE_META_RESOURCE_TYPE;
 import static eu.dnetlib.dhp.schema.oaf.utils.OafMapperUtils.getProvenance;

+import java.net.MalformedURLException;
+import java.net.URL;
 import java.time.LocalDate;
 import java.time.ZoneId;
 import java.time.format.DateTimeFormatter;
@ -19,6 +23,7 @@ import com.google.common.collect.Lists;
 import com.google.common.collect.Sets;

 import eu.dnetlib.dhp.common.vocabulary.VocabularyGroup;
+import eu.dnetlib.dhp.common.vocabulary.VocabularyTerm;
 import eu.dnetlib.dhp.schema.common.ModelConstants;
 import eu.dnetlib.dhp.schema.common.ModelSupport;
 import eu.dnetlib.dhp.schema.oaf.*;
@ -26,6 +31,10 @@ import me.xuender.unidecode.Unidecode;

 public class GraphCleaningFunctions extends CleaningFunctions {

+	public static final String DNET_PUBLISHERS = "dnet:publishers";
+
+	public static final String DNET_LICENSES = "dnet:licenses";
+
 	public static final String ORCID_CLEANING_REGEX = ".*([0-9]{4}).*[-–—−=].*([0-9]{4}).*[-–—−=].*([0-9]{4}).*[-–—−=].*([0-9x]{4})";
 	public static final int ORCID_LEN = 19;
 	public static final String CLEANING_REGEX = "(?:\\n|\\r|\\t)";
@ -36,6 +45,13 @@ public class GraphCleaningFunctions extends CleaningFunctions {

 	public static final int TITLE_FILTER_RESIDUAL_LENGTH = 5;
 	private static final String NAME_CLEANING_REGEX = "[\\r\\n\\t\\s]+";
+
+	private static final Set<String> INVALID_AUTHOR_NAMES = new HashSet<>();
+
+	private static final Set<String> INVALID_URLS = new HashSet<>();
+
+	private static final Set<String> INVALID_URL_HOSTS = new HashSet<>();
+
 	private static final HashSet<String> PEER_REVIEWED_TYPES = new HashSet<>();

 	static {
@ -48,6 +64,49 @@ public class GraphCleaningFunctions extends CleaningFunctions {
 		PEER_REVIEWED_TYPES.add("Thesis");
 		PEER_REVIEWED_TYPES.add("Bachelor thesis");
 		PEER_REVIEWED_TYPES.add("Conference object");
+
+		INVALID_AUTHOR_NAMES.add("(:null)");
+		INVALID_AUTHOR_NAMES.add("(:unap)");
+		INVALID_AUTHOR_NAMES.add("(:tba)");
+		INVALID_AUTHOR_NAMES.add("(:unas)");
+		INVALID_AUTHOR_NAMES.add("(:unav)");
+		INVALID_AUTHOR_NAMES.add("(:unkn)");
+		INVALID_AUTHOR_NAMES.add("(:unkn) unknown");
+		INVALID_AUTHOR_NAMES.add(":none");
+		INVALID_AUTHOR_NAMES.add(":null");
+		INVALID_AUTHOR_NAMES.add(":unas");
+		INVALID_AUTHOR_NAMES.add(":unav");
+		INVALID_AUTHOR_NAMES.add(":unkn");
+		INVALID_AUTHOR_NAMES.add("[autor desconocido]");
+		INVALID_AUTHOR_NAMES.add("[s. n.]");
+		INVALID_AUTHOR_NAMES.add("[s.n]");
+		INVALID_AUTHOR_NAMES.add("[unknown]");
+		INVALID_AUTHOR_NAMES.add("anonymous");
+		INVALID_AUTHOR_NAMES.add("n.n.");
+		INVALID_AUTHOR_NAMES.add("nn");
+		INVALID_AUTHOR_NAMES.add("no name supplied");
+		INVALID_AUTHOR_NAMES.add("none");
+		INVALID_AUTHOR_NAMES.add("none available");
+		INVALID_AUTHOR_NAMES.add("not available not available");
+		INVALID_AUTHOR_NAMES.add("null &na;");
+		INVALID_AUTHOR_NAMES.add("null anonymous");
+		INVALID_AUTHOR_NAMES.add("unbekannt");
+		INVALID_AUTHOR_NAMES.add("unknown");
+		INVALID_AUTHOR_NAMES.add("autor, Sin");
+		INVALID_AUTHOR_NAMES.add("Desconocido / Inconnu,");
+
+		INVALID_URL_HOSTS.add("creativecommons.org");
+		INVALID_URL_HOSTS.add("www.academia.edu");
+		INVALID_URL_HOSTS.add("academia.edu");
+		INVALID_URL_HOSTS.add("researchgate.net");
+		INVALID_URL_HOSTS.add("www.researchgate.net");
+
+		INVALID_URLS.add("http://repo.scoap3.org/api");
+		INVALID_URLS.add("http://ora.ox.ac.uk/objects/uuid:");
+		INVALID_URLS.add("http://ntur.lib.ntu.edu.tw/news/agent_contract.pdf");
+		INVALID_URLS.add("https://media.springer.com/full/springer-instructions-for-authors-assets/pdf/SN_BPF_EN.pdf");
+		INVALID_URLS.add("http://www.tobaccoinduceddiseases.org/dl/61aad426c96519bea4040a374c6a6110/");
+		INVALID_URLS.add("https://www.bilboard.nl/verenigingsbladen/bestuurskundige-berichten");
 	}

 	public static <T extends Oaf> T cleanContext(T value, String contextId, String verifyParam) {
@ -60,7 +119,7 @@ public class GraphCleaningFunctions extends CleaningFunctions {
 							.getContext()
 							.stream()
 							.filter(c -> !StringUtils.startsWith(c.getId().toLowerCase(), contextId))
-							.collect(Collectors.toList()));
+							.collect(Collectors.toCollection(ArrayList::new)));
 			}
 			return (T) res;
 		} else {
@ -255,7 +314,8 @@ public class GraphCleaningFunctions extends CleaningFunctions {
 		}

 		if (value instanceof Datasource) {
-			// nothing to evaluate here
+			final Datasource d = (Datasource) value;
+			return Objects.nonNull(d.getOfficialname()) && StringUtils.isNotBlank(d.getOfficialname().getValue());
 		} else if (value instanceof Project) {
 			final Project p = (Project) value;
 			return Objects.nonNull(p.getCode()) && StringUtils.isNotBlank(p.getCode().getValue());
@ -357,6 +417,14 @@ public class GraphCleaningFunctions extends CleaningFunctions {
 									.getPublisher()
 									.getValue()
 									.replaceAll(NAME_CLEANING_REGEX, " "));
+
+						if (vocs.vocabularyExists(DNET_PUBLISHERS)) {
+							vocs
+								.find(DNET_PUBLISHERS)
+								.map(voc -> voc.getTermBySynonym(r.getPublisher().getValue()))
+								.map(VocabularyTerm::getName)
+								.ifPresent(publisher -> r.getPublisher().setValue(publisher));
+						}
 					}
 				}
 				if (Objects.isNull(r.getLanguage()) || StringUtils.isBlank(r.getLanguage().getClassid())) {
@ -440,6 +508,8 @@ public class GraphCleaningFunctions extends CleaningFunctions {
 								.filter(Objects::nonNull)
 								.filter(sp -> StringUtils.isNotBlank(sp.getValue()))
 								.map(GraphCleaningFunctions::cleanValue)
+								.sorted((s1, s2) -> s2.getValue().length() - s1.getValue().length())
+								.limit(ModelHardLimits.MAX_ABSTRACTS)
 								.collect(Collectors.toList()));
 				}
 				if (Objects.isNull(r.getResourcetype()) || StringUtils.isBlank(r.getResourcetype().getClassid())) {
@ -517,6 +587,14 @@ public class GraphCleaningFunctions extends CleaningFunctions {
 							i.setRefereed(qualifier("0000", "Unknown", ModelConstants.DNET_REVIEW_LEVELS));
 						}

+						if (Objects.nonNull(i.getLicense()) && Objects.nonNull(i.getLicense().getValue())) {
+							vocs
+								.find(DNET_LICENSES)
+								.map(voc -> voc.getTermBySynonym(i.getLicense().getValue()))
+								.map(VocabularyTerm::getId)
+								.ifPresent(license -> i.getLicense().setValue(license));
+						}
+
 						// from the script from Dimitris
 						if ("0000".equals(i.getRefereed().getClassid())) {
 							final boolean isFromCrossref = Optional
@ -558,6 +636,15 @@ public class GraphCleaningFunctions extends CleaningFunctions {
 								ModelConstants.DATASET_RESULTTYPE_CLASSID.equals(r.getResulttype().getClassid()))) {
 							i.setFulltext(null);
 						}
+						if (Objects.nonNull(i.getUrl())) {
+							i
+								.setUrl(
+									i
+										.getUrl()
+										.stream()
+										.filter(GraphCleaningFunctions::urlFilter)
+										.collect(Collectors.toList()));
+						}
 					}
 				}
 				if (Objects.isNull(r.getBestaccessright())
@ -580,8 +667,7 @@ public class GraphCleaningFunctions extends CleaningFunctions {
 								.getAuthor()
 								.stream()
 								.filter(Objects::nonNull)
-								.filter(a -> StringUtils.isNotBlank(a.getFullname()))
-								.filter(a -> StringUtils.isNotBlank(a.getFullname().replaceAll("[\\W]", "")))
+								.filter(GraphCleaningFunctions::isValidAuthorName)
 								.map(GraphCleaningFunctions::cleanupAuthor)
 								.collect(Collectors.toList()));

@ -608,6 +694,9 @@ public class GraphCleaningFunctions extends CleaningFunctions {
 										.filter(Objects::nonNull)
 										.filter(p -> Objects.nonNull(p.getQualifier()))
 										.filter(p -> StringUtils.isNotBlank(p.getValue()))
+										.filter(
+											p -> StringUtils
+												.contains(StringUtils.lowerCase(p.getQualifier().getClassid()), ORCID))
 										.map(p -> {
 											// hack to distinguish orcid from orcid_pending
 											String pidProvenance = getProvenance(p.getDataInfo());
@ -617,7 +706,8 @@ public class GraphCleaningFunctions extends CleaningFunctions {
 												.toLowerCase()
 												.contains(ModelConstants.ORCID)) {
 												if (pidProvenance
-													.equals(ModelConstants.SYSIMPORT_CROSSWALK_ENTITYREGISTRY)) {
+													.equals(ModelConstants.SYSIMPORT_CROSSWALK_ENTITYREGISTRY) ||
+													pidProvenance.equals("ORCID_ENRICHMENT")) {
 													p.getQualifier().setClassid(ModelConstants.ORCID);
 												} else {
 													p.getQualifier().setClassid(ModelConstants.ORCID_PENDING);
@ -739,12 +829,30 @@ public class GraphCleaningFunctions extends CleaningFunctions {
 	// HELPERS

 	private static boolean isValidAuthorName(Author a) {
-		return !Stream
-			.of(a.getFullname(), a.getName(), a.getSurname())
-			.filter(s -> s != null && !s.isEmpty())
-			.collect(Collectors.joining(""))
-			.toLowerCase()
-			.matches(INVALID_AUTHOR_REGEX);
+		return StringUtils.isNotBlank(a.getFullname()) &&
+			StringUtils.isNotBlank(a.getFullname().replaceAll("[\\W]", "")) &&
+			!INVALID_AUTHOR_NAMES.contains(StringUtils.lowerCase(a.getFullname()).trim()) &&
+			!Stream
+				.of(a.getFullname(), a.getName(), a.getSurname())
+				.filter(StringUtils::isNotBlank)
+				.collect(Collectors.joining(""))
+				.toLowerCase()
+				.matches(INVALID_AUTHOR_REGEX);
+	}
+
+	private static boolean urlFilter(String u) {
+		try {
+			final URL url = new URL(u);
+			if (StringUtils.isBlank(url.getPath()) || "/".equals(url.getPath())) {
+				return false;
+			}
+			if (INVALID_URL_HOSTS.contains(url.getHost())) {
+				return false;
+			}
+			return !INVALID_URLS.contains(url.toString());
+		} catch (MalformedURLException ex) {
+			return false;
+		}
 	}

 	private static List<StructuredProperty> processPidCleaning(List<StructuredProperty> pids) {
@ -794,4 +902,142 @@ public class GraphCleaningFunctions extends CleaningFunctions {
 		return s;
 	}

+	public static OafEntity applyCoarVocabularies(OafEntity entity, VocabularyGroup vocs) {
+
+		if (entity instanceof Result) {
+			final Result result = (Result) entity;
+
+			Optional
+				.ofNullable(result.getInstance())
+				.ifPresent(
+					instances -> instances
+						.forEach(
+							instance -> {
+								if (Objects.isNull(instance.getInstanceTypeMapping())) {
+									List<InstanceTypeMapping> mapping = Lists.newArrayList();
+									mapping
+										.add(
+											OafMapperUtils
+												.instanceTypeMapping(
+													instance.getInstancetype().getClassname(),
+													OPENAIRE_COAR_RESOURCE_TYPES_3_1));
+									instance.setInstanceTypeMapping(mapping);
+								}
+								Optional<InstanceTypeMapping> optionalItm = instance
+									.getInstanceTypeMapping()
+									.stream()
+									.filter(GraphCleaningFunctions::originalResourceType)
+									.findFirst();
+								if (optionalItm.isPresent()) {
+									InstanceTypeMapping coarItm = optionalItm.get();
+									Optional
+										.ofNullable(
+											vocs
+												.lookupTermBySynonym(
+													OPENAIRE_COAR_RESOURCE_TYPES_3_1, coarItm.getOriginalType()))
+										.ifPresent(type -> {
+											coarItm.setTypeCode(type.getClassid());
+											coarItm.setTypeLabel(type.getClassname());
+										});
+									final List<InstanceTypeMapping> mappings = Lists.newArrayList();
+									if (vocs.vocabularyExists(OPENAIRE_USER_RESOURCE_TYPES)) {
+										Optional
+											.ofNullable(
+												vocs
+													.lookupTermBySynonym(
+														OPENAIRE_USER_RESOURCE_TYPES, coarItm.getTypeCode()))
+											.ifPresent(
+												type -> mappings
+													.add(
+														OafMapperUtils
+															.instanceTypeMapping(coarItm.getTypeCode(), type)));
+									}
+									if (!mappings.isEmpty()) {
+										instance.getInstanceTypeMapping().addAll(mappings);
+									}
+								}
+							}));
+			result.setMetaResourceType(getMetaResourceType(result.getInstance(), vocs));
+		}
+
+		return entity;
+	}
+
+	private static boolean originalResourceType(InstanceTypeMapping itm) {
+		return StringUtils.isNotBlank(itm.getOriginalType()) &&
+			OPENAIRE_COAR_RESOURCE_TYPES_3_1.equals(itm.getVocabularyName()) &&
+			StringUtils.isBlank(itm.getTypeCode()) &&
+			StringUtils.isBlank(itm.getTypeLabel());
+	}
+
+	private static Qualifier getMetaResourceType(final List<Instance> instances, final VocabularyGroup vocs) {
+		return Optional
+			.ofNullable(instances)
+			.map(ii -> {
+				if (vocs.vocabularyExists(OPENAIRE_META_RESOURCE_TYPE)) {
+					Optional<InstanceTypeMapping> itm = ii
+						.stream()
+						.filter(Objects::nonNull)
+						.flatMap(
+							i -> Optional
+								.ofNullable(i.getInstanceTypeMapping())
+								.map(Collection::stream)
+								.orElse(Stream.empty()))
+						.filter(t -> OPENAIRE_COAR_RESOURCE_TYPES_3_1.equals(t.getVocabularyName()))
+						.findFirst();
+
+					if (!itm.isPresent() || Objects.isNull(itm.get().getTypeCode())) {
+						return null;
+					} else {
+						final String typeCode = itm.get().getTypeCode();
+						return Optional
+							.ofNullable(vocs.lookupTermBySynonym(OPENAIRE_META_RESOURCE_TYPE, typeCode))
+							.orElseThrow(
+								() -> new IllegalStateException("unable to find a synonym for '" + typeCode + "' in " +
+									OPENAIRE_META_RESOURCE_TYPE));
+					}
+				} else {
+					throw new IllegalStateException("vocabulary '" + OPENAIRE_META_RESOURCE_TYPE + "' not available");
+				}
+			})
+			.orElse(null);
+	}
+
+	/**
+	 * Implements bad and ugly things that we should get rid of ASAP.
+	 *
+	 * @param value
+	 * @return
+	 * @param <T>
+	 */
+	public static <T extends Oaf> T dedicatedUglyHacks(T value) {
+		if (value instanceof OafEntity) {
+			if (value instanceof Result) {
+				final Result r = (Result) value;
+
+				// Fix for AMS Acta
+				Optional
+					.ofNullable(r.getInstance())
+					.map(
+						instance -> instance
+							.stream()
+							.filter(
+								i -> Optional
+									.ofNullable(i.getHostedby())
+									.map(KeyValue::getKey)
+									.map(dsId -> dsId.equals("10|re3data_____::4cc76bed7ce2fb95fd8e7a2dfde16016"))
+									.orElse(false)))
+					.ifPresent(instance -> instance.forEach(i -> {
+						if (Optional
+							.ofNullable(i.getPid())
+							.map(pid -> pid.stream().noneMatch(p -> p.getValue().startsWith("10.6092/unibo/amsacta")))
+							.orElse(false)) {
+							i.setHostedby(UNKNOWN_REPOSITORY);
+						}
+					}));
+			}
+		}
+		return value;
+	}
+
 }
--- a/dhp-common/src/main/java/eu/dnetlib/dhp/schema/oaf/utils/IdentifierFactory.java
+++ b/dhp-common/src/main/java/eu/dnetlib/dhp/schema/oaf/utils/IdentifierFactory.java
@ -0,0 +1,294 @@
+
+package eu.dnetlib.dhp.schema.oaf.utils;
+
+import static com.google.common.base.Preconditions.checkArgument;
+import static eu.dnetlib.dhp.schema.common.ModelConstants.*;
+
+import java.io.Serializable;
+import java.nio.charset.StandardCharsets;
+import java.security.MessageDigest;
+import java.util.*;
+import java.util.function.Function;
+import java.util.stream.Collectors;
+import java.util.stream.Stream;
+
+import org.apache.commons.codec.binary.Hex;
+import org.apache.commons.lang3.StringUtils;
+
+import com.google.common.collect.HashBiMap;
+import com.google.common.collect.Maps;
+
+import eu.dnetlib.dhp.schema.common.ModelSupport;
+import eu.dnetlib.dhp.schema.oaf.*;
+
+/**
+ * Factory class for OpenAIRE identifiers in the Graph
+ */
+public class IdentifierFactory implements Serializable {
+
+	public static final String ID_SEPARATOR = "::";
+	public static final String ID_PREFIX_SEPARATOR = "|";
+
+	public static final int ID_PREFIX_LEN = 12;
+
+	/**
+	 * Declares the associations PID_TYPE -> [DATASOURCE ID, NAME] considered authoritative for that PID_TYPE.
+	 * The id of the record (source_::id) will be rewritten as pidType_::id)
+	 */
+	public static final Map<PidType, HashBiMap<String, String>> PID_AUTHORITY = Maps.newHashMap();
+
+	static {
+		PID_AUTHORITY.put(PidType.doi, HashBiMap.create());
+		PID_AUTHORITY.get(PidType.doi).put(CROSSREF_ID, "Crossref");
+		PID_AUTHORITY.get(PidType.doi).put(DATACITE_ID, "Datacite");
+		PID_AUTHORITY.get(PidType.doi).put(ZENODO_OD_ID, "ZENODO");
+		PID_AUTHORITY.get(PidType.doi).put(ZENODO_R3_ID, "Zenodo");
+
+		PID_AUTHORITY.put(PidType.pmc, HashBiMap.create());
+		PID_AUTHORITY.get(PidType.pmc).put(EUROPE_PUBMED_CENTRAL_ID, "Europe PubMed Central");
+		PID_AUTHORITY.get(PidType.pmc).put(PUBMED_CENTRAL_ID, "PubMed Central");
+
+		PID_AUTHORITY.put(PidType.pmid, HashBiMap.create());
+		PID_AUTHORITY.get(PidType.pmid).put(EUROPE_PUBMED_CENTRAL_ID, "Europe PubMed Central");
+		PID_AUTHORITY.get(PidType.pmid).put(PUBMED_CENTRAL_ID, "PubMed Central");
+
+		PID_AUTHORITY.put(PidType.arXiv, HashBiMap.create());
+		PID_AUTHORITY.get(PidType.arXiv).put(ARXIV_ID, "arXiv.org e-Print Archive");
+
+		PID_AUTHORITY.put(PidType.w3id, HashBiMap.create());
+		PID_AUTHORITY.get(PidType.w3id).put(ROHUB_ID, "ROHub");
+	}
+
+	/**
+	 * Declares the associations PID_TYPE -> [DATASOURCE ID, PID SUBSTRING] considered as delegated authority for that
+	 * PID_TYPE. Example, Zenodo is delegated to forge DOIs that contain the 'zenodo' word.
+	 *
+	 * If a record with the same id (same pid) comes from 2 data sources, the one coming from a delegated source wins. E.g. Zenodo records win over those from Datacite.
+	 * See also https://code-repo.d4science.org/D-Net/dnet-hadoop/pulls/187 and the class dhp-common/src/main/java/eu/dnetlib/dhp/schema/oaf/utils/OafMapperUtils.java
+	 */
+	public static final Map<PidType, Map<String, String>> DELEGATED_PID_AUTHORITY = Maps.newHashMap();
+
+	static {
+		DELEGATED_PID_AUTHORITY.put(PidType.doi, new HashMap<>());
+		DELEGATED_PID_AUTHORITY.get(PidType.doi).put(ZENODO_OD_ID, "zenodo");
+		DELEGATED_PID_AUTHORITY.get(PidType.doi).put(ZENODO_R3_ID, "zenodo");
+		DELEGATED_PID_AUTHORITY.put(PidType.w3id, new HashMap<>());
+		DELEGATED_PID_AUTHORITY.get(PidType.w3id).put(ROHUB_ID, "ro-id");
+	}
+
+	/**
+	 * Declares the associations PID_TYPE -> [DATASOURCE ID, NAME] whose records are considered enrichment for the graph.
+	 * Their OpenAIRE ID is built from the declared PID type. Are merged with their corresponding record, identified by
+	 * the same OpenAIRE id.
+	 */
+	public static final Map<PidType, HashBiMap<String, String>> ENRICHMENT_PROVIDER = Maps.newHashMap();
+
+	static {
+		ENRICHMENT_PROVIDER.put(PidType.doi, HashBiMap.create());
+		ENRICHMENT_PROVIDER.get(PidType.doi).put(OPEN_APC_ID, OPEN_APC_NAME);
+	}
+
+	public static Set<String> delegatedAuthorityDatasourceIds() {
+		return DELEGATED_PID_AUTHORITY
+			.values()
+			.stream()
+			.flatMap(m -> m.keySet().stream())
+			.collect(Collectors.toCollection(HashSet::new));
+	}
+
+	public static List<StructuredProperty> getPids(List<StructuredProperty> pid, KeyValue collectedFrom) {
+		return pidFromInstance(pid, collectedFrom, true).distinct().collect(Collectors.toList());
+	}
+
+	public static <T extends Result> String createDOIBoostIdentifier(T entity) {
+		if (entity == null)
+			return null;
+
+		StructuredProperty pid = null;
+		if (entity.getPid() != null) {
+			pid = entity
+				.getPid()
+				.stream()
+				.filter(Objects::nonNull)
+				.filter(s -> s.getQualifier() != null && "doi".equalsIgnoreCase(s.getQualifier().getClassid()))
+				.filter(CleaningFunctions::pidFilter)
+				.findAny()
+				.orElse(null);
+		} else {
+			if (entity.getInstance() != null) {
+				pid = entity
+					.getInstance()
+					.stream()
+					.filter(i -> i.getPid() != null)
+					.flatMap(i -> i.getPid().stream())
+					.filter(CleaningFunctions::pidFilter)
+					.findAny()
+					.orElse(null);
+			}
+		}
+		if (pid != null)
+			return idFromPid(entity, pid, true);
+		return null;
+	}
+
+	/**
+	 * Creates an identifier from the most relevant PID (if available) provided by a known PID authority in the given
+	 * entity T. Returns entity.id when none of the PIDs meet the selection criteria is available.
+	 *
+	 * @param entity the entity providing PIDs and a default ID.
+	 * @param <T> the specific entity type. Currently Organization and Result subclasses are supported.
+	 * @param md5 indicates whether should hash the PID value or not.
+	 * @return an identifier from the most relevant PID, entity.id otherwise
+	 */
+	public static <T extends OafEntity> String createIdentifier(T entity, boolean md5) {
+
+		checkArgument(StringUtils.isNoneBlank(entity.getId()), "missing entity identifier");
+
+		final Map<String, Set<StructuredProperty>> pids = extractPids(entity);
+
+		return pids
+			.values()
+			.stream()
+			.flatMap(Set::stream)
+			.min(new PidComparator<>(entity))
+			.map(
+				min -> Optional
+					.ofNullable(pids.get(min.getQualifier().getClassid()))
+					.map(
+						p -> p
+							.stream()
+							.sorted(new PidValueComparator())
+							.findFirst()
+							.map(s -> idFromPid(entity, s, md5))
+							.orElseGet(entity::getId))
+					.orElseGet(entity::getId))
+			.orElseGet(entity::getId);
+	}
+
+	private static <T extends OafEntity> Map<String, Set<StructuredProperty>> extractPids(T entity) {
+		if (entity instanceof Result) {
+			return Optional
+				.ofNullable(((Result) entity).getInstance())
+				.map(IdentifierFactory::mapPids)
+				.orElse(new HashMap<>());
+		} else {
+			return entity
+				.getPid()
+				.stream()
+				.map(CleaningFunctions::normalizePidValue)
+				.filter(CleaningFunctions::pidFilter)
+				.collect(
+					Collectors
+						.groupingBy(
+							p -> p.getQualifier().getClassid(),
+							Collectors.mapping(p -> p, Collectors.toCollection(HashSet::new))));
+		}
+	}
+
+	private static Map<String, Set<StructuredProperty>> mapPids(List<Instance> instance) {
+		return instance
+			.stream()
+			.map(i -> pidFromInstance(i.getPid(), i.getCollectedfrom(), false))
+			.flatMap(Function.identity())
+			.collect(
+				Collectors
+					.groupingBy(
+						p -> p.getQualifier().getClassid(),
+						Collectors.mapping(p -> p, Collectors.toCollection(HashSet::new))));
+	}
+
+	private static Stream<StructuredProperty> pidFromInstance(List<StructuredProperty> pid, KeyValue collectedFrom,
+		boolean mapHandles) {
+		return Optional
+			.ofNullable(pid)
+			.map(
+				pp -> pp
+					.stream()
+					// filter away PIDs provided by a DS that is not considered an authority for the
+					// given PID Type
+					.filter(p -> shouldFilterPidByCriteria(collectedFrom, p, mapHandles))
+					.map(CleaningFunctions::normalizePidValue)
+					.filter(p -> isNotFromDelegatedAuthority(collectedFrom, p))
+					.filter(CleaningFunctions::pidFilter))
+			.orElse(Stream.empty());
+	}
+
+	private static boolean shouldFilterPidByCriteria(KeyValue collectedFrom, StructuredProperty p, boolean mapHandles) {
+		final PidType pType = PidType.tryValueOf(p.getQualifier().getClassid());
+
+		if (Objects.isNull(collectedFrom)) {
+			return false;
+		}
+
+		boolean isEnrich = Optional
+			.ofNullable(ENRICHMENT_PROVIDER.get(pType))
+			.map(
+				enrich -> enrich.containsKey(collectedFrom.getKey())
+					|| enrich.containsValue(collectedFrom.getValue()))
+			.orElse(false);
+
+		boolean isAuthority = Optional
+			.ofNullable(PID_AUTHORITY.get(pType))
+			.map(
+				authorities -> authorities.containsKey(collectedFrom.getKey())
+					|| authorities.containsValue(collectedFrom.getValue()))
+			.orElse(false);
+
+		return (mapHandles && pType.equals(PidType.handle)) || isEnrich || isAuthority;
+	}
+
+	private static boolean isNotFromDelegatedAuthority(KeyValue collectedFrom, StructuredProperty p) {
+		final PidType pType = PidType.tryValueOf(p.getQualifier().getClassid());
+
+		final Map<String, String> da = DELEGATED_PID_AUTHORITY.get(pType);
+		if (Objects.isNull(da)) {
+			return true;
+		}
+		if (!da.containsKey(collectedFrom.getKey())) {
+			return true;
+		}
+		return StringUtils.contains(p.getValue(), da.get(collectedFrom.getKey()));
+	}
+
+	/**
+	 * @see {@link IdentifierFactory#createIdentifier(OafEntity, boolean)}
+	 */
+	public static <T extends OafEntity> String createIdentifier(T entity) {
+
+		return createIdentifier(entity, true);
+	}
+
+	private static <T extends OafEntity> String idFromPid(T entity, StructuredProperty s, boolean md5) {
+		return idFromPid(ModelSupport.getIdPrefix(entity.getClass()), s.getQualifier().getClassid(), s.getValue(), md5);
+	}
+
+	public static String idFromPid(String numericPrefix, String pidType, String pidValue, boolean md5) {
+		return new StringBuilder()
+			.append(numericPrefix)
+			.append(ID_PREFIX_SEPARATOR)
+			.append(createPrefix(pidType))
+			.append(ID_SEPARATOR)
+			.append(md5 ? md5(pidValue) : pidValue)
+			.toString();
+	}
+
+	// create the prefix (length = 12)
+	private static String createPrefix(String pidType) {
+		StringBuilder prefix = new StringBuilder(StringUtils.left(pidType, ID_PREFIX_LEN));
+		while (prefix.length() < ID_PREFIX_LEN) {
+			prefix.append("_");
+		}
+		return prefix.substring(0, ID_PREFIX_LEN);
+	}
+
+	public static String md5(final String s) {
+		try {
+			final MessageDigest md = MessageDigest.getInstance("MD5");
+			md.update(s.getBytes(StandardCharsets.UTF_8));
+			return new String(Hex.encodeHex(md.digest()));
+		} catch (final Exception e) {
+			return null;
+		}
+	}
+
+}
--- a/dhp-common/src/main/java/eu/dnetlib/dhp/schema/oaf/utils/MergeComparator.java
+++ b/dhp-common/src/main/java/eu/dnetlib/dhp/schema/oaf/utils/MergeComparator.java
@ -0,0 +1,78 @@
+
+package eu.dnetlib.dhp.schema.oaf.utils;
+
+import java.util.Comparator;
+import java.util.HashSet;
+import java.util.Optional;
+import java.util.stream.Collectors;
+
+// 
+// Source code recreated from a .class file by IntelliJ IDEA
+// (powered by FernFlower decompiler)
+//
+import eu.dnetlib.dhp.schema.common.EntityType;
+import eu.dnetlib.dhp.schema.oaf.KeyValue;
+import eu.dnetlib.dhp.schema.oaf.Oaf;
+import eu.dnetlib.dhp.schema.oaf.OafEntity;
+import eu.dnetlib.dhp.schema.oaf.Result;
+
+public class MergeComparator implements Comparator<Oaf> {
+	public MergeComparator() {
+	}
+
+	public int compare(Oaf left, Oaf right) {
+		// nulls at the end
+		if (left == null && right == null) {
+			return 0;
+		} else if (left == null) {
+			return -1;
+		} else if (right == null) {
+			return 1;
+		}
+
+		// invisible
+		if (left.getDataInfo() != null && left.getDataInfo().getInvisible() == true) {
+			if (right.getDataInfo() != null && right.getDataInfo().getInvisible() == false) {
+				return -1;
+			}
+		}
+
+		// collectedfrom
+		HashSet<String> lCf = getCollectedFromIds(left);
+		HashSet<String> rCf = getCollectedFromIds(right);
+		if (lCf.contains("10|openaire____::081b82f96300b6a6e3d282bad31cb6e2")
+			&& !rCf.contains("10|openaire____::081b82f96300b6a6e3d282bad31cb6e2")) {
+			return -1;
+		} else if (!lCf.contains("10|openaire____::081b82f96300b6a6e3d282bad31cb6e2")
+			&& rCf.contains("10|openaire____::081b82f96300b6a6e3d282bad31cb6e2")) {
+			return 1;
+		}
+
+		SubEntityType lClass = SubEntityType.fromClass(left.getClass());
+		SubEntityType rClass = SubEntityType.fromClass(right.getClass());
+		return lClass.ordinal() - rClass.ordinal();
+
+	}
+
+	protected HashSet<String> getCollectedFromIds(Oaf left) {
+		return (HashSet) Optional.ofNullable(left.getCollectedfrom()).map((cf) -> {
+			return (HashSet) cf.stream().map(KeyValue::getKey).collect(Collectors.toCollection(HashSet::new));
+		}).orElse(new HashSet());
+	}
+
+	enum SubEntityType {
+		publication, dataset, software, otherresearchproduct, datasource, organization, project;
+
+		/**
+		 * Resolves the EntityType, given the relative class name
+		 *
+		 * @param clazz the given class name
+		 * @param <T> actual OafEntity subclass
+		 * @return the EntityType associated to the given class
+		 */
+		public static <T extends Oaf> SubEntityType fromClass(Class<T> clazz) {
+			return valueOf(clazz.getSimpleName().toLowerCase());
+		}
+	}
+
+}
--- a/dhp-common/src/main/java/eu/dnetlib/dhp/schema/oaf/utils/MergeEntitiesComparator.java
+++ b/dhp-common/src/main/java/eu/dnetlib/dhp/schema/oaf/utils/MergeEntitiesComparator.java
@ -0,0 +1,106 @@
+
+package eu.dnetlib.dhp.schema.oaf.utils;
+
+import java.util.*;
+
+import eu.dnetlib.dhp.schema.common.ModelConstants;
+import eu.dnetlib.dhp.schema.oaf.Oaf;
+import eu.dnetlib.dhp.schema.oaf.OafEntity;
+import eu.dnetlib.dhp.schema.oaf.Result;
+
+public class MergeEntitiesComparator implements Comparator<Oaf> {
+	static final List<String> PID_AUTHORITIES = Arrays
+		.asList(
+			ModelConstants.ARXIV_ID,
+			ModelConstants.PUBMED_CENTRAL_ID,
+			ModelConstants.EUROPE_PUBMED_CENTRAL_ID,
+			ModelConstants.DATACITE_ID,
+			ModelConstants.CROSSREF_ID);
+
+	static final List<String> RESULT_TYPES = Arrays
+		.asList(
+			ModelConstants.ORP_RESULTTYPE_CLASSID,
+			ModelConstants.SOFTWARE_RESULTTYPE_CLASSID,
+			ModelConstants.DATASET_RESULTTYPE_CLASSID,
+			ModelConstants.PUBLICATION_RESULTTYPE_CLASSID);
+
+	public static final Comparator<Oaf> INSTANCE = new MergeEntitiesComparator();
+
+	@Override
+	public int compare(Oaf left, Oaf right) {
+		if (left == null && right == null)
+			return 0;
+		if (left == null)
+			return -1;
+		if (right == null)
+			return 1;
+
+		int res = 0;
+
+		// pid authority
+		int cfp1 = Optional
+			.ofNullable(left.getCollectedfrom())
+			.map(
+				cf -> cf
+					.stream()
+					.map(kv -> PID_AUTHORITIES.indexOf(kv.getKey()))
+					.max(Integer::compare)
+					.orElse(-1))
+			.orElse(-1);
+		int cfp2 = Optional
+			.ofNullable(right.getCollectedfrom())
+			.map(
+				cf -> cf
+					.stream()
+					.map(kv -> PID_AUTHORITIES.indexOf(kv.getKey()))
+					.max(Integer::compare)
+					.orElse(-1))
+			.orElse(-1);
+
+		if (cfp1 >= 0 && cfp1 > cfp2) {
+			return 1;
+		} else if (cfp2 >= 0 && cfp2 > cfp1) {
+			return -1;
+		}
+
+		// trust
+		if (left.getDataInfo() != null && right.getDataInfo() != null) {
+			res = left.getDataInfo().getTrust().compareTo(right.getDataInfo().getTrust());
+		}
+
+		// result type
+		if (res == 0) {
+			if (left instanceof Result && right instanceof Result) {
+				Result r1 = (Result) left;
+				Result r2 = (Result) right;
+
+				if (r1.getResulttype() == null || r1.getResulttype().getClassid() == null) {
+					if (r2.getResulttype() != null && r2.getResulttype().getClassid() != null) {
+						return -1;
+					}
+				} else if (r2.getResulttype() == null || r2.getResulttype().getClassid() == null) {
+					return 1;
+				}
+
+				int rt1 = RESULT_TYPES.indexOf(r1.getResulttype().getClassid());
+				int rt2 = RESULT_TYPES.indexOf(r2.getResulttype().getClassid());
+
+				if (rt1 >= 0 && rt1 > rt2) {
+					return 1;
+				} else if (rt2 >= 0 && rt2 > rt1) {
+					return -1;
+				}
+			}
+		}
+
+		// id
+		if (res == 0) {
+			if (left instanceof OafEntity && right instanceof OafEntity) {
+				res = ((OafEntity) left).getId().compareTo(((OafEntity) right).getId());
+			}
+		}
+
+		return res;
+	}
+
+}
--- a/dhp-common/src/main/java/eu/dnetlib/dhp/schema/oaf/utils/MergeUtils.java
+++ b/dhp-common/src/main/java/eu/dnetlib/dhp/schema/oaf/utils/MergeUtils.java
--- a/dhp-common/src/main/java/eu/dnetlib/dhp/schema/oaf/utils/ModelHardLimits.java
+++ b/dhp-common/src/main/java/eu/dnetlib/dhp/schema/oaf/utils/ModelHardLimits.java
@ -0,0 +1,27 @@
+
+package eu.dnetlib.dhp.schema.oaf.utils;
+
+public class ModelHardLimits {
+
+	private ModelHardLimits() {
+	}
+
+	public static final String LAYOUT = "index";
+	public static final String INTERPRETATION = "openaire";
+	public static final String SEPARATOR = "-";
+
+	public static final int MAX_EXTERNAL_ENTITIES = 50;
+	public static final int MAX_AUTHORS = 200;
+	public static final int MAX_AUTHOR_FULLNAME_LENGTH = 1000;
+	public static final int MAX_TITLE_LENGTH = 5000;
+	public static final int MAX_TITLES = 10;
+	public static final int MAX_ABSTRACTS = 10;
+	public static final int MAX_ABSTRACT_LENGTH = 150000;
+	public static final int MAX_RELATED_ABSTRACT_LENGTH = 500;
+	public static final int MAX_INSTANCES = 10;
+
+	public static String getCollectionName(String format) {
+		return format + SEPARATOR + LAYOUT + SEPARATOR + INTERPRETATION;
+	}
+
+}
--- a/dhp-common/src/main/java/eu/dnetlib/dhp/schema/oaf/utils/OafMapperUtils.java
+++ b/dhp-common/src/main/java/eu/dnetlib/dhp/schema/oaf/utils/OafMapperUtils.java
@ -14,8 +14,6 @@ import java.util.stream.Collectors;
 import org.apache.commons.lang3.StringUtils;

 import eu.dnetlib.dhp.schema.common.AccessRightComparator;
-import eu.dnetlib.dhp.schema.common.ModelConstants;
-import eu.dnetlib.dhp.schema.common.ModelSupport;
 import eu.dnetlib.dhp.schema.oaf.*;

 public class OafMapperUtils {
@ -23,65 +21,6 @@ public class OafMapperUtils {
 	private OafMapperUtils() {
 	}

-	public static Oaf merge(final Oaf left, final Oaf right) {
-		if (ModelSupport.isSubClass(left, OafEntity.class)) {
-			return mergeEntities((OafEntity) left, (OafEntity) right);
-		} else if (ModelSupport.isSubClass(left, Relation.class)) {
-			((Relation) left).mergeFrom((Relation) right);
-		} else {
-			throw new IllegalArgumentException("invalid Oaf type:" + left.getClass().getCanonicalName());
-		}
-		return left;
-	}
-
-	public static OafEntity mergeEntities(OafEntity left, OafEntity right) {
-		if (ModelSupport.isSubClass(left, Result.class)) {
-			return mergeResults((Result) left, (Result) right);
-		} else if (ModelSupport.isSubClass(left, Datasource.class)) {
-			left.mergeFrom(right);
-		} else if (ModelSupport.isSubClass(left, Organization.class)) {
-			left.mergeFrom(right);
-		} else if (ModelSupport.isSubClass(left, Project.class)) {
-			left.mergeFrom(right);
-		} else {
-			throw new IllegalArgumentException("invalid OafEntity subtype:" + left.getClass().getCanonicalName());
-		}
-		return left;
-	}
-
-	public static Result mergeResults(Result left, Result right) {
-
-		final boolean leftFromDelegatedAuthority = isFromDelegatedAuthority(left);
-		final boolean rightFromDelegatedAuthority = isFromDelegatedAuthority(right);
-
-		if (leftFromDelegatedAuthority && !rightFromDelegatedAuthority) {
-			return left;
-		}
-		if (!leftFromDelegatedAuthority && rightFromDelegatedAuthority) {
-			return right;
-		}
-
-		if (new ResultTypeComparator().compare(left, right) < 0) {
-			left.mergeFrom(right);
-			return left;
-		} else {
-			right.mergeFrom(left);
-			return right;
-		}
-	}
-
-	private static boolean isFromDelegatedAuthority(Result r) {
-		return Optional
-			.ofNullable(r.getInstance())
-			.map(
-				instance -> instance
-					.stream()
-					.filter(i -> Objects.nonNull(i.getCollectedfrom()))
-					.map(i -> i.getCollectedfrom().getKey())
-					.anyMatch(cfId -> IdentifierFactory.delegatedAuthorityDatasourceIds().contains(cfId)))
-			.orElse(false);
-	}
-
 	public static KeyValue keyValue(final String k, final String v) {
 		final KeyValue kv = new KeyValue();
 		kv.setKey(k);
@ -141,6 +80,28 @@ public class OafMapperUtils {
 			.collect(Collectors.toList());
 	}

+	public static InstanceTypeMapping instanceTypeMapping(String originalType, String code, String label,
+		String vocabularyName) {
+		final InstanceTypeMapping m = new InstanceTypeMapping();
+		m.setVocabularyName(vocabularyName);
+		m.setOriginalType(originalType);
+		m.setTypeCode(code);
+		m.setTypeLabel(label);
+		return m;
+	}
+
+	public static InstanceTypeMapping instanceTypeMapping(String originalType, Qualifier term) {
+		return instanceTypeMapping(originalType, term.getClassid(), term.getClassname(), term.getSchemeid());
+	}
+
+	public static InstanceTypeMapping instanceTypeMapping(String originalType) {
+		return instanceTypeMapping(originalType, null, null, null);
+	}
+
+	public static InstanceTypeMapping instanceTypeMapping(String originalType, String vocabularyName) {
+		return instanceTypeMapping(originalType, null, null, vocabularyName);
+	}
+
 	public static Qualifier unknown(final String schemeid, final String schemename) {
 		return qualifier(UNKNOWN, "Unknown", schemeid, schemename);
 	}
--- a/dhp-common/src/main/java/eu/dnetlib/dhp/schema/oaf/utils/OrganizationPidComparator.java
+++ b/dhp-common/src/main/java/eu/dnetlib/dhp/schema/oaf/utils/OrganizationPidComparator.java
@ -0,0 +1,46 @@
+
+package eu.dnetlib.dhp.schema.oaf.utils;
+
+import java.util.Comparator;
+
+import eu.dnetlib.dhp.schema.oaf.StructuredProperty;
+
+public class OrganizationPidComparator implements Comparator<StructuredProperty> {
+
+	@Override
+	public int compare(StructuredProperty left, StructuredProperty right) {
+		if (left == null) {
+			return right == null ? 0 : -1;
+		} else if (right == null) {
+			return 1;
+		}
+
+		PidType lClass = PidType.tryValueOf(left.getQualifier().getClassid());
+		PidType rClass = PidType.tryValueOf(right.getQualifier().getClassid());
+
+		if (lClass.equals(rClass))
+			return 0;
+
+		if (lClass.equals(PidType.openorgs))
+			return -1;
+		if (rClass.equals(PidType.openorgs))
+			return 1;
+
+		if (lClass.equals(PidType.GRID))
+			return -1;
+		if (rClass.equals(PidType.GRID))
+			return 1;
+
+		if (lClass.equals(PidType.mag_id))
+			return -1;
+		if (rClass.equals(PidType.mag_id))
+			return 1;
+
+		if (lClass.equals(PidType.urn))
+			return -1;
+		if (rClass.equals(PidType.urn))
+			return 1;
+
+		return 0;
+	}
+}
--- a/dhp-common/src/main/java/eu/dnetlib/dhp/schema/oaf/utils/PidBlacklist.java
+++ b/dhp-common/src/main/java/eu/dnetlib/dhp/schema/oaf/utils/PidBlacklist.java
@ -0,0 +1,8 @@
+
+package eu.dnetlib.dhp.schema.oaf.utils;
+
+import java.util.HashMap;
+import java.util.HashSet;
+
+public class PidBlacklist extends HashMap<String, HashSet<String>> {
+}
--- a/dhp-common/src/main/java/eu/dnetlib/dhp/schema/oaf/utils/PidBlacklistProvider.java
+++ b/dhp-common/src/main/java/eu/dnetlib/dhp/schema/oaf/utils/PidBlacklistProvider.java
@ -0,0 +1,40 @@
+
+package eu.dnetlib.dhp.schema.oaf.utils;
+
+import java.io.IOException;
+import java.util.HashSet;
+import java.util.Optional;
+import java.util.Set;
+
+import org.apache.commons.io.IOUtils;
+
+import com.fasterxml.jackson.databind.ObjectMapper;
+
+public class PidBlacklistProvider {
+
+	private static final PidBlacklist blacklist;
+
+	static {
+		try {
+			String json = IOUtils.toString(IdentifierFactory.class.getResourceAsStream("pid_blacklist.json"));
+			blacklist = new ObjectMapper().readValue(json, PidBlacklist.class);
+
+		} catch (IOException e) {
+			throw new RuntimeException(e);
+		}
+	}
+
+	public static PidBlacklist getBlacklist() {
+		return blacklist;
+	}
+
+	public static Set<String> getBlacklist(String pidType) {
+		return Optional
+			.ofNullable(getBlacklist().get(pidType))
+			.orElse(new HashSet<>());
+	}
+
+	private PidBlacklistProvider() {
+	}
+
+}
--- a/dhp-common/src/main/java/eu/dnetlib/dhp/schema/oaf/utils/PidComparator.java
+++ b/dhp-common/src/main/java/eu/dnetlib/dhp/schema/oaf/utils/PidComparator.java
@ -0,0 +1,48 @@
+
+package eu.dnetlib.dhp.schema.oaf.utils;
+
+import java.util.Comparator;
+
+import eu.dnetlib.dhp.schema.common.ModelSupport;
+import eu.dnetlib.dhp.schema.oaf.OafEntity;
+import eu.dnetlib.dhp.schema.oaf.Organization;
+import eu.dnetlib.dhp.schema.oaf.Result;
+import eu.dnetlib.dhp.schema.oaf.StructuredProperty;
+
+public class PidComparator<T extends OafEntity> implements Comparator<StructuredProperty> {
+
+	private final T entity;
+
+	public PidComparator(T entity) {
+		this.entity = entity;
+	}
+
+	@Override
+	public int compare(StructuredProperty left, StructuredProperty right) {
+
+		if (left == null && right == null)
+			return 0;
+		if (left == null)
+			return 1;
+		if (right == null)
+			return -1;
+
+		if (ModelSupport.isSubClass(entity, Result.class)) {
+			return compareResultPids(left, right);
+		}
+		if (ModelSupport.isSubClass(entity, Organization.class)) {
+			return compareOrganizationtPids(left, right);
+		}
+
+		// Else (but unlikely), lexicographical ordering will do.
+		return left.getQualifier().getClassid().compareTo(right.getQualifier().getClassid());
+	}
+
+	private int compareResultPids(StructuredProperty left, StructuredProperty right) {
+		return new ResultPidComparator().compare(left, right);
+	}
+
+	private int compareOrganizationtPids(StructuredProperty left, StructuredProperty right) {
+		return new OrganizationPidComparator().compare(left, right);
+	}
+}
--- a/dhp-common/src/main/java/eu/dnetlib/dhp/schema/oaf/utils/PidType.java
+++ b/dhp-common/src/main/java/eu/dnetlib/dhp/schema/oaf/utils/PidType.java
@ -0,0 +1,79 @@
+
+package eu.dnetlib.dhp.schema.oaf.utils;
+
+import org.apache.commons.lang3.EnumUtils;
+
+public enum PidType {
+
+	/**
+	 * The DOI syntax shall be made up of a DOI prefix and a DOI suffix separated by a forward slash.
+	 *
+	 * There is no defined limit on the length of the DOI name, or of the DOI prefix or DOI suffix.
+	 *
+	 * The DOI name is case-insensitive and can incorporate any printable characters from the legal graphic characters
+	 * of Unicode. Further constraints on character use (e.g. use of language-specific alphanumeric characters) can be
+	 * defined for an application by the ISO 26324 Registration Authority.
+	 *
+	 *
+	 * DOI prefix: The DOI prefix shall be composed of a directory indicator followed by a registrant code.
+	 * These two components shall be separated by a full stop (period). The directory indicator shall be "10" and
+	 * distinguishes the entire set of character strings (prefix and suffix) as digital object identifiers within the
+	 * resolution system.
+	 *
+	 * Registrant code: The second element of the DOI prefix shall be the registrant code. The registrant code is a
+	 * unique string assigned to a registrant.
+	 *
+	 * DOI suffix: The DOI suffix shall consist of a character string of any length chosen by the registrant.
+	 * Each suffix shall be unique to the prefix element that precedes it. The unique suffix can be a sequential number,
+	 * or it might incorporate an identifier generated from or based on another system used by the registrant
+	 * (e.g. ISAN, ISBN, ISRC, ISSN, ISTC, ISNI; in such cases, a preferred construction for such a suffix can be
+	 * specified, as in Example 1).
+	 *
+	 * Source: https://www.doi.org/doi_handbook/2_Numbering.html#2.2
+	 */
+	doi,
+
+	/**
+	 * PubMed Unique Identifier (PMID)
+	 *
+	 * This field is a 1-to-8 digit accession number with no leading zeros. It is present on all records and is the
+	 * accession number for managing and disseminating records. PMIDs are not reused after records are deleted.
+	 *
+	 * Beginning in February 2012 PMIDs include extensions following a decimal point to account for article versions
+	 * (e.g., 21804956.2). All citations are considered version 1 until replaced.  The extended PMID is not displayed
+	 * on the MEDLINE format.
+	 *
+	 * View the citation in abstract format in PubMed to access additional versions when available (see the article in
+	 * the Jan-Feb 2012 NLM Technical Bulletin).
+	 *
+	 * Source: https://www.nlm.nih.gov/bsd/mms/medlineelements.html#pmid
+	 */
+	pmid,
+
+	/**
+	 * This field contains the unique identifier for the cited article in PubMed Central. The identifier begins with the
+	 * prefix PMC.
+	 *
+	 * Source: https://www.nlm.nih.gov/bsd/mms/medlineelements.html#pmc
+	 */
+	pmc, handle, arXiv, nct, pdb, w3id,
+
+	// Organization
+	openorgs, ROR, GRID, PIC, ISNI, Wikidata, FundRef, corda, corda_h2020, mag_id, urn,
+
+	// Used by dedup
+	undefined, original;
+
+	public static boolean isValid(String type) {
+		return EnumUtils.isValidEnum(PidType.class, type);
+	}
+
+	public static PidType tryValueOf(String s) {
+		try {
+			return PidType.valueOf(s);
+		} catch (Exception e) {
+			return PidType.original;
+		}
+	}
+
+}
--- a/dhp-common/src/main/java/eu/dnetlib/dhp/schema/oaf/utils/PidValueComparator.java
+++ b/dhp-common/src/main/java/eu/dnetlib/dhp/schema/oaf/utils/PidValueComparator.java
@ -0,0 +1,33 @@
+
+package eu.dnetlib.dhp.schema.oaf.utils;
+
+import java.util.Comparator;
+import java.util.Optional;
+
+import eu.dnetlib.dhp.schema.oaf.StructuredProperty;
+
+public class PidValueComparator implements Comparator<StructuredProperty> {
+
+	@Override
+	public int compare(StructuredProperty left, StructuredProperty right) {
+
+		if (left == null && right == null)
+			return 0;
+		if (left == null)
+			return 1;
+		if (right == null)
+			return -1;
+
+		StructuredProperty l = CleaningFunctions.normalizePidValue(left);
+		StructuredProperty r = CleaningFunctions.normalizePidValue(right);
+
+		return Optional
+			.ofNullable(l.getValue())
+			.map(
+				lv -> Optional
+					.ofNullable(r.getValue())
+					.map(rv -> lv.compareTo(rv))
+					.orElse(-1))
+			.orElse(1);
+	}
+}
--- a/dhp-common/src/main/java/eu/dnetlib/dhp/schema/oaf/utils/RefereedComparator.java
+++ b/dhp-common/src/main/java/eu/dnetlib/dhp/schema/oaf/utils/RefereedComparator.java
@ -0,0 +1,46 @@
+
+package eu.dnetlib.dhp.schema.oaf.utils;
+
+import java.util.Comparator;
+
+import eu.dnetlib.dhp.schema.oaf.Qualifier;
+
+/**
+ * Comparator for sorting the values from the dnet:review_levels vocabulary, implements the following ordering
+ *
+ * peerReviewed (0001) > nonPeerReviewed (0002) > UNKNOWN (0000)
+ */
+public class RefereedComparator implements Comparator<Qualifier> {
+
+	@Override
+	public int compare(Qualifier left, Qualifier right) {
+		if (left == null || left.getClassid() == null) {
+			return (right == null || right.getClassid() == null) ? 0 : -1;
+		} else if (right == null || right.getClassid() == null) {
+			return 1;
+		}
+
+		String lClass = left.getClassid();
+		String rClass = right.getClassid();
+
+		if (lClass.equals(rClass))
+			return 0;
+
+		if ("0001".equals(lClass))
+			return -1;
+		if ("0001".equals(rClass))
+			return 1;
+
+		if ("0002".equals(lClass))
+			return -1;
+		if ("0002".equals(rClass))
+			return 1;
+
+		if ("0000".equals(lClass))
+			return -1;
+		if ("0000".equals(rClass))
+			return 1;
+
+		return 0;
+	}
+}
--- a/dhp-common/src/main/java/eu/dnetlib/dhp/schema/oaf/utils/ResultPidComparator.java
+++ b/dhp-common/src/main/java/eu/dnetlib/dhp/schema/oaf/utils/ResultPidComparator.java
@ -0,0 +1,56 @@
+
+package eu.dnetlib.dhp.schema.oaf.utils;
+
+import java.util.Comparator;
+
+import eu.dnetlib.dhp.schema.oaf.StructuredProperty;
+
+public class ResultPidComparator implements Comparator<StructuredProperty> {
+
+	@Override
+	public int compare(StructuredProperty left, StructuredProperty right) {
+
+		PidType lClass = PidType.tryValueOf(left.getQualifier().getClassid());
+		PidType rClass = PidType.tryValueOf(right.getQualifier().getClassid());
+
+		if (lClass.equals(rClass))
+			return 0;
+
+		if (lClass.equals(PidType.doi))
+			return -1;
+		if (rClass.equals(PidType.doi))
+			return 1;
+
+		if (lClass.equals(PidType.pmid))
+			return -1;
+		if (rClass.equals(PidType.pmid))
+			return 1;
+
+		if (lClass.equals(PidType.pmc))
+			return -1;
+		if (rClass.equals(PidType.pmc))
+			return 1;
+
+		if (lClass.equals(PidType.handle))
+			return -1;
+		if (rClass.equals(PidType.handle))
+			return 1;
+
+		if (lClass.equals(PidType.arXiv))
+			return -1;
+		if (rClass.equals(PidType.arXiv))
+			return 1;
+
+		if (lClass.equals(PidType.nct))
+			return -1;
+		if (rClass.equals(PidType.nct))
+			return 1;
+
+		if (lClass.equals(PidType.pdb))
+			return -1;
+		if (rClass.equals(PidType.pdb))
+			return 1;
+
+		return 0;
+	}
+}
--- a/dhp-common/src/main/java/eu/dnetlib/pace/common/PaceCommonUtils.java
+++ b/dhp-common/src/main/java/eu/dnetlib/pace/common/PaceCommonUtils.java
@ -0,0 +1,101 @@
+
+package eu.dnetlib.pace.common;
+
+import java.nio.charset.StandardCharsets;
+import java.text.Normalizer;
+import java.util.Set;
+import java.util.regex.Matcher;
+import java.util.regex.Pattern;
+
+import org.apache.commons.io.IOUtils;
+import org.apache.commons.lang3.StringUtils;
+
+import com.google.common.base.Splitter;
+import com.google.common.collect.Iterables;
+import com.google.common.collect.Sets;
+import com.ibm.icu.text.Transliterator;
+
+/**
+ * Set of common functions for the framework
+ *
+ * @author claudio
+ */
+public class PaceCommonUtils {
+
+	// transliterator
+	protected static Transliterator transliterator = Transliterator.getInstance("Any-Eng");
+
+	protected static final String aliases_from = "⁰¹²³⁴⁵⁶⁷⁸⁹⁺⁻⁼⁽⁾ⁿ₀₁₂₃₄₅₆₇₈₉₊₋₌₍₎àáâäæãåāèéêëēėęəîïíīįìôöòóœøōõûüùúūßśšłžźżçćčñń";
+	protected static final String aliases_to = "0123456789+-=()n0123456789+-=()aaaaaaaaeeeeeeeeiiiiiioooooooouuuuussslzzzcccnn";
+
+	protected static Pattern hexUnicodePattern = Pattern.compile("\\\\u(\\p{XDigit}{4})");
+
+	protected static String fixAliases(final String s) {
+		final StringBuilder sb = new StringBuilder();
+
+		s.chars().forEach(ch -> {
+			final int i = StringUtils.indexOf(aliases_from, ch);
+			sb.append(i >= 0 ? aliases_to.charAt(i) : (char) ch);
+		});
+
+		return sb.toString();
+	}
+
+	protected static String transliterate(final String s) {
+		try {
+			return transliterator.transliterate(s);
+		} catch (Exception e) {
+			return s;
+		}
+	}
+
+	public static String normalize(final String s) {
+		return fixAliases(transliterate(nfd(unicodeNormalization(s))))
+			.toLowerCase()
+			// do not compact the regexes in a single expression, would cause StackOverflowError in case of large input
+			// strings
+			.replaceAll("[^ \\w]+", "")
+			.replaceAll("(\\p{InCombiningDiacriticalMarks})+", "")
+			.replaceAll("(\\p{Punct})+", " ")
+			.replaceAll("(\\d)+", " ")
+			.replaceAll("(\\n)+", " ")
+			.trim();
+	}
+
+	public static String nfd(final String s) {
+		return Normalizer.normalize(s, Normalizer.Form.NFD);
+	}
+
+	public static String unicodeNormalization(final String s) {
+
+		Matcher m = hexUnicodePattern.matcher(s);
+		StringBuffer buf = new StringBuffer(s.length());
+		while (m.find()) {
+			String ch = String.valueOf((char) Integer.parseInt(m.group(1), 16));
+			m.appendReplacement(buf, Matcher.quoteReplacement(ch));
+		}
+		m.appendTail(buf);
+		return buf.toString();
+	}
+
+	public static Set<String> loadFromClasspath(final String classpath) {
+
+		Transliterator transliterator = Transliterator.getInstance("Any-Eng");
+
+		final Set<String> h = Sets.newHashSet();
+		try {
+			for (final String s : IOUtils
+				.readLines(PaceCommonUtils.class.getResourceAsStream(classpath), StandardCharsets.UTF_8)) {
+				h.add(fixAliases(transliterator.transliterate(s))); // transliteration of the stopwords
+			}
+		} catch (final Throwable e) {
+			return Sets.newHashSet();
+		}
+		return h;
+	}
+
+	protected static Iterable<String> tokens(final String s, final int maxTokens) {
+		return Iterables.limit(Splitter.on(" ").omitEmptyStrings().trimResults().split(s), maxTokens);
+	}
+
+}
--- a/dhp-pace-core/src/main/java/eu/dnetlib/pace/model/Person.java
+++ b/dhp-pace-core/src/main/java/eu/dnetlib/pace/model/Person.java
@ -12,7 +12,7 @@ import com.google.common.collect.Iterables;
 import com.google.common.collect.Lists;
 import com.google.common.hash.Hashing;

-import eu.dnetlib.pace.common.AbstractPaceFunctions;
+import eu.dnetlib.pace.common.PaceCommonUtils;
 import eu.dnetlib.pace.util.Capitalise;
 import eu.dnetlib.pace.util.DotAbbreviations;

@ -86,7 +86,7 @@ public class Person {

 	private List<String> splitTerms(final String s) {
 		if (particles == null) {
-			particles = AbstractPaceFunctions.loadFromClasspath("/eu/dnetlib/pace/config/name_particles.txt");
+			particles = PaceCommonUtils.loadFromClasspath("/eu/dnetlib/pace/config/name_particles.txt");
 		}

 		final List<String> list = Lists.newArrayList();
--- a/dhp-pace-core/src/main/java/eu/dnetlib/pace/util/Capitalise.java
+++ b/dhp-pace-core/src/main/java/eu/dnetlib/pace/util/Capitalise.java
@ -15,4 +15,4 @@ public class Capitalise implements Function<String, String> {
 	public String apply(final String s) {
 		return WordUtils.capitalize(s.toLowerCase(), DELIM);
 	}
-};
+}
--- a/dhp-pace-core/src/main/java/eu/dnetlib/pace/util/DotAbbreviations.java
+++ b/dhp-pace-core/src/main/java/eu/dnetlib/pace/util/DotAbbreviations.java
@ -8,4 +8,4 @@ public class DotAbbreviations implements Function<String, String> {
 	public String apply(String s) {
 		return s.length() == 1 ? s + "." : s;
 	}
-};
+}
--- a/dhp-common/src/main/resources/eu/dnetlib/dhp/oa/merge/group_graph_entities_parameters.json
+++ b/dhp-common/src/main/resources/eu/dnetlib/dhp/oa/merge/group_graph_entities_parameters.json
@ -28,5 +28,11 @@
    "paramLongName": "filterInvisible",
    "paramDescription": "if true filters out invisible entities",
    "paramRequired": true
+  },
+  {
+    "paramName": "isu",
+    "paramLongName": "isLookupUrl",
+    "paramDescription": "url to the ISLookup Service",
+    "paramRequired": true
  }
 ]
--- a/dhp-common/src/main/resources/eu/dnetlib/dhp/oozie/run_sql_parameters.json
+++ b/dhp-common/src/main/resources/eu/dnetlib/dhp/oozie/run_sql_parameters.json
@ -0,0 +1,20 @@
+[
+  {
+    "paramName": "issm",
+    "paramLongName": "isSparkSessionManaged",
+    "paramDescription": "when true will stop SparkSession after job execution",
+    "paramRequired": false
+  },
+  {
+    "paramName": "hmu",
+    "paramLongName": "hiveMetastoreUris",
+    "paramDescription": "the hive metastore uris",
+    "paramRequired": true
+  },
+  {
+    "paramName": "sql",
+    "paramLongName": "sql",
+    "paramDescription": "sql script to execute",
+    "paramRequired": true
+  }
+]
--- a/dhp-pace-core/src/main/resources/eu/dnetlib/pace/config/name_particles.txt
+++ b/dhp-pace-core/src/main/resources/eu/dnetlib/pace/config/name_particles.txt
--- a/dhp-common/src/main/scala/eu/dnetlib/dhp/application/SparkScalaApplication.scala
+++ b/dhp-common/src/main/scala/eu/dnetlib/dhp/application/SparkScalaApplication.scala
@ -1,5 +1,8 @@
 package eu.dnetlib.dhp.application

+import eu.dnetlib.dhp.common.Constants
+import eu.dnetlib.dhp.utils.DHPUtils.writeHdfsFile
+
 import scala.io.Source

 /** This is the main Interface SparkApplication
@ -70,4 +73,13 @@ abstract class AbstractScalaApplication(
      .getOrCreate()
  }

+  def reportTotalSize(targetPath: String, outputBasePath: String): Unit = {
+    val total_items = spark.read.text(targetPath).count()
+    writeHdfsFile(
+      spark.sparkContext.hadoopConfiguration,
+      s"$total_items",
+      outputBasePath + Constants.MDSTORE_SIZE_PATH
+    )
+  }
+
 }
--- a/dhp-common/src/test/java/eu/dnetlib/dhp/schema/oaf/utils/BlackListProviderTest.java
+++ b/dhp-common/src/test/java/eu/dnetlib/dhp/schema/oaf/utils/BlackListProviderTest.java
@ -0,0 +1,21 @@
+
+package eu.dnetlib.dhp.schema.oaf.utils;
+
+import java.util.Set;
+
+import org.junit.jupiter.api.Assertions;
+import org.junit.jupiter.api.Test;
+
+class BlackListProviderTest {
+
+	@Test
+	void blackListTest() {
+
+		Assertions.assertNotNull(PidBlacklistProvider.getBlacklist());
+		Assertions.assertNotNull(PidBlacklistProvider.getBlacklist().get("doi"));
+		Assertions.assertTrue(PidBlacklistProvider.getBlacklist().get("doi").size() > 0);
+		final Set<String> xxx = PidBlacklistProvider.getBlacklist("xxx");
+		Assertions.assertNotNull(xxx);
+		Assertions.assertEquals(0, xxx.size());
+	}
+}
--- a/dhp-common/src/test/java/eu/dnetlib/dhp/schema/oaf/utils/IdentifierFactoryTest.java
+++ b/dhp-common/src/test/java/eu/dnetlib/dhp/schema/oaf/utils/IdentifierFactoryTest.java
@ -0,0 +1,87 @@
+
+package eu.dnetlib.dhp.schema.oaf.utils;
+
+import static org.junit.jupiter.api.Assertions.assertEquals;
+import static org.junit.jupiter.api.Assertions.assertNotNull;
+
+import java.io.IOException;
+
+import org.apache.commons.io.IOUtils;
+import org.junit.jupiter.api.Test;
+
+import com.fasterxml.jackson.databind.DeserializationFeature;
+import com.fasterxml.jackson.databind.ObjectMapper;
+
+import eu.dnetlib.dhp.schema.oaf.Publication;
+
+class IdentifierFactoryTest {
+
+	private static final ObjectMapper OBJECT_MAPPER = new ObjectMapper()
+		.configure(DeserializationFeature.FAIL_ON_UNKNOWN_PROPERTIES, false);
+
+	@Test
+	void testCreateIdentifierForPublication() throws IOException {
+
+		verifyIdentifier(
+			"publication_doi1.json", "50|doi_________::79dbc7a2a56dc1532659f9038843256e", true);
+
+		verifyIdentifier(
+			"publication_doi2.json", "50|doi_________::79dbc7a2a56dc1532659f9038843256e", true);
+
+		verifyIdentifier(
+			"publication_doi3.json", "50|pmc_________::94e4cb08c93f8733b48e2445d04002ac", true);
+
+		verifyIdentifier(
+			"publication_doi4.json", "50|od______2852::38861c44e6052a8d49f59a4c39ba5e66", true);
+
+		verifyIdentifier(
+			"publication_doi5.json", "50|doi_________::3bef95c0ca26dd55451fc8839ea69d27", true);
+
+		verifyIdentifier(
+			"publication_pmc1.json", "50|DansKnawCris::0829b5191605bdbea36d6502b8c1ce1f", true);
+
+		verifyIdentifier(
+			"publication_pmc2.json", "50|pmc_________::94e4cb08c93f8733b48e2445d04002ac", true);
+
+		verifyIdentifier(
+			"publication_openapc.json", "50|doi_________::79dbc7a2a56dc1532659f9038843256e", true);
+
+		final String defaultID = "50|DansKnawCris::0829b5191605bdbea36d6502b8c1ce1f";
+		verifyIdentifier("publication_3.json", defaultID, true);
+		verifyIdentifier("publication_4.json", defaultID, true);
+		verifyIdentifier("publication_5.json", defaultID, true);
+
+	}
+
+	@Test
+	void testCreateIdentifierForPublicationNoHash() throws IOException {
+
+		verifyIdentifier("publication_doi1.json", "50|doi_________::10.1016/j.cmet.2010.03.013", false);
+		verifyIdentifier("publication_doi2.json", "50|doi_________::10.1016/j.cmet.2010.03.013", false);
+		verifyIdentifier("publication_pmc1.json", "50|DansKnawCris::0829b5191605bdbea36d6502b8c1ce1f", false);
+		verifyIdentifier(
+			"publication_urn1.json", "50|DansKnawCris::0829b5191605bdbea36d6502b8c1ce1f", false);
+
+		final String defaultID = "50|DansKnawCris::0829b5191605bdbea36d6502b8c1ce1f";
+		verifyIdentifier("publication_3.json", defaultID, false);
+		verifyIdentifier("publication_4.json", defaultID, false);
+		verifyIdentifier("publication_5.json", defaultID, false);
+	}
+
+	@Test
+	void testCreateIdentifierForROHub() throws IOException {
+		verifyIdentifier(
+			"orp-rohub.json", "50|w3id________::afc7592914ae190a50570db90f55f9c2", true);
+	}
+
+	protected void verifyIdentifier(String filename, String expectedID, boolean md5) throws IOException {
+		final String json = IOUtils.toString(getClass().getResourceAsStream(filename));
+		final Publication pub = OBJECT_MAPPER.readValue(json, Publication.class);
+
+		String id = IdentifierFactory.createIdentifier(pub, md5);
+		System.out.println(id);
+		assertNotNull(id);
+		assertEquals(expectedID, id);
+	}
+
+}
--- a/dhp-common/src/test/java/eu/dnetlib/dhp/schema/oaf/utils/MergeUtilsTest.java
+++ b/dhp-common/src/test/java/eu/dnetlib/dhp/schema/oaf/utils/MergeUtilsTest.java
@ -0,0 +1,130 @@
+
+package eu.dnetlib.dhp.schema.oaf.utils;
+
+import static org.junit.jupiter.api.Assertions.*;
+import static org.junit.jupiter.api.Assertions.assertEquals;
+
+import java.io.IOException;
+import java.lang.reflect.InvocationTargetException;
+import java.util.HashSet;
+import java.util.List;
+import java.util.stream.Collectors;
+
+import org.apache.commons.beanutils.BeanUtils;
+import org.apache.commons.io.IOUtils;
+import org.junit.jupiter.api.Test;
+
+import com.fasterxml.jackson.databind.DeserializationFeature;
+import com.fasterxml.jackson.databind.ObjectMapper;
+import com.google.common.collect.Lists;
+
+import eu.dnetlib.dhp.schema.common.ModelConstants;
+import eu.dnetlib.dhp.schema.common.ModelSupport;
+import eu.dnetlib.dhp.schema.oaf.*;
+
+public class MergeUtilsTest {
+
+	private static final ObjectMapper OBJECT_MAPPER = new ObjectMapper()
+		.configure(DeserializationFeature.FAIL_ON_UNKNOWN_PROPERTIES, false);
+
+	@Test
+	void testMergePubs_new() throws IOException {
+		Publication pt = read("publication_test.json", Publication.class);
+		Publication p1 = read("publication_test.json", Publication.class);
+
+		assertEquals(1, pt.getCollectedfrom().size());
+		assertEquals(ModelConstants.CROSSREF_ID, pt.getCollectedfrom().get(0).getKey());
+
+		Instance i = new Instance();
+		i.setUrl(Lists.newArrayList("https://..."));
+		p1.getInstance().add(i);
+
+		Publication ptp1 = MergeUtils.mergePublication(pt, p1);
+
+		assertNotNull(ptp1.getInstance());
+		assertEquals(2, ptp1.getInstance().size());
+
+	}
+
+	@Test
+	void testMergePubs() throws IOException {
+		Publication p1 = read("publication_1.json", Publication.class);
+		Publication p2 = read("publication_2.json", Publication.class);
+		Dataset d1 = read("dataset_1.json", Dataset.class);
+		Dataset d2 = read("dataset_2.json", Dataset.class);
+
+		assertEquals(1, p1.getCollectedfrom().size());
+		assertEquals(ModelConstants.CROSSREF_ID, p1.getCollectedfrom().get(0).getKey());
+		assertEquals(1, d2.getCollectedfrom().size());
+		assertFalse(cfId(d2.getCollectedfrom()).contains(ModelConstants.CROSSREF_ID));
+
+		assertEquals(1, p2.getCollectedfrom().size());
+		assertFalse(cfId(p2.getCollectedfrom()).contains(ModelConstants.CROSSREF_ID));
+		assertEquals(1, d1.getCollectedfrom().size());
+		assertTrue(cfId(d1.getCollectedfrom()).contains(ModelConstants.CROSSREF_ID));
+
+		final Result p1d2 = MergeUtils.checkedMerge(p1, d2, true);
+		assertEquals(ModelConstants.PUBLICATION_RESULTTYPE_CLASSID, p1d2.getResulttype().getClassid());
+		assertTrue(p1d2 instanceof Publication);
+		assertEquals(p1.getId(), p1d2.getId());
+	}
+
+	@Test
+	void testMergePubs_1() throws IOException {
+		Publication p2 = read("publication_2.json", Publication.class);
+		Dataset d1 = read("dataset_1.json", Dataset.class);
+
+		final Result p2d1 = MergeUtils.checkedMerge(p2, d1, true);
+		assertEquals((ModelConstants.DATASET_RESULTTYPE_CLASSID), p2d1.getResulttype().getClassid());
+		assertTrue(p2d1 instanceof Dataset);
+		assertEquals(d1.getId(), p2d1.getId());
+		assertEquals(2, p2d1.getCollectedfrom().size());
+	}
+
+	@Test
+	void testMergePubs_2() throws IOException {
+		Publication p1 = read("publication_1.json", Publication.class);
+		Publication p2 = read("publication_2.json", Publication.class);
+
+		Result p1p2 = MergeUtils.checkedMerge(p1, p2, true);
+		assertTrue(p1p2 instanceof Publication);
+		assertEquals(p1.getId(), p1p2.getId());
+		assertEquals(2, p1p2.getCollectedfrom().size());
+	}
+
+	@Test
+	void testDelegatedAuthority_1() throws IOException {
+		Dataset d1 = read("dataset_2.json", Dataset.class);
+		Dataset d2 = read("dataset_delegated.json", Dataset.class);
+
+		assertEquals(1, d2.getCollectedfrom().size());
+		assertTrue(cfId(d2.getCollectedfrom()).contains(ModelConstants.ZENODO_OD_ID));
+
+		Result res = (Result) MergeUtils.merge(d1, d2, true);
+
+		assertEquals(d2, res);
+	}
+
+	@Test
+	void testDelegatedAuthority_2() throws IOException {
+		Dataset p1 = read("publication_1.json", Dataset.class);
+		Dataset d2 = read("dataset_delegated.json", Dataset.class);
+
+		assertEquals(1, d2.getCollectedfrom().size());
+		assertTrue(cfId(d2.getCollectedfrom()).contains(ModelConstants.ZENODO_OD_ID));
+
+		Result res = (Result) MergeUtils.merge(p1, d2, true);
+
+		assertEquals(d2, res);
+	}
+
+	protected HashSet<String> cfId(List<KeyValue> collectedfrom) {
+		return collectedfrom.stream().map(KeyValue::getKey).collect(Collectors.toCollection(HashSet::new));
+	}
+
+	protected <T extends Result> T read(String filename, Class<T> clazz) throws IOException {
+		final String json = IOUtils.toString(getClass().getResourceAsStream(filename));
+		return OBJECT_MAPPER.readValue(json, clazz);
+	}
+
+}
--- a/dhp-common/src/test/java/eu/dnetlib/dhp/schema/oaf/utils/OafMapperUtilsTest.java
+++ b/dhp-common/src/test/java/eu/dnetlib/dhp/schema/oaf/utils/OafMapperUtilsTest.java
@ -149,7 +149,7 @@ class OafMapperUtilsTest {
 	void testDate() {
 		final String date = GraphCleaningFunctions.cleanDate("23-FEB-1998");
 		assertNotNull(date);
-		System.out.println(date);
+		assertEquals("1998-02-23", date);
 	}

 	@Test
@ -166,8 +166,8 @@ class OafMapperUtilsTest {

 		assertEquals(
 			ModelConstants.PUBLICATION_RESULTTYPE_CLASSID,
-			OafMapperUtils
-				.mergeResults(p1, d2)
+			MergeUtils
+				.mergeResult(p1, d2)
 				.getResulttype()
 				.getClassid());

@ -178,10 +178,10 @@ class OafMapperUtilsTest {

 		assertEquals(
 			ModelConstants.DATASET_RESULTTYPE_CLASSID,
-			OafMapperUtils
-				.mergeResults(p2, d1)
-				.getResulttype()
-				.getClassid());
+			((Result) MergeUtils
+				.merge(p2, d1))
+					.getResulttype()
+					.getClassid());
 	}

 	@Test
@ -192,7 +192,7 @@ class OafMapperUtilsTest {
 		assertEquals(1, d2.getCollectedfrom().size());
 		assertTrue(cfId(d2.getCollectedfrom()).contains(ModelConstants.ZENODO_OD_ID));

-		Result res = OafMapperUtils.mergeResults(d1, d2);
+		Result res = MergeUtils.mergeResult(d1, d2);

 		assertEquals(d2, res);

--- a/dhp-common/src/test/resources/eu/dnetlib/dhp/schema/oaf/utils/enrichment.json
+++ b/dhp-common/src/test/resources/eu/dnetlib/dhp/schema/oaf/utils/enrichment.json
@ -0,0 +1,12 @@
+{"dataInfo":{"deletedbyinference":false,"inferred":true,"invisible":false,"provenanceaction":{"classid":"sysimport:enrich","schemeid":"dnet:provenanceActions","schemename":"dnet:provenanceActions"}},"id":"unresolved::10.0000/ra.v2i3.114::doi","instance":[{"measures":[{"id":"influence","unit":[{"dataInfo":{"deletedbyinference":false,"inferenceprovenance":"update","inferred":true,"invisible":false,"provenanceaction":{"classid":"measure:bip","classname":"Inferred by OpenAIRE","schemeid":"dnet:provenanceActions","schemename":"dnet:provenanceActions"},"trust":""},"key":"score","value":"5.91019644836e-09"}]},{"id":"popularity_alt","unit":[{"dataInfo":{"deletedbyinference":false,"inferenceprovenance":"update","inferred":true,"invisible":false,"provenanceaction":{"classid":"measure:bip","classname":"Inferred by OpenAIRE","schemeid":"dnet:provenanceActions","schemename":"dnet:provenanceActions"},"trust":""},"key":"score","value":"0.0"}]},{"id":"popularity","unit":[{"dataInfo":{"deletedbyinference":false,"inferenceprovenance":"update","inferred":true,"invisible":false,"provenanceaction":{"classid":"measure:bip","classname":"Inferred by OpenAIRE","schemeid":"dnet:provenanceActions","schemename":"dnet:provenanceActions"},"trust":""},"key":"score","value":"4.65008652949e-09"}]}],"pid":[{"qualifier":{"classid":"doi","classname":"Digital Object Identifier","schemeid":"dnet:pid_types","schemename":"dnet:pid_types"},"value":"10.0000/ra.v2i3.114"}]}]}
+{"dataInfo":{"deletedbyinference":false,"inferred":true,"invisible":false,"provenanceaction":{"classid":"sysimport:enrich","schemeid":"dnet:provenanceActions","schemename":"dnet:provenanceActions"}},"id":"unresolved::10.0001/(aj).v3i6.458::doi","instance":[{"measures":[{"id":"influence","unit":[{"dataInfo":{"deletedbyinference":false,"inferenceprovenance":"update","inferred":true,"invisible":false,"provenanceaction":{"classid":"measure:bip","classname":"Inferred by OpenAIRE","schemeid":"dnet:provenanceActions","schemename":"dnet:provenanceActions"},"trust":""},"key":"score","value":"5.91019644836e-09"}]},{"id":"popularity_alt","unit":[{"dataInfo":{"deletedbyinference":false,"inferenceprovenance":"update","inferred":true,"invisible":false,"provenanceaction":{"classid":"measure:bip","classname":"Inferred by OpenAIRE","schemeid":"dnet:provenanceActions","schemename":"dnet:provenanceActions"},"trust":""},"key":"score","value":"0.0"}]},{"id":"popularity","unit":[{"dataInfo":{"deletedbyinference":false,"inferenceprovenance":"update","inferred":true,"invisible":false,"provenanceaction":{"classid":"measure:bip","classname":"Inferred by OpenAIRE","schemeid":"dnet:provenanceActions","schemename":"dnet:provenanceActions"},"trust":""},"key":"score","value":"4.01810569717e-09"}]}],"pid":[{"qualifier":{"classid":"doi","classname":"Digital Object Identifier","schemeid":"dnet:pid_types","schemename":"dnet:pid_types"},"value":"10.0001/(aj).v3i6.458"}]}]}
+{"dataInfo":{"deletedbyinference":false,"inferred":true,"invisible":false,"provenanceaction":{"classid":"sysimport:enrich","schemeid":"dnet:provenanceActions","schemename":"dnet:provenanceActions"}},"id":"unresolved::10.0001/1587::doi","instance":[{"measures":[{"id":"influence","unit":[{"dataInfo":{"deletedbyinference":false,"inferenceprovenance":"update","inferred":true,"invisible":false,"provenanceaction":{"classid":"measure:bip","classname":"Inferred by OpenAIRE","schemeid":"dnet:provenanceActions","schemename":"dnet:provenanceActions"},"trust":""},"key":"score","value":"5.91019644836e-09"}]},{"id":"popularity_alt","unit":[{"dataInfo":{"deletedbyinference":false,"inferenceprovenance":"update","inferred":true,"invisible":false,"provenanceaction":{"classid":"measure:bip","classname":"Inferred by OpenAIRE","schemeid":"dnet:provenanceActions","schemename":"dnet:provenanceActions"},"trust":""},"key":"score","value":"0.0"}]},{"id":"popularity","unit":[{"dataInfo":{"deletedbyinference":false,"inferenceprovenance":"update","inferred":true,"invisible":false,"provenanceaction":{"classid":"measure:bip","classname":"Inferred by OpenAIRE","schemeid":"dnet:provenanceActions","schemename":"dnet:provenanceActions"},"trust":""},"key":"score","value":"5.39172290649e-09"}]}],"pid":[{"qualifier":{"classid":"doi","classname":"Digital Object Identifier","schemeid":"dnet:pid_types","schemename":"dnet:pid_types"},"value":"10.0001/1587"}]}]}
+{"dataInfo":{"deletedbyinference":false,"inferred":true,"invisible":false,"provenanceaction":{"classid":"sysimport:enrich","schemeid":"dnet:provenanceActions","schemename":"dnet:provenanceActions"}},"id":"unresolved::10.0001/462::doi","instance":[{"measures":[{"id":"influence","unit":[{"dataInfo":{"deletedbyinference":false,"inferenceprovenance":"update","inferred":true,"invisible":false,"provenanceaction":{"classid":"measure:bip","classname":"Inferred by OpenAIRE","schemeid":"dnet:provenanceActions","schemename":"dnet:provenanceActions"},"trust":""},"key":"score","value":"6.33235333753e-09"}]},{"id":"popularity_alt","unit":[{"dataInfo":{"deletedbyinference":false,"inferenceprovenance":"update","inferred":true,"invisible":false,"provenanceaction":{"classid":"measure:bip","classname":"Inferred by OpenAIRE","schemeid":"dnet:provenanceActions","schemename":"dnet:provenanceActions"},"trust":""},"key":"score","value":"0.36"}]},{"id":"popularity","unit":[{"dataInfo":{"deletedbyinference":false,"inferenceprovenance":"update","inferred":true,"invisible":false,"provenanceaction":{"classid":"measure:bip","classname":"Inferred by OpenAIRE","schemeid":"dnet:provenanceActions","schemename":"dnet:provenanceActions"},"trust":""},"key":"score","value":"5.00285265116e-09"}]}],"pid":[{"qualifier":{"classid":"doi","classname":"Digital Object Identifier","schemeid":"dnet:pid_types","schemename":"dnet:pid_types"},"value":"10.0001/462"}]}]}
+{"dataInfo":{"deletedbyinference":false,"inferred":true,"invisible":false,"provenanceaction":{"classid":"sysimport:enrich","schemeid":"dnet:provenanceActions","schemename":"dnet:provenanceActions"}},"id":"unresolved::10.0001/731::doi","instance":[{"measures":[{"id":"influence","unit":[{"dataInfo":{"deletedbyinference":false,"inferenceprovenance":"update","inferred":true,"invisible":false,"provenanceaction":{"classid":"measure:bip","classname":"Inferred by OpenAIRE","schemeid":"dnet:provenanceActions","schemename":"dnet:provenanceActions"},"trust":""},"key":"score","value":"5.91019644836e-09"}]},{"id":"popularity_alt","unit":[{"dataInfo":{"deletedbyinference":false,"inferenceprovenance":"update","inferred":true,"invisible":false,"provenanceaction":{"classid":"measure:bip","classname":"Inferred by OpenAIRE","schemeid":"dnet:provenanceActions","schemename":"dnet:provenanceActions"},"trust":""},"key":"score","value":"0.0"}]},{"id":"popularity","unit":[{"dataInfo":{"deletedbyinference":false,"inferenceprovenance":"update","inferred":true,"invisible":false,"provenanceaction":{"classid":"measure:bip","classname":"Inferred by OpenAIRE","schemeid":"dnet:provenanceActions","schemename":"dnet:provenanceActions"},"trust":""},"key":"score","value":"4.01810569717e-09"}]}],"pid":[{"qualifier":{"classid":"doi","classname":"Digital Object Identifier","schemeid":"dnet:pid_types","schemename":"dnet:pid_types"},"value":"10.0001/731"}]}]}
+{"dataInfo":{"deletedbyinference":false,"inferred":true,"invisible":false,"provenanceaction":{"classid":"sysimport:enrich","schemeid":"dnet:provenanceActions","schemename":"dnet:provenanceActions"}},"id":"unresolved::10.0001/ijllis.v9i4.2066.g2482::doi","instance":[{"measures":[{"id":"influence","unit":[{"dataInfo":{"deletedbyinference":false,"inferenceprovenance":"update","inferred":true,"invisible":false,"provenanceaction":{"classid":"measure:bip","classname":"Inferred by OpenAIRE","schemeid":"dnet:provenanceActions","schemename":"dnet:provenanceActions"},"trust":""},"key":"score","value":"5.91019644836e-09"}]},{"id":"popularity_alt","unit":[{"dataInfo":{"deletedbyinference":false,"inferenceprovenance":"update","inferred":true,"invisible":false,"provenanceaction":{"classid":"measure:bip","classname":"Inferred by OpenAIRE","schemeid":"dnet:provenanceActions","schemename":"dnet:provenanceActions"},"trust":""},"key":"score","value":"0.0"}]},{"id":"popularity","unit":[{"dataInfo":{"deletedbyinference":false,"inferenceprovenance":"update","inferred":true,"invisible":false,"provenanceaction":{"classid":"measure:bip","classname":"Inferred by OpenAIRE","schemeid":"dnet:provenanceActions","schemename":"dnet:provenanceActions"},"trust":""},"key":"score","value":"8.48190886761e-09"}]}],"pid":[{"qualifier":{"classid":"doi","classname":"Digital Object Identifier","schemeid":"dnet:pid_types","schemename":"dnet:pid_types"},"value":"10.0001/ijllis.v9i4.2066.g2482"}]}]}
+{"dataInfo":{"deletedbyinference":false,"inferred":true,"invisible":false,"provenanceaction":{"classid":"sysimport:enrich","schemeid":"dnet:provenanceActions","schemename":"dnet:provenanceActions"}},"id":"unresolved::10.0118/alfahim.v3i1.140::doi","instance":[{"measures":[{"id":"influence","unit":[{"dataInfo":{"deletedbyinference":false,"inferenceprovenance":"update","inferred":true,"invisible":false,"provenanceaction":{"classid":"measure:bip","classname":"Inferred by OpenAIRE","schemeid":"dnet:provenanceActions","schemename":"dnet:provenanceActions"},"trust":""},"key":"score","value":"5.91019644836e-09"}]},{"id":"popularity_alt","unit":[{"dataInfo":{"deletedbyinference":false,"inferenceprovenance":"update","inferred":true,"invisible":false,"provenanceaction":{"classid":"measure:bip","classname":"Inferred by OpenAIRE","schemeid":"dnet:provenanceActions","schemename":"dnet:provenanceActions"},"trust":""},"key":"score","value":"0.0"}]},{"id":"popularity","unit":[{"dataInfo":{"deletedbyinference":false,"inferenceprovenance":"update","inferred":true,"invisible":false,"provenanceaction":{"classid":"measure:bip","classname":"Inferred by OpenAIRE","schemeid":"dnet:provenanceActions","schemename":"dnet:provenanceActions"},"trust":""},"key":"score","value":"9.88840807598e-09"}]}],"pid":[{"qualifier":{"classid":"doi","classname":"Digital Object Identifier","schemeid":"dnet:pid_types","schemename":"dnet:pid_types"},"value":"10.0118/alfahim.v3i1.140"}]}]}
+{"dataInfo":{"deletedbyinference":false,"inferred":true,"invisible":false,"provenanceaction":{"classid":"sysimport:enrich","schemeid":"dnet:provenanceActions","schemename":"dnet:provenanceActions"}},"id":"unresolved::10.0166/fk2.stagefigshare.6442896.v3::doi","instance":[{"measures":[{"id":"influence","unit":[{"dataInfo":{"deletedbyinference":false,"inferenceprovenance":"update","inferred":true,"invisible":false,"provenanceaction":{"classid":"measure:bip","classname":"Inferred by OpenAIRE","schemeid":"dnet:provenanceActions","schemename":"dnet:provenanceActions"},"trust":""},"key":"score","value":"5.91019644836e-09"}]},{"id":"popularity_alt","unit":[{"dataInfo":{"deletedbyinference":false,"inferenceprovenance":"update","inferred":true,"invisible":false,"provenanceaction":{"classid":"measure:bip","classname":"Inferred by OpenAIRE","schemeid":"dnet:provenanceActions","schemename":"dnet:provenanceActions"},"trust":""},"key":"score","value":"0.0"}]},{"id":"popularity","unit":[{"dataInfo":{"deletedbyinference":false,"inferenceprovenance":"update","inferred":true,"invisible":false,"provenanceaction":{"classid":"measure:bip","classname":"Inferred by OpenAIRE","schemeid":"dnet:provenanceActions","schemename":"dnet:provenanceActions"},"trust":""},"key":"score","value":"7.28336930301e-09"}]}],"pid":[{"qualifier":{"classid":"doi","classname":"Digital Object Identifier","schemeid":"dnet:pid_types","schemename":"dnet:pid_types"},"value":"10.0166/fk2.stagefigshare.6442896.v3"}]}]}
+{"dataInfo":{"deletedbyinference":false,"inferred":true,"invisible":false,"provenanceaction":{"classid":"sysimport:enrich","schemeid":"dnet:provenanceActions","schemename":"dnet:provenanceActions"}},"id":"unresolved::10.0301/jttb.v2i1.64::doi","instance":[{"measures":[{"id":"influence","unit":[{"dataInfo":{"deletedbyinference":false,"inferenceprovenance":"update","inferred":true,"invisible":false,"provenanceaction":{"classid":"measure:bip","classname":"Inferred by OpenAIRE","schemeid":"dnet:provenanceActions","schemename":"dnet:provenanceActions"},"trust":""},"key":"score","value":"5.91019644836e-09"}]},{"id":"popularity_alt","unit":[{"dataInfo":{"deletedbyinference":false,"inferenceprovenance":"update","inferred":true,"invisible":false,"provenanceaction":{"classid":"measure:bip","classname":"Inferred by OpenAIRE","schemeid":"dnet:provenanceActions","schemename":"dnet:provenanceActions"},"trust":""},"key":"score","value":"0.0"}]},{"id":"popularity","unit":[{"dataInfo":{"deletedbyinference":false,"inferenceprovenance":"update","inferred":true,"invisible":false,"provenanceaction":{"classid":"measure:bip","classname":"Inferred by OpenAIRE","schemeid":"dnet:provenanceActions","schemename":"dnet:provenanceActions"},"trust":""},"key":"score","value":"7.28336930301e-09"}]}],"pid":[{"qualifier":{"classid":"doi","classname":"Digital Object Identifier","schemeid":"dnet:pid_types","schemename":"dnet:pid_types"},"value":"10.0301/jttb.v2i1.64"}]}]}
+{"dataInfo":{"deletedbyinference":false,"inferred":true,"invisible":false,"provenanceaction":{"classid":"sysimport:enrich","schemeid":"dnet:provenanceActions","schemename":"dnet:provenanceActions"}},"id":"unresolved::10.0809/seruni.v1i1.567::doi","instance":[{"measures":[{"id":"influence","unit":[{"dataInfo":{"deletedbyinference":false,"inferenceprovenance":"update","inferred":true,"invisible":false,"provenanceaction":{"classid":"measure:bip","classname":"Inferred by OpenAIRE","schemeid":"dnet:provenanceActions","schemename":"dnet:provenanceActions"},"trust":""},"key":"score","value":"5.91019644836e-09"}]},{"id":"popularity_alt","unit":[{"dataInfo":{"deletedbyinference":false,"inferenceprovenance":"update","inferred":true,"invisible":false,"provenanceaction":{"classid":"measure:bip","classname":"Inferred by OpenAIRE","schemeid":"dnet:provenanceActions","schemename":"dnet:provenanceActions"},"trust":""},"key":"score","value":"0.0"}]},{"id":"popularity","unit":[{"dataInfo":{"deletedbyinference":false,"inferenceprovenance":"update","inferred":true,"invisible":false,"provenanceaction":{"classid":"measure:bip","classname":"Inferred by OpenAIRE","schemeid":"dnet:provenanceActions","schemename":"dnet:provenanceActions"},"trust":""},"key":"score","value":"2.62959564033e-09"}]}],"pid":[{"qualifier":{"classid":"doi","classname":"Digital Object Identifier","schemeid":"dnet:pid_types","schemename":"dnet:pid_types"},"value":"10.0809/seruni.v1i1.567"}]}]}
+{"dataInfo":{"deletedbyinference":false,"inferred":true,"invisible":false,"provenanceaction":{"classid":"sysimport:enrich","schemeid":"dnet:provenanceActions","schemename":"dnet:provenanceActions"}},"id":"unresolved::10.0809/seruni.v2i1.765::doi","instance":[{"measures":[{"id":"influence","unit":[{"dataInfo":{"deletedbyinference":false,"inferenceprovenance":"update","inferred":true,"invisible":false,"provenanceaction":{"classid":"measure:bip","classname":"Inferred by OpenAIRE","schemeid":"dnet:provenanceActions","schemename":"dnet:provenanceActions"},"trust":""},"key":"score","value":"9.40178571921e-09"}]},{"id":"popularity_alt","unit":[{"dataInfo":{"deletedbyinference":false,"inferenceprovenance":"update","inferred":true,"invisible":false,"provenanceaction":{"classid":"measure:bip","classname":"Inferred by OpenAIRE","schemeid":"dnet:provenanceActions","schemename":"dnet:provenanceActions"},"trust":""},"key":"score","value":"0.0559872"}]},{"id":"popularity","unit":[{"dataInfo":{"deletedbyinference":false,"inferenceprovenance":"update","inferred":true,"invisible":false,"provenanceaction":{"classid":"measure:bip","classname":"Inferred by OpenAIRE","schemeid":"dnet:provenanceActions","schemename":"dnet:provenanceActions"},"trust":""},"key":"score","value":"3.67659957614e-09"}]}],"pid":[{"qualifier":{"classid":"doi","classname":"Digital Object Identifier","schemeid":"dnet:pid_types","schemename":"dnet:pid_types"},"value":"10.0809/seruni.v2i1.765"}]}]}
+{"dataInfo":{"deletedbyinference":false,"inferred":true,"invisible":false,"provenanceaction":{"classid":"sysimport:enrich","schemeid":"dnet:provenanceActions","schemename":"dnet:provenanceActions"}},"id":"unresolved::10.0901/jkip.v7i3.485::doi","instance":[{"measures":[{"id":"influence","unit":[{"dataInfo":{"deletedbyinference":false,"inferenceprovenance":"update","inferred":true,"invisible":false,"provenanceaction":{"classid":"measure:bip","classname":"Inferred by OpenAIRE","schemeid":"dnet:provenanceActions","schemename":"dnet:provenanceActions"},"trust":""},"key":"score","value":"5.91019644836e-09"}]},{"id":"popularity_alt","unit":[{"dataInfo":{"deletedbyinference":false,"inferenceprovenance":"update","inferred":true,"invisible":false,"provenanceaction":{"classid":"measure:bip","classname":"Inferred by OpenAIRE","schemeid":"dnet:provenanceActions","schemename":"dnet:provenanceActions"},"trust":""},"key":"score","value":"0.0"}]},{"id":"popularity","unit":[{"dataInfo":{"deletedbyinference":false,"inferenceprovenance":"update","inferred":true,"invisible":false,"provenanceaction":{"classid":"measure:bip","classname":"Inferred by OpenAIRE","schemeid":"dnet:provenanceActions","schemename":"dnet:provenanceActions"},"trust":""},"key":"score","value":"6.26204125721e-09"}]}],"pid":[{"qualifier":{"classid":"doi","classname":"Digital Object Identifier","schemeid":"dnet:pid_types","schemename":"dnet:pid_types"},"value":"10.0901/jkip.v7i3.485"}]}]}
--- a/dhp-common/src/test/resources/eu/dnetlib/dhp/schema/oaf/utils/orp-rohub.json
+++ b/dhp-common/src/test/resources/eu/dnetlib/dhp/schema/oaf/utils/orp-rohub.json
--- a/dhp-common/src/test/resources/eu/dnetlib/dhp/schema/oaf/utils/publication_3.json
+++ b/dhp-common/src/test/resources/eu/dnetlib/dhp/schema/oaf/utils/publication_3.json
@ -0,0 +1 @@
+{"id":"50|DansKnawCris::0829b5191605bdbea36d6502b8c1ce1f","pid":[{"qualifier":{"classid":"scp-number"},"value":"79953761260"}]}
--- a/dhp-common/src/test/resources/eu/dnetlib/dhp/schema/oaf/utils/publication_4.json
+++ b/dhp-common/src/test/resources/eu/dnetlib/dhp/schema/oaf/utils/publication_4.json
@ -0,0 +1 @@
+{"id":"50|DansKnawCris::0829b5191605bdbea36d6502b8c1ce1f","pid":[]}
--- a/dhp-common/src/test/resources/eu/dnetlib/dhp/schema/oaf/utils/publication_5.json
+++ b/dhp-common/src/test/resources/eu/dnetlib/dhp/schema/oaf/utils/publication_5.json
@ -0,0 +1 @@
+{"id":"50|DansKnawCris::0829b5191605bdbea36d6502b8c1ce1f"}
--- a/dhp-common/src/test/resources/eu/dnetlib/dhp/schema/oaf/utils/publication_apc.json
+++ b/dhp-common/src/test/resources/eu/dnetlib/dhp/schema/oaf/utils/publication_apc.json
--- a/dhp-common/src/test/resources/eu/dnetlib/dhp/schema/oaf/utils/publication_apc2.json
+++ b/dhp-common/src/test/resources/eu/dnetlib/dhp/schema/oaf/utils/publication_apc2.json
--- a/dhp-common/src/test/resources/eu/dnetlib/dhp/schema/oaf/utils/publication_doi1.json
+++ b/dhp-common/src/test/resources/eu/dnetlib/dhp/schema/oaf/utils/publication_doi1.json
@ -0,0 +1,33 @@
+{
+  "id": "50|DansKnawCris::0829b5191605bdbea36d6502b8c1ce1f",
+  "instance": [
+    {
+      "collectedfrom": {
+        "key": "10|openaire____::081b82f96300b6a6e3d282bad31cb6e2",
+        "value": "Crossref"
+      },
+      "pid": [
+        {
+          "qualifier": {"classid": "doi"},
+          "value": "10.1016/j.cmet.2010.03.013"
+        }
+      ]
+    },
+    {
+      "pid": [
+        {
+          "qualifier": {"classid": "urn"},
+          "value": "urn:nbn:nl:ui:29-f3ed5f9e-edf6-457e-8848-61b58a4075e2"
+        },
+        {
+          "qualifier": {"classid": "scp-number"},
+          "value": "79953761260"
+        },
+        {
+          "qualifier": {"classid": "pmc"},
+          "value": "21459329"
+        }
+      ]
+    }
+  ]
+}
--- a/dhp-common/src/test/resources/eu/dnetlib/dhp/schema/oaf/utils/publication_doi2.json
+++ b/dhp-common/src/test/resources/eu/dnetlib/dhp/schema/oaf/utils/publication_doi2.json
@ -0,0 +1,37 @@
+{
+  "id": "50|DansKnawCris::0829b5191605bdbea36d6502b8c1ce1f",
+  "instance": [
+    {
+      "collectedfrom": {
+        "key": "10|openaire____::081b82f96300b6a6e3d282bad31cb6e2",
+        "value": "Crossref"
+      },
+      "pid": [
+        {
+          "qualifier": {"classid": "doi"},
+          "value": "10.1016/j.cmet.2010.03.013"
+        }
+      ]
+    },
+    {
+      "collectedfrom": {
+        "key": "10|opendoar____::8b6dd7db9af49e67306feb59a8bdc52c",
+        "value": "Europe PubMed Central"
+      },
+      "pid": [
+        {
+          "qualifier": {"classid": "urn"},
+          "value": "urn:nbn:nl:ui:29-f3ed5f9e-edf6-457e-8848-61b58a4075e2"
+        },
+        {
+          "qualifier": {"classid": "scp-number"},
+          "value": "79953761260"
+        },
+        {
+          "qualifier": {"classid": "pmc"},
+          "value": "21459329"
+        }
+      ]
+    }
+  ]
+}
--- a/dhp-common/src/test/resources/eu/dnetlib/dhp/schema/oaf/utils/publication_doi3.json
+++ b/dhp-common/src/test/resources/eu/dnetlib/dhp/schema/oaf/utils/publication_doi3.json
@ -0,0 +1,37 @@
+{
+  "id": "50|DansKnawCris::0829b5191605bdbea36d6502b8c1ce1f",
+  "instance": [
+    {
+      "collectedfrom": {
+        "key": "10|opendoar____::358aee4cc897452c00244351e4d91f69",
+        "value": "Zenodo"
+      },
+      "pid": [
+        {
+          "qualifier": {"classid": "doi"},
+          "value": "10.1016/j.cmet.2010.03.013"
+        }
+      ]
+    },
+    {
+      "collectedfrom": {
+        "key": "10|opendoar____::8b6dd7db9af49e67306feb59a8bdc52c",
+        "value": "Europe PubMed Central"
+      },
+      "pid": [
+        {
+          "qualifier": {"classid": "urn"},
+          "value": "urn:nbn:nl:ui:29-f3ed5f9e-edf6-457e-8848-61b58a4075e2"
+        },
+        {
+          "qualifier": {"classid": "scp-number"},
+          "value": "79953761260"
+        },
+        {
+          "qualifier": {"classid": "pmc"},
+          "value": "21459329"
+        }
+      ]
+    }
+  ]
+}
--- a/dhp-common/src/test/resources/eu/dnetlib/dhp/schema/oaf/utils/publication_doi4.json
+++ b/dhp-common/src/test/resources/eu/dnetlib/dhp/schema/oaf/utils/publication_doi4.json
@ -0,0 +1,37 @@
+{
+  "id": "50|od______2852::38861c44e6052a8d49f59a4c39ba5e66",
+  "instance": [
+    {
+      "collectedfrom": {
+        "key": "10|opendoar____::358aee4cc897452c00244351e4d91f69",
+        "value": "Zenodo"
+      },
+      "pid": [
+        {
+          "qualifier": {"classid": "doi"},
+          "value": "10.1016/j.cmet.2010.03.013"
+        },
+        {
+          "qualifier": {"classid": "handle"},
+          "value": "11012/83840"
+        }
+      ]
+    },
+    {
+      "collectedfrom": {
+        "key": "10|opendoar____::2852",
+        "value": "Digital library of Brno University of Technology"
+      },
+      "pid": [
+        {
+          "qualifier": {"classid": "pmc"},
+          "value": "21459329"
+        },
+        {
+          "qualifier": {"classid": "handle"},
+          "value": "11012/83840"
+        }
+      ]
+    }
+  ]
+}
--- a/dhp-common/src/test/resources/eu/dnetlib/dhp/schema/oaf/utils/publication_doi5.json
+++ b/dhp-common/src/test/resources/eu/dnetlib/dhp/schema/oaf/utils/publication_doi5.json
@ -0,0 +1,37 @@
+{
+  "id": "50|DansKnawCris::0829b5191605bdbea36d6502b8c1ce1f",
+  "instance": [
+    {
+      "collectedfrom": {
+        "key": "10|opendoar____::358aee4cc897452c00244351e4d91f69",
+        "value": "Zenodo"
+      },
+      "pid": [
+        {
+          "qualifier": {"classid": "doi"},
+          "value": "10.5281/zenodo.5121485"
+        }
+      ]
+    },
+    {
+      "collectedfrom": {
+        "key": "10|opendoar____::8b6dd7db9af49e67306feb59a8bdc52c",
+        "value": "Europe PubMed Central"
+      },
+      "pid": [
+        {
+          "qualifier": {"classid": "urn"},
+          "value": "urn:nbn:nl:ui:29-f3ed5f9e-edf6-457e-8848-61b58a4075e2"
+        },
+        {
+          "qualifier": {"classid": "scp-number"},
+          "value": "79953761260"
+        },
+        {
+          "qualifier": {"classid": "pmc"},
+          "value": "21459329"
+        }
+      ]
+    }
+  ]
+}
--- a/dhp-common/src/test/resources/eu/dnetlib/dhp/schema/oaf/utils/publication_irish_tender_1.json
+++ b/dhp-common/src/test/resources/eu/dnetlib/dhp/schema/oaf/utils/publication_irish_tender_1.json
@ -0,0 +1,3 @@
+{"id":"50|DansKnawCris::0829b5191605bdbea36d6502b8c1ce1f", "resulttype" : { "classid" : "publication" }, "pid":[{"qualifier":{"classid":"doi"},"value":"10.1016/j.cmet.2011.03.013"},{"qualifier":{"classid":"urn"},"value":"urn:nbn:nl:ui:29-f3ed5f9e-edf6-457e-8848-61b58a4075e2"},{"qualifier":{"classid":"scp-number"},"value":"79953761260"},{"qualifier":{"classid":"pmc"},"value":"21459329"}], "collectedfrom" : [ { "key" : "10|openaire____::081b82f96300b6a6e3d282bad31cb6e2", "value" : "Crossref"} ], "isGreen": null, "openAccessColor":  "gold", "isInDiamondJournal": null, "publiclyFunded": null}
+{"id":"50|DansKnawCris::0829b5191605bdbea36d6502b8c1ce1g", "resulttype" : { "classid" : "publication" }, "isGreen": true, "openAccessColor":  "gold", "isInDiamondJournal": true, "publiclyFunded": false }
+{"id":"50|DansKnawCris::0829b5191605bdbea36d6502b8c1ce1h", "resulttype" : { "classid" : "publication" }, "isGreen": false, "openAccessColor":  null, "isInDiamondJournal": true, "publiclyFunded": false }
--- a/dhp-common/src/test/resources/eu/dnetlib/dhp/schema/oaf/utils/publication_irish_tender_2.json
+++ b/dhp-common/src/test/resources/eu/dnetlib/dhp/schema/oaf/utils/publication_irish_tender_2.json
@ -0,0 +1,3 @@
+{"id":"50|DansKnawCris::0829b5191605bdbea36d6502b8c1ce1f", "resulttype" : { "classid" : "publication" }, "pid":[{"qualifier":{"classid":"doi"},"value":"10.1016/j.cmet.2011.03.013"},{"qualifier":{"classid":"urn"},"value":"urn:nbn:nl:ui:29-f3ed5f9e-edf6-457e-8848-61b58a4075e2"},{"qualifier":{"classid":"scp-number"},"value":"79953761260"},{"qualifier":{"classid":"pmc"},"value":"21459329"}], "collectedfrom" : [ { "key" : "10|openaire____::081b82f96300b6a6e3d282bad31cb6e2", "value" : "Crossref"} ], "isGreen": null, "openAccessColor":  "gold", "isInDiamondJournal": null, "publiclyFunded": null}
+{"id":"50|DansKnawCris::0829b5191605bdbea36d6502b8c1ce1g", "resulttype" : { "classid" : "publication" }, "isGreen": true, "openAccessColor":  "bronze", "isInDiamondJournal": true, "publiclyFunded": false }
+{"id":"50|DansKnawCris::0829b5191605bdbea36d6502b8c1ce1h", "resulttype" : { "classid" : "publication" }, "isGreen": false, "openAccessColor":  null, "isInDiamondJournal": true, "publiclyFunded": false }
--- a/dhp-common/src/test/resources/eu/dnetlib/dhp/schema/oaf/utils/publication_openapc.json
+++ b/dhp-common/src/test/resources/eu/dnetlib/dhp/schema/oaf/utils/publication_openapc.json
@ -0,0 +1,31 @@
+{
+  "id": "50|openapc_____::000023f9cb6e3a247c764daec4273cbc",
+  "resuttype": {
+    "classid": "publication"
+  },
+  "instance": [
+    {
+      "collectedfrom": {
+        "key": "10|apc_________::e2b1600b229fc30663c8a1f662debddf",
+        "value": "OpenAPC Global Initiative"
+      },
+      "pid": [
+        {
+          "qualifier": {"classid": "doi"},
+          "value": "10.1016/j.cmet.2010.03.013"
+        },
+        {
+          "qualifier": {"classid": "pmc"},
+          "value": "21459329"
+        },
+        {
+          "qualifier": {"classid": "pmid"},
+          "value": "25811027"
+        }
+      ],
+      "url":["https://doi.org/10.1155/2015/439379"]
+    }
+  ]
+}
+
+
--- a/dhp-common/src/test/resources/eu/dnetlib/dhp/schema/oaf/utils/publication_pmc1.json
+++ b/dhp-common/src/test/resources/eu/dnetlib/dhp/schema/oaf/utils/publication_pmc1.json
@ -0,0 +1,17 @@
+{
+  "id": "50|DansKnawCris::0829b5191605bdbea36d6502b8c1ce1f",
+  "pid": [
+    {
+      "qualifier": {"classid": "urn"},
+      "value": "urn:nbn:nl:ui:29-f3ed5f9e-edf6-457e-8848-61b58a4075e2"
+    },
+    {
+      "qualifier": {"classid": "scp-number"},
+      "value": "79953761260"
+    },
+    {
+      "qualifier": {"classid": "pmc"},
+      "value": "21459329"
+    }
+  ]
+}
--- a/dhp-common/src/test/resources/eu/dnetlib/dhp/schema/oaf/utils/publication_pmc2.json
+++ b/dhp-common/src/test/resources/eu/dnetlib/dhp/schema/oaf/utils/publication_pmc2.json
@ -0,0 +1,21 @@
+{
+  "id":"50|DansKnawCris::0829b5191605bdbea36d6502b8c1ce1f",
+  "instance": [
+    {
+      "collectedfrom": {
+        "key": "10|opendoar____::8b6dd7db9af49e67306feb59a8bdc52c",
+        "value": "Europe PubMed Central"
+      },
+      "pid": [
+        {
+          "qualifier": {"classid": "doi"},
+          "value": "10.1016/j.cmet.2010.03.013"
+        },
+        {
+          "qualifier":{"classid":"pmc"},
+          "value":"21459329"
+        }
+      ]
+    }
+  ]
+}
--- a/dhp-common/src/test/resources/eu/dnetlib/dhp/schema/oaf/utils/publication_test.json
+++ b/dhp-common/src/test/resources/eu/dnetlib/dhp/schema/oaf/utils/publication_test.json
@ -0,0 +1,428 @@
+{
+  "author": [
+    {
+      "affiliation": null,
+      "fullname": "Deymier, Ghislaine",
+      "name": "Ghislaine",
+      "pid": [],
+      "rank": 1,
+      "surname": "Deymier"
+    },
+    {
+      "affiliation": null,
+      "fullname": "Gaschet, Frédéric",
+      "name": "Frédéric",
+      "pid": [],
+      "rank": 2,
+      "surname": "Gaschet"
+    },
+    {
+      "affiliation": null,
+      "fullname": "Pouyanne, Guillaume",
+      "name": "Guillaume",
+      "pid": [],
+      "rank": 3,
+      "surname": "Pouyanne"
+    }
+  ],
+  "bestaccessright": {
+    "classid": "OPEN",
+    "classname": "Open Access",
+    "schemeid": "dnet:access_modes",
+    "schemename": "dnet:access_modes"
+  },
+  "collectedfrom": [
+    {
+      "key" : "10|openaire____::081b82f96300b6a6e3d282bad31cb6e2",
+      "value" : "Crossref"
+    }
+  ],
+  "context": [],
+  "contributor": [],
+  "country": [],
+  "coverage": [],
+  "dataInfo": {
+    "deletedbyinference": false,
+    "inferred": false,
+    "invisible": false,
+    "provenanceaction": {
+      "classid": "sysimport:crosswalk:repository",
+      "classname": "Harvested",
+      "schemeid": "dnet:provenanceActions",
+      "schemename": "dnet:provenanceActions"
+    },
+    "trust": "0.9"
+  },
+  "dateofacceptance": {
+    "dataInfo": {
+      "deletedbyinference": false,
+      "inferred": false,
+      "invisible": false,
+      "provenanceaction": {
+        "classid": "sysimport:crosswalk:repository",
+        "classname": "Harvested",
+        "schemeid": "dnet:provenanceActions",
+        "schemename": "dnet:provenanceActions"
+      },
+      "trust": "0.9"
+    },
+    "value": "2013-11-30"
+  },
+  "dateofcollection": "2024-02-28T00:22:13+0000",
+  "dateoftransformation": "2024-03-06T08:43:13.253Z",
+  "description": [
+    {
+      "dataInfo": {
+        "deletedbyinference": false,
+        "inferred": false,
+        "invisible": false,
+        "provenanceaction": {
+          "classid": "sysimport:crosswalk:repository",
+          "classname": "Harvested",
+          "schemeid": "dnet:provenanceActions",
+          "schemename": "dnet:provenanceActions"
+        },
+        "trust": "0.9"
+      },
+      "value": "For analyzing the reciprocal interaction between urban sprawl and car use, research has first focused on the link between urban density and mobility. By looking for a reduction in energy consumption, cities have favoured a compact planning development. Then reflection has broadened from the simple density to the wider, multi-dimensional concept of urban form. This controversy has led to a renewal of analysis in term of the costs of urban growth, notably by comparing the costs of \"compact\" and \"sprawled\" development. The idea is to compare the mobility costs of different urban forms. However, most often because of a lack of data, such studies are scarce. This paper suggests an innovative method to compute mobility costs at an infra-urban scale : The Spatialized Travel Account (STA). It is based on the CERTU's travel account methodology at a metropolitan scale. It puts forward an accurate estimate of the mobility costs for each transport mode (individual and public) and for each type of payer (households, firms, local authorities...). In order to test the relationships between mobility costs and urban form, we link the computed costs to morphological characteristics of infra-urban zones, taking in account sociodemographic characteristics of households."
+    },
+    {
+      "dataInfo": {
+        "deletedbyinference": false,
+        "inferred": false,
+        "invisible": false,
+        "provenanceaction": {
+          "classid": "sysimport:crosswalk:repository",
+          "classname": "Harvested",
+          "schemeid": "dnet:provenanceActions",
+          "schemename": "dnet:provenanceActions"
+        },
+        "trust": "0.9"
+      },
+      "value": "L'interaction réciproque entre étalement urbain et usage de l'automobile a conduit la recherche à se focaliser sur le lien entre les densités urbaines et la mobilité. En cherchant à réduire leur consommation d'énergie pour les transports, et donc leurs émissions de Gaz à Effet de Serre, les villes ont alors cherché à planifier la \" ville compacte \", privilégiant notamment la reconstruction de la ville sur elle-même et la densification. Par la suite, la réflexion s'est élargie de la simple densité à la notion de forme urbaine et à toutes ses dimensions. Cette controverse devait conduire à un renouveau des analyses en termes de coûts de la croissance urbaine : le débat reste vif, encore aujourd'hui, sur les coûts comparés de la ville étalée et de la ville compacte. Plus largement, il s'agit d'explorer les coûts des différentes formes urbaines en termes de mobilité. Malgré cela, généralement pour des raisons de disponibilité de données, les études sur le sujet restent extrêmement rares. Cet article propose un outil novateur pour mesurer les coûts de la mobilité à l'échelle intraurbaine : le Compte Déplacements Territorialisé (CDT). Il s'inspire de la méthode développée par le CERTU pour l'établissement des Comptes Déplacements Voyageurs à l'échelle métropolitaine. Le CDT propose, pour chacune des zones de l'agglomération, une estimation précise de l'ensemble des coûts liés aux déplacements de personnes, ventilés par mode de transport (individuels et collectifs) et par type de financeurs (ménages, entreprises, collectivités territoriales, etc.). Nous proposons une application de cette méthode à la controverse sur le lien entre forme urbaine et coûts de la mobilité. Les coûts sont reliés aux caractéristiques morphologiques des zones (en termes de densité et de diversité, notamment), en prenant soin de contrôler les facteurs socio-économiques qui influent traditionnellement sur les comportements de mobilité (taille du ménage, revenu, etc.)."
+    }
+  ],
+  "eoscifguidelines": [],
+  "externalReference": [],
+  "extraInfo": [],
+  "format": [
+    {
+      "dataInfo": {
+        "deletedbyinference": false,
+        "inferred": false,
+        "invisible": false,
+        "provenanceaction": {
+          "classid": "sysimport:crosswalk:repository",
+          "classname": "Harvested",
+          "schemeid": "dnet:provenanceActions",
+          "schemename": "dnet:provenanceActions"
+        },
+        "trust": "0.9"
+      },
+      "value": "application/pdf"
+    }
+  ],
+  "fulltext": [],
+  "id": "50|06cdd3ff4700::4826ac62a11a957fe332e2c291dcfcca",
+  "instance": [
+    {
+      "accessright": {
+        "classid": "OPEN",
+        "classname": "Open Access",
+        "schemeid": "dnet:access_modes",
+        "schemename": "dnet:access_modes"
+      },
+      "alternateIdentifier": [
+        {
+          "dataInfo": {
+            "deletedbyinference": false,
+            "inferred": false,
+            "invisible": false,
+            "provenanceaction": {
+              "classid": "sysimport:crosswalk:repository",
+              "classname": "Harvested",
+              "schemeid": "dnet:provenanceActions",
+              "schemename": "dnet:provenanceActions"
+            },
+            "trust": "0.9"
+          },
+          "qualifier": {
+            "classid": "doi",
+            "classname": "Digital Object Identifier",
+            "schemeid": "dnet:pid_types",
+            "schemename": "dnet:pid_types"
+          },
+          "value": "10.46298/cst.12132"
+        }
+      ],
+      "collectedfrom": {
+        "key": "10|openaire____::6824b298c96ba906a3e6a70593affbf5",
+        "value": "Episciences"
+      },
+      "dateofacceptance": {
+        "dataInfo": {
+          "deletedbyinference": false,
+          "inferred": false,
+          "invisible": false,
+          "provenanceaction": {
+            "classid": "sysimport:crosswalk:repository",
+            "classname": "Harvested",
+            "schemeid": "dnet:provenanceActions",
+            "schemename": "dnet:provenanceActions"
+          },
+          "trust": "0.9"
+        },
+        "value": "2013-11-30"
+      },
+      "hostedby": {
+        "key": "10|openaire____::6824b298c96ba906a3e6a70593affbf5",
+        "value": "Episciences"
+      },
+      "instanceTypeMapping": [
+        {
+          "originalType": "http://purl.org/coar/resource_type/c_6501",
+          "typeCode": "http://purl.org/coar/resource_type/c_6501",
+          "typeLabel": "journal article",
+          "vocabularyName": "openaire::coar_resource_types_3_1"
+        },
+        {
+          "originalType": "http://purl.org/coar/resource_type/c_6501",
+          "typeCode": "Article",
+          "typeLabel": "Article",
+          "vocabularyName": "openaire::user_resource_types"
+        }
+      ],
+      "instancetype": {
+        "classid": "0001",
+        "classname": "Article",
+        "schemeid": "dnet:publication_resource",
+        "schemename": "dnet:publication_resource"
+      },
+      "license": {
+        "dataInfo": {
+          "deletedbyinference": false,
+          "inferred": false,
+          "invisible": false,
+          "provenanceaction": {
+            "classid": "sysimport:crosswalk:repository",
+            "classname": "Harvested",
+            "schemeid": "dnet:provenanceActions",
+            "schemename": "dnet:provenanceActions"
+          },
+          "trust": "0.9"
+        },
+        "value": "CC BY NC SA"
+      },
+      "pid": [],
+      "refereed": {
+        "classid": "0002",
+        "classname": "nonPeerReviewed",
+        "schemeid": "dnet:review_levels",
+        "schemename": "dnet:review_levels"
+      },
+      "url": [
+        "https://doi.org/10.46298/cst.12132",
+        "https://cst.episciences.org/12132"
+      ]
+    }
+  ],
+  "language": {
+    "classid": "fra/fre",
+    "classname": "French",
+    "schemeid": "dnet:languages",
+    "schemename": "dnet:languages"
+  },
+  "lastupdatetimestamp": 1710636106633,
+  "metaResourceType": {
+    "classid": "Research Literature",
+    "classname": "Research Literature",
+    "schemeid": "openaire::meta_resource_types",
+    "schemename": "openaire::meta_resource_types"
+  },
+  "originalId": [
+    "oai:episciences.org:cst:12132",
+    "50|06cdd3ff4700::4826ac62a11a957fe332e2c291dcfcca"
+  ],
+  "pid": [],
+  "publisher": {
+    "dataInfo": {
+      "deletedbyinference": false,
+      "inferred": false,
+      "invisible": false,
+      "provenanceaction": {
+        "classid": "sysimport:crosswalk:repository",
+        "classname": "Harvested",
+        "schemeid": "dnet:provenanceActions",
+        "schemename": "dnet:provenanceActions"
+      },
+      "trust": "0.9"
+    },
+    "value": "episciences.org"
+  },
+  "relevantdate": [
+    {
+      "dataInfo": {
+        "deletedbyinference": false,
+        "inferred": false,
+        "invisible": false,
+        "provenanceaction": {
+          "classid": "sysimport:crosswalk:repository",
+          "classname": "Harvested",
+          "schemeid": "dnet:provenanceActions",
+          "schemename": "dnet:provenanceActions"
+        },
+        "trust": "0.9"
+      },
+      "qualifier": {
+        "classid": "Accepted",
+        "classname": "Accepted",
+        "schemeid": "dnet:dataCite_date",
+        "schemename": "dnet:dataCite_date"
+      },
+      "value": "2024-02-11"
+    },
+    {
+      "dataInfo": {
+        "deletedbyinference": false,
+        "inferred": false,
+        "invisible": false,
+        "provenanceaction": {
+          "classid": "sysimport:crosswalk:repository",
+          "classname": "Harvested",
+          "schemeid": "dnet:provenanceActions",
+          "schemename": "dnet:provenanceActions"
+        },
+        "trust": "0.9"
+      },
+      "qualifier": {
+        "classid": "issued",
+        "classname": "issued",
+        "schemeid": "dnet:dataCite_date",
+        "schemename": "dnet:dataCite_date"
+      },
+      "value": "2013-11-30"
+    },
+    {
+      "dataInfo": {
+        "deletedbyinference": false,
+        "inferred": false,
+        "invisible": false,
+        "provenanceaction": {
+          "classid": "sysimport:crosswalk:repository",
+          "classname": "Harvested",
+          "schemeid": "dnet:provenanceActions",
+          "schemename": "dnet:provenanceActions"
+        },
+        "trust": "0.9"
+      },
+      "qualifier": {
+        "classid": "available",
+        "classname": "available",
+        "schemeid": "dnet:dataCite_date",
+        "schemename": "dnet:dataCite_date"
+      },
+      "value": "2013-11-30"
+    }
+  ],
+  "resourcetype": {
+    "classid": "journal article",
+    "classname": "journal article",
+    "schemeid": "dnet:dataCite_resource",
+    "schemename": "dnet:dataCite_resource"
+  },
+  "resulttype": {
+    "classid": "publication",
+    "classname": "publication",
+    "schemeid": "dnet:result_typologies",
+    "schemename": "dnet:result_typologies"
+  },
+  "source": [],
+  "subject": [
+    {
+      "dataInfo": {
+        "deletedbyinference": false,
+        "inferred": false,
+        "invisible": false,
+        "provenanceaction": {
+          "classid": "sysimport:crosswalk:repository",
+          "classname": "Harvested",
+          "schemeid": "dnet:provenanceActions",
+          "schemename": "dnet:provenanceActions"
+        },
+        "trust": "0.9"
+      },
+      "qualifier": {
+        "classid": "keyword",
+        "classname": "keyword",
+        "schemeid": "dnet:subject_classification_typologies",
+        "schemename": "dnet:subject_classification_typologies"
+      },
+      "value": "JEL: H - Public Economics/H.H7 - State and Local Government • Intergovernmental Relations/H.H7.H72 - State and Local Budget and Expenditures"
+    },
+    {
+      "dataInfo": {
+        "deletedbyinference": false,
+        "inferred": false,
+        "invisible": false,
+        "provenanceaction": {
+          "classid": "sysimport:crosswalk:repository",
+          "classname": "Harvested",
+          "schemeid": "dnet:provenanceActions",
+          "schemename": "dnet:provenanceActions"
+        },
+        "trust": "0.9"
+      },
+      "qualifier": {
+        "classid": "keyword",
+        "classname": "keyword",
+        "schemeid": "dnet:subject_classification_typologies",
+        "schemename": "dnet:subject_classification_typologies"
+      },
+      "value": "Local public finance"
+    },
+    {
+      "dataInfo": {
+        "deletedbyinference": false,
+        "inferred": false,
+        "invisible": false,
+        "provenanceaction": {
+          "classid": "sysimport:crosswalk:repository",
+          "classname": "Harvested",
+          "schemeid": "dnet:provenanceActions",
+          "schemename": "dnet:provenanceActions"
+        },
+        "trust": "0.9"
+      },
+      "qualifier": {
+        "classid": "keyword",
+        "classname": "keyword",
+        "schemeid": "dnet:subject_classification_typologies",
+        "schemename": "dnet:subject_classification_typologies"
+      },
+      "value": "JEL: R - Urban, Rural, Regional, Real Estate, and Transportation Economics/R.R5 - Regional Government Analysis/R.R5.R51 - Finance in Urban and Rural Economies"
+    }
+  ],
+  "title": [
+    {
+      "dataInfo": {
+        "deletedbyinference": false,
+        "inferred": false,
+        "invisible": false,
+        "provenanceaction": {
+          "classid": "sysimport:crosswalk:repository",
+          "classname": "Harvested",
+          "schemeid": "dnet:provenanceActions",
+          "schemename": "dnet:provenanceActions"
+        },
+        "trust": "0.9"
+      },
+      "qualifier": {
+        "classid": "main title",
+        "classname": "main title",
+        "schemeid": "dnet:dataCite_title",
+        "schemename": "dnet:dataCite_title"
+      },
+      "value": "Urban form and the costs of daily mobility. The spatialized travel account tool and its application to the Bordeaux metropolitan area"
+    }
+  ]
+}
--- a/dhp-common/src/test/resources/eu/dnetlib/dhp/schema/oaf/utils/publication_urn1.json
+++ b/dhp-common/src/test/resources/eu/dnetlib/dhp/schema/oaf/utils/publication_urn1.json
@ -0,0 +1,23 @@
+{
+  "id": "50|DansKnawCris::0829b5191605bdbea36d6502b8c1ce1f",
+  "pid": [
+    {
+      "qualifier": {
+        "classid": "urn"
+      },
+      "value": "urn:nbn:nl:ui:29-f3ed5f9e-edf6-457e-8848-61b58a4075e2"
+    },
+    {
+      "qualifier": {
+        "classid": "scp-number"
+      },
+      "value": "79953761260"
+    },
+    {
+      "qualifier": {
+        "classid": "pmcid"
+      },
+      "value": "21459329"
+    }
+  ]
+}
--- a/dhp-common/src/test/resources/eu/dnetlib/dhp/schema/oaf/utils/publications.json
+++ b/dhp-common/src/test/resources/eu/dnetlib/dhp/schema/oaf/utils/publications.json
--- a/dhp-pace-core/pom.xml
+++ b/dhp-pace-core/pom.xml
@ -49,6 +49,12 @@
 	</build>

 	<dependencies>
+		<dependency>
+			<groupId>eu.dnetlib.dhp</groupId>
+			<artifactId>dhp-common</artifactId>
+			<version>${project.version}</version>
+		</dependency>
+
 		<dependency>
 			<groupId>edu.cmu</groupId>
 			<artifactId>secondstring</artifactId>
--- a/dhp-pace-core/src/main/java/eu/dnetlib/pace/clustering/AbstractClusteringFunction.java
+++ b/dhp-pace-core/src/main/java/eu/dnetlib/pace/clustering/AbstractClusteringFunction.java
@ -14,9 +14,9 @@ import eu.dnetlib.pace.config.Config;

 public abstract class AbstractClusteringFunction extends AbstractPaceFunctions implements ClusteringFunction {

-	protected Map<String, Integer> params;
+	protected Map<String, Object> params;

-	public AbstractClusteringFunction(final Map<String, Integer> params) {
+	public AbstractClusteringFunction(final Map<String, Object> params) {
 		this.params = params;
 	}

@ -27,7 +27,7 @@ public abstract class AbstractClusteringFunction extends AbstractPaceFunctions i
 		return fields
 			.stream()
 			.filter(f -> !f.isEmpty())
-			.map(this::normalize)
+			.map(s -> normalize(s))
 			.map(s -> filterAllStopWords(s))
 			.map(s -> doApply(conf, s))
 			.map(c -> filterBlacklisted(c, ngramBlacklist))
@ -36,11 +36,24 @@ public abstract class AbstractClusteringFunction extends AbstractPaceFunctions i
 			.collect(Collectors.toCollection(HashSet::new));
 	}

-	public Map<String, Integer> getParams() {
+	public Map<String, Object> getParams() {
 		return params;
 	}

 	protected Integer param(String name) {
-		return params.get(name);
+		Object val = params.get(name);
+		if (val == null)
+			return null;
+		if (val instanceof Number) {
+			return ((Number) val).intValue();
+		}
+		return Integer.parseInt(val.toString());
+	}
+
+	protected int paramOrDefault(String name, int i) {
+		Integer res = param(name);
+		if (res == null)
+			res = i;
+		return res;
 	}
 }
--- a/dhp-pace-core/src/main/java/eu/dnetlib/pace/clustering/Acronyms.java
+++ b/dhp-pace-core/src/main/java/eu/dnetlib/pace/clustering/Acronyms.java
@ -13,7 +13,7 @@ import eu.dnetlib.pace.config.Config;
@ClusteringClass("acronyms")
 public class Acronyms extends AbstractClusteringFunction {

-	public Acronyms(Map<String, Integer> params) {
+	public Acronyms(Map<String, Object> params) {
 		super(params);
 	}

--- a/dhp-pace-core/src/main/java/eu/dnetlib/pace/clustering/ClusteringFunction.java
+++ b/dhp-pace-core/src/main/java/eu/dnetlib/pace/clustering/ClusteringFunction.java
@ -11,6 +11,6 @@ public interface ClusteringFunction {

 	public Collection<String> apply(Config config, List<String> fields);

-	public Map<String, Integer> getParams();
+	public Map<String, Object> getParams();

 }
--- a/dhp-pace-core/src/main/java/eu/dnetlib/pace/clustering/ImmutableFieldValue.java
+++ b/dhp-pace-core/src/main/java/eu/dnetlib/pace/clustering/ImmutableFieldValue.java
@ -12,7 +12,7 @@ import eu.dnetlib.pace.config.Config;
@ClusteringClass("immutablefieldvalue")
 public class ImmutableFieldValue extends AbstractClusteringFunction {

-	public ImmutableFieldValue(final Map<String, Integer> params) {
+	public ImmutableFieldValue(final Map<String, Object> params) {
 		super(params);
 	}

--- a/dhp-pace-core/src/main/java/eu/dnetlib/pace/clustering/JSONListClustering.java
+++ b/dhp-pace-core/src/main/java/eu/dnetlib/pace/clustering/JSONListClustering.java
@ -0,0 +1,69 @@
+
+package eu.dnetlib.pace.clustering;
+
+import java.util.Collection;
+import java.util.HashSet;
+import java.util.List;
+import java.util.Map;
+import java.util.stream.Collectors;
+
+import org.apache.commons.lang3.StringUtils;
+
+import com.jayway.jsonpath.Configuration;
+import com.jayway.jsonpath.DocumentContext;
+import com.jayway.jsonpath.JsonPath;
+import com.jayway.jsonpath.Option;
+
+import eu.dnetlib.pace.common.AbstractPaceFunctions;
+import eu.dnetlib.pace.config.Config;
+import eu.dnetlib.pace.util.MapDocumentUtil;
+
+@ClusteringClass("jsonlistclustering")
+public class JSONListClustering extends AbstractPaceFunctions implements ClusteringFunction {
+
+	private Map<String, Object> params;
+
+	public JSONListClustering(Map<String, Object> params) {
+		this.params = params;
+	}
+
+	@Override
+	public Map<String, Object> getParams() {
+		return params;
+	}
+
+	@Override
+	public Collection<String> apply(Config conf, List<String> fields) {
+		return fields
+			.stream()
+			.filter(f -> !f.isEmpty())
+			.map(s -> doApply(conf, s))
+			.filter(StringUtils::isNotBlank)
+			.collect(Collectors.toCollection(HashSet::new));
+	}
+
+	private String doApply(Config conf, String json) {
+		StringBuilder st = new StringBuilder(); // to build the string used for comparisons basing on the jpath into
+		// parameters
+		final DocumentContext documentContext = JsonPath
+			.using(Configuration.defaultConfiguration().addOptions(Option.SUPPRESS_EXCEPTIONS))
+			.parse(json);
+
+		// for each path in the param list
+		for (String key : params.keySet().stream().filter(k -> k.contains("jpath")).collect(Collectors.toList())) {
+			String path = params.get(key).toString();
+			String value = MapDocumentUtil.getJPathString(path, documentContext);
+			if (value == null || value.isEmpty())
+				value = "";
+			st.append(value);
+			st.append(" ");
+		}
+
+		st.setLength(st.length() - 1);
+
+		if (StringUtils.isBlank(st)) {
+			return "1";
+		}
+		return st.toString();
+	}
+}
--- a/dhp-pace-core/src/main/java/eu/dnetlib/pace/clustering/KeywordsClustering.java
+++ b/dhp-pace-core/src/main/java/eu/dnetlib/pace/clustering/KeywordsClustering.java
@ -11,7 +11,7 @@ import eu.dnetlib.pace.config.Config;
@ClusteringClass("keywordsclustering")
 public class KeywordsClustering extends AbstractClusteringFunction {

-	public KeywordsClustering(Map<String, Integer> params) {
+	public KeywordsClustering(Map<String, Object> params) {
 		super(params);
 	}

@ -19,8 +19,8 @@ public class KeywordsClustering extends AbstractClusteringFunction {
 	protected Collection<String> doApply(final Config conf, String s) {

 		// takes city codes and keywords codes without duplicates
-		Set<String> keywords = getKeywords(s, conf.translationMap(), params.getOrDefault("windowSize", 4));
-		Set<String> cities = getCities(s, params.getOrDefault("windowSize", 4));
+		Set<String> keywords = getKeywords(s, conf.translationMap(), paramOrDefault("windowSize", 4));
+		Set<String> cities = getCities(s, paramOrDefault("windowSize", 4));

 		// list of combination to return as result
 		final Collection<String> combinations = new LinkedHashSet<String>();
@ -28,7 +28,7 @@ public class KeywordsClustering extends AbstractClusteringFunction {
 		for (String keyword : keywordsToCodes(keywords, conf.translationMap())) {
 			for (String city : citiesToCodes(cities)) {
 				combinations.add(keyword + "-" + city);
-				if (combinations.size() >= params.getOrDefault("max", 2)) {
+				if (combinations.size() >= paramOrDefault("max", 2)) {
 					return combinations;
 				}
 			}
@ -42,8 +42,8 @@ public class KeywordsClustering extends AbstractClusteringFunction {
 		return fields
 			.stream()
 			.filter(f -> !f.isEmpty())
-			.map(this::cleanup)
-			.map(this::normalize)
+			.map(KeywordsClustering::cleanup)
+			.map(KeywordsClustering::normalize)
 			.map(s -> filterAllStopWords(s))
 			.map(s -> doApply(conf, s))
 			.map(c -> filterBlacklisted(c, ngramBlacklist))
--- a/dhp-pace-core/src/main/java/eu/dnetlib/pace/clustering/LastNameFirstInitial.java
+++ b/dhp-pace-core/src/main/java/eu/dnetlib/pace/clustering/LastNameFirstInitial.java
@ -16,7 +16,7 @@ public class LastNameFirstInitial extends AbstractClusteringFunction {

 	private boolean DEFAULT_AGGRESSIVE = true;

-	public LastNameFirstInitial(final Map<String, Integer> params) {
+	public LastNameFirstInitial(final Map<String, Object> params) {
 		super(params);
 	}

@ -25,7 +25,7 @@ public class LastNameFirstInitial extends AbstractClusteringFunction {
 		return fields
 			.stream()
 			.filter(f -> !f.isEmpty())
-			.map(this::normalize)
+			.map(LastNameFirstInitial::normalize)
 			.map(s -> doApply(conf, s))
 			.map(c -> filterBlacklisted(c, ngramBlacklist))
 			.flatMap(c -> c.stream())
@ -33,8 +33,7 @@ public class LastNameFirstInitial extends AbstractClusteringFunction {
 			.collect(Collectors.toCollection(HashSet::new));
 	}

-	@Override
-	protected String normalize(final String s) {
+	public static String normalize(final String s) {
 		return fixAliases(transliterate(nfd(unicodeNormalization(s))))
 			// do not compact the regexes in a single expression, would cause StackOverflowError in case of large input
 			// strings
--- a/dhp-pace-core/src/main/java/eu/dnetlib/pace/clustering/LowercaseClustering.java
+++ b/dhp-pace-core/src/main/java/eu/dnetlib/pace/clustering/LowercaseClustering.java
@ -15,7 +15,7 @@ import eu.dnetlib.pace.config.Config;
@ClusteringClass("lowercase")
 public class LowercaseClustering extends AbstractClusteringFunction {

-	public LowercaseClustering(final Map<String, Integer> params) {
+	public LowercaseClustering(final Map<String, Object> params) {
 		super(params);
 	}

--- a/dhp-pace-core/src/main/java/eu/dnetlib/pace/clustering/NgramPairs.java
+++ b/dhp-pace-core/src/main/java/eu/dnetlib/pace/clustering/NgramPairs.java
@ -12,11 +12,11 @@ import eu.dnetlib.pace.config.Config;
@ClusteringClass("ngrampairs")
 public class NgramPairs extends Ngrams {

-	public NgramPairs(Map<String, Integer> params) {
+	public NgramPairs(Map<String, Object> params) {
 		super(params, false);
 	}

-	public NgramPairs(Map<String, Integer> params, boolean sorted) {
+	public NgramPairs(Map<String, Object> params, boolean sorted) {
 		super(params, sorted);
 	}

--- a/dhp-pace-core/src/main/java/eu/dnetlib/pace/clustering/Ngrams.java
+++ b/dhp-pace-core/src/main/java/eu/dnetlib/pace/clustering/Ngrams.java
@ -10,11 +10,11 @@ public class Ngrams extends AbstractClusteringFunction {

 	private final boolean sorted;

-	public Ngrams(Map<String, Integer> params) {
+	public Ngrams(Map<String, Object> params) {
 		this(params, false);
 	}

-	public Ngrams(Map<String, Integer> params, boolean sorted) {
+	public Ngrams(Map<String, Object> params, boolean sorted) {
 		super(params);
 		this.sorted = sorted;
 	}
--- a/dhp-pace-core/src/main/java/eu/dnetlib/pace/clustering/NumAuthorsTitleSuffixPrefixChain.java
+++ b/dhp-pace-core/src/main/java/eu/dnetlib/pace/clustering/NumAuthorsTitleSuffixPrefixChain.java
@ -0,0 +1,113 @@
+
+package eu.dnetlib.pace.clustering;
+
+import java.util.*;
+import java.util.stream.Collectors;
+import java.util.stream.StreamSupport;
+
+import com.google.common.base.Splitter;
+import com.google.common.collect.Sets;
+
+import eu.dnetlib.pace.config.Config;
+
+@ClusteringClass("numAuthorsTitleSuffixPrefixChain")
+public class NumAuthorsTitleSuffixPrefixChain extends AbstractClusteringFunction {
+
+	public NumAuthorsTitleSuffixPrefixChain(Map<String, Object> params) {
+		super(params);
+	}
+
+	@Override
+	public Collection<String> apply(Config conf, List<String> fields) {
+
+		try {
+			int num_authors = Math.min(Integer.parseInt(fields.get(0)), 21); // SIZE threshold is 20, +1
+
+			if (num_authors > 0) {
+				return super.apply(conf, fields.subList(1, fields.size()))
+					.stream()
+					.map(s -> num_authors + "-" + s)
+					.collect(Collectors.toList());
+			}
+		} catch (NumberFormatException e) {
+			// missing or null authors array
+		}
+
+		return Collections.emptyList();
+	}
+
+	@Override
+	protected Collection<String> doApply(Config conf, String s) {
+		return suffixPrefixChain(cleanup(s), param("mod"));
+	}
+
+	private Collection<String> suffixPrefixChain(String s, int mod) {
+		// create the list of words from the string (remove short words)
+		List<String> wordsList = Arrays
+			.stream(s.split(" "))
+			.filter(si -> si.length() > 3)
+			.collect(Collectors.toList());
+
+		final int words = wordsList.size();
+		final int letters = s.length();
+
+		// create the prefix: number of words + number of letters/mod
+		String prefix = words / mod + "-";
+
+		return doSuffixPrefixChain(wordsList, prefix);
+
+	}
+
+	private Collection<String> doSuffixPrefixChain(List<String> wordsList, String prefix) {
+
+		Set<String> set = Sets.newLinkedHashSet();
+		switch (wordsList.size()) {
+			case 0:
+				break;
+			case 1:
+				set.add(wordsList.get(0));
+				break;
+			case 2:
+				set
+					.add(
+						prefix +
+							suffix(wordsList.get(0), 3) +
+							prefix(wordsList.get(1), 3));
+
+				set
+					.add(
+						prefix +
+							prefix(wordsList.get(0), 3) +
+							suffix(wordsList.get(1), 3));
+
+				break;
+			default:
+				set
+					.add(
+						prefix +
+							suffix(wordsList.get(0), 3) +
+							prefix(wordsList.get(1), 3) +
+							suffix(wordsList.get(2), 3));
+
+				set
+					.add(
+						prefix +
+							prefix(wordsList.get(0), 3) +
+							suffix(wordsList.get(1), 3) +
+							prefix(wordsList.get(2), 3));
+				break;
+		}
+
+		return set;
+
+	}
+
+	private String suffix(String s, int len) {
+		return s.substring(s.length() - len);
+	}
+
+	private String prefix(String s, int len) {
+		return s.substring(0, len);
+	}
+
+}
--- a/dhp-pace-core/src/main/java/eu/dnetlib/pace/clustering/PersonClustering.java
+++ b/dhp-pace-core/src/main/java/eu/dnetlib/pace/clustering/PersonClustering.java
@ -17,11 +17,11 @@ import eu.dnetlib.pace.model.Person;
@ClusteringClass("personClustering")
 public class PersonClustering extends AbstractPaceFunctions implements ClusteringFunction {

-	private Map<String, Integer> params;
+	private Map<String, Object> params;

 	private static final int MAX_TOKENS = 5;

-	public PersonClustering(final Map<String, Integer> params) {
+	public PersonClustering(final Map<String, Object> params) {
 		this.params = params;
 	}

@ -77,7 +77,7 @@ public class PersonClustering extends AbstractPaceFunctions implements Clusterin
 //	}

 	@Override
-	public Map<String, Integer> getParams() {
+	public Map<String, Object> getParams() {
 		return params;
 	}

--- a/dhp-pace-core/src/main/java/eu/dnetlib/pace/clustering/PersonHash.java
+++ b/dhp-pace-core/src/main/java/eu/dnetlib/pace/clustering/PersonHash.java
@ -15,7 +15,7 @@ public class PersonHash extends AbstractClusteringFunction {

 	private boolean DEFAULT_AGGRESSIVE = false;

-	public PersonHash(final Map<String, Integer> params) {
+	public PersonHash(final Map<String, Object> params) {
 		super(params);
 	}

--- a/dhp-pace-core/src/main/java/eu/dnetlib/pace/clustering/RandomClusteringFunction.java
+++ b/dhp-pace-core/src/main/java/eu/dnetlib/pace/clustering/RandomClusteringFunction.java
@ -8,7 +8,7 @@ import eu.dnetlib.pace.config.Config;

 public class RandomClusteringFunction extends AbstractClusteringFunction {

-	public RandomClusteringFunction(Map<String, Integer> params) {
+	public RandomClusteringFunction(Map<String, Object> params) {
 		super(params);
 	}

--- a/dhp-pace-core/src/main/java/eu/dnetlib/pace/clustering/SortedNgramPairs.java
+++ b/dhp-pace-core/src/main/java/eu/dnetlib/pace/clustering/SortedNgramPairs.java
@ -1,7 +1,10 @@

 package eu.dnetlib.pace.clustering;

-import java.util.*;
+import java.util.Collection;
+import java.util.Collections;
+import java.util.List;
+import java.util.Map;

 import com.google.common.base.Joiner;
 import com.google.common.base.Splitter;
@ -12,7 +15,7 @@ import eu.dnetlib.pace.config.Config;
@ClusteringClass("sortedngrampairs")
 public class SortedNgramPairs extends NgramPairs {

-	public SortedNgramPairs(Map<String, Integer> params) {
+	public SortedNgramPairs(Map<String, Object> params) {
 		super(params, false);
 	}

--- a/dhp-pace-core/src/main/java/eu/dnetlib/pace/clustering/SpaceTrimmingFieldValue.java
+++ b/dhp-pace-core/src/main/java/eu/dnetlib/pace/clustering/SpaceTrimmingFieldValue.java
@ -15,7 +15,7 @@ import eu.dnetlib.pace.config.Config;
@ClusteringClass("spacetrimmingfieldvalue")
 public class SpaceTrimmingFieldValue extends AbstractClusteringFunction {

-	public SpaceTrimmingFieldValue(final Map<String, Integer> params) {
+	public SpaceTrimmingFieldValue(final Map<String, Object> params) {
 		super(params);
 	}

@ -25,7 +25,7 @@ public class SpaceTrimmingFieldValue extends AbstractClusteringFunction {

 		res
 			.add(
-				StringUtils.isBlank(s) ? RandomStringUtils.random(getParams().get("randomLength"))
+				StringUtils.isBlank(s) ? RandomStringUtils.random(param("randomLength"))
 					: s.toLowerCase().replaceAll("\\s+", ""));

 		return res;
--- a/dhp-pace-core/src/main/java/eu/dnetlib/pace/clustering/SuffixPrefix.java
+++ b/dhp-pace-core/src/main/java/eu/dnetlib/pace/clustering/SuffixPrefix.java
@ -12,7 +12,7 @@ import eu.dnetlib.pace.config.Config;
@ClusteringClass("suffixprefix")
 public class SuffixPrefix extends AbstractClusteringFunction {

-	public SuffixPrefix(Map<String, Integer> params) {
+	public SuffixPrefix(Map<String, Object> params) {
 		super(params);
 	}

--- a/dhp-pace-core/src/main/java/eu/dnetlib/pace/clustering/UrlClustering.java
+++ b/dhp-pace-core/src/main/java/eu/dnetlib/pace/clustering/UrlClustering.java
@ -15,12 +15,17 @@ import eu.dnetlib.pace.config.Config;
@ClusteringClass("urlclustering")
 public class UrlClustering extends AbstractPaceFunctions implements ClusteringFunction {

-	protected Map<String, Integer> params;
+	protected Map<String, Object> params;

-	public UrlClustering(final Map<String, Integer> params) {
+	public UrlClustering(final Map<String, Object> params) {
 		this.params = params;
 	}

+	@Override
+	public Map<String, Object> getParams() {
+		return params;
+	}
+
 	@Override
 	public Collection<String> apply(final Config conf, List<String> fields) {
 		try {
@ -35,11 +40,6 @@ public class UrlClustering extends AbstractPaceFunctions implements ClusteringFu
 		}
 	}

-	@Override
-	public Map<String, Integer> getParams() {
-		return null;
-	}
-
 	private URL asUrl(String value) {
 		try {
 			return new URL(value);
--- a/dhp-pace-core/src/main/java/eu/dnetlib/pace/clustering/WordsStatsSuffixPrefixChain.java
+++ b/dhp-pace-core/src/main/java/eu/dnetlib/pace/clustering/WordsStatsSuffixPrefixChain.java
@ -11,7 +11,7 @@ import eu.dnetlib.pace.config.Config;
@ClusteringClass("wordsStatsSuffixPrefixChain")
 public class WordsStatsSuffixPrefixChain extends AbstractClusteringFunction {

-	public WordsStatsSuffixPrefixChain(Map<String, Integer> params) {
+	public WordsStatsSuffixPrefixChain(Map<String, Object> params) {
 		super(params);
 	}

@ -20,7 +20,7 @@ public class WordsStatsSuffixPrefixChain extends AbstractClusteringFunction {
 		return suffixPrefixChain(s, param("mod"));
 	}

-	private Collection<String> suffixPrefixChain(String s, int mod) {
+	static Collection<String> suffixPrefixChain(String s, int mod) {

 		// create the list of words from the string (remove short words)
 		List<String> wordsList = Arrays
@ -38,7 +38,7 @@ public class WordsStatsSuffixPrefixChain extends AbstractClusteringFunction {

 	}

-	private Collection<String> doSuffixPrefixChain(List<String> wordsList, String prefix) {
+	static private Collection<String> doSuffixPrefixChain(List<String> wordsList, String prefix) {

 		Set<String> set = Sets.newLinkedHashSet();
 		switch (wordsList.size()) {
@ -80,12 +80,16 @@ public class WordsStatsSuffixPrefixChain extends AbstractClusteringFunction {

 	}

-	private String suffix(String s, int len) {
+	private static String suffix(String s, int len) {
 		return s.substring(s.length() - len);
 	}

-	private String prefix(String s, int len) {
+	private static String prefix(String s, int len) {
 		return s.substring(0, len);
 	}

+	static public void main(String[] args) {
+		String title = "MY LIFE AS A BOSON: THE STORY OF \"THE HIGGS\"".toLowerCase();
+		System.out.println(suffixPrefixChain(title, 10));
+	}
 }
--- a/dhp-pace-core/src/main/java/eu/dnetlib/pace/clustering/WordsSuffixPrefix.java
+++ b/dhp-pace-core/src/main/java/eu/dnetlib/pace/clustering/WordsSuffixPrefix.java
@ -12,7 +12,7 @@ import eu.dnetlib.pace.config.Config;
@ClusteringClass("wordssuffixprefix")
 public class WordsSuffixPrefix extends AbstractClusteringFunction {

-	public WordsSuffixPrefix(Map<String, Integer> params) {
+	public WordsSuffixPrefix(Map<String, Object> params) {
 		super(params);
 	}

--- a/dhp-pace-core/src/main/java/eu/dnetlib/pace/common/AbstractPaceFunctions.java
+++ b/dhp-pace-core/src/main/java/eu/dnetlib/pace/common/AbstractPaceFunctions.java
@ -4,7 +4,6 @@ package eu.dnetlib.pace.common;
 import java.io.IOException;
 import java.io.StringWriter;
 import java.nio.charset.StandardCharsets;
-import java.text.Normalizer;
 import java.util.*;
 import java.util.regex.Matcher;
 import java.util.regex.Pattern;
@ -14,20 +13,15 @@ import org.apache.commons.io.IOUtils;
 import org.apache.commons.lang3.StringUtils;

 import com.google.common.base.Joiner;
-import com.google.common.base.Splitter;
-import com.google.common.collect.Iterables;
-import com.google.common.collect.Lists;
 import com.google.common.collect.Sets;
 import com.ibm.icu.text.Transliterator;

-import eu.dnetlib.pace.clustering.NGramUtils;
-
 /**
 * Set of common functions for the framework
 *
 * @author claudio
 */
-public abstract class AbstractPaceFunctions {
+public class AbstractPaceFunctions extends PaceCommonUtils {

 	// city map to be used when translating the city names into codes
 	private static Map<String, String> cityMap = AbstractPaceFunctions
@ -42,9 +36,6 @@ public abstract class AbstractPaceFunctions {
 	protected static Set<String> stopwords_it = loadFromClasspath("/eu/dnetlib/pace/config/stopwords_it.txt");
 	protected static Set<String> stopwords_pt = loadFromClasspath("/eu/dnetlib/pace/config/stopwords_pt.txt");

-	// transliterator
-	protected static Transliterator transliterator = Transliterator.getInstance("Any-Eng");
-
 	// blacklist of ngrams: to avoid generic keys
 	protected static Set<String> ngramBlacklist = loadFromClasspath("/eu/dnetlib/pace/config/ngram_blacklist.txt");

@ -52,8 +43,6 @@ public abstract class AbstractPaceFunctions {
 	public static final Pattern HTML_REGEX = Pattern.compile("<[^>]*>");

 	private static final String alpha = "abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789 ";
-	private static final String aliases_from = "⁰¹²³⁴⁵⁶⁷⁸⁹⁺⁻⁼⁽⁾ⁿ₀₁₂₃₄₅₆₇₈₉₊₋₌₍₎àáâäæãåāèéêëēėęəîïíīįìôöòóœøōõûüùúūßśšłžźżçćčñń";
-	private static final String aliases_to = "0123456789+-=()n0123456789+-=()aaaaaaaaeeeeeeeeiiiiiioooooooouuuuussslzzzcccnn";

 	// doi prefix for normalization
 	public static final Pattern DOI_PREFIX = Pattern.compile("(https?:\\/\\/dx\\.doi\\.org\\/)|(doi:)");
@ -62,11 +51,14 @@ public abstract class AbstractPaceFunctions {

 	private static Pattern hexUnicodePattern = Pattern.compile("\\\\u(\\p{XDigit}{4})");

-	protected String concat(final List<String> l) {
+	private static Pattern romanNumberPattern = Pattern
+		.compile("^M{0,4}(CM|CD|D?C{0,3})(XC|XL|L?X{0,3})(IX|IV|V?I{0,3})$");
+
+	protected static String concat(final List<String> l) {
 		return Joiner.on(" ").skipNulls().join(l);
 	}

-	protected String cleanup(final String s) {
+	public static String cleanup(final String s) {
 		final String s1 = HTML_REGEX.matcher(s).replaceAll("");
 		final String s2 = unicodeNormalization(s1.toLowerCase());
 		final String s3 = nfd(s2);
@ -82,7 +74,7 @@ public abstract class AbstractPaceFunctions {
 		return s12;
 	}

-	protected String fixXML(final String a) {
+	protected static String fixXML(final String a) {

 		return a
 			.replaceAll("&ndash;", " ")
@ -91,7 +83,7 @@ public abstract class AbstractPaceFunctions {
 			.replaceAll("&minus;", " ");
 	}

-	protected boolean checkNumbers(final String a, final String b) {
+	protected static boolean checkNumbers(final String a, final String b) {
 		final String numbersA = getNumbers(a);
 		final String numbersB = getNumbers(b);
 		final String romansA = getRomans(a);
@ -99,7 +91,7 @@ public abstract class AbstractPaceFunctions {
 		return !numbersA.equals(numbersB) || !romansA.equals(romansB);
 	}

-	protected String getRomans(final String s) {
+	protected static String getRomans(final String s) {
 		final StringBuilder sb = new StringBuilder();
 		for (final String t : s.split(" ")) {
 			sb.append(isRoman(t) ? t : "");
@ -107,13 +99,12 @@ public abstract class AbstractPaceFunctions {
 		return sb.toString();
 	}

-	protected boolean isRoman(final String s) {
-		return s
-			.replaceAll("^M{0,4}(CM|CD|D?C{0,3})(XC|XL|L?X{0,3})(IX|IV|V?I{0,3})$", "qwertyuiop")
-			.equals("qwertyuiop");
+	protected static boolean isRoman(final String s) {
+		Matcher m = romanNumberPattern.matcher(s);
+		return m.matches() && m.hitEnd();
 	}

-	protected String getNumbers(final String s) {
+	protected static String getNumbers(final String s) {
 		final StringBuilder sb = new StringBuilder();
 		for (final String t : s.split(" ")) {
 			sb.append(isNumber(t) ? t : "");
@ -121,33 +112,14 @@ public abstract class AbstractPaceFunctions {
 		return sb.toString();
 	}

-	public boolean isNumber(String strNum) {
+	public static boolean isNumber(String strNum) {
 		if (strNum == null) {
 			return false;
 		}
 		return numberPattern.matcher(strNum).matches();
 	}

-	protected static String fixAliases(final String s) {
-		final StringBuilder sb = new StringBuilder();
-
-		s.chars().forEach(ch -> {
-			final int i = StringUtils.indexOf(aliases_from, ch);
-			sb.append(i >= 0 ? aliases_to.charAt(i) : (char) ch);
-		});
-
-		return sb.toString();
-	}
-
-	protected static String transliterate(final String s) {
-		try {
-			return transliterator.transliterate(s);
-		} catch (Exception e) {
-			return s;
-		}
-	}
-
-	protected String removeSymbols(final String s) {
+	protected static String removeSymbols(final String s) {
 		final StringBuilder sb = new StringBuilder();

 		s.chars().forEach(ch -> {
@ -157,33 +129,16 @@ public abstract class AbstractPaceFunctions {
 		return sb.toString().replaceAll("\\s+", " ");
 	}

-	protected boolean notNull(final String s) {
+	protected static boolean notNull(final String s) {
 		return s != null;
 	}

-	protected String normalize(final String s) {
-		return fixAliases(transliterate(nfd(unicodeNormalization(s))))
-			.toLowerCase()
-			// do not compact the regexes in a single expression, would cause StackOverflowError in case of large input
-			// strings
-			.replaceAll("[^ \\w]+", "")
-			.replaceAll("(\\p{InCombiningDiacriticalMarks})+", "")
-			.replaceAll("(\\p{Punct})+", " ")
-			.replaceAll("(\\d)+", " ")
-			.replaceAll("(\\n)+", " ")
-			.trim();
-	}
-
-	public String nfd(final String s) {
-		return Normalizer.normalize(s, Normalizer.Form.NFD);
-	}
-
-	public String utf8(final String s) {
+	public static String utf8(final String s) {
 		byte[] bytes = s.getBytes(StandardCharsets.UTF_8);
 		return new String(bytes, StandardCharsets.UTF_8);
 	}

-	public String unicodeNormalization(final String s) {
+	public static String unicodeNormalization(final String s) {

 		Matcher m = hexUnicodePattern.matcher(s);
 		StringBuffer buf = new StringBuffer(s.length());
@ -195,7 +150,7 @@ public abstract class AbstractPaceFunctions {
 		return buf.toString();
 	}

-	protected String filterStopWords(final String s, final Set<String> stopwords) {
+	protected static String filterStopWords(final String s, final Set<String> stopwords) {
 		final StringTokenizer st = new StringTokenizer(s);
 		final StringBuilder sb = new StringBuilder();
 		while (st.hasMoreTokens()) {
@ -208,7 +163,7 @@ public abstract class AbstractPaceFunctions {
 		return sb.toString().trim();
 	}

-	public String filterAllStopWords(String s) {
+	public static String filterAllStopWords(String s) {

 		s = filterStopWords(s, stopwords_en);
 		s = filterStopWords(s, stopwords_de);
@ -221,7 +176,8 @@ public abstract class AbstractPaceFunctions {
 		return s;
 	}

-	protected Collection<String> filterBlacklisted(final Collection<String> set, final Set<String> ngramBlacklist) {
+	protected static Collection<String> filterBlacklisted(final Collection<String> set,
+		final Set<String> ngramBlacklist) {
 		final Set<String> newset = Sets.newLinkedHashSet();
 		for (final String s : set) {
 			if (!ngramBlacklist.contains(s)) {
@ -231,22 +187,6 @@ public abstract class AbstractPaceFunctions {
 		return newset;
 	}

-	public static Set<String> loadFromClasspath(final String classpath) {
-
-		Transliterator transliterator = Transliterator.getInstance("Any-Eng");
-
-		final Set<String> h = Sets.newHashSet();
-		try {
-			for (final String s : IOUtils
-				.readLines(NGramUtils.class.getResourceAsStream(classpath), StandardCharsets.UTF_8)) {
-				h.add(fixAliases(transliterator.transliterate(s))); // transliteration of the stopwords
-			}
-		} catch (final Throwable e) {
-			return Sets.newHashSet();
-		}
-		return h;
-	}
-
 	public static Map<String, String> loadMapFromClasspath(final String classpath) {

 		Transliterator transliterator = Transliterator.getInstance("Any-Eng");
@ -268,7 +208,7 @@ public abstract class AbstractPaceFunctions {
 		return m;
 	}

-	public String removeKeywords(String s, Set<String> keywords) {
+	public static String removeKeywords(String s, Set<String> keywords) {

 		s = " " + s + " ";
 		for (String k : keywords) {
@ -278,39 +218,35 @@ public abstract class AbstractPaceFunctions {
 		return s.trim();
 	}

-	public double commonElementsPercentage(Set<String> s1, Set<String> s2) {
+	public static double commonElementsPercentage(Set<String> s1, Set<String> s2) {

 		double longer = Math.max(s1.size(), s2.size());
 		return (double) s1.stream().filter(s2::contains).count() / longer;
 	}

 	// convert the set of keywords to codes
-	public Set<String> toCodes(Set<String> keywords, Map<String, String> translationMap) {
+	public static Set<String> toCodes(Set<String> keywords, Map<String, String> translationMap) {
 		return keywords.stream().map(s -> translationMap.get(s)).collect(Collectors.toSet());
 	}

-	public Set<String> keywordsToCodes(Set<String> keywords, Map<String, String> translationMap) {
+	public static Set<String> keywordsToCodes(Set<String> keywords, Map<String, String> translationMap) {
 		return toCodes(keywords, translationMap);
 	}

-	public Set<String> citiesToCodes(Set<String> keywords) {
+	public static Set<String> citiesToCodes(Set<String> keywords) {
 		return toCodes(keywords, cityMap);
 	}

-	protected String firstLC(final String s) {
+	protected static String firstLC(final String s) {
 		return StringUtils.substring(s, 0, 1).toLowerCase();
 	}

-	protected Iterable<String> tokens(final String s, final int maxTokens) {
-		return Iterables.limit(Splitter.on(" ").omitEmptyStrings().trimResults().split(s), maxTokens);
-	}
-
-	public String normalizePid(String pid) {
+	public static String normalizePid(String pid) {
 		return DOI_PREFIX.matcher(pid.toLowerCase()).replaceAll("");
 	}

 	// get the list of keywords into the input string
-	public Set<String> getKeywords(String s1, Map<String, String> translationMap, int windowSize) {
+	public static Set<String> getKeywords(String s1, Map<String, String> translationMap, int windowSize) {

 		String s = s1;

@ -340,7 +276,7 @@ public abstract class AbstractPaceFunctions {
 		return codes;
 	}

-	public Set<String> getCities(String s1, int windowSize) {
+	public static Set<String> getCities(String s1, int windowSize) {
 		return getKeywords(s1, cityMap, windowSize);
 	}

--- a/dhp-pace-core/src/main/java/eu/dnetlib/pace/model/ClusteringDef.java
+++ b/dhp-pace-core/src/main/java/eu/dnetlib/pace/model/ClusteringDef.java
@ -18,7 +18,7 @@ public class ClusteringDef implements Serializable {

 	private List<String> fields;

-	private Map<String, Integer> params;
+	private Map<String, Object> params;

 	public ClusteringDef() {
 	}
@ -43,11 +43,11 @@ public class ClusteringDef implements Serializable {
 		this.fields = fields;
 	}

-	public Map<String, Integer> getParams() {
+	public Map<String, Object> getParams() {
 		return params;
 	}

-	public void setParams(final Map<String, Integer> params) {
+	public void setParams(final Map<String, Object> params) {
 		this.params = params;
 	}

--- a/dhp-pace-core/src/main/java/eu/dnetlib/pace/model/FieldDef.java
+++ b/dhp-pace-core/src/main/java/eu/dnetlib/pace/model/FieldDef.java
@ -2,6 +2,7 @@
 package eu.dnetlib.pace.model;

 import java.io.Serializable;
+import java.util.HashSet;
 import java.util.List;

 import com.fasterxml.jackson.core.JsonProcessingException;
@ -36,6 +37,16 @@ public class FieldDef implements Serializable {
 	 */
 	private int length = -1;

+	private HashSet<String> filter;
+
+	private boolean sorted;
+
+	public boolean isSorted() {
+		return sorted;
+	}
+
+	private String clean;
+
 	public FieldDef() {
 	}

@ -91,6 +102,30 @@ public class FieldDef implements Serializable {
 		this.path = path;
 	}

+	public HashSet<String> getFilter() {
+		return filter;
+	}
+
+	public void setFilter(HashSet<String> filter) {
+		this.filter = filter;
+	}
+
+	public boolean getSorted() {
+		return sorted;
+	}
+
+	public void setSorted(boolean sorted) {
+		this.sorted = sorted;
+	}
+
+	public String getClean() {
+		return clean;
+	}
+
+	public void setClean(String clean) {
+		this.clean = clean;
+	}
+
 	@Override
 	public String toString() {
 		try {
--- a/dhp-pace-core/src/main/java/eu/dnetlib/pace/model/SparkDeduper.scala
+++ b/dhp-pace-core/src/main/java/eu/dnetlib/pace/model/SparkDeduper.scala
@ -5,9 +5,9 @@ import eu.dnetlib.pace.util.{BlockProcessor, SparkReporter}
 import org.apache.spark.SparkContext
 import org.apache.spark.sql.catalyst.expressions.Literal
 import org.apache.spark.sql.expressions._
-import org.apache.spark.sql.functions.{col, lit, udf}
+import org.apache.spark.sql.functions.{col, desc, expr, lit, udf}
 import org.apache.spark.sql.types._
-import org.apache.spark.sql.{Column, Dataset, Row, functions}
+import org.apache.spark.sql.{Column, Dataset, Row, SaveMode, functions}

 import java.util.function.Predicate
 import java.util.stream.Collectors
@ -80,6 +80,8 @@ case class SparkDeduper(conf: DedupConfig) extends Serializable {
        .withColumn("key", functions.explode(clusterValuesUDF(cd).apply(functions.array(inputColumns: _*))))
        // Add position column having the position of the row within the set of rows having the same key value ordered by the sorting value
        .withColumn("position", functions.row_number().over(Window.partitionBy("key").orderBy(col(model.orderingFieldName), col(model.identifierFieldName))))
+       // .withColumn("count", functions.max("position").over(Window.partitionBy("key").orderBy(col(model.orderingFieldName), col(model.identifierFieldName)).rowsBetween(Window.unboundedPreceding,Window.unboundedFollowing) ))
+       // .filter("count > 1")

      if (df_with_clustering_keys == null)
        df_with_clustering_keys = ds
@ -88,20 +90,44 @@ case class SparkDeduper(conf: DedupConfig) extends Serializable {
    }

    //TODO: analytics
+    /*df_with_clustering_keys.groupBy(col("clustering"), col("key"))
+      .agg(expr("max(count) AS size"))
+      .orderBy(desc("size"))
+      .show*/

    val df_with_blocks = df_with_clustering_keys
-      // filter out rows with position exceeding the maxqueuesize parameter
-      .filter(col("position").leq(conf.getWf.getQueueMaxSize))
-      .groupBy("clustering", "key")
+      // split the clustering block into smaller blocks of queuemaxsize
+      .groupBy(col("clustering"), col("key"), functions.floor(col("position").divide(lit(conf.getWf.getQueueMaxSize))))
      .agg(functions.collect_set(functions.struct(model.schema.fieldNames.map(col): _*)).as("block"))
      .filter(functions.size(new Column("block")).gt(1))
+       .union(
+        //adjacency blocks
+        df_with_clustering_keys
+          // filter out leading and trailing elements
+          .filter(col("position").gt(conf.getWf.getSlidingWindowSize/2))
+          //.filter(col("position").lt(col("count").minus(conf.getWf.getSlidingWindowSize/2)))
+          // create small blocks of records on "the border" of maxqueuesize: getSlidingWindowSize/2 elements before and after
+          .filter(
+            col("position").mod(conf.getWf.getQueueMaxSize).lt(conf.getWf.getSlidingWindowSize/2) // slice of the start of block
+            || col("position").mod(conf.getWf.getQueueMaxSize).gt(conf.getWf.getQueueMaxSize - (conf.getWf.getSlidingWindowSize/2)) //slice of the end of the block
+          )
+          .groupBy(col("clustering"), col("key"), functions.floor((col("position") + lit(conf.getWf.getSlidingWindowSize/2)).divide(lit(conf.getWf.getQueueMaxSize))))
+          .agg(functions.collect_set(functions.struct(model.schema.fieldNames.map(col): _*)).as("block"))
+          .filter(functions.size(new Column("block")).gt(1))
+      )

    df_with_blocks
  }

  def clusterValuesUDF(cd: ClusteringDef) = {
    udf[mutable.WrappedArray[String], mutable.WrappedArray[Any]](values => {
-      values.flatMap(f => cd.clusteringFunction().apply(conf, Seq(f.toString).asJava).asScala)
+      val valueList = values.flatMap {
+        case a: mutable.WrappedArray[Any] => a.map(_.toString)
+        case s: Any => Seq(s.toString)
+      }.asJava;
+
+      mutable.WrappedArray.make(cd.clusteringFunction().apply(conf, valueList).toArray())
+
    })
  }

--- a/dhp-pace-core/src/main/java/eu/dnetlib/pace/model/SparkModel.scala
+++ b/dhp-pace-core/src/main/java/eu/dnetlib/pace/model/SparkModel.scala
@ -1,13 +1,16 @@
 package eu.dnetlib.pace.model

 import com.jayway.jsonpath.{Configuration, JsonPath}
+import eu.dnetlib.pace.common.AbstractPaceFunctions
 import eu.dnetlib.pace.config.{DedupConfig, Type}
 import eu.dnetlib.pace.util.MapDocumentUtil
+import org.apache.commons.lang3.StringUtils
 import org.apache.spark.sql.catalyst.encoders.RowEncoder
 import org.apache.spark.sql.catalyst.expressions.GenericRowWithSchema
 import org.apache.spark.sql.types.{DataTypes, Metadata, StructField, StructType}
 import org.apache.spark.sql.{Dataset, Row}

+import java.util.Locale
 import java.util.regex.Pattern
 import scala.collection.JavaConverters._

@ -60,7 +63,7 @@ case class SparkModel(conf: DedupConfig) {
    values(identityFieldPosition) = MapDocumentUtil.getJPathString(conf.getWf.getIdPath, documentContext)

    schema.fieldNames.zipWithIndex.foldLeft(values) {
-      case ((res, (fname, index))) => {
+      case ((res, (fname, index))) =>
        val fdef = conf.getPace.getModelMap.get(fname)

        if (fdef != null) {
@ -96,13 +99,52 @@ case class SparkModel(conf: DedupConfig) {
            case Type.DoubleArray =>
              MapDocumentUtil.getJPathArray(fdef.getPath, json)
          }
+
+          val filter = fdef.getFilter
+
+          if (StringUtils.isNotBlank(fdef.getClean)) {
+            res(index) = res(index) match {
+              case x: Seq[String] => x.map(clean(_, fdef.getClean)).toSeq
+              case _ => clean(res(index).toString, fdef.getClean)
+            }
+          }
+
+          if (filter != null && !filter.isEmpty) {
+            res(index) = res(index) match {
+              case x: String if filter.contains(x.toLowerCase(Locale.ROOT)) => null
+              case x: Seq[String] => x.filter(s => !filter.contains(s.toLowerCase(Locale.ROOT))).toSeq
+              case _ => res(index)
+            }
+          }
+
+          if (fdef.getSorted) {
+            res(index) = res(index) match {
+              case x: Seq[String] => x.sorted.toSeq
+              case _ => res(index)
+            }
+          }
        }

        res
-      }
    }

    new GenericRowWithSchema(values, schema)
  }
+
+  def clean(value: String, cleantype: String) : String = {
+    val res = cleantype match {
+      case "title" => AbstractPaceFunctions.cleanup(value)
+      case _ => value
+    }
+
+//    if (!res.equals(AbstractPaceFunctions.normalize(value))) {
+//      println(res)
+//      println(AbstractPaceFunctions.normalize(value))
+//      println()
+//    }
+
+    res
+  }
+
 }

--- a/dhp-pace-core/src/main/java/eu/dnetlib/pace/tree/AuthorsMatch.java
+++ b/dhp-pace-core/src/main/java/eu/dnetlib/pace/tree/AuthorsMatch.java
@ -1,8 +1,10 @@

 package eu.dnetlib.pace.tree;

+import java.util.ArrayList;
 import java.util.List;
 import java.util.Map;
+import java.util.function.BiFunction;
 import java.util.stream.Collectors;

 import com.wcohen.ss.AbstractStringDistance;
@ -11,6 +13,7 @@ import eu.dnetlib.pace.config.Config;
 import eu.dnetlib.pace.model.Person;
 import eu.dnetlib.pace.tree.support.AbstractListComparator;
 import eu.dnetlib.pace.tree.support.ComparatorClass;
+import eu.dnetlib.pace.util.AuthorMatchers;

@ComparatorClass("authorsMatch")
 public class AuthorsMatch extends AbstractListComparator {
@ -23,7 +26,6 @@ public class AuthorsMatch extends AbstractListComparator {
 	private String MODE; // full or surname
 	private int SIZE_THRESHOLD;
 	private String TYPE; // count or percentage
-	private int common;

 	public AuthorsMatch(Map<String, String> params) {
 		super(params, new com.wcohen.ss.JaroWinkler());
@ -35,7 +37,6 @@ public class AuthorsMatch extends AbstractListComparator {
 		FULLNAME_THRESHOLD = Double.parseDouble(params.getOrDefault("fullname_th", "0.9"));
 		SIZE_THRESHOLD = Integer.parseInt(params.getOrDefault("size_th", "20"));
 		TYPE = params.getOrDefault("type", "percentage");
-		common = 0;
 	}

 	protected AuthorsMatch(double w, AbstractStringDistance ssalgo) {
@ -43,23 +44,40 @@ public class AuthorsMatch extends AbstractListComparator {
 	}

 	@Override
-	public double compare(final List<String> a, final List<String> b, final Config conf) {
-
-		if (a.isEmpty() || b.isEmpty())
+	public double compare(final List<String> left, final List<String> right, final Config conf) {
+		if (left.isEmpty() || right.isEmpty())
 			return -1;

-		if (a.size() > SIZE_THRESHOLD || b.size() > SIZE_THRESHOLD)
+		if (left.size() > SIZE_THRESHOLD || right.size() > SIZE_THRESHOLD)
 			return 1.0;

-		List<Person> aList = a.stream().map(author -> new Person(author, false)).collect(Collectors.toList());
+		Double threshold = getDoubleParam("threshold");
+		int maxMiss = Integer.MAX_VALUE;
+
+		if (threshold != null && threshold >= 0.0 && threshold <= 1.0 && left.size() == right.size()) {
+			maxMiss = (int) Math.floor((1 - threshold) * Math.max(left.size(), right.size()));
+		}
+
+		int common = 0;
+
+		List<String> a = new ArrayList<>(left);
+		List<String> b = new ArrayList<>(right);
+
+		common += AuthorMatchers
+			.removeMatches(a, b, (BiFunction<String, String, Object>) AuthorMatchers::matchEqualsIgnoreCase)
+			.size() / 2;
+		common += AuthorMatchers
+			.removeMatches(a, b, (BiFunction<String, String, Object>) AuthorMatchers::matchOrderedTokenAndAbbreviations)
+			.size() / 2;
+
 		List<Person> bList = b.stream().map(author -> new Person(author, false)).collect(Collectors.toList());

-		common = 0;
 		// compare each element of List1 with each element of List2
-		for (Person p1 : aList)
+		int alreadyMatched = common;
+		for (int i = 0; i < a.size(); i++) {
+			Person p1 = new Person(a.get(i), false);

 			for (Person p2 : bList) {
-
 				// both persons are inaccurate
 				if (!p1.isAccurate() && !p2.isAccurate()) {
 					// compare just normalized fullnames
@ -118,11 +136,15 @@ public class AuthorsMatch extends AbstractListComparator {
 					}

 				}
-
 			}

+			if (i - common - alreadyMatched > maxMiss) {
+				return 0.0;
+			}
+		}
+
 		// normalization factor to compute the score
-		int normFactor = aList.size() == bList.size() ? aList.size() : (aList.size() + bList.size() - common);
+		int normFactor = left.size() == right.size() ? left.size() : (left.size() + right.size() - common);

 		if (TYPE.equals("percentage")) {
 			return (double) common / normFactor;
@ -153,5 +175,4 @@ public class AuthorsMatch extends AbstractListComparator {
 	public String normalization(String s) {
 		return normalize(utf8(cleanup(s)));
 	}
-
 }
--- a/dhp-pace-core/src/main/java/eu/dnetlib/pace/tree/CountryMatch.java
+++ b/dhp-pace-core/src/main/java/eu/dnetlib/pace/tree/CountryMatch.java
@ -0,0 +1,48 @@
+
+package eu.dnetlib.pace.tree;
+
+import java.util.Map;
+
+import com.wcohen.ss.AbstractStringDistance;
+
+import eu.dnetlib.pace.config.Config;
+import eu.dnetlib.pace.tree.support.AbstractStringComparator;
+import eu.dnetlib.pace.tree.support.ComparatorClass;
+
+@ComparatorClass("countryMatch")
+public class CountryMatch extends AbstractStringComparator {
+
+	public CountryMatch(Map<String, String> params) {
+		super(params, new com.wcohen.ss.JaroWinkler());
+	}
+
+	public CountryMatch(final double weight) {
+		super(weight, new com.wcohen.ss.JaroWinkler());
+	}
+
+	protected CountryMatch(final double weight, final AbstractStringDistance ssalgo) {
+		super(weight, ssalgo);
+	}
+
+	@Override
+	public double distance(final String a, final String b, final Config conf) {
+		if (a.isEmpty() || b.isEmpty()) {
+			return -1.0; // return -1 if a field is missing
+		}
+		if (a.equalsIgnoreCase("unknown") || b.equalsIgnoreCase("unknown")) {
+			return -1.0; // return -1 if a country is UNKNOWN
+		}
+
+		return a.equals(b) ? 1.0 : 0;
+	}
+
+	@Override
+	public double getWeight() {
+		return super.weight;
+	}
+
+	@Override
+	protected double normalize(final double d) {
+		return d;
+	}
+}
--- a/dhp-pace-core/src/main/java/eu/dnetlib/pace/tree/InstanceTypeMatch.java
+++ b/dhp-pace-core/src/main/java/eu/dnetlib/pace/tree/InstanceTypeMatch.java
@ -23,14 +23,18 @@ public class InstanceTypeMatch extends AbstractListComparator {

 		// jolly types
 		translationMap.put("Conference object", "*");
+		translationMap.put("Research", "*");
 		translationMap.put("Other literature type", "*");
 		translationMap.put("Unknown", "*");
+		translationMap.put("UNKNOWN", "*");

 		// article types
 		translationMap.put("Article", "Article");
+		translationMap.put("Journal", "Article");
 		translationMap.put("Data Paper", "Article");
 		translationMap.put("Software Paper", "Article");
 		translationMap.put("Preprint", "Article");
+		translationMap.put("Part of book or chapter of book", "Article");

 		// thesis types
 		translationMap.put("Thesis", "Thesis");
@ -76,5 +80,4 @@ public class InstanceTypeMatch extends AbstractListComparator {
 	protected double normalize(final double d) {
 		return d;
 	}
-
 }
--- a/dhp-pace-core/src/main/java/eu/dnetlib/pace/tree/LevensteinTitle.java
+++ b/dhp-pace-core/src/main/java/eu/dnetlib/pace/tree/LevensteinTitle.java
@ -3,6 +3,7 @@ package eu.dnetlib.pace.tree;

 import java.util.Map;

+import org.apache.commons.lang3.StringUtils;
 import org.apache.commons.logging.Log;
 import org.apache.commons.logging.LogFactory;

@ -30,16 +31,25 @@ public class LevensteinTitle extends AbstractStringComparator {
 	}

 	@Override
-	public double distance(final String a, final String b, final Config conf) {
-		final String ca = cleanup(a);
-		final String cb = cleanup(b);
-
+	public double distance(final String ca, final String cb, final Config conf) {
 		final boolean check = checkNumbers(ca, cb);

 		if (check)
 			return 0.5;

-		return normalize(ssalgo.score(ca, cb), ca.length(), cb.length());
+		Double threshold = getDoubleParam("threshold");
+
+		// reduce Levenshtein algo complexity when target threshold is known
+		if (threshold != null && threshold >= 0.0 && threshold <= 1.0) {
+			int maxdistance = (int) Math.floor((1 - threshold) * Math.max(ca.length(), cb.length()));
+			int score = StringUtils.getLevenshteinDistance(ca, cb, maxdistance);
+			if (score == -1) {
+				return 0;
+			}
+			return normalize(score, ca.length(), cb.length());
+		} else {
+			return normalize(StringUtils.getLevenshteinDistance(ca, cb), ca.length(), cb.length());
+		}
 	}

 	private double normalize(final double score, final int la, final int lb) {
--- a/dhp-pace-core/src/main/java/eu/dnetlib/pace/tree/MaxLengthMatch.java
+++ b/dhp-pace-core/src/main/java/eu/dnetlib/pace/tree/MaxLengthMatch.java
@ -0,0 +1,29 @@
+
+package eu.dnetlib.pace.tree;
+
+import java.util.Map;
+
+import eu.dnetlib.pace.config.Config;
+import eu.dnetlib.pace.tree.support.AbstractStringComparator;
+import eu.dnetlib.pace.tree.support.ComparatorClass;
+
+@ComparatorClass("maxLengthMatch")
+public class MaxLengthMatch extends AbstractStringComparator {
+
+	private final int limit;
+
+	public MaxLengthMatch(Map<String, String> params) {
+		super(params);
+
+		limit = Integer.parseInt(params.getOrDefault("limit", "200"));
+	}
+
+	@Override
+	public double compare(String a, String b, final Config conf) {
+		return a.length() < limit && b.length() < limit ? 1.0 : -1.0;
+	}
+
+	protected String toString(final Object object) {
+		return toFirstString(object);
+	}
+}
--- a/dhp-pace-core/src/main/java/eu/dnetlib/pace/tree/support/AbstractComparator.java
+++ b/dhp-pace-core/src/main/java/eu/dnetlib/pace/tree/support/AbstractComparator.java
@ -127,4 +127,14 @@ public abstract class AbstractComparator<T> extends AbstractPaceFunctions implem
 		return this.weight;
 	}

+	public Double getDoubleParam(String name) {
+		String svalue = params.get(name);
+
+		try {
+			return Double.parseDouble(svalue);
+		} catch (Throwable t) {
+		}
+
+		return null;
+	}
 }
--- a/dhp-pace-core/src/main/java/eu/dnetlib/pace/util/AuthorMatchers.scala
+++ b/dhp-pace-core/src/main/java/eu/dnetlib/pace/util/AuthorMatchers.scala
@ -0,0 +1,112 @@
+package eu.dnetlib.pace.util
+
+import java.util.Locale
+import java.util.regex.Pattern
+import scala.util.control.Breaks.{break, breakable}
+
+object AuthorMatchers {
+  val SPLIT_REGEX = Pattern.compile("[\\s,\\.]+")
+
+  val WORD_DIFF = 2
+
+  def matchEqualsIgnoreCase(a1: String, a2: String): Boolean = {
+    if (a1 == null || a2 == null)
+      false
+    else
+      a1 == a2 || a1.toLowerCase(Locale.ROOT).equals(a2.toLowerCase(Locale.ROOT))
+  }
+
+  def matchOtherNames(fullName: String, otherNames: Seq[String]): Boolean = {
+    if (otherNames != null) {
+      otherNames.exists(matchEqualsIgnoreCase(fullName, _))
+    } else {
+      false
+    }
+  }
+
+  def matchOrderedTokenAndAbbreviations(a1: String, a2: String): Boolean = {
+    val p1: Array[String] = SPLIT_REGEX.split(a1.trim.toLowerCase(Locale.ROOT)).filter(_.nonEmpty).sorted
+    val p2: Array[String] = SPLIT_REGEX.split(a2.trim.toLowerCase(Locale.ROOT)).filter(_.nonEmpty).sorted
+
+    if (p1.length < 2 || p2.length < 2) return false
+    if (Math.abs(p1.length - p2.length) > WORD_DIFF) return false // use alternative comparison algo
+
+    var p1Idx: Int = 0
+    var p2Idx: Int = 0
+    var shortMatches: Int = 0
+    var longMatches: Int = 0
+    while (p1Idx < p1.length && p2Idx < p2.length) {
+      val e1: String = p1(p1Idx)
+      val c1: Char = e1.charAt(0)
+      val e2: String = p2(p2Idx)
+      val c2: Char = e2.charAt(0)
+      if (c1 < c2) p1Idx += 1
+      else if (c1 > c2) p2Idx += 1
+      else {
+        var res: Boolean = false
+        if (e1.length != 1 && e2.length != 1) {
+          res = e1 == e2
+          if (res)
+            longMatches += 1
+        } else {
+          res = true
+          shortMatches += 1
+        }
+        if (res) {
+          p1Idx += 1
+          p2Idx += 1
+        } else {
+          val diff: Int = e1.compareTo(e2)
+          if (diff < 0) p1Idx += 1
+          else if (diff > 0) p2Idx += 1
+        }
+      }
+    }
+    longMatches > 0 && (shortMatches + longMatches) == Math.min(p1.length, p2.length)
+  }
+
+  def removeMatches(
+                     graph_authors: java.util.List[String],
+                     orcid_authors: java.util.List[String],
+                     matchingFunc: java.util.function.BiFunction[String,String,Boolean]
+                   ) : java.util.List[String] = {
+    removeMatches(graph_authors, orcid_authors, (a, b) => matchingFunc(a,b))
+  }
+
+
+  def removeMatches(
+                                       graph_authors: java.util.List[String],
+                                       orcid_authors: java.util.List[String],
+                                       matchingFunc: (String, String) => Boolean
+                                     ) : java.util.List[String]  = {
+    val matched = new java.util.ArrayList[String]()
+
+    if (graph_authors != null && !graph_authors.isEmpty) {
+      val ait = graph_authors.iterator
+
+      while (ait.hasNext) {
+        val author = ait.next()
+        val oit = orcid_authors.iterator
+
+        breakable {
+          while (oit.hasNext) {
+            val orcid = oit.next()
+
+            if (matchingFunc(author, orcid)) {
+              ait.remove()
+              oit.remove()
+
+              matched.add(author)
+              matched.add(orcid)
+
+              break()
+            }
+          }
+        }
+      }
+    }
+
+    matched
+  }
+
+}
--- a/Show More
+++ b/Show More
				`@ -0,0 +1 @@`
				`{"id":"50\|DansKnawCris::0829b5191605bdbea36d6502b8c1ce1f","pid":[{"qualifier":{"classid":"scp-number"},"value":"79953761260"}]}`