cleanup & docs

[AMF] docs
Merge pull request 'Fix cleaning of Pmid where parsing of numbers stopped at first not leading 0 (zero) character' (#345 ) from fix_truncated_pmid into master
2023-10-12 12:23:44 +02:00 · 2023-10-12 10:07:52 +02:00 · 2023-10-06 14:19:49 +02:00 · 2023-10-06 12:35:54 +02:00 · 2023-10-02 11:34:20 +02:00 · 2023-10-02 11:28:16 +02:00
162 changed files with 7823 additions and 2557 deletions
--- a/.gitignore
+++ b/.gitignore
@ -26,3 +26,4 @@ spark-warehouse
 /**/*.log
 /**/.factorypath
 /**/.scalafmt.conf
+/.java-version
--- a/README.md
+++ b/README.md
@ -1,2 +1,128 @@
 # dnet-hadoop
-Dnet-hadoop is the project that defined all the OOZIE workflows for the OpenAIRE Graph construction, processing, provisioning.
+
+Dnet-hadoop is the project that defined all the [OOZIE workflows](https://oozie.apache.org/) for the OpenAIRE Graph construction, processing, provisioning.
+
+How to build, package and run oozie workflows
+====================
+
+Oozie-installer is a utility allowing building, uploading and running oozie workflows. In practice, it creates a `*.tar.gz` 
+package that contains resources that define a workflow and some helper scripts.
+
+This module is automatically executed when running:
+
+`mvn package -Poozie-package -Dworkflow.source.dir=classpath/to/parent/directory/of/oozie_app`
+
+on module having set:
+
+```
+<parent>
+    <groupId>eu.dnetlib.dhp</groupId>
+    <artifactId>dhp-workflows</artifactId>
+</parent>
+```
+
+in `pom.xml` file. `oozie-package` profile initializes oozie workflow packaging, `workflow.source.dir` property points to 
+a workflow (notice: this is not a relative path but a classpath to directory usually holding `oozie_app` subdirectory).
+
+The outcome of this packaging is `oozie-package.tar.gz` file containing inside all the resources required to run Oozie workflow:
+
+- jar packages
+- workflow definitions
+- job properties
+- maintenance scripts
+
+Required properties
+====================
+
+In order to include proper workflow within package, `workflow.source.dir` property has to be set. It could be provided 
+by setting `-Dworkflow.source.dir=some/job/dir` maven parameter.
+
+In oder to define full set of cluster environment properties one should create `~/.dhp/application.properties` file with 
+the following properties:
+
+- `dhp.hadoop.frontend.user.name` - your user name on hadoop cluster and frontend machine
+- `dhp.hadoop.frontend.host.name` - frontend host name
+- `dhp.hadoop.frontend.temp.dir` - frontend directory for temporary files
+- `dhp.hadoop.frontend.port.ssh` - frontend machine ssh port
+- `oozieServiceLoc` - oozie service location required by run_workflow.sh script executing oozie job
+- `nameNode` - name node address
+- `jobTracker` - job tracker address
+- `oozie.execution.log.file.location` - location of file that will be created when executing oozie job, it contains output 
+produced by `run_workflow.sh` script (needed to obtain oozie job id)
+- `maven.executable` - mvn command location, requires parameterization due to a different setup of CI cluster
+- `sparkDriverMemory` - amount of memory assigned to spark jobs driver
+- `sparkExecutorMemory` - amount of memory assigned to spark jobs executors
+- `sparkExecutorCores` - number of cores assigned to spark jobs executors
+
+All values will be overriden with the ones from `job.properties` and eventually `job-override.properties` stored in module's 
+main folder.
+
+When overriding properties from `job.properties`, `job-override.properties` file can be created in main module directory 
+(the one containing `pom.xml` file) and define all new properties which will override existing properties. 
+One can provide those properties one by one as command line `-D` arguments.
+
+Properties overriding order is the following:
+
+1. `pom.xml` defined properties (located in the project root dir)
+2. `~/.dhp/application.properties` defined properties
+3. `${workflow.source.dir}/job.properties`
+4. `job-override.properties` (located in the project root dir)
+5. `maven -Dparam=value`
+
+where the maven `-Dparam` property is overriding all the other ones.
+
+Workflow definition requirements
+====================
+
+`workflow.source.dir` property should point to the following directory structure:
+
+	[${workflow.source.dir}]
+		|
+		|-job.properties (optional)
+		|
+		\-[oozie_app]
+			|
+			\-workflow.xml
+
+This property can be set using maven `-D` switch.
+
+`[oozie_app]` is the default directory name however it can be set to any value as soon as `oozieAppDir` property is 
+provided with directory name as value.
+
+Sub-workflows are supported as well and sub-workflow directories should be nested within `[oozie_app]` directory.
+
+Creating oozie installer step-by-step
+=====================================
+
+Automated oozie-installer steps are the following:
+
+1. creating jar packages:  `*.jar` and `*tests.jar` along with copying all dependencies in `target/dependencies`
+2. reading properties from maven, `~/.dhp/application.properties`, `job.properties`, `job-override.properties`
+3. invoking priming mechanism linking resources from import.txt file (currently resolving subworkflow resources)
+4. assembling shell scripts for preparing Hadoop filesystem, uploading Oozie application and starting workflow
+5. copying whole `${workflow.source.dir}` content to `target/${oozie.package.file.name}`
+6. generating updated `job.properties` file in `target/${oozie.package.file.name}` based on maven, 
+`~/.dhp/application.properties`, `job.properties` and `job-override.properties`
+7. creating `lib` directory (or multiple directories for sub-workflows for each nested directory) and copying jar packages 
+created at step (1) to each one of them
+8. bundling whole `${oozie.package.file.name}` directory into single tar.gz package
+
+Uploading oozie package and running workflow on cluster
+=======================================================
+
+In order to simplify deployment and execution process two dedicated profiles were introduced:
+
+- `deploy`
+- `run`
+
+to be used along with `oozie-package` profile e.g. by providing `-Poozie-package,deploy,run` maven parameters.
+
+The `deploy` profile supplements packaging process with:
+1) uploading oozie-package via scp to `/home/${user.name}/oozie-packages` directory on `${dhp.hadoop.frontend.host.name}` machine
+2) extracting uploaded package
+3) uploading oozie content to hadoop cluster HDFS location defined in `oozie.wf.application.path` property (generated dynamically by maven build process, based on `${dhp.hadoop.frontend.user.name}` and `workflow.source.dir` properties)
+
+The `run` profile introduces:
+1) executing oozie application uploaded to HDFS cluster using `deploy` command. Triggers `run_workflow.sh` script providing runtime properties defined in `job.properties` file.
+
+Notice: ssh access to frontend machine has to be configured on system level and it is preferable to set key-based authentication in order to simplify remote operations.
--- a/dhp-common/pom.xml
+++ b/dhp-common/pom.xml
@ -62,6 +62,11 @@
 	</build>

 	<dependencies>
+		<dependency>
+			<groupId>eu.dnetlib.dhp</groupId>
+			<artifactId>dhp-pace-core</artifactId>
+			<version>${project.version}</version>
+		</dependency>

 		<dependency>
 			<groupId>org.apache.hadoop</groupId>
@ -124,12 +129,6 @@
 		<dependency>
 			<groupId>eu.dnetlib</groupId>
 			<artifactId>cnr-rmi-api</artifactId>
-			<exclusions>
-				<exclusion>
-					<groupId>log4j</groupId>
-					<artifactId>log4j</artifactId>
-				</exclusion>
-			</exclusions>
 		</dependency>

 		<dependency>
@ -150,11 +149,6 @@
 			<artifactId>okhttp</artifactId>
 		</dependency>

-		<dependency>
-			<groupId>eu.dnetlib.dhp</groupId>
-			<artifactId>dhp-pace-core</artifactId>
-		</dependency>
-
 		<dependency>
 			<groupId>org.apache.httpcomponents</groupId>
 			<artifactId>httpclient</artifactId>
@ -167,7 +161,7 @@

 		<dependency>
 			<groupId>eu.dnetlib.dhp</groupId>
-			<artifactId>dhp-schemas_${scala.binary.version}</artifactId>
+			<artifactId>${dhp-schemas.artifact}</artifactId>
 		</dependency>

 		<dependency>
--- a/dhp-common/src/main/java/eu/dnetlib/dhp/oa/merge/DispatchEntitiesSparkJob.java
+++ b/dhp-common/src/main/java/eu/dnetlib/dhp/oa/merge/DispatchEntitiesSparkJob.java
@ -11,25 +11,18 @@ import org.apache.commons.lang3.StringUtils;
 import org.apache.spark.SparkConf;
 import org.apache.spark.api.java.function.FilterFunction;
 import org.apache.spark.api.java.function.MapFunction;
-import org.apache.spark.sql.Encoders;
-import org.apache.spark.sql.SaveMode;
-import org.apache.spark.sql.SparkSession;
+import org.apache.spark.sql.*;
 import org.slf4j.Logger;
 import org.slf4j.LoggerFactory;

-import com.fasterxml.jackson.databind.ObjectMapper;
-
 import eu.dnetlib.dhp.application.ArgumentApplicationParser;
 import eu.dnetlib.dhp.common.HdfsSupport;
-import eu.dnetlib.dhp.schema.oaf.Oaf;
-import eu.dnetlib.dhp.schema.oaf.OafEntity;
+import eu.dnetlib.dhp.schema.common.ModelSupport;

 public class DispatchEntitiesSparkJob {

 	private static final Logger log = LoggerFactory.getLogger(DispatchEntitiesSparkJob.class);

-	private static final ObjectMapper OBJECT_MAPPER = new ObjectMapper();
-
 	public static void main(String[] args) throws Exception {

 		String jsonConfiguration = IOUtils
@ -54,44 +47,51 @@ public class DispatchEntitiesSparkJob {
 		String outputPath = parser.get("outputPath");
 		log.info("outputPath: {}", outputPath);

-		String graphTableClassName = parser.get("graphTableClassName");
-		log.info("graphTableClassName: {}", graphTableClassName);
-
-		@SuppressWarnings("unchecked")
-		Class<? extends OafEntity> entityClazz = (Class<? extends OafEntity>) Class.forName(graphTableClassName);
+		boolean filterInvisible = Boolean.parseBoolean(parser.get("filterInvisible"));
+		log.info("filterInvisible: {}", filterInvisible);

 		SparkConf conf = new SparkConf();
 		runWithSparkSession(
 			conf,
 			isSparkSessionManaged,
-			spark -> {
-				HdfsSupport.remove(outputPath, spark.sparkContext().hadoopConfiguration());
-				dispatchEntities(spark, inputPath, entityClazz, outputPath);
-			});
+			spark -> dispatchEntities(spark, inputPath, outputPath, filterInvisible));
 	}

-	private static <T extends Oaf> void dispatchEntities(
+	private static void dispatchEntities(
 		SparkSession spark,
 		String inputPath,
-		Class<T> clazz,
-		String outputPath) {
+		String outputPath,
+		boolean filterInvisible) {

-		spark
-			.read()
-			.textFile(inputPath)
-			.filter((FilterFunction<String>) s -> isEntityType(s, clazz))
-			.map((MapFunction<String, String>) s -> StringUtils.substringAfter(s, "|"), Encoders.STRING())
-			.map(
-				(MapFunction<String, T>) value -> OBJECT_MAPPER.readValue(value, clazz),
-				Encoders.bean(clazz))
-			.write()
-			.mode(SaveMode.Overwrite)
-			.option("compression", "gzip")
-			.json(outputPath);
+		Dataset<String> df = spark.read().textFile(inputPath);
+
+		ModelSupport.oafTypes.entrySet().parallelStream().forEach(entry -> {
+			String entityType = entry.getKey();
+			Class<?> clazz = entry.getValue();
+
+			final String entityPath = outputPath + "/" + entityType;
+			if (!entityType.equalsIgnoreCase("relation")) {
+				HdfsSupport.remove(entityPath, spark.sparkContext().hadoopConfiguration());
+				Dataset<Row> entityDF = spark
+					.read()
+					.schema(Encoders.bean(clazz).schema())
+					.json(
+						df
+							.filter((FilterFunction<String>) s -> s.startsWith(clazz.getName()))
+							.map(
+								(MapFunction<String, String>) s -> StringUtils.substringAfter(s, "|"),
+								Encoders.STRING()));
+
+				if (filterInvisible) {
+					entityDF = entityDF.filter("dataInfo.invisible != true");
+				}
+
+				entityDF
+					.write()
+					.mode(SaveMode.Overwrite)
+					.option("compression", "gzip")
+					.json(entityPath);
+			}
+		});
 	}
-
-	private static <T extends Oaf> boolean isEntityType(final String s, final Class<T> clazz) {
-		return StringUtils.substringBefore(s, "|").equals(clazz.getName());
-	}
-
 }
--- a/dhp-common/src/main/java/eu/dnetlib/dhp/schema/oaf/utils/FundRefCleaningRule.java
+++ b/dhp-common/src/main/java/eu/dnetlib/dhp/schema/oaf/utils/FundRefCleaningRule.java
@ -6,14 +6,16 @@ import java.util.regex.Pattern;

 public class FundRefCleaningRule {

-	public static String clean(final String fundrefId) {
+	public static final Pattern PATTERN = Pattern.compile("\\d+");

-		String s = fundrefId
+	public static String clean(final String fundRefId) {
+
+		String s = fundRefId
 			.toLowerCase()
 			.replaceAll("\\s", "");

-		Matcher m = Pattern.compile("\\d+").matcher(s);
-		if (m.matches()) {
+		Matcher m = PATTERN.matcher(s);
+		if (m.find()) {
 			return m.group();
 		} else {
 			return "";
--- a/dhp-common/src/main/java/eu/dnetlib/dhp/schema/oaf/utils/GraphCleaningFunctions.java
+++ b/dhp-common/src/main/java/eu/dnetlib/dhp/schema/oaf/utils/GraphCleaningFunctions.java
@ -13,11 +13,7 @@ import java.util.stream.Collectors;
 import java.util.stream.Stream;

 import org.apache.commons.lang3.StringUtils;
-import org.apache.spark.api.java.function.MapFunction;
-import org.apache.spark.sql.Encoders;

-import com.fasterxml.jackson.core.JsonProcessingException;
-import com.fasterxml.jackson.databind.ObjectMapper;
 import com.github.sisyphsu.dateparser.DateParserUtils;
 import com.google.common.collect.Lists;
 import com.google.common.collect.Sets;
@ -39,6 +35,7 @@ public class GraphCleaningFunctions extends CleaningFunctions {
 	public static final String TITLE_FILTER_REGEX = String.format("(%s)|\\W|\\d", TITLE_TEST);

 	public static final int TITLE_FILTER_RESIDUAL_LENGTH = 5;
+	private static final String NAME_CLEANING_REGEX = "[\\r\\n\\t\\s]+";

 	public static <T extends Oaf> T cleanContext(T value, String contextId, String verifyParam) {
 		if (ModelSupport.isSubClass(value, Result.class)) {
@ -228,7 +225,7 @@ public class GraphCleaningFunctions extends CleaningFunctions {
 	}

 	public static <T extends Oaf> boolean filter(T value) {
-		if (Boolean.TRUE
+		if (!(value instanceof Relation) && (Boolean.TRUE
 			.equals(
 				Optional
 					.ofNullable(value)
@ -239,15 +236,16 @@ public class GraphCleaningFunctions extends CleaningFunctions {
 								d -> Optional
 									.ofNullable(d.getInvisible())
 									.orElse(true))
-							.orElse(true))
-					.orElse(true))) {
+							.orElse(false))
+					.orElse(true)))) {
 			return true;
 		}

 		if (value instanceof Datasource) {
 			// nothing to evaluate here
 		} else if (value instanceof Project) {
-			// nothing to evaluate here
+			final Project p = (Project) value;
+			return Objects.nonNull(p.getCode()) && StringUtils.isNotBlank(p.getCode().getValue());
 		} else if (value instanceof Organization) {
 			// nothing to evaluate here
 		} else if (value instanceof Relation) {
@ -294,6 +292,13 @@ public class GraphCleaningFunctions extends CleaningFunctions {
 			} else if (value instanceof Result) {
 				Result r = (Result) value;

+				if (Objects.nonNull(r.getFulltext())
+					&& (ModelConstants.SOFTWARE_RESULTTYPE_CLASSID.equals(r.getResulttype().getClassid()) ||
+						ModelConstants.DATASET_RESULTTYPE_CLASSID.equals(r.getResulttype().getClassid()))) {
+					r.setFulltext(null);
+
+				}
+
 				if (Objects.nonNull(r.getDateofacceptance())) {
 					Optional<String> date = cleanDateField(r.getDateofacceptance());
 					if (date.isPresent()) {
@ -318,8 +323,18 @@ public class GraphCleaningFunctions extends CleaningFunctions {
 								.filter(sp -> StringUtils.isNotBlank(sp.getValue()))
 								.collect(Collectors.toList()));
 				}
-				if (Objects.nonNull(r.getPublisher()) && StringUtils.isBlank(r.getPublisher().getValue())) {
-					r.setPublisher(null);
+				if (Objects.nonNull(r.getPublisher())) {
+					if (StringUtils.isBlank(r.getPublisher().getValue())) {
+						r.setPublisher(null);
+					} else {
+						r
+							.getPublisher()
+							.setValue(
+								r
+									.getPublisher()
+									.getValue()
+									.replaceAll(NAME_CLEANING_REGEX, " "));
+					}
 				}
 				if (Objects.isNull(r.getLanguage()) || StringUtils.isBlank(r.getLanguage().getClassid())) {
 					r
@ -486,6 +501,11 @@ public class GraphCleaningFunctions extends CleaningFunctions {
 								i.setDateofacceptance(null);
 							}
 						}
+						if (StringUtils.isNotBlank(i.getFulltext()) &&
+							(ModelConstants.SOFTWARE_RESULTTYPE_CLASSID.equals(r.getResulttype().getClassid()) ||
+								ModelConstants.DATASET_RESULTTYPE_CLASSID.equals(r.getResulttype().getClassid()))) {
+							i.setFulltext(null);
+						}
 					}
 				}
 				if (Objects.isNull(r.getBestaccessright())
@ -510,6 +530,7 @@ public class GraphCleaningFunctions extends CleaningFunctions {
 								.filter(Objects::nonNull)
 								.filter(a -> StringUtils.isNotBlank(a.getFullname()))
 								.filter(a -> StringUtils.isNotBlank(a.getFullname().replaceAll("[\\W]", "")))
+								.map(GraphCleaningFunctions::cleanupAuthor)
 								.collect(Collectors.toList()));

 					boolean nullRank = r
@ -604,6 +625,35 @@ public class GraphCleaningFunctions extends CleaningFunctions {
 		return value;
 	}

+	private static Author cleanupAuthor(Author author) {
+		if (StringUtils.isNotBlank(author.getFullname())) {
+			author
+				.setFullname(
+					author
+						.getFullname()
+						.replaceAll(NAME_CLEANING_REGEX, " ")
+						.replace("\"", "\\\""));
+		}
+		if (StringUtils.isNotBlank(author.getName())) {
+			author
+				.setName(
+					author
+						.getName()
+						.replaceAll(NAME_CLEANING_REGEX, " ")
+						.replace("\"", "\\\""));
+		}
+		if (StringUtils.isNotBlank(author.getSurname())) {
+			author
+				.setSurname(
+					author
+						.getSurname()
+						.replaceAll(NAME_CLEANING_REGEX, " ")
+						.replace("\"", "\\\""));
+		}
+
+		return author;
+	}
+
 	private static Optional<String> cleanDateField(Field<String> dateofacceptance) {
 		return Optional
 			.ofNullable(dateofacceptance)
--- a/dhp-common/src/main/java/eu/dnetlib/dhp/schema/oaf/utils/GridCleaningRule.java
+++ b/dhp-common/src/main/java/eu/dnetlib/dhp/schema/oaf/utils/GridCleaningRule.java
@ -6,13 +6,19 @@ import java.util.regex.Pattern;

 public class GridCleaningRule {

+	public static final Pattern PATTERN = Pattern.compile("(?<grid>\\d{4,6}\\.[0-9a-z]{1,2})");
+
 	public static String clean(String grid) {
 		String s = grid
 			.replaceAll("\\s", "")
 			.toLowerCase();

-		Matcher m = Pattern.compile("\\d{4,6}\\.[0-9a-z]{1,2}").matcher(s);
-		return m.matches() ? "grid." + m.group() : "";
+		Matcher m = PATTERN.matcher(s);
+		if (m.find()) {
+			return "grid." + m.group("grid");
+		}
+
+		return "";
 	}

 }
--- a/dhp-common/src/main/java/eu/dnetlib/dhp/schema/oaf/utils/ISNICleaningRule.java
+++ b/dhp-common/src/main/java/eu/dnetlib/dhp/schema/oaf/utils/ISNICleaningRule.java
@ -7,10 +7,12 @@ import java.util.regex.Pattern;
 // https://www.wikidata.org/wiki/Property:P213
 public class ISNICleaningRule {

+	public static final Pattern PATTERN = Pattern.compile("([0]{4}) ?([0-9]{4}) ?([0-9]{4}) ?([0-9]{3}[0-9X])");
+
 	public static String clean(final String isni) {

-		Matcher m = Pattern.compile("([0]{4}) ?([0-9]{4}) ?([0-9]{4}) ?([0-9]{3}[0-9X])").matcher(isni);
-		if (m.matches()) {
+		Matcher m = PATTERN.matcher(isni);
+		if (m.find()) {
 			return String.join("", m.group(1), m.group(2), m.group(3), m.group(4));
 		} else {
 			return "";
--- a/dhp-common/src/main/java/eu/dnetlib/dhp/schema/oaf/utils/PICCleaningRule.java
+++ b/dhp-common/src/main/java/eu/dnetlib/dhp/schema/oaf/utils/PICCleaningRule.java
@ -6,10 +6,12 @@ import java.util.regex.Pattern;

 public class PICCleaningRule {

+	public static final Pattern PATTERN = Pattern.compile("\\d{9}");
+
 	public static String clean(final String pic) {

-		Matcher m = Pattern.compile("\\d{9}").matcher(pic);
-		if (m.matches()) {
+		Matcher m = PATTERN.matcher(pic);
+		if (m.find()) {
 			return m.group();
 		} else {
 			return "";
--- a/dhp-common/src/main/java/eu/dnetlib/dhp/schema/oaf/utils/PmcCleaningRule.java
+++ b/dhp-common/src/main/java/eu/dnetlib/dhp/schema/oaf/utils/PmcCleaningRule.java
@ -1,13 +1,24 @@

 package eu.dnetlib.dhp.schema.oaf.utils;

+import java.util.regex.Matcher;
+import java.util.regex.Pattern;
+
 public class PmcCleaningRule {

+	public static final Pattern PATTERN = Pattern.compile("PMC\\d{1,8}");
+
 	public static String clean(String pmc) {
 		String s = pmc
 			.replaceAll("\\s", "")
 			.toUpperCase();
-		return s.matches("^PMC\\d{1,8}$") ? s : "";
+
+		final Matcher m = PATTERN.matcher(s);
+
+		if (m.find()) {
+			return m.group();
+		}
+		return "";
 	}

 }
--- a/dhp-common/src/main/java/eu/dnetlib/dhp/schema/oaf/utils/PmidCleaningRule.java
+++ b/dhp-common/src/main/java/eu/dnetlib/dhp/schema/oaf/utils/PmidCleaningRule.java
@ -1,16 +1,25 @@

 package eu.dnetlib.dhp.schema.oaf.utils;

+import java.util.regex.Matcher;
+import java.util.regex.Pattern;
+
 // https://researchguides.stevens.edu/c.php?g=442331&p=6577176
 public class PmidCleaningRule {

+	public static final Pattern PATTERN = Pattern.compile("0*(\\d{1,8})");
+
 	public static String clean(String pmid) {
 		String s = pmid
 			.toLowerCase()
-			.replaceAll("\\s", "")
-			.trim()
-			.replaceAll("^0+", "");
-		return s.matches("^\\d{1,8}$") ? s : "";
+			.replaceAll("\\s", "");
+
+		final Matcher m = PATTERN.matcher(s);
+
+		if (m.find()) {
+			return m.group(1);
+		}
+		return "";
 	}

 }
--- a/dhp-common/src/main/java/eu/dnetlib/dhp/schema/oaf/utils/RorCleaningRule.java
+++ b/dhp-common/src/main/java/eu/dnetlib/dhp/schema/oaf/utils/RorCleaningRule.java
@ -7,12 +7,21 @@ import java.util.regex.Pattern;
 // https://ror.readme.io/docs/ror-identifier-pattern
 public class RorCleaningRule {

+	public static final String ROR_PREFIX = "https://ror.org/";
+
+	private static final Pattern PATTERN = Pattern.compile("(?<ror>0[a-hj-km-np-tv-z|0-9]{6}[0-9]{2})");
+
 	public static String clean(String ror) {
 		String s = ror
 			.replaceAll("\\s", "")
 			.toLowerCase();
-		Matcher m = Pattern.compile("0[a-hj-km-np-tv-z|0-9]{6}[0-9]{2}").matcher(s);
-		return m.matches() ? "https://ror.org/" + m.group() : "";
+
+		Matcher m = PATTERN.matcher(s);
+
+		if (m.find()) {
+			return ROR_PREFIX + m.group("ror");
+		}
+		return "";
 	}

 }
--- a/dhp-common/src/main/resources/eu/dnetlib/dhp/oa/merge/dispatch_entities_parameters.json
+++ b/dhp-common/src/main/resources/eu/dnetlib/dhp/oa/merge/dispatch_entities_parameters.json
@ -18,9 +18,9 @@
    "paramRequired": true
  },
  {
-    "paramName": "c",
-    "paramLongName": "graphTableClassName",
-    "paramDescription": "the graph entity class name",
+    "paramName": "fi",
+    "paramLongName": "filterInvisible",
+    "paramDescription": "if true filters out invisible entities",
    "paramRequired": true
  }
 ]
--- a/dhp-common/src/main/scala/eu/dnetlib/dhp/sx/graph/scholix/ScholixUtils.scala
+++ b/dhp-common/src/main/scala/eu/dnetlib/dhp/sx/graph/scholix/ScholixUtils.scala
@ -50,10 +50,13 @@ object ScholixUtils extends Serializable {
    }
  }

-  def extractRelationDate(summary: ScholixResource): String = {
-
-    summary.getPublicationDate
+  def extractRelationDate(summary: ScholixSummary): String = {

+    if (summary.getDate == null || summary.getDate.isEmpty)
+      null
+    else {
+      summary.getDate.get(0)
+    }
  }

  def inverseRelationShip(rel: ScholixRelationship): ScholixRelationship = {
@ -141,7 +144,11 @@ object ScholixUtils extends Serializable {
    s.setRelationship(inverseRelationShip(scholix.getRelationship))
    s.setSource(scholix.getTarget)
    s.setTarget(scholix.getSource)
-    updateId(s)
+    s.setIdentifier(
+      DHPUtils.md5(
+        s"${s.getSource.getIdentifier}::${s.getRelationship.getName}::${s.getTarget.getIdentifier}"
+      )
+    )
    s

  }
@ -180,21 +187,6 @@ object ScholixUtils extends Serializable {
    } else List()
  }

-  def updateId(scholix: Scholix): Scholix = {
-    scholix.setIdentifier(
-      generateIdentifier(
-        scholix.getSource.getDnetIdentifier,
-        scholix.getTarget.getDnetIdentifier,
-        scholix.getRelationship.getName
-      )
-    )
-    scholix
-  }
-
-  def generateIdentifier(sourceId: String, targetId: String, relation: String): String = {
-    DHPUtils.md5(s"$sourceId::$relation::$targetId")
-  }
-
  def generateCompleteScholix(scholix: Scholix, target: ScholixSummary): Scholix = {
    val s = new Scholix
    s.setPublicationDate(scholix.getPublicationDate)
@ -203,7 +195,11 @@ object ScholixUtils extends Serializable {
    s.setRelationship(scholix.getRelationship)
    s.setSource(scholix.getSource)
    s.setTarget(generateScholixResourceFromSummary(target))
-    updateId(s)
+    s.setIdentifier(
+      DHPUtils.md5(
+        s"${s.getSource.getIdentifier}::${s.getRelationship.getName}::${s.getTarget.getIdentifier}"
+      )
+    )
    s
  }

@ -215,7 +211,11 @@ object ScholixUtils extends Serializable {
    s.setRelationship(scholix.getRelationship)
    s.setSource(scholix.getSource)
    s.setTarget(target)
-    updateId(s)
+    s.setIdentifier(
+      DHPUtils.md5(
+        s"${s.getSource.getIdentifier}::${s.getRelationship.getName}::${s.getTarget.getIdentifier}"
+      )
+    )
    s
  }

@ -232,7 +232,7 @@ object ScholixUtils extends Serializable {

    if (summaryObject.getAuthor != null && !summaryObject.getAuthor.isEmpty) {
      val l: List[ScholixEntityId] =
-        summaryObject.getAuthor.asScala.map(a => new ScholixEntityId(a, null)).take(100).toList
+        summaryObject.getAuthor.asScala.map(a => new ScholixEntityId(a, null)).toList
      if (l.nonEmpty)
        r.setCreator(l.asJava)
    }
@ -241,7 +241,7 @@ object ScholixUtils extends Serializable {
      r.setPublicationDate(summaryObject.getDate.get(0))
    if (summaryObject.getPublisher != null && !summaryObject.getPublisher.isEmpty) {
      val plist: List[ScholixEntityId] =
-        summaryObject.getPublisher.asScala.map(p => new ScholixEntityId(p, null)).take(100).toList
+        summaryObject.getPublisher.asScala.map(p => new ScholixEntityId(p, null)).toList

      if (plist.nonEmpty)
        r.setPublisher(plist.asJava)
@ -260,7 +260,6 @@ object ScholixUtils extends Serializable {
            "complete"
          )
        )
-        .take(100)
        .toList

      if (l.nonEmpty)
@ -270,38 +269,38 @@ object ScholixUtils extends Serializable {
    r
  }

-//  def scholixFromSource(relation: Relation, source: ScholixResource): Scholix = {
-//    if (relation == null || source == null)
-//      return null
-//    val s = new Scholix
-//    var l: List[ScholixEntityId] = extractCollectedFrom(relation)
-//    if (l.isEmpty)
-//      l = extractCollectedFrom(source)
-//    if (l.isEmpty)
-//      return null
-//    s.setLinkprovider(l.asJava)
-//    var d = extractRelationDate(relation)
-//    if (d == null)
-//      d = source.getPublicationDate
-//
-//    s.setPublicationDate(d)
-//
-//    if (source.getPublisher != null && !source.getPublisher.isEmpty) {
-//      s.setPublisher(source.getPublisher)
-//    }
-//
-//    val semanticRelation = relations.getOrElse(relation.getRelClass.toLowerCase, null)
-//    if (semanticRelation == null)
-//      return null
-//    s.setRelationship(
-//      new ScholixRelationship(semanticRelation.original, "datacite", semanticRelation.inverse)
-//    )
-//    s.setSource(source)
-//
-//    s
-//  }
-
  def scholixFromSource(relation: Relation, source: ScholixResource): Scholix = {
+    if (relation == null || source == null)
+      return null
+    val s = new Scholix
+    var l: List[ScholixEntityId] = extractCollectedFrom(relation)
+    if (l.isEmpty)
+      l = extractCollectedFrom(source)
+    if (l.isEmpty)
+      return null
+    s.setLinkprovider(l.asJava)
+    var d = extractRelationDate(relation)
+    if (d == null)
+      d = source.getPublicationDate
+
+    s.setPublicationDate(d)
+
+    if (source.getPublisher != null && !source.getPublisher.isEmpty) {
+      s.setPublisher(source.getPublisher)
+    }
+
+    val semanticRelation = relations.getOrElse(relation.getRelClass.toLowerCase, null)
+    if (semanticRelation == null)
+      return null
+    s.setRelationship(
+      new ScholixRelationship(semanticRelation.original, "datacite", semanticRelation.inverse)
+    )
+    s.setSource(source)
+
+    s
+  }
+
+  def scholixFromSource(relation: Relation, source: ScholixSummary): Scholix = {

    if (relation == null || source == null)
      return null
@ -323,8 +322,11 @@ object ScholixUtils extends Serializable {
    s.setPublicationDate(d)

    if (source.getPublisher != null && !source.getPublisher.isEmpty) {
-      source.getPublisher
-      val l: List[ScholixEntityId] = source.getPublisher.asScala.toList
+      val l: List[ScholixEntityId] = source.getPublisher.asScala
+        .map { p =>
+          new ScholixEntityId(p, null)
+        }(collection.breakOut)
+
      if (l.nonEmpty)
        s.setPublisher(l.asJava)
    }
@ -335,7 +337,7 @@ object ScholixUtils extends Serializable {
    s.setRelationship(
      new ScholixRelationship(semanticRelation.original, "datacite", semanticRelation.inverse)
    )
-    s.setSource(source)
+    s.setSource(generateScholixResourceFromSummary(source))

    s
  }
--- a/dhp-common/src/test/java/eu/dnetlib/dhp/common/MdStoreClientTest.java
+++ b/dhp-common/src/test/java/eu/dnetlib/dhp/common/MdStoreClientTest.java
@ -15,7 +15,7 @@ import com.fasterxml.jackson.databind.ObjectMapper;

 public class MdStoreClientTest {

-	@Test
+	// @Test
 	public void testMongoCollection() throws IOException {
 		final MdstoreClient client = new MdstoreClient("mongodb://localhost:27017", "mdstore");

--- a/dhp-common/src/test/java/eu/dnetlib/dhp/schema/oaf/utils/GridCleaningRuleTest.java
+++ b/dhp-common/src/test/java/eu/dnetlib/dhp/schema/oaf/utils/GridCleaningRuleTest.java
@ -0,0 +1,18 @@
+
+package eu.dnetlib.dhp.schema.oaf.utils;
+
+import static org.junit.jupiter.api.Assertions.assertEquals;
+
+import org.junit.jupiter.api.Test;
+
+class GridCleaningRuleTest {
+
+	@Test
+	void testCleaning() {
+		assertEquals("grid.493784.5", GridCleaningRule.clean("grid.493784.5"));
+		assertEquals("grid.493784.5x", GridCleaningRule.clean("grid.493784.5x"));
+		assertEquals("grid.493784.5x", GridCleaningRule.clean("493784.5x"));
+		assertEquals("", GridCleaningRule.clean("493x784.5x"));
+	}
+
+}
--- a/dhp-common/src/test/java/eu/dnetlib/dhp/schema/oaf/utils/ISNICleaningRuleTest.java
+++ b/dhp-common/src/test/java/eu/dnetlib/dhp/schema/oaf/utils/ISNICleaningRuleTest.java
@ -0,0 +1,19 @@
+
+package eu.dnetlib.dhp.schema.oaf.utils;
+
+import static org.junit.jupiter.api.Assertions.assertEquals;
+
+import org.junit.jupiter.api.Test;
+
+class ISNICleaningRuleTest {
+
+	@Test
+	void testCleaning() {
+		assertEquals("0000000463436020", ISNICleaningRule.clean("0000 0004 6343 6020"));
+		assertEquals("0000000463436020", ISNICleaningRule.clean("0000000463436020"));
+		assertEquals("", ISNICleaningRule.clean("Q30256598"));
+		assertEquals("0000000493403529", ISNICleaningRule.clean("ISNI:0000000493403529"));
+		assertEquals("000000008614884X", ISNICleaningRule.clean("0000 0000 8614 884X"));
+	}
+
+}
--- a/dhp-common/src/test/java/eu/dnetlib/dhp/schema/oaf/utils/PICCleaningRuleTest.java
+++ b/dhp-common/src/test/java/eu/dnetlib/dhp/schema/oaf/utils/PICCleaningRuleTest.java
@ -0,0 +1,19 @@
+
+package eu.dnetlib.dhp.schema.oaf.utils;
+
+import static org.junit.jupiter.api.Assertions.assertEquals;
+
+import org.junit.jupiter.api.Test;
+
+class PICCleaningRuleTest {
+
+	@Test
+	void testCleaning() {
+		assertEquals("887624982", PICCleaningRule.clean("887624982"));
+		assertEquals("", PICCleaningRule.clean("887 624982"));
+		assertEquals("887624982", PICCleaningRule.clean(" 887624982 "));
+		assertEquals("887624982", PICCleaningRule.clean(" 887624982x "));
+		assertEquals("887624982", PICCleaningRule.clean(" 88762498200 "));
+	}
+
+}
--- a/dhp-common/src/test/java/eu/dnetlib/dhp/schema/oaf/utils/PmcCleaningRuleTest.java
+++ b/dhp-common/src/test/java/eu/dnetlib/dhp/schema/oaf/utils/PmcCleaningRuleTest.java
@ -0,0 +1,19 @@
+
+package eu.dnetlib.dhp.schema.oaf.utils;
+
+import static org.junit.jupiter.api.Assertions.assertEquals;
+
+import org.junit.jupiter.api.Test;
+
+class PmcCleaningRuleTest {
+
+	@Test
+	void testCleaning() {
+		assertEquals("PMC1234", PmcCleaningRule.clean("PMC1234"));
+		assertEquals("PMC1234", PmcCleaningRule.clean(" PMC1234"));
+		assertEquals("PMC12345678", PmcCleaningRule.clean("PMC12345678"));
+		assertEquals("PMC12345678", PmcCleaningRule.clean("PMC123456789"));
+		assertEquals("PMC12345678", PmcCleaningRule.clean("PMC 12345678"));
+	}
+
+}
--- a/dhp-common/src/test/java/eu/dnetlib/dhp/schema/oaf/utils/PmidCleaningRuleTest.java
+++ b/dhp-common/src/test/java/eu/dnetlib/dhp/schema/oaf/utils/PmidCleaningRuleTest.java
@ -0,0 +1,24 @@
+
+package eu.dnetlib.dhp.schema.oaf.utils;
+
+import static org.junit.jupiter.api.Assertions.assertEquals;
+
+import org.junit.jupiter.api.Test;
+
+class PmidCleaningRuleTest {
+
+	@Test
+	void testCleaning() {
+		// leading zeros are removed
+		assertEquals("1234", PmidCleaningRule.clean("01234"));
+		// tolerant to spaces in the middle
+		assertEquals("1234567", PmidCleaningRule.clean("0123 4567"));
+		// stop parsing at first not numerical char
+		assertEquals("123", PmidCleaningRule.clean("0123x4567"));
+		// invalid id leading to empty result
+		assertEquals("", PmidCleaningRule.clean("abc"));
+		// valid id with zeroes in the number
+		assertEquals("20794075", PmidCleaningRule.clean("20794075"));
+	}
+
+}
--- a/dhp-common/src/test/java/eu/dnetlib/dhp/schema/oaf/utils/RorCleaningRuleTest.java
+++ b/dhp-common/src/test/java/eu/dnetlib/dhp/schema/oaf/utils/RorCleaningRuleTest.java
@ -0,0 +1,17 @@
+
+package eu.dnetlib.dhp.schema.oaf.utils;
+
+import static org.junit.jupiter.api.Assertions.assertEquals;
+
+import org.junit.jupiter.api.Test;
+
+class RorCleaningRuleTest {
+
+	@Test
+	void testCleaning() {
+		assertEquals("https://ror.org/05rpz9w55", RorCleaningRule.clean("https://ror.org/05rpz9w55"));
+		assertEquals("https://ror.org/05rpz9w55", RorCleaningRule.clean("05rpz9w55"));
+		assertEquals("", RorCleaningRule.clean("05rpz9w_55"));
+	}
+
+}
--- a/dhp-pace-core/pom.xml
+++ b/dhp-pace-core/pom.xml
@ -7,7 +7,7 @@
 		<groupId>eu.dnetlib.dhp</groupId>
 		<artifactId>dhp</artifactId>
 		<version>1.2.5-SNAPSHOT</version>
-		<relativePath>../pom.xml</relativePath>
+        <relativePath>../pom.xml</relativePath>
 	</parent>

 	<groupId>eu.dnetlib.dhp</groupId>
--- a/dhp-pace-core/src/main/java/eu/dnetlib/pace/clustering/NGramUtils.java
+++ b/dhp-pace-core/src/main/java/eu/dnetlib/pace/clustering/NGramUtils.java
@ -16,8 +16,9 @@ public class NGramUtils extends AbstractPaceFunctions {
 		.loadFromClasspath("/eu/dnetlib/pace/config/stopwords_en.txt");

 	public static String cleanupForOrdering(String s) {
-		String result = NGRAMUTILS.filterStopWords(NGRAMUTILS.normalize(s), stopwords);
-		return result.isEmpty() ? result : result.replace(" ", "");
+		return (NGRAMUTILS.filterStopWords(NGRAMUTILS.normalize(s), stopwords) + StringUtils.repeat(" ", SIZE))
+			.substring(0, SIZE)
+			.replaceAll(" ", "");
 	}

 }
--- a/dhp-pace-core/src/main/java/eu/dnetlib/pace/clustering/NgramPairs.java
+++ b/dhp-pace-core/src/main/java/eu/dnetlib/pace/clustering/NgramPairs.java
@ -2,7 +2,6 @@
 package eu.dnetlib.pace.clustering;

 import java.util.Collection;
-import java.util.HashMap;
 import java.util.List;
 import java.util.Map;

--- a/dhp-pace-core/src/main/java/eu/dnetlib/pace/clustering/SortedNgramPairs.java
+++ b/dhp-pace-core/src/main/java/eu/dnetlib/pace/clustering/SortedNgramPairs.java
@ -13,7 +13,19 @@ import eu.dnetlib.pace.config.Config;
 public class SortedNgramPairs extends NgramPairs {

 	public SortedNgramPairs(Map<String, Integer> params) {
-		super(params, true);
+		super(params, false);
+	}
+
+	@Override
+	protected Collection<String> doApply(Config conf, String s) {
+
+		final List<String> tokens = Lists.newArrayList(Splitter.on(" ").omitEmptyStrings().trimResults().split(s));
+
+		Collections.sort(tokens);
+
+		return ngramPairs(
+			Lists.newArrayList(getNgrams(Joiner.on(" ").join(tokens), param("ngramLen"), param("max") * 2, 1, 2)),
+			param("max"));
 	}

 }
--- a/dhp-pace-core/src/main/java/eu/dnetlib/pace/common/AbstractPaceFunctions.java
+++ b/dhp-pace-core/src/main/java/eu/dnetlib/pace/common/AbstractPaceFunctions.java
@ -49,18 +49,18 @@ public abstract class AbstractPaceFunctions {
 	protected static Set<String> ngramBlacklist = loadFromClasspath("/eu/dnetlib/pace/config/ngram_blacklist.txt");

 	// html regex for normalization
-	public final Pattern HTML_REGEX = Pattern.compile("<[^>]*>");
+	public static final Pattern HTML_REGEX = Pattern.compile("<[^>]*>");

 	private static final String alpha = "abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789 ";
 	private static final String aliases_from = "⁰¹²³⁴⁵⁶⁷⁸⁹⁺⁻⁼⁽⁾ⁿ₀₁₂₃₄₅₆₇₈₉₊₋₌₍₎àáâäæãåāèéêëēėęəîïíīįìôöòóœøōõûüùúūßśšłžźżçćčñń";
 	private static final String aliases_to = "0123456789+-=()n0123456789+-=()aaaaaaaaeeeeeeeeiiiiiioooooooouuuuussslzzzcccnn";

 	// doi prefix for normalization
-	public final Pattern DOI_PREFIX = Pattern.compile("(https?:\\/\\/dx\\.doi\\.org\\/)|(doi:)");
+	public static final Pattern DOI_PREFIX = Pattern.compile("(https?:\\/\\/dx\\.doi\\.org\\/)|(doi:)");

-	private Pattern numberPattern = Pattern.compile("-?\\d+(\\.\\d+)?");
+	private static Pattern numberPattern = Pattern.compile("-?\\d+(\\.\\d+)?");

-	private Pattern hexUnicodePattern = Pattern.compile("\\\\u(\\p{XDigit}{4})");
+	private static Pattern hexUnicodePattern = Pattern.compile("\\\\u(\\p{XDigit}{4})");

 	protected String concat(final List<String> l) {
 		return Joiner.on(" ").skipNulls().join(l);
@ -130,10 +130,12 @@ public abstract class AbstractPaceFunctions {

 	protected static String fixAliases(final String s) {
 		final StringBuilder sb = new StringBuilder();
-		for (final char ch : Lists.charactersOf(s)) {
+
+		s.chars().forEach(ch -> {
 			final int i = StringUtils.indexOf(aliases_from, ch);
-			sb.append(i >= 0 ? aliases_to.charAt(i) : ch);
-		}
+			sb.append(i >= 0 ? aliases_to.charAt(i) : (char) ch);
+		});
+
 		return sb.toString();
 	}

@ -148,9 +150,10 @@ public abstract class AbstractPaceFunctions {
 	protected String removeSymbols(final String s) {
 		final StringBuilder sb = new StringBuilder();

-		for (final char ch : Lists.charactersOf(s)) {
-			sb.append(StringUtils.contains(alpha, ch) ? ch : " ");
-		}
+		s.chars().forEach(ch -> {
+			sb.append(StringUtils.contains(alpha, ch) ? (char) ch : ' ');
+		});
+
 		return sb.toString().replaceAll("\\s+", " ");
 	}

@ -234,7 +237,8 @@ public abstract class AbstractPaceFunctions {

 		final Set<String> h = Sets.newHashSet();
 		try {
-			for (final String s : IOUtils.readLines(NGramUtils.class.getResourceAsStream(classpath))) {
+			for (final String s : IOUtils
+				.readLines(NGramUtils.class.getResourceAsStream(classpath), StandardCharsets.UTF_8)) {
 				h.add(fixAliases(transliterator.transliterate(s))); // transliteration of the stopwords
 			}
 		} catch (final Throwable e) {
@ -249,7 +253,8 @@ public abstract class AbstractPaceFunctions {

 		final Map<String, String> m = new HashMap<>();
 		try {
-			for (final String s : IOUtils.readLines(AbstractPaceFunctions.class.getResourceAsStream(classpath))) {
+			for (final String s : IOUtils
+				.readLines(AbstractPaceFunctions.class.getResourceAsStream(classpath), StandardCharsets.UTF_8)) {
 				// string is like this: code;word1;word2;word3
 				String[] line = s.split(";");
 				String value = line[0];
@ -342,7 +347,7 @@ public abstract class AbstractPaceFunctions {
 	public static <T> String readFromClasspath(final String filename, final Class<T> clazz) {
 		final StringWriter sw = new StringWriter();
 		try {
-			IOUtils.copy(clazz.getResourceAsStream(filename), sw);
+			IOUtils.copy(clazz.getResourceAsStream(filename), sw, StandardCharsets.UTF_8);
 			return sw.toString();
 		} catch (final IOException e) {
 			throw new RuntimeException("cannot load resource from classpath: " + filename);
--- a/dhp-pace-core/src/main/java/eu/dnetlib/pace/config/Config.java
+++ b/dhp-pace-core/src/main/java/eu/dnetlib/pace/config/Config.java
@ -4,7 +4,6 @@ package eu.dnetlib.pace.config;
 import java.util.List;
 import java.util.Map;
 import java.util.function.Predicate;
-import java.util.regex.Pattern;

 import eu.dnetlib.pace.model.ClusteringDef;
 import eu.dnetlib.pace.model.FieldDef;
--- a/dhp-pace-core/src/main/java/eu/dnetlib/pace/model/RowDataOrderingComparator.java
+++ b/dhp-pace-core/src/main/java/eu/dnetlib/pace/model/RowDataOrderingComparator.java
@ -13,7 +13,8 @@ import eu.dnetlib.pace.clustering.NGramUtils;
 public class RowDataOrderingComparator implements Comparator<Row> {

 	/** The comparator field. */
-	private int comparatorField;
+	private final int comparatorField;
+	private final int identityFieldPosition;

 	/**
 	 * Instantiates a new map document comparator.
@ -21,8 +22,9 @@ public class RowDataOrderingComparator implements Comparator<Row> {
 	 * @param comparatorField
 	 *            the comparator field
 	 */
-	public RowDataOrderingComparator(final int comparatorField) {
+	public RowDataOrderingComparator(final int comparatorField, int identityFieldPosition) {
 		this.comparatorField = comparatorField;
+		this.identityFieldPosition = identityFieldPosition;
 	}

 	/*
@ -51,7 +53,10 @@ public class RowDataOrderingComparator implements Comparator<Row> {

 		int res = to1.compareTo(to2);
 		if (res == 0) {
-			return o1.compareTo(o2);
+			res = o1.compareTo(o2);
+			if (res == 0) {
+				return d1.getString(identityFieldPosition).compareTo(d2.getString(identityFieldPosition));
+			}
 		}

 		return res;
--- a/dhp-pace-core/src/main/java/eu/dnetlib/pace/model/SparkDedupConfig.scala
+++ b/dhp-pace-core/src/main/java/eu/dnetlib/pace/model/SparkDedupConfig.scala
@ -1,644 +0,0 @@
-package eu.dnetlib.pace.model
-
-import com.jayway.jsonpath.{Configuration, JsonPath, Option}
-import eu.dnetlib.pace.config.{DedupConfig, Type}
-import eu.dnetlib.pace.tree.support.TreeProcessor
-import eu.dnetlib.pace.util.MapDocumentUtil.truncateValue
-import eu.dnetlib.pace.util.{BlockProcessor, MapDocumentUtil, SparkReporter}
-import org.apache.spark.SparkContext
-import org.apache.spark.rdd.RDD.rddToPairRDDFunctions
-import org.apache.spark.sql.catalyst.encoders.{ExpressionEncoder, RowEncoder}
-import org.apache.spark.sql.{Column, Dataset, Encoder, Encoders, Row, functions}
-import org.apache.spark.sql.catalyst.expressions.{GenericRowWithSchema, Literal}
-import org.apache.spark.sql.expressions.{Aggregator, MutableAggregationBuffer, UserDefinedAggregateFunction, UserDefinedFunction, Window}
-import org.apache.spark.sql.types.{ArrayType, DataType, DataTypes, Metadata, StructField, StructType}
-
-import java.util
-import java.util.function.Predicate
-import java.util.regex.Pattern
-import scala.collection.JavaConverters._
-import scala.collection.mutable
-import org.apache.spark.sql.functions.{col, lit, udf}
-
-import java.util.Collections
-import java.util.stream.Collectors
-
-case class SparkDedupConfig(conf: DedupConfig, numPartitions: Int) extends Serializable {
-
-  private val URL_REGEX: Pattern = Pattern.compile("^\\s*(http|https|ftp)\\://.*")
-
-  private val CONCAT_REGEX: Pattern = Pattern.compile("\\|\\|\\|")
-
-  private val urlFilter = (s: String) => URL_REGEX.matcher(s).matches
-
-  val modelExtractor: (Dataset[String] => Dataset[Row]) = df => {
-
-    df.withColumn("mapDocument", rowFromJsonUDF.apply(df.col(df.columns(0))))
-      .withColumn("identifier", new Column("mapDocument.identifier"))
-      //.repartition(new Column("identifier"))
-      .dropDuplicates("identifier")
-      .select("mapDocument.*")
-
-    df.map(r => rowFromJson(r))(RowEncoder(rowDataType))
-      .dropDuplicates("identifier")
-  }
-
-  val generateClusters: (Dataset[Row] => Dataset[Row]) = df => {
-
-    val df_with_filters = conf.getPace.getModel.asScala.foldLeft(df)((res, fdef) => {
-      if (conf.blacklists.containsKey(fdef.getName)) {
-        res.withColumn(
-          fdef.getName + "_filtered",
-          filterColumnUDF(fdef).apply(new Column(fdef.getName))
-        )
-      } else {
-        res
-      }
-    })
-
-    val df_with_keys = conf
-      .clusterings()
-      .asScala
-      .foldLeft(df_with_filters)((res, cd) => {
-        res.withColumn(
-          cd.getName + "_clustered",
-          functions.explode_outer(
-            clusterValuesUDF(cd).apply(
-              functions.array(
-                cd.getFields.asScala
-                  .map(f => res.col(if (conf.blacklists.containsKey(f)) f.concat("_filtered") else f)): _*
-              )
-            )
-          )
-        )
-      })
-
-    // filter blacklisted values// filter blacklisted values
-    // create one column per cluster prefix// create one column per cluster prefix
-
-    // GROUPING sets approach// GROUPING sets approach
-    val tempTable = this.getClass.getSimpleName + "__generateClusters";
-
-    df_with_keys.createOrReplaceTempView(this.getClass.getSimpleName + "__generateClusters")
-
-    val keys = conf.clusterings().asScala.map(_.getName + "_clustered").mkString(",")
-    val fields = rowDataType.fieldNames.mkString(",")
-
-    //  Using SQL because GROUPING SETS are not available through Scala/Java DSL
-    df_with_keys.sqlContext.sql(
-      ("SELECT coalesce(" + keys + ") as key, sort_array(collect_sort_slice(" + fields + ")) as block FROM " + tempTable + " WHERE coalesce(" + keys + ") IS NOT NULL GROUP BY GROUPING SETS (" + keys + ") HAVING size(block) > 1")
-    )
-
-  }
-
-  val generateClustersWithDFAPI: (Dataset[Row] => Dataset[Row]) = df => {
-
-   System.out.println(conf.getWf.getEntityType + "::" +conf.getWf.getSubEntityType)
-
-    val df_with_filters = conf.getPace.getModel.asScala.foldLeft(df)((res, fdef) => {
-      if (conf.blacklists.containsKey(fdef.getName)) {
-        res.withColumn(
-          fdef.getName + "_filtered",
-          filterColumnUDF(fdef).apply(new Column(fdef.getName))
-        )
-      } else {
-        res
-      }
-    })
-
-    var relBlocks: Dataset[Row] = null
-
-    import scala.collection.JavaConversions._
-
-    for (cd <- conf.clusterings()) {
-      val columns: util.List[Column] = new util.ArrayList[Column](cd.getFields().size)
-
-      for (fName <- cd.getFields()) {
-        if (conf.blacklists.containsKey(fName))
-          columns.add(new Column(fName + "_filtered"))
-        else
-          columns.add(new Column(fName))
-      }
-
-      val tmp: Dataset[Row] = df_with_filters.withColumn("key", functions.explode(clusterValuesUDF(cd).apply(functions.array(columns.asScala: _*))))
-
-
-
-
-       /*.select((Seq(rowDataType.fieldNames: _*) ++ Seq("key")).map(col): _*)
-       .groupByKey(r => r.getAs[String]("key"))(Encoders.STRING)
-        .agg(collectSortSliceAggregator.toColumn)
-        .toDF("key", "block")
-        .select(col("block.block").as("block"))*/
-
-      System.out.println(cd.getName)
-
-      val  ds = tmp.groupBy("key")
-      // .agg(functions.sort_array(collectSortSliceUDAF(rowDataType.fieldNames.map(col): _*)).as("block"))
-       .agg(functions.collect_set(functions.struct(rowDataType.fieldNames.map(col): _*)).as("block"))
-        //.filter(functions.size(new Column("block")).geq(new Literal(2, DataTypes.IntegerType)))
-
-
-      //df_with_filters.printSchema()
-      //ds.printSchema()
-
-      if (relBlocks == null) relBlocks = ds
-      else relBlocks = relBlocks.union(ds)
-    }
-
-   // System.out.println()
-
-    relBlocks
-  }
-
-  val generateClustersWithWindows: (Dataset[Row] => Dataset[Row]) = df => {
-
-    val df_with_filters = conf.getPace.getModel.asScala.foldLeft(df)((res, fdef) => {
-      if (conf.blacklists.containsKey(fdef.getName)) {
-        res.withColumn(
-          fdef.getName + "_filtered",
-          filterColumnUDF(fdef).apply(new Column(fdef.getName))
-        )
-      } else {
-        res
-      }
-    })
-
-    var relBlocks: Dataset[Row] = null
-
-    import scala.collection.JavaConversions._
-
-    for (cd <- conf.clusterings()) {
-      System.out.println(conf.getWf.getEntityType + "::" + conf.getWf.getSubEntityType+ ": " + cd.getName + " " + cd.toString)
-
-      val columns: util.List[Column] = new util.ArrayList[Column](cd.getFields().size)
-
-      for (fName <- cd.getFields()) {
-        if (conf.blacklists.containsKey(fName))
-          columns.add(new Column(fName + "_filtered"))
-        else
-          columns.add(new Column(fName))
-      }
-
-      // Add 'key' column with the value generated by the given clustering definition
-      val ds: Dataset[Row] = df_with_filters.withColumn("key", functions.explode(clusterValuesUDF(cd).apply(functions.array(columns.asScala: _*))))
-        // Add position column having the position of the row within the set of rows having the same key value ordered by the sorting value
-        .withColumn("position", functions.row_number().over(Window.partitionBy("key").orderBy(col(conf.getWf.getOrderField))))
-        // filter out rows with position exceeding the maxqueuesize parameter
-        .filter(col("position").leq(conf.getWf.getQueueMaxSize))
-        .groupBy("key")
-        .agg(functions.collect_set(functions.struct(rowDataType.fieldNames.map(col): _*)).as("block"))
-        .filter(functions.size(new Column("block")).geq(new Literal(2, DataTypes.IntegerType)))
-
-      if (relBlocks == null) relBlocks = ds
-      else relBlocks = relBlocks.union(ds)
-    }
-
-    relBlocks
-  }
-
-  val generateClustersWithDFAPIMerged: (Dataset[Row] => Dataset[Row]) = df => {
-    val df_with_filters = conf.getPace.getModel.asScala.foldLeft(df)((res, fdef) => {
-      if (conf.blacklists.containsKey(fdef.getName)) {
-        res.withColumn(
-          fdef.getName + "_filtered",
-          filterColumnUDF(fdef).apply(new Column(fdef.getName))
-        )
-      } else {
-        res
-      }
-    })
-
-    import scala.collection.JavaConversions._
-
-   val keys = conf.clusterings().foldLeft(null : Column)((res, cd) => {
-     val columns: util.List[Column] = new util.ArrayList[Column](cd.getFields().size)
-
-     for (fName <- cd.getFields()) {
-       if (conf.blacklists.containsKey(fName))
-         columns.add(new Column(fName + "_filtered"))
-       else
-         columns.add(new Column(fName))
-     }
-
-     if (res != null)
-     functions.array_union(res, clusterValuesUDF(cd).apply(functions.array(columns.asScala: _*)))
-     else
-     clusterValuesUDF(cd).apply(functions.array(columns.asScala: _*))
-   })
-
-      val ds: Dataset[Row] = df_with_filters.withColumn("key", functions.explode(keys))
-        .select((Seq(rowDataType.fieldNames: _*) ++ Seq("key")).map(col): _*)
-        .groupByKey(r => r.getAs[String]("key"))(Encoders.STRING)
-        .agg(collectSortSliceAggregator.toColumn)
-        .toDF("key", "block")
-        .select(col("block.block").as("block"))
-
-        /*.groupBy("key")
-        .agg(collectSortSliceUDAF(rowDataType.fieldNames.map(col): _*).as("block"))*/
-        .filter(functions.size(new Column("block")).geq(new Literal(2, DataTypes.IntegerType)))
-
-    ds
-  }
-
-  val generateClustersWithRDDReduction: (Dataset[Row] => Dataset[Row]) = df => {
-    val df_with_filters = conf.getPace.getModel.asScala.foldLeft(df)((res, fdef) => {
-      if (conf.blacklists.containsKey(fdef.getName)) {
-        res.withColumn(
-          fdef.getName + "_filtered",
-          filterColumnUDF(fdef).apply(new Column(fdef.getName))
-        )
-      } else {
-        res
-      }
-    })
-
-    var relBlocks: Dataset[Row] = null
-
-    import scala.collection.JavaConversions._
-
-    for (cd <- conf.clusterings()) {
-      val columns: util.List[Column] = new util.ArrayList[Column](cd.getFields().size)
-
-      for (fName <- cd.getFields()) {
-        if (conf.blacklists.containsKey(fName))
-          columns.add(new Column(fName + "_filtered"))
-        else
-          columns.add(new Column(fName))
-      }
-
-      val ds: Dataset[Row] = df.sparkSession.createDataFrame(df_with_filters.withColumn("key", functions.explode(clusterValuesUDF(cd).apply(functions.array(columns.asScala: _*))))
-        .select(col("key"), functions.array(functions.struct(rowDataType.fieldNames.map(col): _*).as("value")))
-        .rdd.keyBy(_.getString(0))
-        .reduceByKey((a, b) => {
-          val b1 = a.getSeq[Row](1)
-          val b2 = b.getSeq[Row](1)
-
-          if (b1.size + b2.size > conf.getWf.getQueueMaxSize)
-            Row(a.get(0), b1.union(b2).sortBy(_.getString(orderingFieldPosition)).slice(0, conf.getWf.getQueueMaxSize))
-          else
-            Row(a.get(0), b1.union(b2))
-        })
-        .map(_._2)
-        .filter(k => k.getSeq(1).size > 1),
-        new StructType().add(StructField("key", DataTypes.StringType)).add(StructField("block", ArrayType(rowDataType)))
-      )
-
-      if (relBlocks == null) relBlocks = ds
-      else relBlocks = relBlocks.union(ds)
-    }
-
-    relBlocks
-  }
-
-  val printAnalytics: (Dataset[Row] => Dataset[Row]) = df => {
-
-    val df_with_filters = conf.getPace.getModel.asScala.foldLeft(df)((res, fdef) => {
-      if (conf.blacklists.containsKey(fdef.getName)) {
-        res.withColumn(
-          fdef.getName + "_filtered",
-          filterColumnUDF(fdef).apply(new Column(fdef.getName))
-        )
-      } else {
-        res
-      }
-    })
-
-    var relBlocks: Dataset[Row] = null
-
-    import scala.collection.JavaConversions._
-
-    for (cd <- conf.clusterings()) {
-      val columns: util.List[Column] = new util.ArrayList[Column](cd.getFields().size)
-
-      for (fName <- cd.getFields()) {
-        if (conf.blacklists.containsKey(fName))
-          columns.add(new Column(fName + "_filtered"))
-        else
-          columns.add(new Column(fName))
-      }
-
-      // Add 'key' column with the value generated by the given clustering definition
-      val ds: Dataset[Row] = df_with_filters.withColumn("key", functions.explode(clusterValuesUDF(cd).apply(functions.array(columns.asScala: _*))))
-        // Add position column having the position of the row within the set of rows having the same key value ordered by the sorting value
-        .withColumn("position", functions.row_number().over(Window.partitionBy("key").orderBy(conf.getWf.getOrderField)))
-        // filter out rows with position exceeding the maxqueuesize parameter
-        .filter(col("position").lt(conf.getWf.getQueueMaxSize))
-
-
-      // inner join to compute all combination of rows to compare
-      // note the condition on position to obtain 'windowing': given a row this is compared at most with the next
-      // SlidingWindowSize rows following the sort order
-      val dsWithMatch = ds.as("l").join(ds.as("r"),
-        col("l.key").equalTo(col("r.key")),
-        "inner"
-      )
-        .filter((col("l.position").lt(col("r.position")))
-          && (col("r.position").lt(col("l.position").plus(lit(conf.getWf.getSlidingWindowSize)))))
-        // Add match column with the result of comparison
-      // dsWithMatch.show(false)
-
-      if (relBlocks == null)
-        relBlocks = dsWithMatch
-      else
-        relBlocks = relBlocks.union(dsWithMatch)
-    }
-
-    System.out.println(conf.getWf.getEntityType + "::" + conf.getWf.getSubEntityType)
-    System.out.println("Total number of comparations: " + relBlocks.count())
-
-
-    df
-  }
-
-
-  val generateAndProcessClustersWithJoins: (Dataset[Row] => Dataset[Row]) = df => {
-
-    val df_with_filters = conf.getPace.getModel.asScala.foldLeft(df)((res, fdef) => {
-      if (conf.blacklists.containsKey(fdef.getName)) {
-        res.withColumn(
-          fdef.getName + "_filtered",
-          filterColumnUDF(fdef).apply(new Column(fdef.getName))
-        )
-      } else {
-        res
-      }
-    })
-
-    var relBlocks: Dataset[Row] = null
-
-    import scala.collection.JavaConversions._
-
-    for (cd <- conf.clusterings()) {
-      val columns: util.List[Column] = new util.ArrayList[Column](cd.getFields().size)
-
-      for (fName <- cd.getFields()) {
-        if (conf.blacklists.containsKey(fName))
-          columns.add(new Column(fName + "_filtered"))
-        else
-          columns.add(new Column(fName))
-      }
-
-      // Add 'key' column with the value generated by the given clustering definition
-      val ds: Dataset[Row] = df_with_filters.withColumn("key", functions.explode(clusterValuesUDF(cd).apply(functions.array(columns.asScala: _*))))
-        // Add position column having the position of the row within the set of rows having the same key value ordered by the sorting value
-        .withColumn("position", functions.row_number().over(Window.partitionBy("key").orderBy(conf.getWf.getOrderField)))
-        // filter out rows with position exceeding the maxqueuesize parameter
-        .filter(col("position").lt(conf.getWf.getQueueMaxSize))
-
-
-      // inner join to compute all combination of rows to compare
-      // note the condition on position to obtain 'windowing': given a row this is compared at most with the next
-      // SlidingWindowSize rows following the sort order
-      val dsWithMatch = ds.as("l").join(ds.as("r"),
-        col("l.key").equalTo(col("r.key")),
-        "inner"
-        )
-        .filter((col("l.position").lt(col("r.position")))
-          && (col("r.position").lt(col("l.position").plus(lit(conf.getWf.getSlidingWindowSize)))))
-        // Add match column with the result of comparison
-        .withColumn("match", udf[Boolean, Row, Row]((a, b) => {
-          val treeProcessor = new TreeProcessor(conf)
-
-         treeProcessor.compare(a, b)
-        }).apply(functions.struct(rowDataType.fieldNames.map(s => col("l.".concat(s))): _*), functions.struct(rowDataType.fieldNames.map(s => col("r.".concat(s))): _*)))
-        .filter(col("match").equalTo(true))
-        .select(col("l.identifier").as("from"), col("r.identifier").as("to"))
-
-     // dsWithMatch.show(false)
-
-      if (relBlocks == null)
-        relBlocks = dsWithMatch
-      else
-        relBlocks = relBlocks.union(dsWithMatch)
-    }
-
-    val res = relBlocks
-      //.select(col("l.identifier").as("from"), col("r.identifier").as("to"))
-      //.repartition()
-      .distinct()
-
-   // res.show(false)
-    res.select(functions.struct("from", "to"))
-  }
-
-  val processClusters: (Dataset[Row] => Dataset[Row]) = df => {
-
-    val entity = conf.getWf.getEntityType
-
-    df.filter(functions.size(new Column("block")).geq(new Literal(2, DataTypes.IntegerType)))
-      .withColumn("relations", processBlock(df.sqlContext.sparkContext).apply(new Column("block")))
-      .select(functions.explode(new Column("relations")).as("relation"))
-      //.repartition(new Column("relation"))
-      .dropDuplicates("relation")
-  }
-
-  val rowDataType: StructType = {
-//    val unordered = conf.getPace.getModel.asScala.foldLeft(
-//      new StructType()
-//    )((resType, fdef) => {
-//      resType.add(fdef.getType match {
-//        case Type.List | Type.JSON =>
-//          StructField(fdef.getName, DataTypes.createArrayType(DataTypes.StringType), true, Metadata.empty)
-//        case Type.DoubleArray =>
-//          StructField(fdef.getName, DataTypes.createArrayType(DataTypes.DoubleType), true, Metadata.empty)
-//        case _ =>
-//          StructField(fdef.getName, DataTypes.StringType, true, Metadata.empty)
-//      })
-//    })
-//
-//    conf.getPace.getModel.asScala.filterNot(_.getName.equals(conf.getWf.getOrderField)).foldLeft(
-//      new StructType()
-//        .add(unordered(conf.getWf.getOrderField))
-//        .add(StructField("identifier", DataTypes.StringType, false, Metadata.empty))
-//    )((resType, fdef) => resType.add(unordered(fdef.getName)))
-
-    val identifier = new FieldDef()
-    identifier.setName("identifier")
-    identifier.setType(Type.String)
-
-    (conf.getPace.getModel.asScala ++ Seq(identifier)).sortBy(_.getName)
-      .foldLeft(
-          new StructType()
-        )((resType, fdef) => {
-          resType.add(fdef.getType match {
-            case Type.List | Type.JSON =>
-              StructField(fdef.getName, DataTypes.createArrayType(DataTypes.StringType), true, Metadata.empty)
-            case Type.DoubleArray =>
-              StructField(fdef.getName, DataTypes.createArrayType(DataTypes.DoubleType), true, Metadata.empty)
-            case _ =>
-              StructField(fdef.getName, DataTypes.StringType, true, Metadata.empty)
-          })
-        })
-
-
-
-  }
-
-  val identityFieldPosition: Int = rowDataType.fieldIndex("identifier")
-
-  val orderingFieldPosition: Int = rowDataType.fieldIndex(conf.getWf.getOrderField)
-
-  def rowFromJson(json: String) : Row = {
-    val documentContext =
-      JsonPath.using(Configuration.defaultConfiguration.addOptions(Option.SUPPRESS_EXCEPTIONS)).parse(json)
-    val values = new Array[Any](rowDataType.size)
-
-    values(identityFieldPosition) = MapDocumentUtil.getJPathString(conf.getWf.getIdPath, documentContext)
-
-    rowDataType.fieldNames.zipWithIndex.foldLeft(values) {
-      case ((res, (fname, index))) => {
-        val fdef = conf.getPace.getModelMap.get(fname)
-
-        if (fdef != null) {
-          res(index) = fdef.getType match {
-            case Type.String | Type.Int =>
-              MapDocumentUtil.truncateValue(
-                MapDocumentUtil.getJPathString(fdef.getPath, documentContext),
-                fdef.getLength
-              )
-
-            case Type.URL =>
-              var uv = MapDocumentUtil.getJPathString(fdef.getPath, documentContext)
-              if (!urlFilter(uv)) uv = ""
-              uv
-
-            case Type.List | Type.JSON =>
-              MapDocumentUtil.truncateList(
-                MapDocumentUtil.getJPathList(fdef.getPath, documentContext, fdef.getType),
-                fdef.getSize
-              ).toArray
-
-            case Type.StringConcat =>
-              val jpaths = CONCAT_REGEX.split(fdef.getPath)
-
-              truncateValue(
-                jpaths
-                  .map(jpath => MapDocumentUtil.getJPathString(jpath, documentContext))
-                  .mkString(" "),
-                fdef.getLength
-              )
-
-            case Type.DoubleArray =>
-              MapDocumentUtil.getJPathArray(fdef.getPath, json)
-          }
-        }
-
-        res
-      }
-    }
-
-    new GenericRowWithSchema(values, rowDataType)
-  }
-
-  val rowFromJsonUDF = udf(rowFromJson(_), rowDataType)
-
-  def filterColumnUDF(fdef: FieldDef): UserDefinedFunction = {
-
-    val blacklist: Predicate[String] = conf.blacklists().get(fdef.getName)
-
-    if (blacklist == null) {
-      throw new IllegalArgumentException("Column: " + fdef.getName + " does not have any filter")
-    } else {
-      fdef.getType match {
-        case Type.List | Type.JSON =>
-          udf[Array[String], Array[String]](values => {
-            values.filter((v: String) => !blacklist.test(v))
-          })
-
-        case _ =>
-          udf[String, String](v => {
-            if (blacklist.test(v)) ""
-            else v
-          })
-      }
-    }
-  }
-
-  def clusterValuesUDF(cd: ClusteringDef) = {
-    udf[mutable.WrappedArray[String], mutable.WrappedArray[Object]](values => {
-      values.flatMap(f => cd.clusteringFunction().apply(conf, Seq(f.toString).asJava).asScala).map(cd.getName.concat(_))
-    })
-  }
-
-  def processBlock(implicit sc: SparkContext) = {
-    val accumulators = SparkReporter.constructAccumulator(conf, sc)
-
-    udf[Array[Tuple2[String, String]], mutable.WrappedArray[Row]](block => {
-      val reporter = new SparkReporter(accumulators)
-
-      val mapDocuments = block.asJava.stream
-        .sorted(new RowDataOrderingComparator(orderingFieldPosition))
-        .limit(conf.getWf.getQueueMaxSize)
-        .collect(Collectors.toList[Row]())
-
-      new BlockProcessor(conf, identityFieldPosition, orderingFieldPosition).processSortedRows(mapDocuments, reporter)
-
-      reporter.getRelations.asScala.toArray
-    }).asNondeterministic()
-  }
-
-  val collectSortSliceAggregator : Aggregator[Row,Seq[Row], Row] = new Aggregator[Row, Seq[Row], Row] () {
-    override def zero: Seq[Row] = Seq[Row]()
-
-
-    override def reduce(buffer: Seq[Row], input: Row): Seq[Row] = {
-      merge(buffer, Seq(input))
-    }
-
-    override def merge(buffer: Seq[Row], toMerge: Seq[Row]): Seq[Row] = {
-      val newBlock = buffer ++ toMerge
-
-      if (newBlock.size > conf.getWf.getQueueMaxSize)
-        newBlock.sortBy(_.getString(orderingFieldPosition)).slice(0, conf.getWf.getQueueMaxSize)
-      else
-        newBlock
-    }
-
-    override def finish(reduction: Seq[Row]): Row = {
-      Row(reduction.toArray)
-    }
-
-    override def bufferEncoder: Encoder[Seq[Row]] = Encoders.kryo[Seq[Row]]
-
-    override def outputEncoder: Encoder[Row] = RowEncoder.apply(new StructType().add("block", DataTypes.createArrayType(rowDataType), nullable = true))
-  }
-
-  val collectSortSliceUDAF : UserDefinedAggregateFunction = new UserDefinedAggregateFunction {
-    override def inputSchema: StructType = rowDataType
-
-    override def bufferSchema: StructType = {
-      new StructType().add("block", DataTypes.createArrayType(rowDataType), nullable = true)
-    }
-
-    override def dataType: DataType = DataTypes.createArrayType(rowDataType)
-
-    override def deterministic: Boolean = true
-
-    override def initialize(buffer: MutableAggregationBuffer): Unit = {
-      buffer(0) = Seq[Row]()
-    }
-
-    override def update(buffer: MutableAggregationBuffer, input: Row): Unit = {
-      val newBlock = buffer.getSeq[Row](0) ++ Seq(input)
-
-      if (newBlock.size > conf.getWf.getQueueMaxSize)
-        buffer(0) = newBlock.sortBy(_.getString(orderingFieldPosition)).slice(0, conf.getWf.getQueueMaxSize)
-      else
-        buffer(0) = newBlock
-    }
-
-    override def merge(buffer: MutableAggregationBuffer, row: Row): Unit = {
-      val newBlock = buffer.getSeq[Row](0) ++ row.getSeq[Row](0)
-
-      if (newBlock.size > conf.getWf.getQueueMaxSize)
-        buffer(0) = newBlock.sortBy(_.getString(orderingFieldPosition)).slice(0, conf.getWf.getQueueMaxSize)
-      else
-        buffer(0) = newBlock
-    }
-
-    override def evaluate(buffer: Row): Any = {
-      buffer.getSeq[Row](0)
-    }
-  }
-}
--- a/dhp-pace-core/src/main/java/eu/dnetlib/pace/model/SparkDeduper.scala
+++ b/dhp-pace-core/src/main/java/eu/dnetlib/pace/model/SparkDeduper.scala
@ -0,0 +1,131 @@
+package eu.dnetlib.pace.model
+
+import eu.dnetlib.pace.config.{DedupConfig, Type}
+import eu.dnetlib.pace.util.{BlockProcessor, SparkReporter}
+import org.apache.spark.SparkContext
+import org.apache.spark.sql.catalyst.expressions.Literal
+import org.apache.spark.sql.expressions._
+import org.apache.spark.sql.functions.{col, lit, udf}
+import org.apache.spark.sql.types._
+import org.apache.spark.sql.{Column, Dataset, Row, functions}
+
+import java.util.function.Predicate
+import java.util.stream.Collectors
+import scala.collection.JavaConversions._
+import scala.collection.JavaConverters._
+import scala.collection.mutable
+case class SparkDeduper(conf: DedupConfig) extends Serializable {
+
+  val model: SparkModel = SparkModel(conf)
+
+  val dedup: (Dataset[Row] => Dataset[Row]) = df => {
+    df.transform(filterAndCleanup)
+      .transform(generateClustersWithCollect)
+      .transform(processBlocks)
+  }
+
+
+  val filterAndCleanup: (Dataset[Row] => Dataset[Row]) = df => {
+    val df_with_filters = conf.getPace.getModel.asScala.foldLeft(df)((res, fdef) => {
+      if (conf.blacklists.containsKey(fdef.getName)) {
+        res.withColumn(
+          fdef.getName + "_filtered",
+          filterColumnUDF(fdef).apply(new Column(fdef.getName))
+        )
+      } else {
+        res
+      }
+    })
+
+    df_with_filters
+  }
+
+  def filterColumnUDF(fdef: FieldDef): UserDefinedFunction = {
+    val blacklist: Predicate[String] = conf.blacklists().get(fdef.getName)
+
+    if (blacklist == null) {
+      throw new IllegalArgumentException("Column: " + fdef.getName + " does not have any filter")
+    } else {
+      fdef.getType match {
+        case Type.List | Type.JSON =>
+          udf[Array[String], Array[String]](values => {
+            values.filter((v: String) => !blacklist.test(v))
+          })
+
+        case _ =>
+          udf[String, String](v => {
+            if (blacklist.test(v)) ""
+            else v
+          })
+      }
+    }
+  }
+
+  val generateClustersWithCollect: (Dataset[Row] => Dataset[Row]) = df_with_filters => {
+    var df_with_clustering_keys: Dataset[Row] = null
+
+    for ((cd, idx) <- conf.clusterings().zipWithIndex) {
+      val inputColumns = cd.getFields().foldLeft(Seq[Column]())((acc, fName) => {
+        val column = if (conf.blacklists.containsKey(fName))
+          Seq(col(fName + "_filtered"))
+        else
+          Seq(col(fName))
+
+        acc ++ column
+      })
+
+      // Add 'key' column with the value generated by the given clustering definition
+      val ds: Dataset[Row] = df_with_filters
+        .withColumn("clustering", lit(cd.getName + "::" + idx))
+        .withColumn("key", functions.explode(clusterValuesUDF(cd).apply(functions.array(inputColumns: _*))))
+        // Add position column having the position of the row within the set of rows having the same key value ordered by the sorting value
+        .withColumn("position", functions.row_number().over(Window.partitionBy("key").orderBy(col(model.orderingFieldName), col(model.identifierFieldName))))
+
+      if (df_with_clustering_keys == null)
+        df_with_clustering_keys = ds
+      else
+        df_with_clustering_keys = df_with_clustering_keys.union(ds)
+    }
+
+    //TODO: analytics
+
+    val df_with_blocks = df_with_clustering_keys
+      // filter out rows with position exceeding the maxqueuesize parameter
+      .filter(col("position").leq(conf.getWf.getQueueMaxSize))
+      .groupBy("clustering", "key")
+      .agg(functions.collect_set(functions.struct(model.schema.fieldNames.map(col): _*)).as("block"))
+      .filter(functions.size(new Column("block")).gt(1))
+
+    df_with_blocks
+  }
+
+  def clusterValuesUDF(cd: ClusteringDef) = {
+    udf[mutable.WrappedArray[String], mutable.WrappedArray[Any]](values => {
+      values.flatMap(f => cd.clusteringFunction().apply(conf, Seq(f.toString).asJava).asScala)
+    })
+  }
+
+  val processBlocks: (Dataset[Row] => Dataset[Row]) = df => {
+    df.filter(functions.size(new Column("block")).geq(new Literal(2, DataTypes.IntegerType)))
+      .withColumn("relations", processBlock(df.sqlContext.sparkContext).apply(new Column("block")))
+      .select(functions.explode(new Column("relations")).as("relation"))
+  }
+
+  def processBlock(implicit sc: SparkContext) = {
+    val accumulators = SparkReporter.constructAccumulator(conf, sc)
+
+    udf[Array[(String, String)], mutable.WrappedArray[Row]](block => {
+      val reporter = new SparkReporter(accumulators)
+
+      val mapDocuments = block.asJava.stream()
+        .sorted(new RowDataOrderingComparator(model.orderingFieldPosition, model.identityFieldPosition))
+        .limit(conf.getWf.getQueueMaxSize)
+        .collect(Collectors.toList[Row]())
+
+      new BlockProcessor(conf, model.identityFieldPosition, model.orderingFieldPosition).processSortedRows(mapDocuments, reporter)
+
+      reporter.getRelations.asScala.toArray
+    }).asNondeterministic()
+  }
+
+}
--- a/dhp-pace-core/src/main/java/eu/dnetlib/pace/model/SparkModel.scala
+++ b/dhp-pace-core/src/main/java/eu/dnetlib/pace/model/SparkModel.scala
@ -0,0 +1,108 @@
+package eu.dnetlib.pace.model
+
+import com.jayway.jsonpath.{Configuration, JsonPath}
+import eu.dnetlib.pace.config.{DedupConfig, Type}
+import eu.dnetlib.pace.util.MapDocumentUtil
+import org.apache.spark.sql.catalyst.encoders.RowEncoder
+import org.apache.spark.sql.catalyst.expressions.GenericRowWithSchema
+import org.apache.spark.sql.types.{DataTypes, Metadata, StructField, StructType}
+import org.apache.spark.sql.{Dataset, Row}
+
+import java.util.regex.Pattern
+import scala.collection.JavaConverters._
+
+case class SparkModel(conf: DedupConfig) {
+  private val URL_REGEX: Pattern = Pattern.compile("^\\s*(http|https|ftp)\\://.*")
+
+  private val CONCAT_REGEX: Pattern = Pattern.compile("\\|\\|\\|")
+
+  val identifierFieldName = "identifier"
+
+  val orderingFieldName = if (!conf.getWf.getOrderField.isEmpty) conf.getWf.getOrderField else identifierFieldName
+
+  val schema: StructType = {
+    // create an implicit identifier field
+    val identifier = new FieldDef()
+    identifier.setName(identifierFieldName)
+    identifier.setType(Type.String)
+
+    // Construct a Spark StructType representing the schema of the model
+    (Seq(identifier) ++ conf.getPace.getModel.asScala)
+      .foldLeft(
+        new StructType()
+      )((resType, fieldDef) => {
+        resType.add(fieldDef.getType match {
+          case Type.List | Type.JSON =>
+            StructField(fieldDef.getName, DataTypes.createArrayType(DataTypes.StringType), true, Metadata.empty)
+          case Type.DoubleArray =>
+            StructField(fieldDef.getName, DataTypes.createArrayType(DataTypes.DoubleType), true, Metadata.empty)
+          case _ =>
+            StructField(fieldDef.getName, DataTypes.StringType, true, Metadata.empty)
+        })
+      })
+
+
+  }
+
+  val identityFieldPosition: Int = schema.fieldIndex(identifierFieldName)
+
+  val orderingFieldPosition: Int = schema.fieldIndex(orderingFieldName)
+
+  val parseJsonDataset: (Dataset[String] => Dataset[Row]) = df => {
+    df.map(r => rowFromJson(r))(RowEncoder(schema))
+  }
+
+  def rowFromJson(json: String): Row = {
+    val documentContext =
+      JsonPath.using(Configuration.defaultConfiguration.addOptions(com.jayway.jsonpath.Option.SUPPRESS_EXCEPTIONS)).parse(json)
+    val values = new Array[Any](schema.size)
+
+    values(identityFieldPosition) = MapDocumentUtil.getJPathString(conf.getWf.getIdPath, documentContext)
+
+    schema.fieldNames.zipWithIndex.foldLeft(values) {
+      case ((res, (fname, index))) => {
+        val fdef = conf.getPace.getModelMap.get(fname)
+
+        if (fdef != null) {
+          res(index) = fdef.getType match {
+            case Type.String | Type.Int =>
+              MapDocumentUtil.truncateValue(
+                MapDocumentUtil.getJPathString(fdef.getPath, documentContext),
+                fdef.getLength
+              )
+
+            case Type.URL =>
+              var uv = MapDocumentUtil.getJPathString(fdef.getPath, documentContext)
+              if (!URL_REGEX.matcher(uv).matches)
+                uv = ""
+              uv
+
+            case Type.List | Type.JSON =>
+              MapDocumentUtil.truncateList(
+                MapDocumentUtil.getJPathList(fdef.getPath, documentContext, fdef.getType),
+                fdef.getSize
+              ).asScala
+
+            case Type.StringConcat =>
+              val jpaths = CONCAT_REGEX.split(fdef.getPath)
+
+              MapDocumentUtil.truncateValue(
+                jpaths
+                  .map(jpath => MapDocumentUtil.getJPathString(jpath, documentContext))
+                  .mkString(" "),
+                fdef.getLength
+              )
+
+            case Type.DoubleArray =>
+              MapDocumentUtil.getJPathArray(fdef.getPath, json)
+          }
+        }
+
+        res
+      }
+    }
+
+    new GenericRowWithSchema(values, schema)
+  }
+}
+
--- a/dhp-pace-core/src/main/java/eu/dnetlib/pace/tree/ExactMatchIgnoreCase.java
+++ b/dhp-pace-core/src/main/java/eu/dnetlib/pace/tree/ExactMatchIgnoreCase.java
@ -1,11 +1,8 @@

 package eu.dnetlib.pace.tree;

-import java.util.List;
 import java.util.Map;

-import com.google.common.base.Joiner;
-
 import eu.dnetlib.pace.config.Config;
 import eu.dnetlib.pace.tree.support.AbstractStringComparator;
 import eu.dnetlib.pace.tree.support.ComparatorClass;
--- a/dhp-pace-core/src/main/java/eu/dnetlib/pace/tree/Level2JaroWinkler.java
+++ b/dhp-pace-core/src/main/java/eu/dnetlib/pace/tree/Level2JaroWinkler.java
@ -5,7 +5,6 @@ import java.util.Map;

 import com.wcohen.ss.AbstractStringDistance;

-import eu.dnetlib.pace.tree.support.AbstractComparator;
 import eu.dnetlib.pace.tree.support.AbstractStringComparator;
 import eu.dnetlib.pace.tree.support.ComparatorClass;

--- a/dhp-pace-core/src/main/java/eu/dnetlib/pace/tree/NumbersMatch.java
+++ b/dhp-pace-core/src/main/java/eu/dnetlib/pace/tree/NumbersMatch.java
@ -4,7 +4,6 @@ package eu.dnetlib.pace.tree;
 import java.util.Map;

 import eu.dnetlib.pace.config.Config;
-import eu.dnetlib.pace.tree.support.AbstractComparator;
 import eu.dnetlib.pace.tree.support.AbstractStringComparator;
 import eu.dnetlib.pace.tree.support.ComparatorClass;

--- a/dhp-pace-core/src/main/java/eu/dnetlib/pace/tree/RomansMatch.java
+++ b/dhp-pace-core/src/main/java/eu/dnetlib/pace/tree/RomansMatch.java
@ -4,7 +4,6 @@ package eu.dnetlib.pace.tree;
 import java.util.Map;

 import eu.dnetlib.pace.config.Config;
-import eu.dnetlib.pace.tree.support.AbstractComparator;
 import eu.dnetlib.pace.tree.support.AbstractStringComparator;
 import eu.dnetlib.pace.tree.support.ComparatorClass;

--- a/dhp-pace-core/src/main/java/eu/dnetlib/pace/tree/SizeMatch.java
+++ b/dhp-pace-core/src/main/java/eu/dnetlib/pace/tree/SizeMatch.java
@ -4,8 +4,6 @@ package eu.dnetlib.pace.tree;
 import java.util.List;
 import java.util.Map;

-import com.google.common.collect.Lists;
-
 import eu.dnetlib.pace.config.Config;
 import eu.dnetlib.pace.tree.support.AbstractListComparator;
 import eu.dnetlib.pace.tree.support.ComparatorClass;
--- a/dhp-pace-core/src/main/java/eu/dnetlib/pace/tree/StringContainsMatch.java
+++ b/dhp-pace-core/src/main/java/eu/dnetlib/pace/tree/StringContainsMatch.java
@ -4,7 +4,6 @@ package eu.dnetlib.pace.tree;
 import java.util.Map;

 import eu.dnetlib.pace.config.Config;
-import eu.dnetlib.pace.tree.support.AbstractComparator;
 import eu.dnetlib.pace.tree.support.AbstractStringComparator;
 import eu.dnetlib.pace.tree.support.ComparatorClass;

@ -44,22 +43,25 @@ public class StringContainsMatch extends AbstractStringComparator {
 			STRING = STRING.toLowerCase();
 		}

-		switch (AGGREGATOR) {
-			case "AND":
-				if (ca.contains(STRING) && cb.contains(STRING))
-					return 1.0;
-				break;
-			case "OR":
-				if (ca.contains(STRING) || cb.contains(STRING))
-					return 1.0;
-				break;
-			case "XOR":
-				if (ca.contains(STRING) ^ cb.contains(STRING))
-					return 1.0;
-				break;
-			default:
-				return 0.0;
+		if (AGGREGATOR != null) {
+			switch (AGGREGATOR) {
+				case "AND":
+					if (ca.contains(STRING) && cb.contains(STRING))
+						return 1.0;
+					break;
+				case "OR":
+					if (ca.contains(STRING) || cb.contains(STRING))
+						return 1.0;
+					break;
+				case "XOR":
+					if (ca.contains(STRING) ^ cb.contains(STRING))
+						return 1.0;
+					break;
+				default:
+					return 0.0;
+			}
 		}
+
 		return 0.0;
 	}
 }
--- a/dhp-pace-core/src/main/java/eu/dnetlib/pace/tree/support/AbstractComparator.java
+++ b/dhp-pace-core/src/main/java/eu/dnetlib/pace/tree/support/AbstractComparator.java
@ -1,7 +1,6 @@

 package eu.dnetlib.pace.tree.support;

-import java.util.Collections;
 import java.util.List;
 import java.util.Map;

--- a/dhp-pace-core/src/main/java/eu/dnetlib/pace/tree/support/AbstractListComparator.java
+++ b/dhp-pace-core/src/main/java/eu/dnetlib/pace/tree/support/AbstractListComparator.java
@ -4,11 +4,9 @@ package eu.dnetlib.pace.tree.support;
 import java.util.List;
 import java.util.Map;

-import com.google.common.collect.Lists;
 import com.wcohen.ss.AbstractStringDistance;

 import eu.dnetlib.pace.config.Config;
-import eu.dnetlib.pace.config.Type;

 abstract public class AbstractListComparator extends AbstractComparator<List<String>> {
 	protected AbstractListComparator(Map<String, String> params) {
--- a/dhp-pace-core/src/main/java/eu/dnetlib/pace/tree/support/AbstractSortedComparator.java
+++ b/dhp-pace-core/src/main/java/eu/dnetlib/pace/tree/support/AbstractSortedComparator.java
@ -1,7 +1,6 @@

 package eu.dnetlib.pace.tree.support;

-import java.util.AbstractList;
 import java.util.Collections;
 import java.util.List;
 import java.util.Map;
--- a/dhp-pace-core/src/main/java/eu/dnetlib/pace/util/BlockProcessor.java
+++ b/dhp-pace-core/src/main/java/eu/dnetlib/pace/util/BlockProcessor.java
@ -2,8 +2,6 @@
 package eu.dnetlib.pace.util;

 import java.util.ArrayList;
-import java.util.Collection;
-import java.util.Iterator;
 import java.util.List;

 import org.apache.commons.lang3.StringUtils;
@ -13,7 +11,6 @@ import org.apache.spark.sql.Row;
 import org.apache.spark.sql.types.ArrayType;
 import org.apache.spark.sql.types.DataType;
 import org.apache.spark.sql.types.StringType;
-import org.apache.spark.sql.types.StructType;

 import eu.dnetlib.pace.config.DedupConfig;
 import eu.dnetlib.pace.config.WfConfig;
--- a/dhp-pace-core/src/main/java/eu/dnetlib/pace/util/DiffPatchMatch.java
+++ b/dhp-pace-core/src/main/java/eu/dnetlib/pace/util/DiffPatchMatch.java
@ -18,6 +18,7 @@ package eu.dnetlib.pace.util;
 * See the License for the specific language governing permissions and
 * limitations under the License.
 */
+
 /*
 * Diff Match and Patch
 * Copyright 2018 The diff-match-patch Authors.
--- a/dhp-pace-core/src/main/java/eu/dnetlib/pace/util/MapDocumentUtil.java
+++ b/dhp-pace-core/src/main/java/eu/dnetlib/pace/util/MapDocumentUtil.java
@ -2,20 +2,20 @@
 package eu.dnetlib.pace.util;

 import java.math.BigDecimal;
-import java.util.*;
+import java.util.ArrayList;
+import java.util.LinkedHashMap;
+import java.util.List;
+import java.util.concurrent.ConcurrentHashMap;
 import java.util.function.Predicate;
-import java.util.stream.Collectors;

 import com.fasterxml.jackson.core.JsonProcessingException;
 import com.fasterxml.jackson.databind.ObjectMapper;
-import com.jayway.jsonpath.Configuration;
 import com.jayway.jsonpath.DocumentContext;
 import com.jayway.jsonpath.JsonPath;
-import com.jayway.jsonpath.Option;
+import com.jayway.jsonpath.spi.cache.Cache;
+import com.jayway.jsonpath.spi.cache.CacheProvider;

-import eu.dnetlib.pace.config.DedupConfig;
 import eu.dnetlib.pace.config.Type;
-import eu.dnetlib.pace.model.*;
 import net.minidev.json.JSONArray;

 public class MapDocumentUtil {
@ -23,47 +23,20 @@ public class MapDocumentUtil {
 	public static final String URL_REGEX = "^(http|https|ftp)\\://.*";
 	public static Predicate<String> urlFilter = s -> s.trim().matches(URL_REGEX);

-	public static List<String> getJPathList(String path, String json, Type type) {
-		if (type == Type.List)
-			return JsonPath
-				.using(
-					Configuration
-						.defaultConfiguration()
-						.addOptions(Option.ALWAYS_RETURN_LIST, Option.SUPPRESS_EXCEPTIONS))
-				.parse(json)
-				.read(path);
-		Object jresult;
-		List<String> result = new ArrayList<>();
-		try {
-			jresult = JsonPath.read(json, path);
-		} catch (Throwable e) {
-			return result;
-		}
-		if (jresult instanceof JSONArray) {
-
-			((JSONArray) jresult).forEach(it -> {
-
-				try {
-					result.add(new ObjectMapper().writeValueAsString(it));
-				} catch (JsonProcessingException e) {
-
-				}
-			});
-			return result;
-		}
-
-		if (jresult instanceof LinkedHashMap) {
-			try {
-				result.add(new ObjectMapper().writeValueAsString(jresult));
-			} catch (JsonProcessingException e) {
+	static {
+		CacheProvider.setCache(new Cache() {
+			private final ConcurrentHashMap<String, JsonPath> jsonPathCache = new ConcurrentHashMap();

+			@Override
+			public JsonPath get(String key) {
+				return jsonPathCache.get(key);
 			}
-			return result;
-		}
-		if (jresult instanceof String) {
-			result.add((String) jresult);
-		}
-		return result;
+
+			@Override
+			public void put(String key, JsonPath value) {
+				jsonPathCache.put(key, value);
+			}
+		});
 	}

 	public static String getJPathString(final String jsonPath, final String json) {
@ -144,6 +117,11 @@ public class MapDocumentUtil {
 			return result;
 		}

+		if (type == Type.List && jresult instanceof List) {
+			((List<?>) jresult).forEach(x -> result.add(x.toString()));
+			return result;
+		}
+
 		if (jresult instanceof JSONArray) {
 			((JSONArray) jresult).forEach(it -> {
 				try {
--- a/dhp-pace-core/src/main/java/eu/dnetlib/pace/util/SparkReporter.java
+++ b/dhp-pace-core/src/main/java/eu/dnetlib/pace/util/SparkReporter.java
@ -10,7 +10,6 @@ import org.apache.spark.SparkContext;
 import org.apache.spark.util.LongAccumulator;

 import eu.dnetlib.pace.config.DedupConfig;
-import eu.dnetlib.pace.util.Reporter;
 import scala.Serializable;
 import scala.Tuple2;

--- a/dhp-pace-core/src/test/java/eu/dnetlib/pace/clustering/ClusteringFunctionTest.java
+++ b/dhp-pace-core/src/test/java/eu/dnetlib/pace/clustering/ClusteringFunctionTest.java
@ -2,14 +2,12 @@
 package eu.dnetlib.pace.clustering;

 import java.util.Map;
-import java.util.Set;
-import java.util.stream.Collectors;

-import org.junit.jupiter.api.*;
+import org.junit.jupiter.api.BeforeAll;
+import org.junit.jupiter.api.Test;

 import com.google.common.collect.Lists;
 import com.google.common.collect.Maps;
-import com.google.common.collect.Sets;

 import eu.dnetlib.pace.AbstractPaceTest;
 import eu.dnetlib.pace.common.AbstractPaceFunctions;
--- a/dhp-pace-core/src/test/java/eu/dnetlib/pace/comparators/ComparatorTest.java
+++ b/dhp-pace-core/src/test/java/eu/dnetlib/pace/comparators/ComparatorTest.java
@ -6,6 +6,7 @@ import static org.junit.jupiter.api.Assertions.assertEquals;
 import java.util.*;

 import org.junit.jupiter.api.BeforeAll;
+import org.junit.jupiter.api.BeforeEach;
 import org.junit.jupiter.api.Test;
 import org.junit.jupiter.api.TestInstance;

@ -22,14 +23,18 @@ public class ComparatorTest extends AbstractPaceTest {

 	@BeforeAll
 	public void setup() {
+		conf = DedupConfig
+			.load(readFromClasspath("/eu/dnetlib/pace/config/organization.current.conf.json", ComparatorTest.class));
+	}
+
+	@BeforeEach
+	public void beforeEachTest() {
 		params = new HashMap<>();
 		params.put("weight", "1.0");
 		params.put("surname_th", "0.99");
 		params.put("name_th", "0.95");
 		params.put("jpath_value", "$.value");
 		params.put("jpath_classid", "$.qualifier.classid");
-		conf = DedupConfig
-			.load(readFromClasspath("/eu/dnetlib/pace/config/organization.current.conf.json", ComparatorTest.class));
 	}

 	@Test
@ -63,7 +68,10 @@ public class ComparatorTest extends AbstractPaceTest {
 				.distance(
 					"Politechniki Warszawskiej (Warsaw University of Technology)", "Warsaw University of Technology",
 					conf));
-		assertEquals(-1.0, cityMatch.distance("Allen (United States)", "United States Military Academy", conf));
+
+		// failing becasuse 'Allen' is a transliterrated greek stopword
+		// assertEquals(-1.0, cityMatch.distance("Allen (United States)", "United States Military Academy", conf));
+		assertEquals(-1.0, cityMatch.distance("Washington (United States)", "United States Military Academy", conf));
 	}

 	@Test
@ -78,7 +86,7 @@ public class ComparatorTest extends AbstractPaceTest {
 		assertEquals(1.0, keywordMatch.distance("Polytechnic University of Turin", "POLITECNICO DI TORINO", conf));
 		assertEquals(1.0, keywordMatch.distance("Istanbul Commerce University", "İstanbul Ticarət Universiteti", conf));
 		assertEquals(1.0, keywordMatch.distance("Franklin College", "Concordia College", conf));
-		assertEquals(0.5, keywordMatch.distance("University of Georgia", "Georgia State University", conf));
+		assertEquals(2.0 / 3.0, keywordMatch.distance("University of Georgia", "Georgia State University", conf));
 		assertEquals(0.5, keywordMatch.distance("University College London", "University of London", conf));
 		assertEquals(0.5, keywordMatch.distance("Washington State University", "University of Washington", conf));
 		assertEquals(-1.0, keywordMatch.distance("Allen (United States)", "United States Military Academy", conf));
@ -112,7 +120,7 @@ public class ComparatorTest extends AbstractPaceTest {
 	public void stringContainsMatchTest() {

 		params.put("string", "openorgs");
-		params.put("bool", "XOR");
+		params.put("aggregator", "XOR");
 		params.put("caseSensitive", "false");

 		StringContainsMatch stringContainsMatch = new StringContainsMatch(params);
@ -120,7 +128,7 @@ public class ComparatorTest extends AbstractPaceTest {
 		assertEquals(0.0, stringContainsMatch.distance("openorgs", "openorgs", conf));

 		params.put("string", "openorgs");
-		params.put("bool", "AND");
+		params.put("aggregator", "AND");
 		params.put("caseSensitive", "false");

 		stringContainsMatch = new StringContainsMatch(params);
--- a/dhp-pace-core/src/test/java/eu/dnetlib/pace/util/UtilTest.java
+++ b/dhp-pace-core/src/test/java/eu/dnetlib/pace/util/UtilTest.java
@ -6,7 +6,8 @@ import static org.junit.jupiter.api.Assertions.assertEquals;
 import java.util.HashMap;
 import java.util.Map;

-import org.junit.jupiter.api.*;
+import org.junit.jupiter.api.BeforeAll;
+import org.junit.jupiter.api.Test;

 import eu.dnetlib.pace.model.Person;
 import jdk.nashorn.internal.ir.annotations.Ignore;
--- a/dhp-workflows/dhp-actionmanager/README.md
+++ b/dhp-workflows/dhp-actionmanager/README.md
@ -0,0 +1,72 @@
+# Action Management Framework
+
+This module implements the oozie workflow for the integration of pre-built contents into the OpenAIRE Graph.
+
+Such contents can be 
+
+* brand new, non-existing records to be introduced as nodes of the graph
+* updates (or enrichment) for records that does exist in the graph (e.g. a new subject term for a publication)
+* relations among existing nodes
+
+The actionset contents are organised into logical containers, each of them can contain multiple versions contents and is characterised by
+
+* a name
+* an identifier
+* the paths on HDFS where each version of the contents is stored
+
+Each version is then characterised by 
+
+* the creation date
+* the last update date
+* the indication where it is the latest one or it is an expired version, candidate for garbage collection
+
+## ActionSet serialization
+
+Each actionset version contains records compliant to the graph internal data model, i.e. subclasses of `eu.dnetlib.dhp.schema.oaf.Oaf`,
+defined in the external schemas module
+
+```
+<dependency>
+    <groupId>eu.dnetlib.dhp</groupId>
+    <artifactId>${dhp-schemas.artifact}</artifactId>
+    <version>${dhp-schemas.version}</version>
+</dependency>
+```
+
+When the actionset contains a relationship, the model class to use is `eu.dnetlib.dhp.schema.oaf.Relation`, otherwise 
+when the actionset contains an entity, it is a `eu.dnetlib.dhp.schema.oaf.OafEntity` or one of its subclasses 
+`Datasource`, `Organization`, `Project`, `Result` (or one of its subclasses `Publication`, `Dataset`, etc...). 
+
+Then, each OpenAIRE Graph model class instance must be wrapped using the class `eu.dnetlib.dhp.schema.action.AtomicAction`, a generic 
+container that defines two attributes
+
+* `T payload` the OpenAIRE Graph class instance containing the data;
+* `Class<T> clazz` must contain the class whose instance is contained in the payload.
+
+Each AtomicAction can be then serialised in JSON format using `com.fasterxml.jackson.databind.ObjectMapper` from
+
+```
+<dependency>
+    <groupId>com.fasterxml.jackson.core</groupId>
+    <artifactId>jackson-databind</artifactId>
+    <version>${dhp.jackson.version}</version>
+</dependency>
+```
+
+Then, the JSON serialization must be stored as a GZip compressed sequence file (`org.apache.hadoop.mapred.SequenceFileOutputFormat`). 
+As such, it contains a set of tuples, a key and a value defined as `org.apache.hadoop.io.Text` where
+
+* the `key` must be set to the class canonical name contained in the `AtomicAction`;
+* the `value` must be set to the AtomicAction JSON serialization.
+
+The following snippet provides an example of how create an actionset version of Relation records:
+
+```
+  rels // JavaRDD<Relation>
+    .map(relation -> new AtomicAction<Relation>(Relation.class, relation))
+    .mapToPair(
+        aa -> new Tuple2<>(new Text(aa.getClazz().getCanonicalName()),
+            new Text(OBJECT_MAPPER.writeValueAsString(aa))))
+    .saveAsHadoopFile(outputPath, Text.class, Text.class, SequenceFileOutputFormat.class, GzipCodec.class);
+```
+
--- a/dhp-workflows/dhp-actionmanager/pom.xml
+++ b/dhp-workflows/dhp-actionmanager/pom.xml
@ -63,10 +63,6 @@
                    <groupId>eu.dnetlib</groupId>
                    <artifactId>dnet-openaireplus-mapping-utils</artifactId>
                </exclusion>
-                <exclusion>
-                    <groupId>eu.dnetlib</groupId>
-                    <artifactId>dnet-index-solr-common</artifactId>
-                </exclusion>
                <exclusion>
                    <groupId>saxonica</groupId>
                    <artifactId>saxon</artifactId>
--- a/dhp-workflows/dhp-actionmanager/src/main/java/eu/dnetlib/dhp/actionmanager/promote/PromoteActionPayloadForGraphTableJob.java
+++ b/dhp-workflows/dhp-actionmanager/src/main/java/eu/dnetlib/dhp/actionmanager/promote/PromoteActionPayloadForGraphTableJob.java
@ -20,6 +20,7 @@ import org.apache.spark.sql.SparkSession;
 import org.slf4j.Logger;
 import org.slf4j.LoggerFactory;

+import com.fasterxml.jackson.databind.DeserializationFeature;
 import com.fasterxml.jackson.databind.ObjectMapper;
 import com.fasterxml.jackson.databind.exc.UnrecognizedPropertyException;

@ -33,7 +34,8 @@ import eu.dnetlib.dhp.schema.oaf.*;
 public class PromoteActionPayloadForGraphTableJob {
 	private static final Logger logger = LoggerFactory.getLogger(PromoteActionPayloadForGraphTableJob.class);

-	private static final ObjectMapper OBJECT_MAPPER = new ObjectMapper();
+	private static final ObjectMapper OBJECT_MAPPER = new ObjectMapper()
+		.configure(DeserializationFeature.FAIL_ON_UNKNOWN_PROPERTIES, false);

 	public static void main(String[] args) throws Exception {
 		String jsonConfiguration = IOUtils
--- a/dhp-workflows/dhp-actionmanager/src/test/java/eu/dnetlib/dhp/actionmanager/partition/PartitionActionSetsByPayloadTypeJobTest.java
+++ b/dhp-workflows/dhp-actionmanager/src/test/java/eu/dnetlib/dhp/actionmanager/partition/PartitionActionSetsByPayloadTypeJobTest.java
@ -31,6 +31,7 @@ import org.mockito.Mock;
 import org.mockito.Mockito;
 import org.mockito.junit.jupiter.MockitoExtension;

+import com.fasterxml.jackson.databind.DeserializationFeature;
 import com.fasterxml.jackson.databind.ObjectMapper;

 import eu.dnetlib.dhp.actionmanager.ISClient;
@ -46,7 +47,8 @@ public class PartitionActionSetsByPayloadTypeJobTest {
 	private static Configuration configuration;
 	private static SparkSession spark;

-	private static final ObjectMapper OBJECT_MAPPER = new ObjectMapper();
+	private static final ObjectMapper OBJECT_MAPPER = new ObjectMapper()
+		.configure(DeserializationFeature.FAIL_ON_UNKNOWN_PROPERTIES, false);

 	private static final StructType ATOMIC_ACTION_SCHEMA = StructType$.MODULE$
 		.apply(
--- a/dhp-workflows/dhp-actionmanager/src/test/java/eu/dnetlib/dhp/actionmanager/promote/PromoteActionPayloadForGraphTableJobTest.java
+++ b/dhp-workflows/dhp-actionmanager/src/test/java/eu/dnetlib/dhp/actionmanager/promote/PromoteActionPayloadForGraphTableJobTest.java
@ -25,6 +25,7 @@ import org.junit.jupiter.params.ParameterizedTest;
 import org.junit.jupiter.params.provider.Arguments;
 import org.junit.jupiter.params.provider.MethodSource;

+import com.fasterxml.jackson.databind.DeserializationFeature;
 import com.fasterxml.jackson.databind.ObjectMapper;

 import eu.dnetlib.dhp.schema.common.ModelSupport;
@ -41,7 +42,8 @@ public class PromoteActionPayloadForGraphTableJobTest {
 	private Path inputActionPayloadRootDir;
 	private Path outputDir;

-	private static final ObjectMapper OBJECT_MAPPER = new ObjectMapper();
+	private static final ObjectMapper OBJECT_MAPPER = new ObjectMapper()
+		.configure(DeserializationFeature.FAIL_ON_UNKNOWN_PROPERTIES, false);

 	@BeforeAll
 	public static void beforeAll() {
@ -154,6 +156,10 @@ public class PromoteActionPayloadForGraphTableJobTest {
 			List<? extends Oaf> actualOutputRows = readGraphTableFromJobOutput(outputGraphTableDir.toString(), rowClazz)
 				.collectAsList()
 				.stream()
+				.map(s -> {
+					s.setLastupdatetimestamp(0L);
+					return s;
+				})
 				.sorted(Comparator.comparingInt(Object::hashCode))
 				.collect(Collectors.toList());
 			String expectedOutputGraphTableJsonDumpPath = resultFileLocation(strategy, rowClazz, actionPayloadClazz);
@ -166,6 +172,10 @@ public class PromoteActionPayloadForGraphTableJobTest {
 				expectedOutputGraphTableJsonDumpFile.toString(), rowClazz)
 					.collectAsList()
 					.stream()
+					.map(s -> {
+						s.setLastupdatetimestamp(0L);
+						return s;
+					})
 					.sorted(Comparator.comparingInt(Object::hashCode))
 					.collect(Collectors.toList());
 			assertIterableEquals(expectedOutputRows, actualOutputRows);
--- a/dhp-workflows/dhp-aggregation/pom.xml
+++ b/dhp-workflows/dhp-aggregation/pom.xml
@ -79,8 +79,8 @@
        </dependency>
        <dependency>
            <groupId>org.scala-lang.modules</groupId>
-            <artifactId>scala-xml_2.12</artifactId>
-            <version>2.1.0</version>
+            <artifactId>scala-xml_${scala.binary.version}</artifactId>
+            <version>${scala-xml.version}</version>
        </dependency>

        <dependency>
--- a/dhp-workflows/dhp-aggregation/src/main/java/eu/dnetlib/dhp/actionmanager/Constants.java
+++ b/dhp-workflows/dhp-aggregation/src/main/java/eu/dnetlib/dhp/actionmanager/Constants.java
@ -11,6 +11,7 @@ import org.apache.spark.sql.SparkSession;
 import com.fasterxml.jackson.databind.ObjectMapper;

 import eu.dnetlib.dhp.application.ArgumentApplicationParser;
+import eu.dnetlib.dhp.common.HdfsSupport;
 import eu.dnetlib.dhp.schema.common.ModelConstants;
 import eu.dnetlib.dhp.schema.oaf.StructuredProperty;
 import eu.dnetlib.dhp.schema.oaf.Subject;
@ -93,4 +94,9 @@ public class Constants {
 		return s;

 	}
+
+	public static void removeOutputDir(SparkSession spark, String path) {
+		HdfsSupport.remove(path, spark.sparkContext().hadoopConfiguration());
+	}
+
 }
--- a/dhp-workflows/dhp-aggregation/src/main/java/eu/dnetlib/dhp/actionmanager/bipaffiliations/PrepareAffiliationRelations.java
+++ b/dhp-workflows/dhp-aggregation/src/main/java/eu/dnetlib/dhp/actionmanager/bipaffiliations/PrepareAffiliationRelations.java
@ -0,0 +1,162 @@
+
+package eu.dnetlib.dhp.actionmanager.bipaffiliations;
+
+import static eu.dnetlib.dhp.common.SparkSessionSupport.runWithSparkSession;
+
+import java.io.Serializable;
+import java.util.Arrays;
+import java.util.List;
+
+import org.apache.commons.io.IOUtils;
+import org.apache.hadoop.io.Text;
+import org.apache.hadoop.io.compress.GzipCodec;
+import org.apache.hadoop.mapred.SequenceFileOutputFormat;
+import org.apache.spark.SparkConf;
+import org.apache.spark.api.java.function.FlatMapFunction;
+import org.apache.spark.sql.*;
+import org.apache.spark.sql.Dataset;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+import com.fasterxml.jackson.databind.ObjectMapper;
+
+import eu.dnetlib.dhp.actionmanager.Constants;
+import eu.dnetlib.dhp.actionmanager.ror.GenerateRorActionSetJob;
+import eu.dnetlib.dhp.application.ArgumentApplicationParser;
+import eu.dnetlib.dhp.schema.action.AtomicAction;
+import eu.dnetlib.dhp.schema.common.ModelConstants;
+import eu.dnetlib.dhp.schema.oaf.*;
+import eu.dnetlib.dhp.schema.oaf.utils.CleaningFunctions;
+import eu.dnetlib.dhp.schema.oaf.utils.IdentifierFactory;
+import eu.dnetlib.dhp.schema.oaf.utils.OafMapperUtils;
+import scala.Tuple2;
+
+/**
+ * Creates action sets for Crossref affiliation relations inferred by BIP!
+ */
+public class PrepareAffiliationRelations implements Serializable {
+
+	private static final Logger log = LoggerFactory.getLogger(PrepareAffiliationRelations.class);
+	private static final ObjectMapper OBJECT_MAPPER = new ObjectMapper();
+	private static final String ID_PREFIX = "50|doi_________::";
+	public static final String BIP_AFFILIATIONS_CLASSID = "result:organization:bipinference";
+	public static final String BIP_AFFILIATIONS_CLASSNAME = "Affiliation relation inferred by BIP!";
+	public static final String BIP_INFERENCE_PROVENANCE = "bip:affiliation:crossref";
+
+	public static <I extends Result> void main(String[] args) throws Exception {
+
+		String jsonConfiguration = IOUtils
+			.toString(
+				PrepareAffiliationRelations.class
+					.getResourceAsStream(
+						"/eu/dnetlib/dhp/actionmanager/bipaffiliations/input_actionset_parameter.json"));
+
+		final ArgumentApplicationParser parser = new ArgumentApplicationParser(jsonConfiguration);
+		parser.parseArgument(args);
+
+		Boolean isSparkSessionManaged = Constants.isSparkSessionManaged(parser);
+		log.info("isSparkSessionManaged: {}", isSparkSessionManaged);
+
+		final String inputPath = parser.get("inputPath");
+		log.info("inputPath {}: ", inputPath);
+
+		final String outputPath = parser.get("outputPath");
+		log.info("outputPath {}: ", outputPath);
+
+		SparkConf conf = new SparkConf();
+
+		runWithSparkSession(
+			conf,
+			isSparkSessionManaged,
+			spark -> {
+				Constants.removeOutputDir(spark, outputPath);
+				prepareAffiliationRelations(spark, inputPath, outputPath);
+			});
+	}
+
+	private static <I extends Result> void prepareAffiliationRelations(SparkSession spark, String inputPath,
+		String outputPath) {
+
+		// load and parse affiliation relations from HDFS
+		Dataset<Row> df = spark
+			.read()
+			.schema("`DOI` STRING, `Matchings` ARRAY<STRUCT<`RORid`:STRING,`Confidence`:DOUBLE>>")
+			.json(inputPath);
+
+		// unroll nested arrays
+		df = df
+			.withColumn("matching", functions.explode(new Column("Matchings")))
+			.select(
+				new Column("DOI").as("doi"),
+				new Column("matching.RORid").as("rorid"),
+				new Column("matching.Confidence").as("confidence"));
+
+		// prepare action sets for affiliation relations
+		df
+			.toJavaRDD()
+			.flatMap((FlatMapFunction<Row, Relation>) row -> {
+
+				// DOI to OpenAIRE id
+				final String paperId = ID_PREFIX
+					+ IdentifierFactory.md5(CleaningFunctions.normalizePidValue("doi", row.getAs("doi")));
+
+				// ROR id to OpenAIRE id
+				final String affId = GenerateRorActionSetJob.calculateOpenaireId(row.getAs("rorid"));
+
+				Qualifier qualifier = OafMapperUtils
+					.qualifier(
+						BIP_AFFILIATIONS_CLASSID,
+						BIP_AFFILIATIONS_CLASSNAME,
+						ModelConstants.DNET_PROVENANCE_ACTIONS,
+						ModelConstants.DNET_PROVENANCE_ACTIONS);
+
+				// format data info; setting `confidence` into relation's `trust`
+				DataInfo dataInfo = OafMapperUtils
+					.dataInfo(
+						false,
+						BIP_INFERENCE_PROVENANCE,
+						true,
+						false,
+						qualifier,
+						Double.toString(row.getAs("confidence")));
+
+				List<KeyValue> collectedfrom = OafMapperUtils.listKeyValues(ModelConstants.CROSSREF_ID, "Crossref");
+
+				// return bi-directional relations
+				return getAffiliationRelationPair(paperId, affId, collectedfrom, dataInfo).iterator();
+
+			})
+			.map(p -> new AtomicAction(Relation.class, p))
+			.mapToPair(
+				aa -> new Tuple2<>(new Text(aa.getClazz().getCanonicalName()),
+					new Text(OBJECT_MAPPER.writeValueAsString(aa))))
+			.saveAsHadoopFile(outputPath, Text.class, Text.class, SequenceFileOutputFormat.class, GzipCodec.class);
+
+	}
+
+	private static List<Relation> getAffiliationRelationPair(String paperId, String affId, List<KeyValue> collectedfrom,
+		DataInfo dataInfo) {
+		return Arrays
+			.asList(
+				OafMapperUtils
+					.getRelation(
+						paperId,
+						affId,
+						ModelConstants.RESULT_ORGANIZATION,
+						ModelConstants.AFFILIATION,
+						ModelConstants.HAS_AUTHOR_INSTITUTION,
+						collectedfrom,
+						dataInfo,
+						null),
+				OafMapperUtils
+					.getRelation(
+						affId,
+						paperId,
+						ModelConstants.RESULT_ORGANIZATION,
+						ModelConstants.AFFILIATION,
+						ModelConstants.IS_AUTHOR_INSTITUTION_OF,
+						collectedfrom,
+						dataInfo,
+						null));
+	}
+}
--- a/dhp-workflows/dhp-aggregation/src/main/java/eu/dnetlib/dhp/actionmanager/bipfinder/SparkAtomicActionScoreJob.java
+++ b/dhp-workflows/dhp-aggregation/src/main/java/eu/dnetlib/dhp/actionmanager/bipfinder/SparkAtomicActionScoreJob.java
@ -6,13 +6,14 @@ import static eu.dnetlib.dhp.common.SparkSessionSupport.runWithSparkSession;

 import java.io.Serializable;
 import java.util.List;
-import java.util.Optional;
 import java.util.stream.Collectors;

 import org.apache.commons.io.IOUtils;
 import org.apache.hadoop.io.Text;
+import org.apache.hadoop.io.compress.GzipCodec;
 import org.apache.hadoop.mapred.SequenceFileOutputFormat;
 import org.apache.spark.SparkConf;
+import org.apache.spark.api.java.JavaPairRDD;
 import org.apache.spark.api.java.JavaRDD;
 import org.apache.spark.api.java.JavaSparkContext;
 import org.apache.spark.api.java.function.MapFunction;
@ -24,8 +25,9 @@ import org.slf4j.LoggerFactory;

 import com.fasterxml.jackson.databind.ObjectMapper;

-import eu.dnetlib.dhp.actionmanager.bipmodel.BipDeserialize;
 import eu.dnetlib.dhp.actionmanager.bipmodel.BipScore;
+import eu.dnetlib.dhp.actionmanager.bipmodel.score.deserializers.BipProjectModel;
+import eu.dnetlib.dhp.actionmanager.bipmodel.score.deserializers.BipResultModel;
 import eu.dnetlib.dhp.application.ArgumentApplicationParser;
 import eu.dnetlib.dhp.common.HdfsSupport;
 import eu.dnetlib.dhp.schema.action.AtomicAction;
@ -40,7 +42,6 @@ import scala.Tuple2;
 */
 public class SparkAtomicActionScoreJob implements Serializable {

-	private static final String DOI = "doi";
 	private static final Logger log = LoggerFactory.getLogger(SparkAtomicActionScoreJob.class);
 	private static final ObjectMapper OBJECT_MAPPER = new ObjectMapper();

@ -56,18 +57,17 @@ public class SparkAtomicActionScoreJob implements Serializable {

 		parser.parseArgument(args);

-		Boolean isSparkSessionManaged = Optional
-			.ofNullable(parser.get("isSparkSessionManaged"))
-			.map(Boolean::valueOf)
-			.orElse(Boolean.TRUE);
-
+		Boolean isSparkSessionManaged = isSparkSessionManaged(parser);
 		log.info("isSparkSessionManaged: {}", isSparkSessionManaged);

-		final String inputPath = parser.get("inputPath");
-		log.info("inputPath {}: ", inputPath);
+		final String resultsInputPath = parser.get("resultsInputPath");
+		log.info("resultsInputPath: {}", resultsInputPath);
+
+		final String projectsInputPath = parser.get("projectsInputPath");
+		log.info("projectsInputPath: {}", projectsInputPath);

 		final String outputPath = parser.get("outputPath");
-		log.info("outputPath {}: ", outputPath);
+		log.info("outputPath: {}", outputPath);

 		SparkConf conf = new SparkConf();

@ -76,17 +76,45 @@ public class SparkAtomicActionScoreJob implements Serializable {
 			isSparkSessionManaged,
 			spark -> {
 				removeOutputDir(spark, outputPath);
-				prepareResults(spark, inputPath, outputPath);
+
+				JavaPairRDD<Text, Text> resultsRDD = prepareResults(spark, resultsInputPath, outputPath);
+				JavaPairRDD<Text, Text> projectsRDD = prepareProjects(spark, projectsInputPath, outputPath);
+
+				resultsRDD
+					.union(projectsRDD)
+					.saveAsHadoopFile(
+						outputPath, Text.class, Text.class, SequenceFileOutputFormat.class, GzipCodec.class);
 			});
 	}

-	private static <I extends Result> void prepareResults(SparkSession spark, String bipScorePath, String outputPath) {
+	private static <I extends Project> JavaPairRDD<Text, Text> prepareProjects(SparkSession spark, String inputPath,
+		String outputPath) {
+
+		// read input bip project scores
+		Dataset<BipProjectModel> projectScores = readPath(spark, inputPath, BipProjectModel.class);
+
+		return projectScores.map((MapFunction<BipProjectModel, Project>) bipProjectScores -> {
+			Project project = new Project();
+			project.setId(bipProjectScores.getProjectId());
+			project.setMeasures(bipProjectScores.toMeasures());
+			return project;
+		}, Encoders.bean(Project.class))
+			.toJavaRDD()
+			.map(p -> new AtomicAction(Project.class, p))
+			.mapToPair(
+				aa -> new Tuple2<>(new Text(aa.getClazz().getCanonicalName()),
+					new Text(OBJECT_MAPPER.writeValueAsString(aa))));
+
+	}
+
+	private static <I extends Result> JavaPairRDD<Text, Text> prepareResults(SparkSession spark, String bipScorePath,
+		String outputPath) {

 		final JavaSparkContext sc = JavaSparkContext.fromSparkContext(spark.sparkContext());

-		JavaRDD<BipDeserialize> bipDeserializeJavaRDD = sc
+		JavaRDD<BipResultModel> bipDeserializeJavaRDD = sc
 			.textFile(bipScorePath)
-			.map(item -> OBJECT_MAPPER.readValue(item, BipDeserialize.class));
+			.map(item -> OBJECT_MAPPER.readValue(item, BipResultModel.class));

 		Dataset<BipScore> bipScores = spark
 			.createDataset(bipDeserializeJavaRDD.flatMap(entry -> entry.keySet().stream().map(key -> {
@ -96,24 +124,20 @@ public class SparkAtomicActionScoreJob implements Serializable {
 				return bs;
 			}).collect(Collectors.toList()).iterator()).rdd(), Encoders.bean(BipScore.class));

-		bipScores
+		return bipScores.map((MapFunction<BipScore, Result>) bs -> {
+			Result ret = new Result();

-			.map((MapFunction<BipScore, Result>) bs -> {
-				Result ret = new Result();
+			ret.setId(bs.getId());

-				ret.setId(bs.getId());
+			ret.setMeasures(getMeasure(bs));

-				ret.setMeasures(getMeasure(bs));
-
-				return ret;
-			}, Encoders.bean(Result.class))
+			return ret;
+		}, Encoders.bean(Result.class))
 			.toJavaRDD()
 			.map(p -> new AtomicAction(Result.class, p))
 			.mapToPair(
 				aa -> new Tuple2<>(new Text(aa.getClazz().getCanonicalName()),
-					new Text(OBJECT_MAPPER.writeValueAsString(aa))))
-			.saveAsHadoopFile(outputPath, Text.class, Text.class, SequenceFileOutputFormat.class);
-
+					new Text(OBJECT_MAPPER.writeValueAsString(aa))));
 	}

 	private static List<Measure> getMeasure(BipScore value) {
@ -159,12 +183,4 @@ public class SparkAtomicActionScoreJob implements Serializable {
 		HdfsSupport.remove(path, spark.sparkContext().hadoopConfiguration());
 	}

-	public static <R> Dataset<R> readPath(
-		SparkSession spark, String inputPath, Class<R> clazz) {
-		return spark
-			.read()
-			.textFile(inputPath)
-			.map((MapFunction<String, R>) value -> OBJECT_MAPPER.readValue(value, clazz), Encoders.bean(clazz));
-	}
-
 }
--- a/dhp-workflows/dhp-aggregation/src/main/java/eu/dnetlib/dhp/actionmanager/bipmodel/score/deserializers/BipProjectModel.java
+++ b/dhp-workflows/dhp-aggregation/src/main/java/eu/dnetlib/dhp/actionmanager/bipmodel/score/deserializers/BipProjectModel.java
@ -0,0 +1,74 @@
+
+package eu.dnetlib.dhp.actionmanager.bipmodel.score.deserializers;
+
+import static eu.dnetlib.dhp.actionmanager.Constants.*;
+
+import java.util.ArrayList;
+import java.util.Arrays;
+import java.util.Collections;
+import java.util.List;
+
+import com.opencsv.bean.CsvBindByPosition;
+
+import eu.dnetlib.dhp.schema.common.ModelConstants;
+import eu.dnetlib.dhp.schema.oaf.KeyValue;
+import eu.dnetlib.dhp.schema.oaf.Measure;
+import eu.dnetlib.dhp.schema.oaf.utils.OafMapperUtils;
+import lombok.AllArgsConstructor;
+import lombok.Getter;
+import lombok.NoArgsConstructor;
+import lombok.Setter;
+
+@NoArgsConstructor
+@AllArgsConstructor
+@Getter
+@Setter
+public class BipProjectModel {
+	String projectId;
+
+	String numOfInfluentialResults;
+
+	String numOfPopularResults;
+
+	String totalImpulse;
+
+	String totalCitationCount;
+
+	// each project bip measure has exactly one value, hence one key-value pair
+	private Measure createMeasure(String measureId, String measureValue) {
+
+		KeyValue kv = new KeyValue();
+		kv.setKey("score");
+		kv.setValue(measureValue);
+		kv
+			.setDataInfo(
+				OafMapperUtils
+					.dataInfo(
+						false,
+						UPDATE_DATA_INFO_TYPE,
+						true,
+						false,
+						OafMapperUtils
+							.qualifier(
+								UPDATE_MEASURE_BIP_CLASS_ID,
+								UPDATE_CLASS_NAME,
+								ModelConstants.DNET_PROVENANCE_ACTIONS,
+								ModelConstants.DNET_PROVENANCE_ACTIONS),
+						""));
+
+		Measure measure = new Measure();
+		measure.setId(measureId);
+		measure.setUnit(Collections.singletonList(kv));
+		return measure;
+	}
+
+	public List<Measure> toMeasures() {
+		return Arrays
+			.asList(
+				createMeasure("numOfInfluentialResults", numOfInfluentialResults),
+				createMeasure("numOfPopularResults", numOfPopularResults),
+				createMeasure("totalImpulse", totalImpulse),
+				createMeasure("totalCitationCount", totalCitationCount));
+	}
+
+}
--- a/dhp-workflows/dhp-aggregation/src/main/java/eu/dnetlib/dhp/actionmanager/bipmodel/score/deserializers/BipResultModel.java
+++ b/dhp-workflows/dhp-aggregation/src/main/java/eu/dnetlib/dhp/actionmanager/bipmodel/score/deserializers/BipResultModel.java
@ -1,19 +1,21 @@

-package eu.dnetlib.dhp.actionmanager.bipmodel;
+package eu.dnetlib.dhp.actionmanager.bipmodel.score.deserializers;

 import java.io.Serializable;
 import java.util.ArrayList;
 import java.util.HashMap;
 import java.util.List;

+import eu.dnetlib.dhp.actionmanager.bipmodel.Score;
+
 /**
 * Class that maps the model of the bipFinder! input data.
 * Only needed for deserialization purposes
 */

-public class BipDeserialize extends HashMap<String, List<Score>> implements Serializable {
+public class BipResultModel extends HashMap<String, List<Score>> implements Serializable {

-	public BipDeserialize() {
+	public BipResultModel() {
 		super();
 	}

--- a/dhp-workflows/dhp-aggregation/src/main/java/eu/dnetlib/dhp/actionmanager/createunresolvedentities/PrepareBipFinder.java
+++ b/dhp-workflows/dhp-aggregation/src/main/java/eu/dnetlib/dhp/actionmanager/createunresolvedentities/PrepareBipFinder.java
@ -24,8 +24,8 @@ import org.slf4j.LoggerFactory;

 import com.fasterxml.jackson.databind.ObjectMapper;

-import eu.dnetlib.dhp.actionmanager.bipmodel.BipDeserialize;
 import eu.dnetlib.dhp.actionmanager.bipmodel.BipScore;
+import eu.dnetlib.dhp.actionmanager.bipmodel.score.deserializers.BipResultModel;
 import eu.dnetlib.dhp.application.ArgumentApplicationParser;
 import eu.dnetlib.dhp.common.HdfsSupport;
 import eu.dnetlib.dhp.schema.common.ModelConstants;
@ -82,9 +82,9 @@ public class PrepareBipFinder implements Serializable {

 		final JavaSparkContext sc = JavaSparkContext.fromSparkContext(spark.sparkContext());

-		JavaRDD<BipDeserialize> bipDeserializeJavaRDD = sc
+		JavaRDD<BipResultModel> bipDeserializeJavaRDD = sc
 			.textFile(inputPath)
-			.map(item -> OBJECT_MAPPER.readValue(item, BipDeserialize.class));
+			.map(item -> OBJECT_MAPPER.readValue(item, BipResultModel.class));

 		spark
 			.createDataset(bipDeserializeJavaRDD.flatMap(entry -> entry.keySet().stream().map(key -> {
--- a/dhp-workflows/dhp-aggregation/src/main/java/eu/dnetlib/dhp/actionmanager/project/utils/ReadTopics.java
+++ b/dhp-workflows/dhp-aggregation/src/main/java/eu/dnetlib/dhp/actionmanager/project/utils/ReadTopics.java
@ -6,7 +6,6 @@ import java.io.IOException;
 import java.io.OutputStreamWriter;
 import java.io.Serializable;
 import java.nio.charset.StandardCharsets;
-import java.util.ArrayList;
 import java.util.List;

 import org.apache.commons.io.IOUtils;
@ -23,7 +22,6 @@ import com.fasterxml.jackson.databind.ObjectMapper;

 import eu.dnetlib.dhp.actionmanager.project.PrepareProjects;
 import eu.dnetlib.dhp.actionmanager.project.utils.model.JsonTopic;
-import eu.dnetlib.dhp.actionmanager.project.utils.model.Project;
 import eu.dnetlib.dhp.application.ArgumentApplicationParser;

 /**
--- a/dhp-workflows/dhp-aggregation/src/main/java/eu/dnetlib/dhp/actionmanager/ror/GenerateRorActionSetJob.java
+++ b/dhp-workflows/dhp-aggregation/src/main/java/eu/dnetlib/dhp/actionmanager/ror/GenerateRorActionSetJob.java
@ -168,7 +168,7 @@ public class GenerateRorActionSetJob {

 	}

-	private static String calculateOpenaireId(final String rorId) {
+	public static String calculateOpenaireId(final String rorId) {
 		return String.format("20|%s::%s", Constants.ROR_NS_PREFIX, DHPUtils.md5(rorId));
 	}

--- a/dhp-workflows/dhp-aggregation/src/main/java/eu/dnetlib/dhp/actionmanager/usagestats/SparkAtomicActionUsageJob.java
+++ b/dhp-workflows/dhp-aggregation/src/main/java/eu/dnetlib/dhp/actionmanager/usagestats/SparkAtomicActionUsageJob.java
@ -75,7 +75,7 @@ public class SparkAtomicActionUsageJob implements Serializable {
 				removeOutputDir(spark, outputPath);
 				prepareData(dbname, spark, workingPath + "/usageDb", "usage_stats", "result_id");
 				prepareData(dbname, spark, workingPath + "/projectDb", "project_stats", "id");
-				prepareData(dbname, spark, workingPath + "/datasourceDb", "datasource_stats", "repositor_id");
+				prepareData(dbname, spark, workingPath + "/datasourceDb", "datasource_stats", "repository_id");
 				writeActionSet(spark, workingPath, outputPath);
 			});
 	}
--- a/dhp-workflows/dhp-aggregation/src/main/resources/eu/dnetlib/dhp/actionmanager/bipaffiliations/input_actionset_parameter.json
+++ b/dhp-workflows/dhp-aggregation/src/main/resources/eu/dnetlib/dhp/actionmanager/bipaffiliations/input_actionset_parameter.json
@ -0,0 +1,20 @@
+[
+  {
+    "paramName": "issm",
+    "paramLongName": "isSparkSessionManaged",
+    "paramDescription": "when true will stop SparkSession after job execution",
+    "paramRequired": false
+  },
+  {
+    "paramName": "ip",
+    "paramLongName": "inputPath",
+    "paramDescription": "the URL from where to get the programme file",
+    "paramRequired": true
+  },
+  {
+    "paramName": "o",
+    "paramLongName": "outputPath",
+    "paramDescription": "the path of the new ActionSet",
+    "paramRequired": true
+  }
+]
--- a/dhp-workflows/dhp-aggregation/src/main/resources/eu/dnetlib/dhp/actionmanager/bipaffiliations/job.properties
+++ b/dhp-workflows/dhp-aggregation/src/main/resources/eu/dnetlib/dhp/actionmanager/bipaffiliations/job.properties
@ -0,0 +1,35 @@
+# --- You can override the following properties (if needed) coming from your ~/.dhp/application.properties ---
+# dhp.hadoop.frontend.temp.dir=/home/ilias.kanellos
+# dhp.hadoop.frontend.user.name=ilias.kanellos
+# dhp.hadoop.frontend.host.name=iis-cdh5-test-gw.ocean.icm.edu.pl
+# dhp.hadoop.frontend.port.ssh=22
+# oozieServiceLoc=http://iis-cdh5-test-m3:11000/oozie
+# jobTracker=yarnRM
+# nameNode=hdfs://nameservice1
+# oozie.execution.log.file.location = target/extract-and-run-on-remote-host.log
+# maven.executable=mvn
+
+# Some memory and driver settings for more demanding tasks
+sparkDriverMemory=10G
+sparkExecutorMemory=10G
+sparkExecutorCores=4
+sparkShufflePartitions=7680
+
+# The above is given differently in an example I found online
+oozie.action.sharelib.for.spark=spark2
+oozieActionShareLibForSpark2=spark2
+spark2YarnHistoryServerAddress=http://iis-cdh5-test-gw.ocean.icm.edu.pl:18089
+spark2EventLogDir=/user/spark/spark2ApplicationHistory
+sparkSqlWarehouseDir=/user/hive/warehouse
+hiveMetastoreUris=thrift://iis-cdh5-test-m3.ocean.icm.edu.pl:9083
+# This MAY avoid the no library used error
+oozie.use.system.libpath=true
+# Some stuff copied from openaire's jobs
+spark2ExtraListeners=com.cloudera.spark.lineage.NavigatorAppListener
+spark2SqlQueryExecutionListeners=com.cloudera.spark.lineage.NavigatorQueryListener
+
+# The following is needed as a property of a workflow
+oozie.wf.application.path=${oozieTopWfApplicationPath}
+
+inputPath=/data/bip-affiliations/data.json
+outputPath=/tmp/crossref-affiliations-output-v5
--- a/dhp-workflows/dhp-aggregation/src/main/resources/eu/dnetlib/dhp/actionmanager/bipaffiliations/oozie_app/config-default.xml
+++ b/dhp-workflows/dhp-aggregation/src/main/resources/eu/dnetlib/dhp/actionmanager/bipaffiliations/oozie_app/config-default.xml
@ -0,0 +1,30 @@
+<configuration>
+    <property>
+        <name>jobTracker</name>
+        <value>yarnRM</value>
+    </property>
+    <property>
+        <name>nameNode</name>
+        <value>hdfs://nameservice1</value>
+    </property>
+    <property>
+        <name>oozie.use.system.libpath</name>
+        <value>true</value>
+    </property>
+    <property>
+        <name>hiveMetastoreUris</name>
+        <value>thrift://iis-cdh5-test-m3.ocean.icm.edu.pl:9083</value>
+    </property>
+    <property>
+        <name>hiveJdbcUrl</name>
+        <value>jdbc:hive2://iis-cdh5-test-m3.ocean.icm.edu.pl:10000</value>
+    </property>
+    <property>
+        <name>hiveDbName</name>
+        <value>openaire</value>
+    </property>
+    <property>
+        <name>oozie.launcher.mapreduce.user.classpath.first</name>
+        <value>true</value>
+    </property>
+</configuration>
--- a/dhp-workflows/dhp-aggregation/src/main/resources/eu/dnetlib/dhp/actionmanager/bipaffiliations/oozie_app/workflow.xml
+++ b/dhp-workflows/dhp-aggregation/src/main/resources/eu/dnetlib/dhp/actionmanager/bipaffiliations/oozie_app/workflow.xml
@ -0,0 +1,107 @@
+<workflow-app name="BipAffiliations" xmlns="uri:oozie:workflow:0.5">
+    <parameters>
+
+        <property>
+            <name>inputPath</name>
+            <description>the path where to find the inferred affiliation relations</description>
+        </property>
+        <property>
+            <name>outputPath</name>
+            <description>the path where to store the actionset</description>
+        </property>
+        <property>
+            <name>sparkDriverMemory</name>
+            <description>memory for driver process</description>
+        </property>
+        <property>
+            <name>sparkExecutorMemory</name>
+            <description>memory for individual executor</description>
+        </property>
+        <property>
+            <name>sparkExecutorCores</name>
+            <description>number of cores used by single executor</description>
+        </property>
+        <property>
+            <name>oozieActionShareLibForSpark2</name>
+            <description>oozie action sharelib for spark 2.*</description>
+        </property>
+        <property>
+            <name>spark2ExtraListeners</name>
+            <value>com.cloudera.spark.lineage.NavigatorAppListener</value>
+            <description>spark 2.* extra listeners classname</description>
+        </property>
+        <property>
+            <name>spark2SqlQueryExecutionListeners</name>
+            <value>com.cloudera.spark.lineage.NavigatorQueryListener</value>
+            <description>spark 2.* sql query execution listeners classname</description>
+        </property>
+        <property>
+            <name>spark2YarnHistoryServerAddress</name>
+            <description>spark 2.* yarn history server address</description>
+        </property>
+        <property>
+            <name>spark2EventLogDir</name>
+            <description>spark 2.* event log dir location</description>
+        </property>
+    </parameters>
+
+    <global>
+        <job-tracker>${jobTracker}</job-tracker>
+        <name-node>${nameNode}</name-node>
+        <configuration>
+            <property>
+                <name>mapreduce.job.queuename</name>
+                <value>${queueName}</value>
+            </property>
+            <property>
+                <name>oozie.launcher.mapred.job.queue.name</name>
+                <value>${oozieLauncherQueueName}</value>
+            </property>
+            <property>
+                <name>oozie.action.sharelib.for.spark</name>
+                <value>${oozieActionShareLibForSpark2}</value>
+            </property>
+
+        </configuration>
+    </global>
+    <start to="deleteoutputpath"/>
+    <kill name="Kill">
+        <message>Action failed, error message[${wf:errorMessage(wf:lastErrorNode())}]</message>
+    </kill>
+    <action name="deleteoutputpath">
+        <fs>
+            <delete path="${outputPath}"/>
+            <mkdir path="${outputPath}"/>
+            <delete path="${workingDir}"/>
+            <mkdir path="${workingDir}"/>
+        </fs>
+        <ok to="atomicactions"/>
+        <error to="Kill"/>
+    </action>
+
+    <action name="atomicactions">
+        <spark xmlns="uri:oozie:spark-action:0.2">
+            <master>yarn</master>
+            <mode>cluster</mode>
+            <name>Produces the atomic action with the inferred by BIP! affiliation relations from Crossref</name>
+            <class>eu.dnetlib.dhp.actionmanager.bipaffiliations.PrepareAffiliationRelations</class>
+            <jar>dhp-aggregation-${projectVersion}.jar</jar>
+            <spark-opts>
+                --executor-memory=${sparkExecutorMemory}
+                --executor-cores=${sparkExecutorCores}
+                --driver-memory=${sparkDriverMemory}
+                --conf spark.extraListeners=${spark2ExtraListeners}
+                --conf spark.sql.queryExecutionListeners=${spark2SqlQueryExecutionListeners}
+                --conf spark.yarn.historyServer.address=${spark2YarnHistoryServerAddress}
+                --conf spark.eventLog.dir=${nameNode}${spark2EventLogDir}
+                --conf spark.sql.warehouse.dir=${sparkSqlWarehouseDir}
+            </spark-opts>
+            <arg>--inputPath</arg><arg>${inputPath}</arg>
+            <arg>--outputPath</arg><arg>${outputPath}</arg>
+        </spark>
+        <ok to="End"/>
+        <error to="Kill"/>
+    </action>
+
+    <end name="End"/>
+</workflow-app>
--- a/dhp-workflows/dhp-aggregation/src/main/resources/eu/dnetlib/dhp/actionmanager/bipfinder/input_actionset_parameter.json
+++ b/dhp-workflows/dhp-aggregation/src/main/resources/eu/dnetlib/dhp/actionmanager/bipfinder/input_actionset_parameter.json
@ -6,9 +6,15 @@
    "paramRequired": false
  },
  {
-    "paramName": "ip",
-    "paramLongName": "inputPath",
-    "paramDescription": "the URL from where to get the programme file",
+    "paramName": "rip",
+    "paramLongName": "resultsInputPath",
+    "paramDescription": "the URL from where to get the input file for results",
+    "paramRequired": true
+  },
+  {
+    "paramName": "pip",
+    "paramLongName": "projectsInputPath",
+    "paramDescription": "the URL from where to get the input file for projects",
    "paramRequired": true
  },
  {
--- a/dhp-workflows/dhp-aggregation/src/main/resources/eu/dnetlib/dhp/datacite/hostedBy_map.json
+++ b/dhp-workflows/dhp-aggregation/src/main/resources/eu/dnetlib/dhp/datacite/hostedBy_map.json
@ -1,4 +1,9 @@
 {
+ "ETHZ.UNIGENF": {
+  "openaire_id": "opendoar____::1400",
+  "datacite_name": "Uni Genf",
+  "official_name": "Archive ouverte UNIGE"
+ },
 "GESIS.RKI": {
  "openaire_id": "re3data_____::r3d100010436",
  "datacite_name": "Forschungsdatenzentrum  am Robert Koch Institut",
--- a/dhp-workflows/dhp-aggregation/src/main/scala/eu/dnetlib/dhp/sx/bio/BioDBToOAF.scala
+++ b/dhp-workflows/dhp-aggregation/src/main/scala/eu/dnetlib/dhp/sx/bio/BioDBToOAF.scala
@ -222,7 +222,7 @@ object BioDBToOAF {
  def uniprotToOAF(input: String): List[Oaf] = {
    implicit lazy val formats: DefaultFormats.type = org.json4s.DefaultFormats
    lazy val json = parse(input)
-    val pid = (json \ "pid").extract[String]
+    val pid = (json \ "pid").extract[String].trim()

    val d = new Dataset

--- a/dhp-workflows/dhp-aggregation/src/main/scala/eu/dnetlib/dhp/sx/bio/ebi/SparkCreateBaselineDataFrame.scala
+++ b/dhp-workflows/dhp-aggregation/src/main/scala/eu/dnetlib/dhp/sx/bio/ebi/SparkCreateBaselineDataFrame.scala
@ -18,9 +18,9 @@ import org.apache.spark.sql.expressions.Aggregator
 import org.apache.spark.sql._
 import org.slf4j.{Logger, LoggerFactory}

-import java.io.{ByteArrayInputStream, InputStream}
+import java.io.InputStream
 import scala.io.Source
-//import scala.xml.pull.XMLEventReader
+import scala.xml.pull.XMLEventReader

 object SparkCreateBaselineDataFrame {

@ -197,8 +197,8 @@ object SparkCreateBaselineDataFrame {
      val ds: Dataset[PMArticle] = spark.createDataset(
        k.filter(i => i._1.endsWith(".gz"))
          .flatMap(i => {
-//            val xml = new XMLEventReader(Source.fromBytes(i._2.getBytes()))
-            new PMParser(new ByteArrayInputStream(i._2.getBytes()))
+            val xml = new XMLEventReader(Source.fromBytes(i._2.getBytes()))
+            new PMParser(xml)
          })
      )
      ds.map(p => (p.getPmid, p))(Encoders.tuple(Encoders.STRING, PMEncoder))
--- a/dhp-workflows/dhp-aggregation/src/main/scala/eu/dnetlib/dhp/sx/bio/pubmed/PMParser.scala
+++ b/dhp-workflows/dhp-aggregation/src/main/scala/eu/dnetlib/dhp/sx/bio/pubmed/PMParser.scala
@ -1,20 +1,11 @@
 package eu.dnetlib.dhp.sx.bio.pubmed

-import javax.xml.stream.{XMLEventReader, XMLInputFactory, XMLStreamConstants}
-import scala.language.postfixOps
 import scala.xml.MetaData
-//import scala.xml.pull.{EvElemEnd, EvElemStart, EvText, XMLEventReader}
+import scala.xml.pull.{EvElemEnd, EvElemStart, EvText, XMLEventReader}

 /** @param xml
  */
-class PMParser(stream: java.io.InputStream) extends Iterator[PMArticle] {
-
-  private val reader: XMLEventReader = {
-    println("INSTANTIATE READER")
-    val factory = XMLInputFactory.newInstance()
-    factory.createXMLEventReader(stream)
-
-  }
+class PMParser(xml: XMLEventReader) extends Iterator[PMArticle] {

  var currentArticle: PMArticle = generateNextArticle()

@ -58,142 +49,85 @@ class PMParser(stream: java.io.InputStream) extends Iterator[PMArticle] {
    var currentMonth = "01"
    var currentDay = "01"
    var currentArticleType: String = null
-    var sb = new StringBuilder()
-    var insideChar = false
-    var complete = false
-    while (reader.hasNext && !complete) {

-      val next = reader.nextEvent()
+    while (xml.hasNext) {
+      xml.next match {
+        case EvElemStart(_, label, attrs, _) =>
+          currNode = label

-      if (next.isStartElement) {
-        if (insideChar) {
-          if (sb.nonEmpty)
-            println(s"got data ${sb.toString.trim}")
-          insideChar = false
-        }
-        val name = next.asStartElement().getName.getLocalPart
-        println(s"Start Element $name")
-        next.asStartElement().getAttributes.forEachRemaining(e => print(e.toString))
+          label match {
+            case "PubmedArticle" => currentArticle = new PMArticle
+            case "Author"        => currentAuthor = new PMAuthor
+            case "Journal"       => currentJournal = new PMJournal
+            case "Grant"         => currentGrant = new PMGrant
+            case "PublicationType" | "DescriptorName" =>
+              currentSubject = new PMSubject
+              currentSubject.setMeshId(extractAttributes(attrs, "UI"))
+            case "ArticleId" => currentArticleType = extractAttributes(attrs, "IdType")
+            case _           =>
+          }
+        case EvElemEnd(_, label) =>
+          label match {
+            case "PubmedArticle" => return currentArticle
+            case "Author"        => currentArticle.getAuthors.add(currentAuthor)
+            case "Journal"       => currentArticle.setJournal(currentJournal)
+            case "Grant"         => currentArticle.getGrants.add(currentGrant)
+            case "PubMedPubDate" =>
+              if (currentArticle.getDate == null)
+                currentArticle.setDate(validate_Date(currentYear, currentMonth, currentDay))
+            case "PubDate"         => currentJournal.setDate(s"$currentYear-$currentMonth-$currentDay")
+            case "DescriptorName"  => currentArticle.getSubjects.add(currentSubject)
+            case "PublicationType" => currentArticle.getPublicationTypes.add(currentSubject)
+            case _                 =>
+          }
+        case EvText(text) =>
+          if (currNode != null && text.trim.nonEmpty)
+            currNode match {
+              case "ArticleTitle" => {
+                if (currentArticle.getTitle == null)
+                  currentArticle.setTitle(text.trim)
+                else
+                  currentArticle.setTitle(currentArticle.getTitle + text.trim)
+              }
+              case "AbstractText" => {
+                if (currentArticle.getDescription == null)
+                  currentArticle.setDescription(text.trim)
+                else
+                  currentArticle.setDescription(currentArticle.getDescription + text.trim)
+              }
+              case "PMID" => currentArticle.setPmid(text.trim)
+              case "ArticleId" =>
+                if ("doi".equalsIgnoreCase(currentArticleType)) currentArticle.setDoi(text.trim)
+                if ("pmc".equalsIgnoreCase(currentArticleType)) currentArticle.setPmcId(text.trim)
+              case "Language"                           => currentArticle.setLanguage(text.trim)
+              case "ISSN"                               => currentJournal.setIssn(text.trim)
+              case "GrantID"                            => currentGrant.setGrantID(text.trim)
+              case "Agency"                             => currentGrant.setAgency(text.trim)
+              case "Country"                            => if (currentGrant != null) currentGrant.setCountry(text.trim)
+              case "Year"                               => currentYear = text.trim
+              case "Month"                              => currentMonth = text.trim
+              case "Day"                                => currentDay = text.trim
+              case "Volume"                             => currentJournal.setVolume(text.trim)
+              case "Issue"                              => currentJournal.setIssue(text.trim)
+              case "PublicationType" | "DescriptorName" => currentSubject.setValue(text.trim)
+              case "LastName" => {
+                if (currentAuthor != null)
+                  currentAuthor.setLastName(text.trim)
+              }
+              case "ForeName" =>
+                if (currentAuthor != null)
+                  currentAuthor.setForeName(text.trim)
+              case "Title" =>
+                if (currentJournal.getTitle == null)
+                  currentJournal.setTitle(text.trim)
+                else
+                  currentJournal.setTitle(currentJournal.getTitle + text.trim)
+              case _ =>

-      } else if (next.isEndElement) {
-        if (insideChar) {
-          if (sb.nonEmpty)
-            println(s"got data ${sb.toString.trim}")
-          insideChar = false
-        }
-        val name = next.asEndElement().getName.getLocalPart
-        println(s"End Element $name")
-        if (name.equalsIgnoreCase("PubmedArticle")) {
-          complete = true
-          println("Condizione di uscita")
-        }
-
-      } else if (next.isCharacters) {
-        if (!insideChar) {
-          insideChar = true
-          sb.clear()
-        }
-        val d = next.asCharacters().getData
-        if (d.trim.nonEmpty)
-          sb.append(d.trim)
+            }
+        case _ =>
      }

-//      next match {
-//        case _ if (next.isStartElement) =>
-//          val name = next.asStartElement().getName.getLocalPart
-//          println(s"Start Element $name")
-//        case _ if (next.isEndElement) =>
-//          val name = next.asStartElement().getName.getLocalPart
-//          println(s"End Element $name")
-//        case _ if (next.isCharacters) =>
-//          val c = next.asCharacters()
-//          val data = c.getData
-//          println(s"Text value $data")
-//
-//      }
-
-      //
-//
-//      reader.next match {
-//
-//        case
-//
-//        case EvElemStart(_, label, attrs, _) =>
-//          currNode = label
-//
-//          label match {
-//            case "PubmedArticle" => currentArticle = new PMArticle
-//            case "Author"        => currentAuthor = new PMAuthor
-//            case "Journal"       => currentJournal = new PMJournal
-//            case "Grant"         => currentGrant = new PMGrant
-//            case "PublicationType" | "DescriptorName" =>
-//              currentSubject = new PMSubject
-//              currentSubject.setMeshId(extractAttributes(attrs, "UI"))
-//            case "ArticleId" => currentArticleType = extractAttributes(attrs, "IdType")
-//            case _           =>
-//          }
-//        case EvElemEnd(_, label) =>
-//          label match {
-//            case "PubmedArticle" => return currentArticle
-//            case "Author"        => currentArticle.getAuthors.add(currentAuthor)
-//            case "Journal"       => currentArticle.setJournal(currentJournal)
-//            case "Grant"         => currentArticle.getGrants.add(currentGrant)
-//            case "PubMedPubDate" =>
-//              if (currentArticle.getDate == null)
-//                currentArticle.setDate(validate_Date(currentYear, currentMonth, currentDay))
-//            case "PubDate"         => currentJournal.setDate(s"$currentYear-$currentMonth-$currentDay")
-//            case "DescriptorName"  => currentArticle.getSubjects.add(currentSubject)
-//            case "PublicationType" => currentArticle.getPublicationTypes.add(currentSubject)
-//            case _                 =>
-//          }
-//        case EvText(text) =>
-//          if (currNode != null && text.trim.nonEmpty)
-//            currNode match {
-//              case "ArticleTitle" => {
-//                if (currentArticle.getTitle == null)
-//                  currentArticle.setTitle(text.trim)
-//                else
-//                  currentArticle.setTitle(currentArticle.getTitle + text.trim)
-//              }
-//              case "AbstractText" => {
-//                if (currentArticle.getDescription == null)
-//                  currentArticle.setDescription(text.trim)
-//                else
-//                  currentArticle.setDescription(currentArticle.getDescription + text.trim)
-//              }
-//              case "PMID" => currentArticle.setPmid(text.trim)
-//              case "ArticleId" =>
-//                if ("doi".equalsIgnoreCase(currentArticleType)) currentArticle.setDoi(text.trim)
-//                if ("pmc".equalsIgnoreCase(currentArticleType)) currentArticle.setPmcId(text.trim)
-//              case "Language"                           => currentArticle.setLanguage(text.trim)
-//              case "ISSN"                               => currentJournal.setIssn(text.trim)
-//              case "GrantID"                            => currentGrant.setGrantID(text.trim)
-//              case "Agency"                             => currentGrant.setAgency(text.trim)
-//              case "Country"                            => if (currentGrant != null) currentGrant.setCountry(text.trim)
-//              case "Year"                               => currentYear = text.trim
-//              case "Month"                              => currentMonth = text.trim
-//              case "Day"                                => currentDay = text.trim
-//              case "Volume"                             => currentJournal.setVolume(text.trim)
-//              case "Issue"                              => currentJournal.setIssue(text.trim)
-//              case "PublicationType" | "DescriptorName" => currentSubject.setValue(text.trim)
-//              case "LastName" => {
-//                if (currentAuthor != null)
-//                  currentAuthor.setLastName(text.trim)
-//              }
-//              case "ForeName" =>
-//                if (currentAuthor != null)
-//                  currentAuthor.setForeName(text.trim)
-//              case "Title" =>
-//                if (currentJournal.getTitle == null)
-//                  currentJournal.setTitle(text.trim)
-//                else
-//                  currentJournal.setTitle(currentJournal.getTitle + text.trim)
-//              case _ =>
-//
-//            }
-//        case _ =>
-//      }
-
    }
    null
  }
--- a/dhp-workflows/dhp-aggregation/src/test/java/eu/dnetlib/dhp/actionmanager/bipaffiliations/PrepareAffiliationRelationsTest.java
+++ b/dhp-workflows/dhp-aggregation/src/test/java/eu/dnetlib/dhp/actionmanager/bipaffiliations/PrepareAffiliationRelationsTest.java
@ -0,0 +1,145 @@
+
+package eu.dnetlib.dhp.actionmanager.bipaffiliations;
+
+import static org.junit.jupiter.api.Assertions.*;
+
+import java.io.IOException;
+import java.nio.file.Files;
+import java.nio.file.Path;
+
+import org.apache.commons.io.FileUtils;
+import org.apache.hadoop.io.Text;
+import org.apache.spark.SparkConf;
+import org.apache.spark.api.java.JavaRDD;
+import org.apache.spark.api.java.JavaSparkContext;
+import org.apache.spark.sql.Dataset;
+import org.apache.spark.sql.Encoders;
+import org.apache.spark.sql.Row;
+import org.apache.spark.sql.SparkSession;
+import org.junit.jupiter.api.AfterAll;
+import org.junit.jupiter.api.Assertions;
+import org.junit.jupiter.api.BeforeAll;
+import org.junit.jupiter.api.Test;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+import com.fasterxml.jackson.databind.ObjectMapper;
+
+import eu.dnetlib.dhp.schema.action.AtomicAction;
+import eu.dnetlib.dhp.schema.common.ModelConstants;
+import eu.dnetlib.dhp.schema.oaf.Relation;
+import eu.dnetlib.dhp.schema.oaf.utils.CleaningFunctions;
+import eu.dnetlib.dhp.schema.oaf.utils.IdentifierFactory;
+
+public class PrepareAffiliationRelationsTest {
+
+	private static final ObjectMapper OBJECT_MAPPER = new ObjectMapper();
+
+	private static SparkSession spark;
+
+	private static Path workingDir;
+	private static final String ID_PREFIX = "50|doi_________::";
+	private static final Logger log = LoggerFactory
+		.getLogger(PrepareAffiliationRelationsTest.class);
+
+	@BeforeAll
+	public static void beforeAll() throws IOException {
+		workingDir = Files.createTempDirectory(PrepareAffiliationRelationsTest.class.getSimpleName());
+
+		log.info("Using work dir {}", workingDir);
+
+		SparkConf conf = new SparkConf();
+		conf.setAppName(PrepareAffiliationRelationsTest.class.getSimpleName());
+
+		conf.setMaster("local[*]");
+		conf.set("spark.driver.host", "localhost");
+		conf.set("hive.metastore.local", "true");
+		conf.set("spark.ui.enabled", "false");
+		conf.set("spark.sql.warehouse.dir", workingDir.toString());
+		conf.set("hive.metastore.warehouse.dir", workingDir.resolve("warehouse").toString());
+
+		spark = SparkSession
+			.builder()
+			.appName(PrepareAffiliationRelationsTest.class.getSimpleName())
+			.config(conf)
+			.getOrCreate();
+	}
+
+	@AfterAll
+	public static void afterAll() throws IOException {
+		FileUtils.deleteDirectory(workingDir.toFile());
+		spark.stop();
+	}
+
+	@Test
+	void testMatch() throws Exception {
+
+		String affiliationRelationsPath = getClass()
+			.getResource("/eu/dnetlib/dhp/actionmanager/bipaffiliations/doi_to_ror.json")
+			.getPath();
+
+		String outputPath = workingDir.toString() + "/actionSet";
+
+		PrepareAffiliationRelations
+			.main(
+				new String[] {
+					"-isSparkSessionManaged", Boolean.FALSE.toString(),
+					"-inputPath", affiliationRelationsPath,
+					"-outputPath", outputPath
+				});
+
+		final JavaSparkContext sc = new JavaSparkContext(spark.sparkContext());
+
+		JavaRDD<Relation> tmp = sc
+			.sequenceFile(outputPath, Text.class, Text.class)
+			.map(value -> OBJECT_MAPPER.readValue(value._2().toString(), AtomicAction.class))
+			.map(aa -> ((Relation) aa.getPayload()));
+
+//        for (Relation r : tmp.collect()) {
+//            System.out.println(
+//                    r.getSource() + "\t" + r.getTarget() + "\t" + r.getRelType() + "\t" + r.getRelClass() + "\t" + r.getSubRelType() + "\t" + r.getValidationDate() + "\t" + r.getDataInfo().getTrust() + "\t" + r.getDataInfo().getInferred()
+//            );
+//        }
+		// count the number of relations
+		assertEquals(20, tmp.count());
+
+		Dataset<Relation> dataset = spark.createDataset(tmp.rdd(), Encoders.bean(Relation.class));
+		dataset.createOrReplaceTempView("result");
+
+		Dataset<Row> execVerification = spark
+			.sql("select r.relType, r.relClass, r.source, r.target, r.dataInfo.trust from result r");
+
+		// verify that we have equal number of bi-directional relations
+		Assertions
+			.assertEquals(
+				10, execVerification
+					.filter(
+						"relClass='" + ModelConstants.HAS_AUTHOR_INSTITUTION + "'")
+					.collectAsList()
+					.size());
+
+		Assertions
+			.assertEquals(
+				10, execVerification
+					.filter(
+						"relClass='" + ModelConstants.IS_AUTHOR_INSTITUTION_OF + "'")
+					.collectAsList()
+					.size());
+
+		// check confidence value of a specific relation
+		String sourceDOI = "10.1061/(asce)0733-9399(2002)128:7(759)";
+
+		final String sourceOpenaireId = ID_PREFIX
+			+ IdentifierFactory.md5(CleaningFunctions.normalizePidValue("doi", sourceDOI));
+
+		Assertions
+			.assertEquals(
+				"0.7071067812", execVerification
+					.filter(
+						"source='" + sourceOpenaireId + "'")
+					.collectAsList()
+					.get(0)
+					.getString(4));
+
+	}
+}
--- a/dhp-workflows/dhp-aggregation/src/test/java/eu/dnetlib/dhp/actionmanager/bipfinder/SparkAtomicActionScoreJobTest.java
+++ b/dhp-workflows/dhp-aggregation/src/test/java/eu/dnetlib/dhp/actionmanager/bipfinder/SparkAtomicActionScoreJobTest.java
@ -6,7 +6,8 @@ import static org.junit.jupiter.api.Assertions.*;
 import java.io.IOException;
 import java.nio.file.Files;
 import java.nio.file.Path;
-import java.util.List;
+
+import javax.xml.crypto.Data;

 import org.apache.commons.io.FileUtils;
 import org.apache.hadoop.io.Text;
@ -27,7 +28,9 @@ import org.slf4j.LoggerFactory;
 import com.fasterxml.jackson.databind.ObjectMapper;

 import eu.dnetlib.dhp.schema.action.AtomicAction;
-import eu.dnetlib.dhp.schema.oaf.Publication;
+import eu.dnetlib.dhp.schema.oaf.KeyValue;
+import eu.dnetlib.dhp.schema.oaf.OafEntity;
+import eu.dnetlib.dhp.schema.oaf.Project;
 import eu.dnetlib.dhp.schema.oaf.Result;

 public class SparkAtomicActionScoreJobTest {
@ -37,8 +40,8 @@ public class SparkAtomicActionScoreJobTest {
 	private static SparkSession spark;

 	private static Path workingDir;
-	private static final Logger log = LoggerFactory
-		.getLogger(SparkAtomicActionScoreJobTest.class);
+
+	private static final Logger log = LoggerFactory.getLogger(SparkAtomicActionScoreJobTest.class);

 	@BeforeAll
 	public static void beforeAll() throws IOException {
@ -69,47 +72,64 @@ public class SparkAtomicActionScoreJobTest {
 		spark.stop();
 	}

-	@Test
-	void testMatch() throws Exception {
-		String bipScoresPath = getClass()
-			.getResource("/eu/dnetlib/dhp/actionmanager/bipfinder/bip_scores_oid.json")
-			.getPath();
-
+	private void runJob(String resultsInputPath, String projectsInputPath, String outputPath) throws Exception {
 		SparkAtomicActionScoreJob
 			.main(
 				new String[] {
-					"-isSparkSessionManaged",
-					Boolean.FALSE.toString(),
-					"-inputPath",
-
-					bipScoresPath,
-
-					"-outputPath",
-					workingDir.toString() + "/actionSet"
+					"-isSparkSessionManaged", Boolean.FALSE.toString(),
+					"-resultsInputPath", resultsInputPath,
+					"-projectsInputPath", projectsInputPath,
+					"-outputPath", outputPath,
 				});
+	}
+
+	@Test
+	void testScores() throws Exception {
+
+		String resultsInputPath = getClass()
+			.getResource("/eu/dnetlib/dhp/actionmanager/bipfinder/result_bip_scores.json")
+			.getPath();
+
+		String projectsInputPath = getClass()
+			.getResource("/eu/dnetlib/dhp/actionmanager/bipfinder/project_bip_scores.json")
+			.getPath();
+
+		String outputPath = workingDir.toString() + "/actionSet";
+
+		// execute the job to generate the action sets for result scores
+		runJob(resultsInputPath, projectsInputPath, outputPath);

 		final JavaSparkContext sc = new JavaSparkContext(spark.sparkContext());

-		JavaRDD<Result> tmp = sc
-			.sequenceFile(workingDir.toString() + "/actionSet", Text.class, Text.class)
+		JavaRDD<OafEntity> tmp = sc
+			.sequenceFile(outputPath, Text.class, Text.class)
 			.map(value -> OBJECT_MAPPER.readValue(value._2().toString(), AtomicAction.class))
-			.map(aa -> ((Result) aa.getPayload()));
+			.map(aa -> ((OafEntity) aa.getPayload()));

-		assertEquals(4, tmp.count());
+		assertEquals(8, tmp.count());

-		Dataset<Result> verificationDataset = spark.createDataset(tmp.rdd(), Encoders.bean(Result.class));
+		Dataset<OafEntity> verificationDataset = spark.createDataset(tmp.rdd(), Encoders.bean(OafEntity.class));
 		verificationDataset.createOrReplaceTempView("result");

-		Dataset<Row> execVerification = spark
+		Dataset<Row> testDataset = spark
 			.sql(
 				"Select p.id oaid, mes.id, mUnit.value from result p " +
 					"lateral view explode(measures) m as mes " +
 					"lateral view explode(mes.unit) u as mUnit ");

-		Assertions.assertEquals(12, execVerification.count());
+//		execVerification.show();
+
+		Assertions.assertEquals(28, testDataset.count());
+
+		assertResultImpactScores(testDataset);
+		assertProjectImpactScores(testDataset);
+
+	}
+
+	void assertResultImpactScores(Dataset<Row> testDataset) {
 		Assertions
 			.assertEquals(
-				"6.63451994567e-09", execVerification
+				"6.63451994567e-09", testDataset
 					.filter(
 						"oaid='50|arXiv_dedup_::4a2d5fd8d71daec016c176ec71d957b1' " +
 							"and id = 'influence'")
@ -119,7 +139,7 @@ public class SparkAtomicActionScoreJobTest {
 					.getString(0));
 		Assertions
 			.assertEquals(
-				"0.348694533145", execVerification
+				"0.348694533145", testDataset
 					.filter(
 						"oaid='50|arXiv_dedup_::4a2d5fd8d71daec016c176ec71d957b1' " +
 							"and id = 'popularity_alt'")
@ -129,7 +149,7 @@ public class SparkAtomicActionScoreJobTest {
 					.getString(0));
 		Assertions
 			.assertEquals(
-				"2.16094680115e-09", execVerification
+				"2.16094680115e-09", testDataset
 					.filter(
 						"oaid='50|arXiv_dedup_::4a2d5fd8d71daec016c176ec71d957b1' " +
 							"and id = 'popularity'")
@ -137,7 +157,49 @@ public class SparkAtomicActionScoreJobTest {
 					.collectAsList()
 					.get(0)
 					.getString(0));
-
 	}

+	void assertProjectImpactScores(Dataset<Row> testDataset) throws Exception {
+
+		Assertions
+			.assertEquals(
+				"0", testDataset
+					.filter(
+						"oaid='40|nih_________::c02a8233e9b60f05bb418f0c9b714833' " +
+							"and id = 'numOfInfluentialResults'")
+					.select("value")
+					.collectAsList()
+					.get(0)
+					.getString(0));
+		Assertions
+			.assertEquals(
+				"1", testDataset
+					.filter(
+						"oaid='40|nih_________::c02a8233e9b60f05bb418f0c9b714833' " +
+							"and id = 'numOfPopularResults'")
+					.select("value")
+					.collectAsList()
+					.get(0)
+					.getString(0));
+		Assertions
+			.assertEquals(
+				"25", testDataset
+					.filter(
+						"oaid='40|nih_________::c02a8233e9b60f05bb418f0c9b714833' " +
+							"and id = 'totalImpulse'")
+					.select("value")
+					.collectAsList()
+					.get(0)
+					.getString(0));
+		Assertions
+			.assertEquals(
+				"43", testDataset
+					.filter(
+						"oaid='40|nih_________::c02a8233e9b60f05bb418f0c9b714833' " +
+							"and id = 'totalCitationCount'")
+					.select("value")
+					.collectAsList()
+					.get(0)
+					.getString(0));
+	}
 }
--- a/dhp-workflows/dhp-aggregation/src/test/resources/eu/dnetlib/dhp/actionmanager/bipaffiliations/doi_to_ror.json
+++ b/dhp-workflows/dhp-aggregation/src/test/resources/eu/dnetlib/dhp/actionmanager/bipaffiliations/doi_to_ror.json
@ -0,0 +1,7 @@
+{"DOI":"10.1061\/(asce)0733-9399(2002)128:7(759)","Matchings":[{"RORid":"https:\/\/ror.org\/03yxnpp24","Confidence":0.7071067812},{"RORid":"https:\/\/ror.org\/01teme464","Confidence":0.89}]}
+{"DOI":"10.1105\/tpc.8.3.343","Matchings":[{"RORid":"https:\/\/ror.org\/02k40bc56","Confidence":0.7071067812}]}
+{"DOI":"10.1161\/01.cir.0000013305.01850.37","Matchings":[{"RORid":"https:\/\/ror.org\/00qjgza05","Confidence":1}]}
+{"DOI":"10.1142\/s021821650200186x","Matchings":[{"RORid":"https:\/\/ror.org\/035xkbk20","Confidence":1},{"RORid":"https:\/\/ror.org\/05apxxy63","Confidence":1}]}
+{"DOI":"10.1061\/(asce)0733-9372(2002)128:7(575)","Matchings":[{"RORid":"https:\/\/ror.org\/04j198w64","Confidence":0.82}]}
+{"DOI":"10.1061\/(asce)0733-9372(2002)128:7(588)","Matchings":[{"RORid":"https:\/\/ror.org\/03m8km719","Confidence":0.8660254038},{"RORid":"https:\/\/ror.org\/02aze4h65","Confidence":0.87}]}
+{"DOI":"10.1161\/hy0202.103001","Matchings":[{"RORid":"https:\/\/ror.org\/057xtrt18","Confidence":0.7071067812}]}
--- a/dhp-workflows/dhp-aggregation/src/test/resources/eu/dnetlib/dhp/actionmanager/bipfinder/project_bip_scores.json
+++ b/dhp-workflows/dhp-aggregation/src/test/resources/eu/dnetlib/dhp/actionmanager/bipfinder/project_bip_scores.json
@ -0,0 +1,4 @@
+{"projectId":"40|nsf_________::d93e50d22374a1cf59f6a232413ea027","numOfInfluentialResults":0,"numOfPopularResults":10,"totalImpulse":181,"totalCitationCount":235}
+{"projectId":"40|nih_________::1c93debc7085e440f245fbe70b2e8b21","numOfInfluentialResults":14,"numOfPopularResults":17,"totalImpulse":1558,"totalCitationCount":4226}
+{"projectId":"40|nih_________::c02a8233e9b60f05bb418f0c9b714833","numOfInfluentialResults":0,"numOfPopularResults":1,"totalImpulse":25,"totalCitationCount":43}
+{"projectId":"40|corda_______::d91dcf3a87dd7f72248fab0b8a4ba273","numOfInfluentialResults":2,"numOfPopularResults":3,"totalImpulse":78,"totalCitationCount":178}
--- a/dhp-workflows/dhp-aggregation/src/test/resources/eu/dnetlib/dhp/actionmanager/bipfinder/result_bip_scores.json
+++ b/dhp-workflows/dhp-aggregation/src/test/resources/eu/dnetlib/dhp/actionmanager/bipfinder/result_bip_scores.json
--- a/dhp-workflows/dhp-aggregation/src/test/resources/eu/dnetlib/dhp/sx/graph/bio/pdb_dump
+++ b/dhp-workflows/dhp-aggregation/src/test/resources/eu/dnetlib/dhp/sx/graph/bio/pdb_dump
@ -1,15 +1,44 @@
-{"pdb": "1CW0", "title": "crystal structure analysis of very short patch repair (vsr) endonuclease in complex with a duplex dna", "authors": ["S.E.Tsutakawa", "H.Jingami", "K.Morikawa"], "doi": "10.1016/S0092-8674(00)81550-0", "pmid": "10612397"}
-{"pdb": "2CWW", "title": "crystal structure of thermus thermophilus ttha1280, a putative sam- dependent rna methyltransferase, in complex with s-adenosyl-l- homocysteine", "authors": ["A.A.Pioszak", "K.Murayama", "N.Nakagawa", "A.Ebihara", "S.Kuramitsu", "M.Shirouzu", "S.Yokoyama", "Riken Structural Genomics/proteomics Initiative (Rsgi)"], "doi": "10.1107/S1744309105029842", "pmid": "16511182"}
-{"pdb": "6CWE", "title": "structure of alpha-gsa[8,6p] bound by cd1d and in complex with the va14vb8.2 tcr", "authors": ["J.Wang", "D.Zajonc"], "doi": null, "pmid": null}
-{"pdb": "5CWS", "title": "crystal structure of the intact chaetomium thermophilum nsp1-nup49- nup57 channel nucleoporin heterotrimer bound to its nic96 nuclear pore complex attachment site", "authors": ["C.J.Bley", "S.Petrovic", "M.Paduch", "V.Lu", "A.A.Kossiakoff", "A.Hoelz"], "doi": "10.1126/SCIENCE.AAC9176", "pmid": "26316600"}
-{"pdb": "5CWE", "title": "structure of cyp107l2 from streptomyces avermitilis with lauric acid", "authors": ["T.-V.Pham", "S.-H.Han", "J.-H.Kim", "D.-H.Kim", "L.-W.Kang"], "doi": null, "pmid": null}
-{"pdb": "7CW4", "title": "acetyl-coa acetyltransferase from bacillus cereus atcc 14579", "authors": ["J.Hong", "K.J.Kim"], "doi": "10.1016/J.BBRC.2020.09.048", "pmid": "32972748"}
-{"pdb": "2CWP", "title": "crystal structure of metrs related protein from pyrococcus horikoshii", "authors": ["K.Murayama", "M.Kato-Murayama", "M.Shirouzu", "S.Yokoyama", "Riken StructuralGenomics/proteomics Initiative (Rsgi)"], "doi": null, "pmid": null}
-{"pdb": "2CW7", "title": "crystal structure of intein homing endonuclease ii", "authors": ["H.Matsumura", "H.Takahashi", "T.Inoue", "H.Hashimoto", "M.Nishioka", "S.Fujiwara", "M.Takagi", "T.Imanaka", "Y.Kai"], "doi": "10.1002/PROT.20858", "pmid": "16493661"}
-{"pdb": "1CWU", "title": "brassica napus enoyl acp reductase a138g mutant complexed with nad+ and thienodiazaborine", "authors": ["A.Roujeinikova", "J.B.Rafferty", "D.W.Rice"], "doi": "10.1074/JBC.274.43.30811", "pmid": "10521472"}
-{"pdb": "3CWN", "title": "escherichia coli transaldolase b mutant f178y", "authors": ["T.Sandalova", "G.Schneider", "A.Samland"], "doi": "10.1074/JBC.M803184200", "pmid": "18687684"}
-{"pdb": "1CWL", "title": "human cyclophilin a complexed with 4 4-hydroxy-meleu cyclosporin", "authors": ["V.Mikol", "J.Kallen", "P.Taylor", "M.D.Walkinshaw"], "doi": "10.1006/JMBI.1998.2108", "pmid": "9769216"}
-{"pdb": "3CW2", "title": "crystal structure of the intact archaeal translation initiation factor 2 from sulfolobus solfataricus .", "authors": ["E.A.Stolboushkina", "S.V.Nikonov", "A.D.Nikulin", "U.Blaesi", "D.J.Manstein", "R.V.Fedorov", "M.B.Garber", "O.S.Nikonov"], "doi": "10.1016/J.JMB.2008.07.039", "pmid": "18675278"}
-{"pdb": "3CW9", "title": "4-chlorobenzoyl-coa ligase/synthetase in the thioester-forming conformation, bound to 4-chlorophenacyl-coa", "authors": ["A.S.Reger", "J.Cao", "R.Wu", "D.Dunaway-Mariano", "A.M.Gulick"], "doi": "10.1021/BI800696Y", "pmid": "18620418"}
-{"pdb": "3CWU", "title": "crystal structure of an alka host/guest complex 2'-fluoro-2'-deoxy-1, n6-ethenoadenine:thymine base pair", "authors": ["B.R.Bowman", "S.Lee", "S.Wang", "G.L.Verdine"], "doi": "10.1016/J.STR.2008.04.012", "pmid": "18682218"}
-{"pdb": "5CWF", "title": "crystal structure of de novo designed helical repeat protein dhr8", "authors": ["G.Bhabha", "D.C.Ekiert"], "doi": "10.1038/NATURE16162", "pmid": "26675729"}
+{"classification": "Signaling protein", "pdb": "5NM4", "deposition_date": "2017-04-05", "title": "A2a adenosine receptor room-temperature structure determined by serial Femtosecond crystallography", "Keywords": ["Oom-temperature", " serial crystallography", " signaling protein"], "authors": ["T.weinert", "R.cheng", "D.james", "D.gashi", "P.nogly", "K.jaeger", "M.hennig", "", "J.standfuss"], "pmid": "28912485", "doi": "10.1038/S41467-017-00630-4"}
+{"classification": "Oxidoreductase/oxidoreductase inhibitor", "pdb": "4KN3", "deposition_date": "2013-05-08", "title": "Structure of the y34ns91g double mutant of dehaloperoxidase from Amphitrite ornata with 2,4,6-trichlorophenol", "Keywords": ["Lobin", " oxygen storage", " peroxidase", " oxidoreductase", " oxidoreductase-", "Oxidoreductase inhibitor complex"], "authors": ["C.wang", "L.lovelace", "L.lebioda"], "pmid": "23952341", "doi": "10.1021/BI400627W"}
+{"classification": "Transport protein", "pdb": "8HKM", "deposition_date": "2022-11-27", "title": "Ion channel", "Keywords": ["On channel", " transport protein"], "authors": ["D.h.jiang", "J.t.zhang"], "pmid": "37494189", "doi": "10.1016/J.CELREP.2023.112858"}
+{"classification": "Signaling protein", "pdb": "6JT1", "deposition_date": "2019-04-08", "title": "Structure of human soluble guanylate cyclase in the heme oxidised State", "Keywords": ["Oluble guanylate cyclase", " signaling protein"], "authors": ["L.chen", "Y.kang", "R.liu", "J.-x.wu"], "pmid": "31514202", "doi": "10.1038/S41586-019-1584-6"}
+{"classification": "Immune system", "pdb": "7OW6", "deposition_date": "2021-06-16", "title": "Crystal structure of a tcr in complex with hla-a*11:01 bound to kras G12d peptide (vvvgadgvgk)", "Keywords": ["La", " kras", " tcr", " immune system"], "authors": ["V.karuppiah", "R.a.robinson"], "doi": "10.1038/S41467-022-32811-1"}
+{"classification": "Biosynthetic protein", "pdb": "5EQ8", "deposition_date": "2015-11-12", "title": "Crystal structure of medicago truncatula histidinol-phosphate Phosphatase (mthpp) in complex with l-histidinol", "Keywords": ["Istidine biosynthesis", " metabolic pathways", " dimer", " plant", "", "Biosynthetic protein"], "authors": ["M.ruszkowski", "Z.dauter"], "pmid": "26994138", "doi": "10.1074/JBC.M115.708727"}
+{"classification": "De novo protein", "pdb": "8CWA", "deposition_date": "2022-05-18", "title": "Solution nmr structure of 8-residue rosetta-designed cyclic peptide D8.21 in cdcl3 with cis/trans switching (tc conformation, 53%)", "Keywords": ["Yclic peptide", " non natural amino acids", " cis/trans", " switch peptides", "", "De novo design", "Membrane permeability", "De novo protein"], "authors": ["T.a.ramelot", "R.tejero", "G.t.montelione"], "pmid": "36041435", "doi": "10.1016/J.CELL.2022.07.019"}
+{"classification": "Hydrolase", "pdb": "3R6M", "deposition_date": "2011-03-21", "title": "Crystal structure of vibrio parahaemolyticus yeaz", "Keywords": ["Ctin/hsp70 nucleotide-binding fold", " bacterial resuscitation", " viable", "But non-culturable state", "Resuscitation promoting factor", "Ygjd", "", "Yjee", "Vibrio parahaemolyticus", "Hydrolase"], "authors": ["A.roujeinikova", "I.aydin"], "pmid": "21858042", "doi": "10.1371/JOURNAL.PONE.0023245"}
+{"classification": "Hydrolase", "pdb": "2W5J", "deposition_date": "2008-12-10", "title": "Structure of the c14-rotor ring of the proton translocating Chloroplast atp synthase", "Keywords": ["Ydrolase", " chloroplast", " atp synthase", " lipid-binding", " cf(0)", " membrane", "", "Transport", "Formylation", "Energy transduction", "Hydrogen ion transport", "", "Ion transport", "Transmembrane", "Membrane protein"], "authors": ["M.vollmar", "D.schlieper", "M.winn", "C.buechner", "G.groth"], "pmid": "19423706", "doi": "10.1074/JBC.M109.006916"}
+{"classification": "De novo protein", "pdb": "4GLU", "deposition_date": "2012-08-14", "title": "Crystal structure of the mirror image form of vegf-a", "Keywords": ["-protein", " covalent dimer", " cysteine knot protein", " growth factor", " de", "Novo protein"], "authors": ["K.mandal", "M.uppalapati", "D.ault-riche", "J.kenney", "J.lowitz", "S.sidhu", "", "S.b.h.kent"], "pmid": "22927390", "doi": "10.1073/PNAS.1210483109"}
+{"classification": "Hydrolase/hydrolase inhibitor", "pdb": "3WYL", "deposition_date": "2014-09-01", "title": "Crystal structure of the catalytic domain of pde10a complexed with 5- Methoxy-3-(1-phenyl-1h-pyrazol-5-yl)-1-(3-(trifluoromethyl)phenyl) Pyridazin-4(1h)-one", "Keywords": ["Ydrolase-hydrolase inhibitor complex"], "authors": ["H.oki", "Y.hayano"], "pmid": "25384088", "doi": "10.1021/JM5013648"}
+{"classification": "Isomerase", "pdb": "5BOR", "deposition_date": "2015-05-27", "title": "Structure of acetobacter aceti pure-s57c, sulfonate form", "Keywords": ["Cidophile", " pure", " purine biosynthesis", " isomerase"], "authors": ["K.l.sullivan", "T.j.kappock"]}
+{"classification": "Hydrolase", "pdb": "1X0C", "deposition_date": "2005-03-17", "title": "Improved crystal structure of isopullulanase from aspergillus niger Atcc 9642", "Keywords": ["Ullulan", " glycoside hydrolase family 49", " glycoprotein", " hydrolase"], "authors": ["M.mizuno", "T.tonozuka", "A.yamamura", "Y.miyasaka", "H.akeboshi", "S.kamitori", "", "A.nishikawa", "Y.sakano"], "pmid": "18155243", "doi": "10.1016/J.JMB.2007.11.098"}
+{"classification": "Oxidoreductase", "pdb": "7CUP", "deposition_date": "2020-08-23", "title": "Structure of 2,5-dihydroxypridine dioxygenase from pseudomonas putida Kt2440", "Keywords": ["On-heme dioxygenase", " oxidoreductase"], "authors": ["G.q.liu", "H.z.tang"]}
+{"classification": "Ligase", "pdb": "1VCN", "deposition_date": "2004-03-10", "title": "Crystal structure of t.th. hb8 ctp synthetase complex with sulfate Anion", "Keywords": ["Etramer", " riken structural genomics/proteomics initiative", " rsgi", "", "Structural genomics", "Ligase"], "authors": ["M.goto", "Riken structural genomics/proteomics initiative (rsgi)"], "pmid": "15296735", "doi": "10.1016/J.STR.2004.05.013"}
+{"classification": "Transferase/transferase inhibitor", "pdb": "6C9V", "deposition_date": "2018-01-28", "title": "Mycobacterium tuberculosis adenosine kinase bound to (2r,3s,4r,5r)-2- (hydroxymethyl)-5-(6-(4-phenylpiperazin-1-yl)-9h-purin-9-yl) Tetrahydrofuran-3,4-diol", "Keywords": ["Ucleoside analog", " complex", " inhibitor", " structural genomics", " psi-2", "", "Protein structure initiative", "Tb structural genomics consortium", "", "Tbsgc", "Transferase-transferase inhibitor complex"], "authors": ["R.a.crespo", "Tb structural genomics consortium (tbsgc)"], "pmid": "31002508", "doi": "10.1021/ACS.JMEDCHEM.9B00020"}
+{"classification": "De novo protein", "pdb": "4LPY", "deposition_date": "2013-07-16", "title": "Crystal structure of tencon variant g10", "Keywords": ["Ibronectin type iii fold", " alternate scaffold", " de novo protein"], "authors": ["A.teplyakov", "G.obmolova", "G.l.gilliland"], "pmid": "24375666", "doi": "10.1002/PROT.24502"}
+{"classification": "Isomerase", "pdb": "2Y88", "deposition_date": "2011-02-03", "title": "Crystal structure of mycobacterium tuberculosis phosphoribosyl Isomerase (variant d11n) with bound prfar", "Keywords": ["Romatic amino acid biosynthesis", " isomerase", " tim-barrel", " histidine", "Biosynthesis", "Tryptophan biosynthesis"], "authors": ["J.kuper", "A.v.due", "A.geerlof", "M.wilmanns"], "pmid": "21321225", "doi": "10.1073/PNAS.1015996108"}
+{"classification": "Unknown function", "pdb": "1SR0", "deposition_date": "2004-03-22", "title": "Crystal structure of signalling protein from sheep(sps-40) at 3.0a Resolution using crystal grown in the presence of polysaccharides", "Keywords": ["Ignalling protein", " involution", " unknown function"], "authors": ["D.b.srivastava", "A.s.ethayathulla", "N.singh", "J.kumar", "S.sharma", "T.p.singh"]}
+{"classification": "Dna binding protein", "pdb": "3RH2", "deposition_date": "2011-04-11", "title": "Crystal structure of a tetr-like transcriptional regulator (sama_0099) From shewanella amazonensis sb2b at 2.42 a resolution", "Keywords": ["Na/rna-binding 3-helical bundle", " structural genomics", " joint center", "For structural genomics", "Jcsg", "Protein structure initiative", "Psi-", "Biology", "Dna binding protein"], "authors": ["Joint center for structural genomics (jcsg)"]}
+{"classification": "Transferase", "pdb": "2WK5", "deposition_date": "2009-06-05", "title": "Structural features of native human thymidine phosphorylase And in complex with 5-iodouracil", "Keywords": ["Lycosyltransferase", " developmental protein", " angiogenesis", "", "5-iodouracil", "Growth factor", "Enzyme kinetics", "", "Differentiation", "Disease mutation", "Thymidine", "Phosphorylase", "Chemotaxis", "Transferase", "Mutagenesis", "", "Polymorphism"], "authors": ["E.mitsiki", "A.c.papageorgiou", "S.iyer", "N.thiyagarajan", "S.h.prior", "", "D.sleep", "C.finnis", "K.r.acharya"], "pmid": "19555658", "doi": "10.1016/J.BBRC.2009.06.104"}
+{"classification": "Hydrolase", "pdb": "3P9Y", "deposition_date": "2010-10-18", "title": "Crystal structure of the drosophila melanogaster ssu72-pctd complex", "Keywords": ["Hosphatase", " cis proline", " lmw ptp-like fold", " rna polymerase ii ctd", "", "Hydrolase"], "authors": ["J.w.werner-allen", "P.zhou"], "pmid": "21159777", "doi": "10.1074/JBC.M110.197129"}
+{"classification": "Recombination/dna", "pdb": "6OEO", "deposition_date": "2019-03-27", "title": "Cryo-em structure of mouse rag1/2 nfc complex (dna1)", "Keywords": ["(d)j recombination", " dna transposition", " rag", " scid", " recombination", "", "Recombination-dna complex"], "authors": ["X.chen", "Y.cui", "Z.h.zhou", "W.yang", "M.gellert"], "pmid": "32015552", "doi": "10.1038/S41594-019-0363-2"}
+{"classification": "Hydrolase", "pdb": "4ECA", "deposition_date": "1997-02-21", "title": "Asparaginase from e. coli, mutant t89v with covalently bound aspartate", "Keywords": ["Ydrolase", " acyl-enzyme intermediate", " threonine amidohydrolase"], "authors": ["G.j.palm", "J.lubkowski", "A.wlodawer"], "pmid": "8706862", "doi": "10.1016/0014-5793(96)00660-6"}
+{"classification": "Transcription/protein binding", "pdb": "3UVX", "deposition_date": "2011-11-30", "title": "Crystal structure of the first bromodomain of human brd4 in complex With a diacetylated histone 4 peptide (h4k12ack16ac)", "Keywords": ["Romodomain", " bromodomain containing protein 4", " cap", " hunk1", " mcap", "", "Mitotic chromosome associated protein", "Peptide complex", "Structural", "Genomics consortium", "Sgc", "Transcription-protein binding complex"], "authors": ["P.filippakopoulos", "S.picaud", "T.keates", "E.ugochukwu", "F.von delft", "", "C.h.arrowsmith", "A.m.edwards", "J.weigelt", "C.bountra", "S.knapp", "Structural", "Genomics consortium (sgc)"], "pmid": "22464331", "doi": "10.1016/J.CELL.2012.02.013"}
+{"classification": "Membrane protein", "pdb": "1TLZ", "deposition_date": "2004-06-10", "title": "Tsx structure complexed with uridine", "Keywords": ["Ucleoside transporter", " beta barrel", " uridine", " membrane", "Protein"], "authors": ["J.ye", "B.van den berg"], "pmid": "15272310", "doi": "10.1038/SJ.EMBOJ.7600330"}
+{"classification": "Dna binding protein", "pdb": "7AZD", "deposition_date": "2020-11-16", "title": "Dna polymerase sliding clamp from escherichia coli with peptide 20 Bound", "Keywords": ["Ntibacterial drug", " dna binding protein"], "authors": ["C.monsarrat", "G.compain", "C.andre", "I.martiel", "S.engilberge", "V.olieric", "", "P.wolff", "K.brillet", "M.landolfo", "C.silva da veiga", "J.wagner", "G.guichard", "", "D.y.burnouf"], "pmid": "34806883", "doi": "10.1021/ACS.JMEDCHEM.1C00918"}
+{"classification": "Transferase", "pdb": "5N3K", "deposition_date": "2017-02-08", "title": "Camp-dependent protein kinase a from cricetulus griseus in complex With fragment like molecule o-guanidino-l-homoserine", "Keywords": ["Ragment", " complex", " transferase", " serine threonine kinase", " camp", "", "Kinase", "Pka"], "authors": ["C.siefker", "A.heine", "G.klebe"]}
+{"classification": "Biosynthetic protein", "pdb": "8H52", "deposition_date": "2022-10-11", "title": "Crystal structure of helicobacter pylori carboxyspermidine Dehydrogenase in complex with nadp", "Keywords": ["Arboxyspermidine dehydrogenase", " biosynthetic protein"], "authors": ["K.y.ko", "S.c.park", "S.y.cho", "S.i.yoon"], "pmid": "36283333", "doi": "10.1016/J.BBRC.2022.10.049"}
+{"classification": "Metal binding protein", "pdb": "6DYC", "deposition_date": "2018-07-01", "title": "Co(ii)-bound structure of the engineered cyt cb562 variant, ch3", "Keywords": ["Esigned protein", " 4-helix bundle", " electron transport", " metal binding", "Protein"], "authors": ["F.a.tezcan", "J.rittle"], "pmid": "30778140", "doi": "10.1038/S41557-019-0218-9"}
+{"classification": "Protein fibril", "pdb": "6A6B", "deposition_date": "2018-06-27", "title": "Cryo-em structure of alpha-synuclein fiber", "Keywords": ["Lpha-syn fiber", " parkinson disease", " protein fibril"], "authors": ["Y.w.li", "C.y.zhao", "F.luo", "Z.liu", "X.gui", "Z.luo", "X.zhang", "D.li", "C.liu", "X.li"], "pmid": "30065316", "doi": "10.1038/S41422-018-0075-X"}
+{"classification": "Dna", "pdb": "7D5E", "deposition_date": "2020-09-25", "title": "Left-handed g-quadruplex containing two bulges", "Keywords": ["-quadruplex", " bulge", " dna", " left-handed"], "authors": ["P.das", "A.maity", "K.h.ngo", "F.r.winnerdy", "B.bakalar", "Y.mechulam", "E.schmitt", "", "A.t.phan"], "pmid": "33503265", "doi": "10.1093/NAR/GKAA1259"}
+{"classification": "Transferase", "pdb": "3RSY", "deposition_date": "2011-05-02", "title": "Cellobiose phosphorylase from cellulomonas uda in complex with sulfate And glycerol", "Keywords": ["H94", " alpha barrel", " cellobiose phosphorylase", " disaccharide", "Phosphorylase", "Transferase"], "authors": ["A.van hoorebeke", "J.stout", "W.soetaert", "J.van beeumen", "T.desmet", "S.savvides"]}
+{"classification": "Oxidoreductase", "pdb": "7MCI", "deposition_date": "2021-04-02", "title": "Mofe protein from azotobacter vinelandii with a sulfur-replenished Cofactor", "Keywords": ["Zotobacter vinelandii", " mofe-protein", " nitrogenase", " oxidoreductase"], "authors": ["W.kang", "C.lee", "Y.hu", "M.w.ribbe"], "doi": "10.1038/S41929-022-00782-7"}
+{"classification": "Dna", "pdb": "1XUW", "deposition_date": "2004-10-26", "title": "Structural rationalization of a large difference in rna affinity Despite a small difference in chemistry between two 2'-o-modified Nucleic acid analogs", "Keywords": ["Na mimetic methylcarbamate amide analog", " dna"], "authors": ["R.pattanayek", "L.sethaphong", "C.pan", "M.prhavc", "T.p.prakash", "M.manoharan", "", "M.egli"], "pmid": "15547979", "doi": "10.1021/JA044637K"}
+{"classification": "Lyase", "pdb": "7C0D", "deposition_date": "2020-05-01", "title": "Crystal structure of azospirillum brasilense l-2-keto-3-deoxyarabonate Dehydratase (hydroxypyruvate-bound form)", "Keywords": ["-2-keto-3-deoxyarabonate dehydratase", " lyase"], "authors": ["Y.watanabe", "S.watanabe"], "pmid": "32697085", "doi": "10.1021/ACS.BIOCHEM.0C00515"}
+{"classification": "Signaling protein", "pdb": "5LYK", "deposition_date": "2016-09-28", "title": "Crystal structure of intracellular b30.2 domain of btn3a1 bound to Citrate", "Keywords": ["30.2", " butyrophilin", " signaling protein"], "authors": ["F.mohammed", "A.t.baker", "M.salim", "B.e.willcox"], "pmid": "28862425", "doi": "10.1021/ACSCHEMBIO.7B00694"}
+{"classification": "Toxin", "pdb": "4IZL", "deposition_date": "2013-01-30", "title": "Structure of the n248a mutant of the panton-valentine leucocidin s Component from staphylococcus aureus", "Keywords": ["I-component leucotoxin", " staphylococcus aureus", " s component", "Leucocidin", "Beta-barrel pore forming toxin", "Toxin"], "authors": ["L.maveyraud", "B.j.laventie", "G.prevost", "L.mourey"], "pmid": "24643034", "doi": "10.1371/JOURNAL.PONE.0092094"}
+{"classification": "Dna", "pdb": "6F3C", "deposition_date": "2017-11-28", "title": "The cytotoxic [pt(h2bapbpy)] platinum complex interacting with the Cgtacg hexamer", "Keywords": ["Rug-dna complex", " four-way junction", " dna"], "authors": ["M.ferraroni", "C.bazzicalupi", "P.gratteri", "F.papi"], "pmid": "31046177", "doi": "10.1002/ANIE.201814532"}
+{"classification": "Signaling protein/inhibitor", "pdb": "4L5M", "deposition_date": "2013-06-11", "title": "Complexe of arno sec7 domain with the protein-protein interaction Inhibitor n-(4-hydroxy-2,6-dimethylphenyl)benzenesulfonamide at ph6.5", "Keywords": ["Ec-7domain", " signaling protein-inhibitor complex"], "authors": ["F.hoh", "J.rouhana"], "pmid": "24112024", "doi": "10.1021/JM4009357"}
+{"classification": "Signaling protein", "pdb": "5I6J", "deposition_date": "2016-02-16", "title": "Crystal structure of srgap2 f-barx", "Keywords": ["Rgap2", " f-bar", " fx", " signaling protein"], "authors": ["M.sporny", "J.guez-haddad", "M.n.isupov", "Y.opatowsky"], "pmid": "28333212", "doi": "10.1093/MOLBEV/MSX094"}
+{"classification": "Metal binding protein", "pdb": "1Q80", "deposition_date": "2003-08-20", "title": "Solution structure and dynamics of nereis sarcoplasmic calcium binding Protein", "Keywords": ["Ll-alpha", " metal binding protein"], "authors": ["G.rabah", "R.popescu", "J.a.cox", "Y.engelborghs", "C.t.craescu"], "pmid": "15819893", "doi": "10.1111/J.1742-4658.2005.04629.X"}
+{"classification": "Transferase", "pdb": "1TW1", "deposition_date": "2004-06-30", "title": "Beta-1,4-galactosyltransferase mutant met344his (m344h-gal-t1) complex With udp-galactose and magnesium", "Keywords": ["Et344his mutation; closed conformation; mn binding", " transferase"], "authors": ["B.ramakrishnan", "E.boeggeman", "P.k.qasba"], "pmid": "15449940", "doi": "10.1021/BI049007+"}
+{"classification": "Rna", "pdb": "2PN4", "deposition_date": "2007-04-23", "title": "Crystal structure of hepatitis c virus ires subdomain iia", "Keywords": ["Cv", " ires", " subdoamin iia", " rna", " strontium", " hepatitis"], "authors": ["Q.zhao", "Q.han", "C.r.kissinger", "P.a.thompson"], "pmid": "18391410", "doi": "10.1107/S0907444908002011"}
--- a/dhp-workflows/dhp-aggregation/src/test/resources/eu/dnetlib/dhp/sx/graph/bio/uniprot_dump
+++ b/dhp-workflows/dhp-aggregation/src/test/resources/eu/dnetlib/dhp/sx/graph/bio/uniprot_dump
@ -1,6 +1,36 @@
-{"pid": "Q6GZX4", "dates": [{"date": "28-JUN-2011", "date_info": " integrated into UniProtKB/Swiss-Prot."}, {"date": "19-JUL-2004", "date_info": " sequence version 1."}, {"date": "12-AUG-2020", "date_info": " entry version 41."}], "title": "Putative transcription factor 001R;", "organism_species": "Frog virus 3 (isolate Goorha) (FV-3).", "subjects": ["Viruses", "Varidnaviria", "Bamfordvirae", "Nucleocytoviricota", "Megaviricetes", "Pimascovirales", "Iridoviridae", "Alphairidovirinae", "Ranavirus."], "references": [{"PubMed": "15165820"}, {" DOI": "10.1016/j.virol.2004.02.019"}]}
-{"pid": "Q6GZX3", "dates": [{"date": "28-JUN-2011", "date_info": " integrated into UniProtKB/Swiss-Prot."}, {"date": "19-JUL-2004", "date_info": " sequence version 1."}, {"date": "12-AUG-2020", "date_info": " entry version 42."}], "title": "Uncharacterized protein 002L;", "organism_species": "Frog virus 3 (isolate Goorha) (FV-3).", "subjects": ["Viruses", "Varidnaviria", "Bamfordvirae", "Nucleocytoviricota", "Megaviricetes", "Pimascovirales", "Iridoviridae", "Alphairidovirinae", "Ranavirus."], "references": [{"PubMed": "15165820"}, {" DOI": "10.1016/j.virol.2004.02.019"}]}
-{"pid": "Q197F8", "dates": [{"date": "16-JUN-2009", "date_info": " integrated into UniProtKB/Swiss-Prot."}, {"date": "11-JUL-2006", "date_info": " sequence version 1."}, {"date": "12-AUG-2020", "date_info": " entry version 27."}], "title": "Uncharacterized protein 002R;", "organism_species": "Invertebrate iridescent virus 3 (IIV-3) (Mosquito iridescent virus).", "subjects": ["Viruses", "Varidnaviria", "Bamfordvirae", "Nucleocytoviricota", "Megaviricetes", "Pimascovirales", "Iridoviridae", "Betairidovirinae", "Chloriridovirus."], "references": [{"PubMed": "16912294"}, {" DOI": "10.1128/jvi.00464-06"}]}
-{"pid": "Q197F7", "dates": [{"date": "16-JUN-2009", "date_info": " integrated into UniProtKB/Swiss-Prot."}, {"date": "11-JUL-2006", "date_info": " sequence version 1."}, {"date": "12-AUG-2020", "date_info": " entry version 23."}], "title": "Uncharacterized protein 003L;", "organism_species": "Invertebrate iridescent virus 3 (IIV-3) (Mosquito iridescent virus).", "subjects": ["Viruses", "Varidnaviria", "Bamfordvirae", "Nucleocytoviricota", "Megaviricetes", "Pimascovirales", "Iridoviridae", "Betairidovirinae", "Chloriridovirus."], "references": [{"PubMed": "16912294"}, {" DOI": "10.1128/jvi.00464-06"}]}
-{"pid": "Q6GZX2", "dates": [{"date": "28-JUN-2011", "date_info": " integrated into UniProtKB/Swiss-Prot."}, {"date": "19-JUL-2004", "date_info": " sequence version 1."}, {"date": "12-AUG-2020", "date_info": " entry version 36."}], "title": "Uncharacterized protein 3R;", "organism_species": "Frog virus 3 (isolate Goorha) (FV-3).", "subjects": ["Viruses", "Varidnaviria", "Bamfordvirae", "Nucleocytoviricota", "Megaviricetes", "Pimascovirales", "Iridoviridae", "Alphairidovirinae", "Ranavirus."], "references": [{"PubMed": "15165820"}, {" DOI": "10.1016/j.virol.2004.02.019"}]}
-{"pid": "Q6GZX1", "dates": [{"date": "28-JUN-2011", "date_info": " integrated into UniProtKB/Swiss-Prot."}, {"date": "19-JUL-2004", "date_info": " sequence version 1."}, {"date": "12-AUG-2020", "date_info": " entry version 34."}], "title": "Uncharacterized protein 004R;", "organism_species": "Frog virus 3 (isolate Goorha) (FV-3).", "subjects": ["Viruses", "Varidnaviria", "Bamfordvirae", "Nucleocytoviricota", "Megaviricetes", "Pimascovirales", "Iridoviridae", "Alphairidovirinae", "Ranavirus."], "references": [{"PubMed": "15165820"}, {" DOI": "10.1016/j.virol.2004.02.019"}]}
+{"pid": " Q6GZX4", "dates": [{"date": "2011-06-28", "date_info": "integrated into UniProtKB/Swiss-Prot"}, {"date": "2004-07-19", "date_info": "sequence version 1"}, {"date": "2023-09-13", "date_info": "entry version 43"}], "title": "Putative transcription factor 001R", "organism_species": "Frog virus 3 (isolate Goorha) (FV-3)", "subjects": ["Viruses", "Varidnaviria", "Bamfordvirae", "Nucleocytoviricota", "Megaviricetes", "Pimascovirales", "Iridoviridae", "Alphairidovirinae", "Ranavirus"], "references": [{"PubMed": "15165820"}, {"DOI": "10.1016/j.virol.2004.02.019"}]}
+{"pid": " Q6GZX3", "dates": [{"date": "2011-06-28", "date_info": "integrated into UniProtKB/Swiss-Prot"}, {"date": "2004-07-19", "date_info": "sequence version 1"}, {"date": "2023-09-13", "date_info": "entry version 45"}], "title": "Uncharacterized protein 002L", "organism_species": "Frog virus 3 (isolate Goorha) (FV-3)", "subjects": ["Viruses", "Varidnaviria", "Bamfordvirae", "Nucleocytoviricota", "Megaviricetes", "Pimascovirales", "Iridoviridae", "Alphairidovirinae", "Ranavirus"], "references": [{"PubMed": "15165820"}, {"DOI": "10.1016/j.virol.2004.02.019"}]}
+{"pid": " Q197F8", "dates": [{"date": "2009-06-16", "date_info": "integrated into UniProtKB/Swiss-Prot"}, {"date": "2006-07-11", "date_info": "sequence version 1"}, {"date": "2022-02-23", "date_info": "entry version 29"}], "title": "Uncharacterized protein 002R", "organism_species": "Invertebrate iridescent virus 3 (IIV-3) (Mosquito iridescent virus)", "subjects": ["Viruses", "Varidnaviria", "Bamfordvirae", "Nucleocytoviricota", "Megaviricetes", "Pimascovirales", "Iridoviridae", "Betairidovirinae", "Chloriridovirus"], "references": [{"PubMed": "16912294"}, {"DOI": "10.1128/jvi.00464-06"}]}
+{"pid": " Q197F7", "dates": [{"date": "2009-06-16", "date_info": "integrated into UniProtKB/Swiss-Prot"}, {"date": "2006-07-11", "date_info": "sequence version 1"}, {"date": "2020-08-12", "date_info": "entry version 23"}], "title": "Uncharacterized protein 003L", "organism_species": "Invertebrate iridescent virus 3 (IIV-3) (Mosquito iridescent virus)", "subjects": ["Viruses", "Varidnaviria", "Bamfordvirae", "Nucleocytoviricota", "Megaviricetes", "Pimascovirales", "Iridoviridae", "Betairidovirinae", "Chloriridovirus"], "references": [{"PubMed": "16912294"}, {"DOI": "10.1128/jvi.00464-06"}]}
+{"pid": " Q6GZX2", "dates": [{"date": "2011-06-28", "date_info": "integrated into UniProtKB/Swiss-Prot"}, {"date": "2004-07-19", "date_info": "sequence version 1"}, {"date": "2023-09-13", "date_info": "entry version 37"}], "title": "Uncharacterized protein 3R", "organism_species": "Frog virus 3 (isolate Goorha) (FV-3)", "subjects": ["Viruses", "Varidnaviria", "Bamfordvirae", "Nucleocytoviricota", "Megaviricetes", "Pimascovirales", "Iridoviridae", "Alphairidovirinae", "Ranavirus"], "references": [{"PubMed": "15165820"}, {"DOI": "10.1016/j.virol.2004.02.019"}]}
+{"pid": " Q6GZX1", "dates": [{"date": "2011-06-28", "date_info": "integrated into UniProtKB/Swiss-Prot"}, {"date": "2004-07-19", "date_info": "sequence version 1"}, {"date": "2023-09-13", "date_info": "entry version 38"}], "title": "Uncharacterized protein 004R", "organism_species": "Frog virus 3 (isolate Goorha) (FV-3)", "subjects": ["Viruses", "Varidnaviria", "Bamfordvirae", "Nucleocytoviricota", "Megaviricetes", "Pimascovirales", "Iridoviridae", "Alphairidovirinae", "Ranavirus"], "references": [{"PubMed": "15165820"}, {"DOI": "10.1016/j.virol.2004.02.019"}]}
+{"pid": " Q197F5", "dates": [{"date": "2009-06-16", "date_info": "integrated into UniProtKB/Swiss-Prot"}, {"date": "2006-07-11", "date_info": "sequence version 1"}, {"date": "2022-10-12", "date_info": "entry version 32"}], "title": "Uncharacterized protein 005L", "organism_species": "Invertebrate iridescent virus 3 (IIV-3) (Mosquito iridescent virus)", "subjects": ["Viruses", "Varidnaviria", "Bamfordvirae", "Nucleocytoviricota", "Megaviricetes", "Pimascovirales", "Iridoviridae", "Betairidovirinae", "Chloriridovirus"], "references": [{"PubMed": "16912294"}, {"DOI": "10.1128/jvi.00464-06"}]}
+{"pid": " Q6GZX0", "dates": [{"date": "2011-06-28", "date_info": "integrated into UniProtKB/Swiss-Prot"}, {"date": "2004-07-19", "date_info": "sequence version 1"}, {"date": "2023-09-13", "date_info": "entry version 47"}], "title": "Uncharacterized protein 005R", "organism_species": "Frog virus 3 (isolate Goorha) (FV-3)", "subjects": ["Viruses", "Varidnaviria", "Bamfordvirae", "Nucleocytoviricota", "Megaviricetes", "Pimascovirales", "Iridoviridae", "Alphairidovirinae", "Ranavirus"], "references": [{"PubMed": "15165820"}, {"DOI": "10.1016/j.virol.2004.02.019"}]}
+{"pid": " Q91G88", "dates": [{"date": "2009-06-16", "date_info": "integrated into UniProtKB/Swiss-Prot"}, {"date": "2001-12-01", "date_info": "sequence version 1"}, {"date": "2023-06-28", "date_info": "entry version 53"}], "title": "Putative KilA-N domain-containing protein 006L", "organism_species": "Invertebrate iridescent virus 6 (IIV-6) (Chilo iridescent virus)", "subjects": ["Viruses", "Varidnaviria", "Bamfordvirae", "Nucleocytoviricota", "Megaviricetes", "Pimascovirales", "Iridoviridae", "Betairidovirinae", "Iridovirus"], "references": [{"PubMed": "17239238"}, {"DOI": "10.1186/1743-422x-4-11"}]}
+{"pid": " Q6GZW9", "dates": [{"date": "2011-06-28", "date_info": "integrated into UniProtKB/Swiss-Prot"}, {"date": "2004-07-19", "date_info": "sequence version 1"}, {"date": "2023-09-13", "date_info": "entry version 34"}], "title": "Uncharacterized protein 006R", "organism_species": "Frog virus 3 (isolate Goorha) (FV-3)", "subjects": ["Viruses", "Varidnaviria", "Bamfordvirae", "Nucleocytoviricota", "Megaviricetes", "Pimascovirales", "Iridoviridae", "Alphairidovirinae", "Ranavirus"], "references": [{"PubMed": "15165820"}, {"DOI": "10.1016/j.virol.2004.02.019"}]}
+{"pid": " Q6GZW8", "dates": [{"date": "2011-06-28", "date_info": "integrated into UniProtKB/Swiss-Prot"}, {"date": "2004-07-19", "date_info": "sequence version 1"}, {"date": "2023-09-13", "date_info": "entry version 32"}], "title": "Uncharacterized protein 007R", "organism_species": "Frog virus 3 (isolate Goorha) (FV-3)", "subjects": ["Viruses", "Varidnaviria", "Bamfordvirae", "Nucleocytoviricota", "Megaviricetes", "Pimascovirales", "Iridoviridae", "Alphairidovirinae", "Ranavirus"], "references": [{"PubMed": "15165820"}, {"DOI": "10.1016/j.virol.2004.02.019"}]}
+{"pid": " Q197F3", "dates": [{"date": "2009-06-16", "date_info": "integrated into UniProtKB/Swiss-Prot"}, {"date": "2006-07-11", "date_info": "sequence version 1"}, {"date": "2023-02-22", "date_info": "entry version 28"}], "title": "Uncharacterized protein 007R", "organism_species": "Invertebrate iridescent virus 3 (IIV-3) (Mosquito iridescent virus)", "subjects": ["Viruses", "Varidnaviria", "Bamfordvirae", "Nucleocytoviricota", "Megaviricetes", "Pimascovirales", "Iridoviridae", "Betairidovirinae", "Chloriridovirus"], "references": [{"PubMed": "16912294"}, {"DOI": "10.1128/jvi.00464-06"}]}
+{"pid": " Q197F2", "dates": [{"date": "2009-06-16", "date_info": "integrated into UniProtKB/Swiss-Prot"}, {"date": "2006-07-11", "date_info": "sequence version 1"}, {"date": "2022-02-23", "date_info": "entry version 22"}], "title": "Uncharacterized protein 008L", "organism_species": "Invertebrate iridescent virus 3 (IIV-3) (Mosquito iridescent virus)", "subjects": ["Viruses", "Varidnaviria", "Bamfordvirae", "Nucleocytoviricota", "Megaviricetes", "Pimascovirales", "Iridoviridae", "Betairidovirinae", "Chloriridovirus"], "references": [{"PubMed": "16912294"}, {"DOI": "10.1128/jvi.00464-06"}]}
+{"pid": " Q6GZW6", "dates": [{"date": "2011-06-28", "date_info": "integrated into UniProtKB/Swiss-Prot"}, {"date": "2004-07-19", "date_info": "sequence version 1"}, {"date": "2023-09-13", "date_info": "entry version 67"}], "title": "Putative helicase 009L", "organism_species": "Frog virus 3 (isolate Goorha) (FV-3)", "subjects": ["Viruses", "Varidnaviria", "Bamfordvirae", "Nucleocytoviricota", "Megaviricetes", "Pimascovirales", "Iridoviridae", "Alphairidovirinae", "Ranavirus"], "references": [{"PubMed": "15165820"}, {"DOI": "10.1016/j.virol.2004.02.019"}]}
+{"pid": " Q91G85", "dates": [{"date": "2009-06-16", "date_info": "integrated into UniProtKB/Swiss-Prot"}, {"date": "2001-12-01", "date_info": "sequence version 1"}, {"date": "2023-02-22", "date_info": "entry version 38"}], "title": "Uncharacterized protein 009R", "organism_species": "Invertebrate iridescent virus 6 (IIV-6) (Chilo iridescent virus)", "subjects": ["Viruses", "Varidnaviria", "Bamfordvirae", "Nucleocytoviricota", "Megaviricetes", "Pimascovirales", "Iridoviridae", "Betairidovirinae", "Iridovirus"], "references": [{"PubMed": "17239238"}, {"DOI": "10.1186/1743-422x-4-11"}]}
+{"pid": " Q6GZW5", "dates": [{"date": "2011-06-28", "date_info": "integrated into UniProtKB/Swiss-Prot"}, {"date": "2004-07-19", "date_info": "sequence version 1"}, {"date": "2023-09-13", "date_info": "entry version 37"}], "title": "Uncharacterized protein 010R", "organism_species": "Frog virus 3 (isolate Goorha) (FV-3)", "subjects": ["Viruses", "Varidnaviria", "Bamfordvirae", "Nucleocytoviricota", "Megaviricetes", "Pimascovirales", "Iridoviridae", "Alphairidovirinae", "Ranavirus"], "references": [{"PubMed": "15165820"}, {"DOI": "10.1016/j.virol.2004.02.019"}]}
+{"pid": " Q197E9", "dates": [{"date": "2009-06-16", "date_info": "integrated into UniProtKB/Swiss-Prot"}, {"date": "2006-07-11", "date_info": "sequence version 1"}, {"date": "2023-02-22", "date_info": "entry version 28"}], "title": "Uncharacterized protein 011L", "organism_species": "Invertebrate iridescent virus 3 (IIV-3) (Mosquito iridescent virus)", "subjects": ["Viruses", "Varidnaviria", "Bamfordvirae", "Nucleocytoviricota", "Megaviricetes", "Pimascovirales", "Iridoviridae", "Betairidovirinae", "Chloriridovirus"], "references": [{"PubMed": "16912294"}, {"DOI": "10.1128/jvi.00464-06"}]}
+{"pid": " Q6GZW4", "dates": [{"date": "2011-06-28", "date_info": "integrated into UniProtKB/Swiss-Prot"}, {"date": "2004-07-19", "date_info": "sequence version 1"}, {"date": "2023-09-13", "date_info": "entry version 37"}], "title": "Uncharacterized protein 011R", "organism_species": "Frog virus 3 (isolate Goorha) (FV-3)", "subjects": ["Viruses", "Varidnaviria", "Bamfordvirae", "Nucleocytoviricota", "Megaviricetes", "Pimascovirales", "Iridoviridae", "Alphairidovirinae", "Ranavirus"], "references": [{"PubMed": "15165820"}, {"DOI": "10.1016/j.virol.2004.02.019"}]}
+{"pid": " Q6GZW3", "dates": [{"date": "2011-06-28", "date_info": "integrated into UniProtKB/Swiss-Prot"}, {"date": "2004-07-19", "date_info": "sequence version 1"}, {"date": "2023-09-13", "date_info": "entry version 35"}], "title": "Uncharacterized protein 012L", "organism_species": "Frog virus 3 (isolate Goorha) (FV-3)", "subjects": ["Viruses", "Varidnaviria", "Bamfordvirae", "Nucleocytoviricota", "Megaviricetes", "Pimascovirales", "Iridoviridae", "Alphairidovirinae", "Ranavirus"], "references": [{"PubMed": "15165820"}, {"DOI": "10.1016/j.virol.2004.02.019"}]}
+{"pid": " Q197E7", "dates": [{"date": "2009-06-16", "date_info": "integrated into UniProtKB/Swiss-Prot"}, {"date": "2006-07-11", "date_info": "sequence version 1"}, {"date": "2023-02-22", "date_info": "entry version 37"}], "title": "Uncharacterized protein IIV3-013L", "organism_species": "Invertebrate iridescent virus 3 (IIV-3) (Mosquito iridescent virus)", "subjects": ["Viruses", "Varidnaviria", "Bamfordvirae", "Nucleocytoviricota", "Megaviricetes", "Pimascovirales", "Iridoviridae", "Betairidovirinae", "Chloriridovirus"], "references": [{"PubMed": "16912294"}, {"DOI": "10.1128/jvi.00464-06"}]}
+{"pid": " Q6GZW2", "dates": [{"date": "2011-06-28", "date_info": "integrated into UniProtKB/Swiss-Prot"}, {"date": "2004-07-19", "date_info": "sequence version 1"}, {"date": "2023-09-13", "date_info": "entry version 30"}], "title": "Uncharacterized protein 013R", "organism_species": "Frog virus 3 (isolate Goorha) (FV-3)", "subjects": ["Viruses", "Varidnaviria", "Bamfordvirae", "Nucleocytoviricota", "Megaviricetes", "Pimascovirales", "Iridoviridae", "Alphairidovirinae", "Ranavirus"], "references": [{"PubMed": "15165820"}, {"DOI": "10.1016/j.virol.2004.02.019"}]}
+{"pid": " Q6GZW1", "dates": [{"date": "2011-06-28", "date_info": "integrated into UniProtKB/Swiss-Prot"}, {"date": "2004-07-19", "date_info": "sequence version 1"}, {"date": "2023-09-13", "date_info": "entry version 35"}], "title": "Uncharacterized protein 014R", "organism_species": "Frog virus 3 (isolate Goorha) (FV-3)", "subjects": ["Viruses", "Varidnaviria", "Bamfordvirae", "Nucleocytoviricota", "Megaviricetes", "Pimascovirales", "Iridoviridae", "Alphairidovirinae", "Ranavirus"], "references": [{"PubMed": "15165820"}, {"DOI": "10.1016/j.virol.2004.02.019"}]}
+{"pid": " Q6GZW0", "dates": [{"date": "2011-06-28", "date_info": "integrated into UniProtKB/Swiss-Prot"}, {"date": "2004-07-19", "date_info": "sequence version 1"}, {"date": "2023-09-13", "date_info": "entry version 50"}], "title": "Uncharacterized protein 015R", "organism_species": "Frog virus 3 (isolate Goorha) (FV-3)", "subjects": ["Viruses", "Varidnaviria", "Bamfordvirae", "Nucleocytoviricota", "Megaviricetes", "Pimascovirales", "Iridoviridae", "Alphairidovirinae", "Ranavirus"], "references": [{"PubMed": "15165820"}, {"DOI": "10.1016/j.virol.2004.02.019"}]}
+{"pid": " Q6GZV8", "dates": [{"date": "2011-06-28", "date_info": "integrated into UniProtKB/Swiss-Prot"}, {"date": "2004-07-19", "date_info": "sequence version 1"}, {"date": "2023-09-13", "date_info": "entry version 35"}], "title": "Uncharacterized protein 017L", "organism_species": "Frog virus 3 (isolate Goorha) (FV-3)", "subjects": ["Viruses", "Varidnaviria", "Bamfordvirae", "Nucleocytoviricota", "Megaviricetes", "Pimascovirales", "Iridoviridae", "Alphairidovirinae", "Ranavirus"], "references": [{"PubMed": "15165820"}, {"DOI": "10.1016/j.virol.2004.02.019"}]}
+{"pid": " Q6GZV7", "dates": [{"date": "2011-06-28", "date_info": "integrated into UniProtKB/Swiss-Prot"}, {"date": "2004-07-19", "date_info": "sequence version 1"}, {"date": "2023-09-13", "date_info": "entry version 33"}], "title": "Uncharacterized protein 018L", "organism_species": "Frog virus 3 (isolate Goorha) (FV-3)", "subjects": ["Viruses", "Varidnaviria", "Bamfordvirae", "Nucleocytoviricota", "Megaviricetes", "Pimascovirales", "Iridoviridae", "Alphairidovirinae", "Ranavirus"], "references": [{"PubMed": "15165820"}, {"DOI": "10.1016/j.virol.2004.02.019"}]}
+{"pid": " Q6GZV6", "dates": [{"date": "2011-06-28", "date_info": "integrated into UniProtKB/Swiss-Prot"}, {"date": "2004-07-19", "date_info": "sequence version 1"}, {"date": "2023-09-13", "date_info": "entry version 87"}], "title": "Putative serine/threonine-protein kinase 019R", "organism_species": "Frog virus 3 (isolate Goorha) (FV-3)", "subjects": ["Viruses", "Varidnaviria", "Bamfordvirae", "Nucleocytoviricota", "Megaviricetes", "Pimascovirales", "Iridoviridae", "Alphairidovirinae", "Ranavirus"], "references": [{"PubMed": "15165820"}, {"DOI": "10.1016/j.virol.2004.02.019"}]}
+{"pid": " Q6GZV5", "dates": [{"date": "2011-06-28", "date_info": "integrated into UniProtKB/Swiss-Prot"}, {"date": "2004-07-19", "date_info": "sequence version 1"}, {"date": "2023-09-13", "date_info": "entry version 40"}], "title": "Uncharacterized protein 020R", "organism_species": "Frog virus 3 (isolate Goorha) (FV-3)", "subjects": ["Viruses", "Varidnaviria", "Bamfordvirae", "Nucleocytoviricota", "Megaviricetes", "Pimascovirales", "Iridoviridae", "Alphairidovirinae", "Ranavirus"], "references": [{"PubMed": "15165820"}, {"DOI": "10.1016/j.virol.2004.02.019"}]}
+{"pid": " Q6GZV4", "dates": [{"date": "2011-06-28", "date_info": "integrated into UniProtKB/Swiss-Prot"}, {"date": "2004-07-19", "date_info": "sequence version 1"}, {"date": "2023-09-13", "date_info": "entry version 35"}], "title": "Uncharacterized protein 021L", "organism_species": "Frog virus 3 (isolate Goorha) (FV-3)", "subjects": ["Viruses", "Varidnaviria", "Bamfordvirae", "Nucleocytoviricota", "Megaviricetes", "Pimascovirales", "Iridoviridae", "Alphairidovirinae", "Ranavirus"], "references": [{"PubMed": "15165820"}, {"DOI": "10.1016/j.virol.2004.02.019"}]}
+{"pid": " Q197D8", "dates": [{"date": "2009-06-16", "date_info": "integrated into UniProtKB/Swiss-Prot"}, {"date": "2006-07-11", "date_info": "sequence version 1"}, {"date": "2022-12-14", "date_info": "entry version 35"}], "title": "Transmembrane protein 022L", "organism_species": "Invertebrate iridescent virus 3 (IIV-3) (Mosquito iridescent virus)", "subjects": ["Viruses", "Varidnaviria", "Bamfordvirae", "Nucleocytoviricota", "Megaviricetes", "Pimascovirales", "Iridoviridae", "Betairidovirinae", "Chloriridovirus"], "references": [{"PubMed": "16912294"}, {"DOI": "10.1128/jvi.00464-06"}]}
+{"pid": " Q6GZV2", "dates": [{"date": "2011-06-28", "date_info": "integrated into UniProtKB/Swiss-Prot"}, {"date": "2004-07-19", "date_info": "sequence version 1"}, {"date": "2023-09-13", "date_info": "entry version 33"}], "title": "Uncharacterized protein 023R", "organism_species": "Frog virus 3 (isolate Goorha) (FV-3)", "subjects": ["Viruses", "Varidnaviria", "Bamfordvirae", "Nucleocytoviricota", "Megaviricetes", "Pimascovirales", "Iridoviridae", "Alphairidovirinae", "Ranavirus"], "references": [{"PubMed": "15165820"}, {"DOI": "10.1016/j.virol.2004.02.019"}]}
+{"pid": " Q197D7", "dates": [{"date": "2009-06-16", "date_info": "integrated into UniProtKB/Swiss-Prot"}, {"date": "2006-07-11", "date_info": "sequence version 1"}, {"date": "2023-02-22", "date_info": "entry version 25"}], "title": "Uncharacterized protein 023R", "organism_species": "Invertebrate iridescent virus 3 (IIV-3) (Mosquito iridescent virus)", "subjects": ["Viruses", "Varidnaviria", "Bamfordvirae", "Nucleocytoviricota", "Megaviricetes", "Pimascovirales", "Iridoviridae", "Betairidovirinae", "Chloriridovirus"], "references": [{"PubMed": "16912294"}, {"DOI": "10.1128/jvi.00464-06"}]}
+{"pid": " Q6GZV1", "dates": [{"date": "2011-06-28", "date_info": "integrated into UniProtKB/Swiss-Prot"}, {"date": "2004-07-19", "date_info": "sequence version 1"}, {"date": "2023-09-13", "date_info": "entry version 37"}], "title": "Uncharacterized protein 024R", "organism_species": "Frog virus 3 (isolate Goorha) (FV-3)", "subjects": ["Viruses", "Varidnaviria", "Bamfordvirae", "Nucleocytoviricota", "Megaviricetes", "Pimascovirales", "Iridoviridae", "Alphairidovirinae", "Ranavirus"], "references": [{"PubMed": "15165820"}, {"DOI": "10.1016/j.virol.2004.02.019"}]}
+{"pid": " Q197D5", "dates": [{"date": "2009-06-16", "date_info": "integrated into UniProtKB/Swiss-Prot"}, {"date": "2006-07-11", "date_info": "sequence version 1"}, {"date": "2022-10-12", "date_info": "entry version 24"}], "title": "Uncharacterized protein 025R", "organism_species": "Invertebrate iridescent virus 3 (IIV-3) (Mosquito iridescent virus)", "subjects": ["Viruses", "Varidnaviria", "Bamfordvirae", "Nucleocytoviricota", "Megaviricetes", "Pimascovirales", "Iridoviridae", "Betairidovirinae", "Chloriridovirus"], "references": [{"PubMed": "16912294"}, {"DOI": "10.1128/jvi.00464-06"}]}
+{"pid": " Q91G70", "dates": [{"date": "2009-06-16", "date_info": "integrated into UniProtKB/Swiss-Prot"}, {"date": "2001-12-01", "date_info": "sequence version 1"}, {"date": "2020-08-12", "date_info": "entry version 32"}], "title": "Uncharacterized protein 026R", "organism_species": "Invertebrate iridescent virus 6 (IIV-6) (Chilo iridescent virus)", "subjects": ["Viruses", "Varidnaviria", "Bamfordvirae", "Nucleocytoviricota", "Megaviricetes", "Pimascovirales", "Iridoviridae", "Betairidovirinae", "Iridovirus"], "references": [{"PubMed": "17239238"}, {"DOI": "10.1186/1743-422x-4-11"}]}
+{"pid": " Q6GZU9", "dates": [{"date": "2011-06-28", "date_info": "integrated into UniProtKB/Swiss-Prot"}, {"date": "2004-07-19", "date_info": "sequence version 1"}, {"date": "2023-09-13", "date_info": "entry version 49"}], "title": "Uncharacterized protein 027R", "organism_species": "Frog virus 3 (isolate Goorha) (FV-3)", "subjects": ["Viruses", "Varidnaviria", "Bamfordvirae", "Nucleocytoviricota", "Megaviricetes", "Pimascovirales", "Iridoviridae", "Alphairidovirinae", "Ranavirus"], "references": [{"PubMed": "15165820"}, {"DOI": "10.1016/j.virol.2004.02.019"}]}
+{"pid": " Q6GZU8", "dates": [{"date": "2011-06-28", "date_info": "integrated into UniProtKB/Swiss-Prot"}, {"date": "2004-07-19", "date_info": "sequence version 1"}, {"date": "2023-09-13", "date_info": "entry version 55"}], "title": "Uncharacterized protein 028R", "organism_species": "Frog virus 3 (isolate Goorha) (FV-3)", "subjects": ["Viruses", "Varidnaviria", "Bamfordvirae", "Nucleocytoviricota", "Megaviricetes", "Pimascovirales", "Iridoviridae", "Alphairidovirinae", "Ranavirus"], "references": [{"PubMed": "15165820"}, {"DOI": "10.1016/j.virol.2004.02.019"}]}
--- a/dhp-workflows/dhp-aggregation/src/test/scala/eu/dnetlib/dhp/sx/bio/BioScholixTest.scala
+++ b/dhp-workflows/dhp-aggregation/src/test/scala/eu/dnetlib/dhp/sx/bio/BioScholixTest.scala
@ -14,10 +14,12 @@ import org.junit.jupiter.api.extension.ExtendWith
 import org.junit.jupiter.api.{BeforeEach, Test}
 import org.mockito.junit.jupiter.MockitoExtension

-import java.io.{BufferedReader, FileInputStream, InputStream, InputStreamReader}
+import java.io.{BufferedReader, InputStream, InputStreamReader}
 import java.util.zip.GZIPInputStream
 import scala.collection.JavaConverters._
+import scala.collection.mutable.ListBuffer
 import scala.io.Source
+import scala.xml.pull.XMLEventReader

@ExtendWith(Array(classOf[MockitoExtension]))
 class BioScholixTest extends AbstractVocabularyTest {
@ -47,11 +49,11 @@ class BioScholixTest extends AbstractVocabularyTest {

  @Test
  def testEBIData() = {
-    val inputXML = getClass.getResourceAsStream("/eu/dnetlib/dhp/sx/graph/bio/pubmed.xml")
-
-    // new PubmedParser(new GZIPInputStream(new FileInputStream("/Users/sandro/Downloads/pubmed23n1078.xml.gz")))
-    new PMParser(new GZIPInputStream(new FileInputStream("/Users/sandro/Downloads/pubmed23n1078.xml.gz")))
-    print("DONE")
+    val inputXML = Source
+      .fromInputStream(getClass.getResourceAsStream("/eu/dnetlib/dhp/sx/graph/bio/pubmed.xml"))
+      .mkString
+    val xml = new XMLEventReader(Source.fromBytes(inputXML.getBytes()))
+    new PMParser(xml).foreach(s => println(mapper.writeValueAsString(s)))
  }

  @Test
@ -87,14 +89,14 @@ class BioScholixTest extends AbstractVocabularyTest {

  }

-//  @Test
-//  def testParsingPubmedXML(): Unit = {
-//    val xml = new XMLEventReader(
-//      Source.fromInputStream(getClass.getResourceAsStream("/eu/dnetlib/dhp/sx/graph/bio/pubmed.xml"))
-//    )
-//    val parser = new PMParser(xml)
-//    parser.foreach(checkPMArticle)
-//  }
+  @Test
+  def testParsingPubmedXML(): Unit = {
+    val xml = new XMLEventReader(
+      Source.fromInputStream(getClass.getResourceAsStream("/eu/dnetlib/dhp/sx/graph/bio/pubmed.xml"))
+    )
+    val parser = new PMParser(xml)
+    parser.foreach(checkPMArticle)
+  }

  private def checkPubmedPublication(o: Oaf): Unit = {
    assertTrue(o.isInstanceOf[Publication])
@ -151,19 +153,19 @@ class BioScholixTest extends AbstractVocabularyTest {
    assertTrue(hasOldOpenAIREID)
  }

-//  @Test
-//  def testPubmedMapping(): Unit = {
-//
-//    val xml = new XMLEventReader(
-//      Source.fromInputStream(getClass.getResourceAsStream("/eu/dnetlib/dhp/sx/graph/bio/pubmed.xml"))
-//    )
-//    val parser = new PMParser(xml)
-//    val results = ListBuffer[Oaf]()
-//    parser.foreach(x => results += PubMedToOaf.convert(x, vocabularies))
-//
-//    results.foreach(checkPubmedPublication)
-//
-//  }
+  @Test
+  def testPubmedMapping(): Unit = {
+
+    val xml = new XMLEventReader(
+      Source.fromInputStream(getClass.getResourceAsStream("/eu/dnetlib/dhp/sx/graph/bio/pubmed.xml"))
+    )
+    val parser = new PMParser(xml)
+    val results = ListBuffer[Oaf]()
+    parser.foreach(x => results += PubMedToOaf.convert(x, vocabularies))
+
+    results.foreach(checkPubmedPublication)
+
+  }

  @Test
  def testPDBToOAF(): Unit = {
--- a/dhp-workflows/dhp-broker-events/src/main/java/eu/dnetlib/dhp/broker/oa/util/TrustUtils.java
+++ b/dhp-workflows/dhp-broker-events/src/main/java/eu/dnetlib/dhp/broker/oa/util/TrustUtils.java
@ -2,7 +2,9 @@
 package eu.dnetlib.dhp.broker.oa.util;

 import java.io.IOException;
+import java.nio.charset.StandardCharsets;

+import org.apache.commons.io.IOUtils;
 import org.apache.spark.sql.Row;
 import org.slf4j.Logger;
 import org.slf4j.LoggerFactory;
@ -11,7 +13,7 @@ import com.fasterxml.jackson.databind.ObjectMapper;

 import eu.dnetlib.broker.objects.OaBrokerMainEntity;
 import eu.dnetlib.pace.config.DedupConfig;
-import eu.dnetlib.pace.model.SparkDedupConfig;
+import eu.dnetlib.pace.model.SparkDeduper;
 import eu.dnetlib.pace.tree.support.TreeProcessor;

 public class TrustUtils {
@ -20,18 +22,22 @@ public class TrustUtils {

 	private static DedupConfig dedupConfig;

-	private static SparkDedupConfig sparkDedupConfig;
+	private static SparkDeduper deduper;

 	private static final ObjectMapper mapper;

 	static {
 		mapper = new ObjectMapper();
 		try {
-			dedupConfig = mapper
-				.readValue(
-					DedupConfig.class.getResourceAsStream("/eu/dnetlib/dhp/broker/oa/dedupConfig/dedupConfig.json"),
-					DedupConfig.class);
-			sparkDedupConfig = new SparkDedupConfig(dedupConfig, 1);
+			dedupConfig = DedupConfig
+				.load(
+					IOUtils
+						.toString(
+							DedupConfig.class
+								.getResourceAsStream("/eu/dnetlib/dhp/broker/oa/dedupConfig/dedupConfig.json"),
+							StandardCharsets.UTF_8));
+
+			deduper = new SparkDeduper(dedupConfig);
 		} catch (final IOException e) {
 			log.error("Error loading dedupConfig, e");
 		}
@ -47,8 +53,8 @@ public class TrustUtils {
 		}

 		try {
-			final Row doc1 = sparkDedupConfig.rowFromJson(mapper.writeValueAsString(r2));
-			final Row doc2 = sparkDedupConfig.rowFromJson(mapper.writeValueAsString(r2));
+			final Row doc1 = deduper.model().rowFromJson(mapper.writeValueAsString(r1));
+			final Row doc2 = deduper.model().rowFromJson(mapper.writeValueAsString(r2));

 			final double score = new TreeProcessor(dedupConfig).computeScore(doc1, doc2);

@ -57,7 +63,7 @@ public class TrustUtils {
 			return TrustUtils.rescale(score, threshold);
 		} catch (final Exception e) {
 			log.error("Error computing score between results", e);
-			return BrokerConstants.MIN_TRUST;
+			throw new RuntimeException(e);
 		}
 	}

--- a/dhp-workflows/dhp-broker-events/src/test/java/eu/dnetlib/dhp/broker/oa/samples/SimpleVariableJobTest.java
+++ b/dhp-workflows/dhp-broker-events/src/test/java/eu/dnetlib/dhp/broker/oa/samples/SimpleVariableJobTest.java
@ -83,7 +83,7 @@ public class SimpleVariableJobTest {

 		final long n = spark
 			.createDataset(inputList, Encoders.STRING())
-			.filter((FilterFunction<String>)  s -> filter(map.get(s)))
+			.filter((FilterFunction<String>) s -> filter(map.get(s)))
 			.map((MapFunction<String, String>) String::toLowerCase, Encoders.STRING())
 			.count();

--- a/dhp-workflows/dhp-dedup-openaire/pom.xml
+++ b/dhp-workflows/dhp-dedup-openaire/pom.xml
@ -41,54 +41,18 @@
    
    </build>

-    <dependencyManagement>
-        <dependencies>
-            <dependency>
-                <groupId>io.opentelemetry</groupId>
-                <artifactId>opentelemetry-bom</artifactId>
-                <version>1.16.0</version>
-                <type>pom</type>
-                <scope>import</scope>
-            </dependency>
-        </dependencies>
-    </dependencyManagement>
-
    <dependencies>

        <dependency>
            <groupId>eu.dnetlib.dhp</groupId>
            <artifactId>dhp-common</artifactId>
            <version>${project.version}</version>
-            <exclusions>
-                <exclusion>
-                    <artifactId>log4j</artifactId>
-                    <groupId>log4j</groupId>
-                </exclusion>
-                <exclusion>
-                    <artifactId>annotations</artifactId>
-                    <groupId>org.jetbrains</groupId>
-                </exclusion>
-                <exclusion>
-                    <artifactId>slf4j-api</artifactId>
-                    <groupId>org.slf4j</groupId>
-                </exclusion>
-            </exclusions>
        </dependency>

        <dependency>
            <groupId>eu.dnetlib.dhp</groupId>
            <artifactId>dhp-pace-core</artifactId>
            <version>${project.version}</version>
-            <exclusions>
-                <exclusion>
-                    <artifactId>jsr305</artifactId>
-                    <groupId>com.google.code.findbugs</groupId>
-                </exclusion>
-                <exclusion>
-                    <artifactId>javassist</artifactId>
-                    <groupId>org.javassist</groupId>
-                </exclusion>
-            </exclusions>
        </dependency>

        <dependency>
@ -126,17 +90,14 @@
            <groupId>com.arakelian</groupId>
            <artifactId>java-jq</artifactId>
        </dependency>
-
        <dependency>
            <groupId>dom4j</groupId>
            <artifactId>dom4j</artifactId>
        </dependency>
-
        <dependency>
            <groupId>jaxen</groupId>
            <artifactId>jaxen</artifactId>
        </dependency>
-
        <dependency>
            <groupId>com.fasterxml.jackson.core</groupId>
            <artifactId>jackson-databind</artifactId>
@ -155,7 +116,6 @@
            <version>1.4.200</version>
            <scope>test</scope>
        </dependency>
-
    </dependencies>


--- a/dhp-workflows/dhp-dedup-openaire/src/main/java/eu/dnetlib/dhp/oa/dedup/DSLExample.scala
+++ b/dhp-workflows/dhp-dedup-openaire/src/main/java/eu/dnetlib/dhp/oa/dedup/DSLExample.scala
@ -1,103 +0,0 @@
-package eu.dnetlib.dhp.oa.dedup
-
-import eu.dnetlib.dhp.application.ArgumentApplicationParser
-import eu.dnetlib.dhp.oa.dedup.dsl.{Clustering, Deduper}
-import eu.dnetlib.dhp.oa.dedup.model.BlockStats
-import eu.dnetlib.dhp.utils.ISLookupClientFactory
-import eu.dnetlib.enabling.is.lookup.rmi.{ISLookUpException, ISLookUpService}
-import eu.dnetlib.pace.model.{RowDataOrderingComparator, SparkDedupConfig}
-import org.apache.commons.io.IOUtils
-import org.apache.spark.SparkConf
-import org.apache.spark.sql._
-import org.apache.spark.sql.catalyst.expressions.Literal
-import org.apache.spark.sql.types.DataTypes
-import org.dom4j.DocumentException
-import org.slf4j.LoggerFactory
-import org.xml.sax.SAXException
-
-import java.io.IOException
-import java.util.stream.Collectors
-
-object DSLExample {
-  private val log = LoggerFactory.getLogger(classOf[DSLExample])
-
-  @throws[Exception]
-  def main(args: Array[String]): Unit = {
-    val parser = new ArgumentApplicationParser(
-      IOUtils
-        .toString(classOf[DSLExample].getResourceAsStream("/eu/dnetlib/dhp/oa/dedup/createBlockStats_parameters.json"))
-    )
-    parser.parseArgument(args)
-    val conf = new SparkConf
-    new DSLExample(parser, AbstractSparkAction.getSparkSession(conf)).run(ISLookupClientFactory.getLookUpService(parser.get("isLookUpUrl")))
-  }
-}
-
-class DSLExample(parser: ArgumentApplicationParser, spark: SparkSession) extends AbstractSparkAction(parser, spark) {
-
-  def computeComparisons(blockSize: Long, slidingWindowSize: Long): Long =
-    if (slidingWindowSize >= blockSize) (slidingWindowSize * (slidingWindowSize - 1)) / 2
-    else (blockSize - slidingWindowSize + 1) * (slidingWindowSize * (slidingWindowSize - 1)) / 2
-
-  @throws[DocumentException]
-  @throws[IOException]
-  @throws[ISLookUpException]
-  @throws[SAXException]
-  override def run(isLookUpService: ISLookUpService): Unit = {
-// read oozie parameters
-    val graphBasePath = parser.get("graphBasePath")
-    val isLookUpUrl = parser.get("isLookUpUrl")
-    val actionSetId = parser.get("actionSetId")
-    val workingPath = parser.get("workingPath")
-    val numPartitions : Int = Option(parser.get("numPartitions")).map(_.toInt).getOrElse(AbstractSparkAction.NUM_PARTITIONS)
-
-    DSLExample.log.info("graphBasePath: '{}'", graphBasePath)
-    DSLExample.log.info("isLookUpUrl:   '{}'", isLookUpUrl)
-    DSLExample.log.info("actionSetId:   '{}'", actionSetId)
-    DSLExample.log.info("workingPath:   '{}'", workingPath)
-    // for each dedup configuration
-    import scala.collection.JavaConversions._
-    for (dedupConf <- getConfigurations(isLookUpService, actionSetId).subList(0, 1)) {
-      val subEntity = dedupConf.getWf.getSubEntityValue
-      DSLExample.log.info("Creating blockstats for: '{}'", subEntity)
-      val outputPath = DedupUtility.createBlockStatsPath(workingPath, actionSetId, subEntity)
-      AbstractSparkAction.removeOutputDir(spark, outputPath)
-
-      val sparkConfig = SparkDedupConfig(dedupConf, numPartitions)
-
-      val inputDF = spark.read
-        .textFile(DedupUtility.createEntityPath(graphBasePath, subEntity))
-        .transform(sparkConfig.modelExtractor)
-      val simRels = inputDF
-        .transform(sparkConfig.generateClusters)
-        .filter(functions.size(new Column("block")).geq(new Literal(1, DataTypes.IntegerType)))
-
-      val deduper = Deduper(inputDF.schema)
-        .withClustering( Clustering("sortedngrampairs"),
-          Clustering("sortedngrampairs", Seq("legalname"), Map("max" -> 2, "ngramLen" -> 3)),
-          Clustering("suffixprefix", Seq("legalname"), Map("max" -> 1, "len" -> 3)),
-          Clustering("urlclustering", Seq("websiteurl")),
-          Clustering("keywordsclustering", Seq("fields"),  Map("max" -> 2, "windowSize" -> 4))
-        )
-
-      simRels
-        .map[BlockStats](
-          (b:Row) => {
-            val documents = b.getList(1)
-            val mapDocuments = documents.stream
-              .sorted(new RowDataOrderingComparator(sparkConfig.orderingFieldPosition))
-              .limit(dedupConf.getWf.getQueueMaxSize)
-              .collect(Collectors.toList)
-            new BlockStats(
-              b.getString(0),
-              mapDocuments.size.toLong,
-              computeComparisons(mapDocuments.size.toLong, dedupConf.getWf.getSlidingWindowSize.toLong)
-            )
-
-          })(Encoders.bean[BlockStats](classOf[BlockStats]))
-        .write
-        .mode(SaveMode.Overwrite)
-        .save(outputPath)
-    }
-  }
-}
--- a/dhp-workflows/dhp-dedup-openaire/src/main/java/eu/dnetlib/dhp/oa/dedup/DedupRecordFactory.java
+++ b/dhp-workflows/dhp-dedup-openaire/src/main/java/eu/dnetlib/dhp/oa/dedup/DedupRecordFactory.java
@ -110,6 +110,10 @@ public class DedupRecordFactory {

 		// set authors and date
 		if (ModelSupport.isSubClass(entity, Result.class)) {
+			Optional
+				.ofNullable(((Result) entity).getAuthor())
+				.ifPresent(a -> authors.add(a));
+
 			((Result) entity).setAuthor(AuthorMerger.merge(authors));
 		}

--- a/dhp-workflows/dhp-dedup-openaire/src/main/java/eu/dnetlib/dhp/oa/dedup/DedupUtility.java
+++ b/dhp-workflows/dhp-dedup-openaire/src/main/java/eu/dnetlib/dhp/oa/dedup/DedupUtility.java
@ -3,12 +3,8 @@ package eu.dnetlib.dhp.oa.dedup;

 import java.io.StringReader;
 import java.util.ArrayList;
-import java.util.HashMap;
 import java.util.List;
-import java.util.Map;

-import org.apache.spark.SparkContext;
-import org.apache.spark.util.LongAccumulator;
 import org.dom4j.Document;
 import org.dom4j.DocumentException;
 import org.dom4j.Element;
--- a/dhp-workflows/dhp-dedup-openaire/src/main/java/eu/dnetlib/dhp/oa/dedup/SparkBlockStats.java
+++ b/dhp-workflows/dhp-dedup-openaire/src/main/java/eu/dnetlib/dhp/oa/dedup/SparkBlockStats.java
@ -3,17 +3,13 @@ package eu.dnetlib.dhp.oa.dedup;

 import java.io.IOException;
 import java.util.Collection;
-import java.util.List;
 import java.util.Optional;
-import java.util.stream.Collectors;

 import org.apache.commons.io.IOUtils;
 import org.apache.spark.SparkConf;
 import org.apache.spark.api.java.JavaSparkContext;
 import org.apache.spark.api.java.function.MapFunction;
 import org.apache.spark.sql.*;
-import org.apache.spark.sql.catalyst.expressions.Literal;
-import org.apache.spark.sql.types.DataTypes;
 import org.dom4j.DocumentException;
 import org.slf4j.Logger;
 import org.slf4j.LoggerFactory;
@ -25,8 +21,7 @@ import eu.dnetlib.dhp.utils.ISLookupClientFactory;
 import eu.dnetlib.enabling.is.lookup.rmi.ISLookUpException;
 import eu.dnetlib.enabling.is.lookup.rmi.ISLookUpService;
 import eu.dnetlib.pace.config.DedupConfig;
-import eu.dnetlib.pace.model.RowDataOrderingComparator;
-import eu.dnetlib.pace.model.SparkDedupConfig;
+import eu.dnetlib.pace.model.SparkDeduper;

 public class SparkBlockStats extends AbstractSparkAction {

@ -90,27 +85,28 @@ public class SparkBlockStats extends AbstractSparkAction {

 			JavaSparkContext sc = JavaSparkContext.fromSparkContext(spark.sparkContext());

-			SparkDedupConfig sparkConfig = new SparkDedupConfig(dedupConf, numPartitions);
+			SparkDeduper deduper = new SparkDeduper(dedupConf);

-			Dataset<Row> inputDF = spark
+			Dataset<Row> simRels = spark
 				.read()
 				.textFile(DedupUtility.createEntityPath(graphBasePath, subEntity))
-				.transform(sparkConfig.modelExtractor());
-			Dataset<Row> simRels = inputDF
-				.transform(sparkConfig.generateClusters())
-				.filter(functions.size(new Column("block")).geq(new Literal(1, DataTypes.IntegerType)));
+				.transform(deduper.model().parseJsonDataset())
+				.transform(deduper.filterAndCleanup())
+				.transform(deduper.generateClustersWithCollect())
+				.filter(functions.size(new Column("block")).geq(1));

-			simRels.map((MapFunction<Row, BlockStats>) b -> {
-				Collection<Row> documents = b.getList(1);
+			simRels.map((MapFunction<Row, BlockStats>) row -> {
+				Collection<Row> mapDocuments = row.getList(row.fieldIndex("block"));

-				List<Row> mapDocuments = documents
-					.stream()
-					.sorted(new RowDataOrderingComparator(sparkConfig.orderingFieldPosition()))
-					.limit(dedupConf.getWf().getQueueMaxSize())
-					.collect(Collectors.toList());
+				/*
+				 * List<Row> mapDocuments = documents .stream() .sorted( new
+				 * RowDataOrderingComparator(deduper.model().orderingFieldPosition(),
+				 * deduper.model().identityFieldPosition())) .limit(dedupConf.getWf().getQueueMaxSize())
+				 * .collect(Collectors.toList());
+				 */

 				return new BlockStats(
-					b.getString(0),
+					row.getString(row.fieldIndex("key")),
 					(long) mapDocuments.size(),
 					computeComparisons(
 						(long) mapDocuments.size(), (long) dedupConf.getWf().getSlidingWindowSize()));
--- a/dhp-workflows/dhp-dedup-openaire/src/main/java/eu/dnetlib/dhp/oa/dedup/SparkCleanRelation.scala
+++ b/dhp-workflows/dhp-dedup-openaire/src/main/java/eu/dnetlib/dhp/oa/dedup/SparkCleanRelation.scala
@ -0,0 +1,78 @@
+package eu.dnetlib.dhp.oa.dedup
+
+import eu.dnetlib.dhp.application.ArgumentApplicationParser
+import eu.dnetlib.dhp.common.HdfsSupport
+import eu.dnetlib.dhp.schema.oaf.Relation
+import eu.dnetlib.dhp.utils.ISLookupClientFactory
+import eu.dnetlib.enabling.is.lookup.rmi.ISLookUpService
+import org.apache.commons.io.IOUtils
+import org.apache.spark.SparkConf
+import org.apache.spark.sql._
+import org.apache.spark.sql.functions.col
+import org.apache.spark.sql.types.{DataTypes, StructField, StructType}
+import org.slf4j.LoggerFactory
+
+object SparkCleanRelation {
+  private val log = LoggerFactory.getLogger(classOf[SparkCleanRelation])
+
+  @throws[Exception]
+  def main(args: Array[String]): Unit = {
+    val parser = new ArgumentApplicationParser(
+      IOUtils.toString(
+        classOf[SparkCleanRelation].getResourceAsStream("/eu/dnetlib/dhp/oa/dedup/cleanRelation_parameters.json")
+      )
+    )
+    parser.parseArgument(args)
+    val conf = new SparkConf
+
+    new SparkCleanRelation(parser, AbstractSparkAction.getSparkSession(conf))
+      .run(ISLookupClientFactory.getLookUpService(parser.get("isLookUpUrl")))
+  }
+}
+
+class SparkCleanRelation(parser: ArgumentApplicationParser, spark: SparkSession)
+    extends AbstractSparkAction(parser, spark) {
+  override def run(isLookUpService: ISLookUpService): Unit = {
+    val graphBasePath = parser.get("graphBasePath")
+    val inputPath = parser.get("inputPath")
+    val outputPath = parser.get("outputPath")
+
+    SparkCleanRelation.log.info("graphBasePath: '{}'", graphBasePath)
+    SparkCleanRelation.log.info("inputPath: '{}'", inputPath)
+    SparkCleanRelation.log.info("outputPath: '{}'", outputPath)
+
+    AbstractSparkAction.removeOutputDir(spark, outputPath)
+
+    val entities =
+      Seq("datasource", "project", "organization", "publication", "dataset", "software", "otherresearchproduct")
+
+    val idsSchema = StructType.fromDDL("`id` STRING, `dataInfo` STRUCT<`deletedbyinference`:BOOLEAN,`invisible`:BOOLEAN>")
+
+    val emptyIds = spark.createDataFrame(spark.sparkContext.emptyRDD[Row].setName("empty"),
+      idsSchema)
+
+    val ids = entities
+      .foldLeft(emptyIds)((ds, entity) => {
+        val entityPath = graphBasePath + '/' + entity
+        if (HdfsSupport.exists(entityPath, spark.sparkContext.hadoopConfiguration)) {
+          ds.union(spark.read.schema(idsSchema).json(entityPath))
+        } else {
+          ds
+        }
+      })
+      .filter("dataInfo.deletedbyinference != true AND dataInfo.invisible != true")
+      .select("id")
+      .distinct()
+
+    val relations = spark.read.schema(Encoders.bean(classOf[Relation]).schema).json(inputPath)
+      .filter("dataInfo.deletedbyinference != true AND dataInfo.invisible != true")
+
+    AbstractSparkAction.save(
+      relations
+        .join(ids, col("source") === ids("id"), "leftsemi")
+        .join(ids, col("target") === ids("id"), "leftsemi"),
+      outputPath,
+      SaveMode.Overwrite
+    )
+  }
+}
--- a/dhp-workflows/dhp-dedup-openaire/src/main/java/eu/dnetlib/dhp/oa/dedup/SparkCreateSimRels.java
+++ b/dhp-workflows/dhp-dedup-openaire/src/main/java/eu/dnetlib/dhp/oa/dedup/SparkCreateSimRels.java
@ -23,7 +23,7 @@ import eu.dnetlib.dhp.utils.ISLookupClientFactory;
 import eu.dnetlib.enabling.is.lookup.rmi.ISLookUpException;
 import eu.dnetlib.enabling.is.lookup.rmi.ISLookUpService;
 import eu.dnetlib.pace.config.DedupConfig;
-import eu.dnetlib.pace.model.SparkDedupConfig;
+import eu.dnetlib.pace.model.SparkDeduper;

 public class SparkCreateSimRels extends AbstractSparkAction {

@ -84,20 +84,14 @@ public class SparkCreateSimRels extends AbstractSparkAction {

 			JavaSparkContext sc = JavaSparkContext.fromSparkContext(spark.sparkContext());

-			SparkDedupConfig sparkConfig = new SparkDedupConfig(dedupConf, numPartitions);
-
-			spark.udf().register("collect_sort_slice", sparkConfig.collectSortSliceUDAF());
+			SparkDeduper deduper = new SparkDeduper(dedupConf);

 			Dataset<?> simRels = spark
 				.read()
 				.textFile(DedupUtility.createEntityPath(graphBasePath, subEntity))
-				.transform(sparkConfig.modelExtractor()) // Extract fields from input json column according to model
-				// definition
-				.transform(sparkConfig.generateClustersWithWindows()) // generate <key,block> pairs according to
-				// filters, clusters, and model
-				// definition
-				.transform(sparkConfig.processClusters()) // process blocks and emits <from,to> pairs of found
-				// similarities
+				.transform(deduper.model().parseJsonDataset())
+				.transform(deduper.dedup())
+				.distinct()
 				.map(
 					(MapFunction<Row, Relation>) t -> DedupUtility
 						.createSimRel(t.getStruct(0).getString(0), t.getStruct(0).getString(1), entity),
--- a/dhp-workflows/dhp-dedup-openaire/src/main/java/eu/dnetlib/dhp/oa/dedup/SparkPropagateRelation.java
+++ b/dhp-workflows/dhp-dedup-openaire/src/main/java/eu/dnetlib/dhp/oa/dedup/SparkPropagateRelation.java
@ -3,13 +3,18 @@ package eu.dnetlib.dhp.oa.dedup;

 import static org.apache.spark.sql.functions.col;

+import java.util.Arrays;
+import java.util.Collections;
+import java.util.Iterator;
 import java.util.Objects;

+import org.apache.commons.beanutils.BeanUtils;
 import org.apache.commons.io.IOUtils;
 import org.apache.commons.lang3.StringUtils;
 import org.apache.spark.SparkConf;
 import org.apache.spark.api.java.function.FilterFunction;
 import org.apache.spark.api.java.function.MapFunction;
+import org.apache.spark.api.java.function.ReduceFunction;
 import org.apache.spark.sql.*;
 import org.slf4j.Logger;
 import org.slf4j.LoggerFactory;
@ -28,9 +33,9 @@ public class SparkPropagateRelation extends AbstractSparkAction {

 	private static final Logger log = LoggerFactory.getLogger(SparkPropagateRelation.class);

-	enum FieldType {
-		SOURCE, TARGET
-	}
+	private static Encoder<Relation> REL_BEAN_ENC = Encoders.bean(Relation.class);
+
+	private static Encoder<Relation> REL_KRYO_ENC = Encoders.kryo(Relation.class);

 	public SparkPropagateRelation(ArgumentApplicationParser parser, SparkSession spark) {
 		super(parser, spark);
@ -71,38 +76,62 @@ public class SparkPropagateRelation extends AbstractSparkAction {
 		Dataset<Relation> mergeRels = spark
 			.read()
 			.load(DedupUtility.createMergeRelPath(workingPath, "*", "*"))
-			.as(Encoders.bean(Relation.class));
+			.as(REL_BEAN_ENC);

 		// <mergedObjectID, dedupID>
-		Dataset<Tuple2<String, String>> mergedIds = mergeRels
+		Dataset<Row> mergedIds = mergeRels
 			.where(col("relClass").equalTo(ModelConstants.MERGES))
-			.select(col("source"), col("target"))
+			.select(col("source").as("dedupID"), col("target").as("mergedObjectID"))
 			.distinct()
-			.map(
-				(MapFunction<Row, Tuple2<String, String>>) r -> new Tuple2<>(r.getString(1), r.getString(0)),
-				Encoders.tuple(Encoders.STRING(), Encoders.STRING()))
 			.cache();

-		final String relationPath = DedupUtility.createEntityPath(graphBasePath, "relation");
+		Dataset<Row> allRels = spark
+			.read()
+			.schema(REL_BEAN_ENC.schema())
+			.json(DedupUtility.createEntityPath(graphBasePath, "relation"));

-		Dataset<Relation> rels = spark.read().textFile(relationPath).map(patchRelFn(), Encoders.bean(Relation.class));
+		Dataset<Relation> dedupedRels = allRels
+			.joinWith(mergedIds, allRels.col("source").equalTo(mergedIds.col("mergedObjectID")), "left_outer")
+			.joinWith(mergedIds, col("_1.target").equalTo(mergedIds.col("mergedObjectID")), "left_outer")
+			.select("_1._1", "_1._2.dedupID", "_2.dedupID")
+			.as(Encoders.tuple(REL_BEAN_ENC, Encoders.STRING(), Encoders.STRING()))
+			.flatMap(SparkPropagateRelation::addInferredRelations, REL_KRYO_ENC);

-		Dataset<Relation> newRels = createNewRels(rels, mergedIds, getFixRelFn());
+		Dataset<Relation> processedRelations = distinctRelations(
+			dedupedRels.union(mergeRels.map((MapFunction<Relation, Relation>) r -> r, REL_KRYO_ENC)))
+				.filter((FilterFunction<Relation>) r -> !Objects.equals(r.getSource(), r.getTarget()));

-		Dataset<Relation> updated = processDataset(
-			processDataset(rels, mergedIds, FieldType.SOURCE, getDeletedFn()),
-			mergedIds,
-			FieldType.TARGET,
-			getDeletedFn());
+		save(processedRelations, outputRelationPath, SaveMode.Overwrite);
+	}

-		save(
-			distinctRelations(
-				newRels
-					.union(updated)
-					.union(mergeRels)
-					.map((MapFunction<Relation, Relation>) r -> r, Encoders.kryo(Relation.class)))
-						.filter((FilterFunction<Relation>) r -> !Objects.equals(r.getSource(), r.getTarget())),
-			outputRelationPath, SaveMode.Overwrite);
+	private static Iterator<Relation> addInferredRelations(Tuple3<Relation, String, String> t) throws Exception {
+		Relation existingRel = t._1();
+		String newSource = t._2();
+		String newTarget = t._3();
+
+		if (newSource == null && newTarget == null) {
+			return Collections.singleton(t._1()).iterator();
+		}
+
+		// update existing relation
+		if (existingRel.getDataInfo() == null) {
+			existingRel.setDataInfo(new DataInfo());
+		}
+		existingRel.getDataInfo().setDeletedbyinference(true);
+
+		// Create new relation inferred by dedupIDs
+		Relation inferredRel = (Relation) BeanUtils.cloneBean(existingRel);
+
+		inferredRel.setDataInfo((DataInfo) BeanUtils.cloneBean(existingRel.getDataInfo()));
+		inferredRel.getDataInfo().setDeletedbyinference(false);
+
+		if (newSource != null)
+			inferredRel.setSource(newSource);
+
+		if (newTarget != null)
+			inferredRel.setTarget(newTarget);
+
+		return Arrays.asList(existingRel, inferredRel).iterator();
 	}

 	private Dataset<Relation> distinctRelations(Dataset<Relation> rels) {
@ -110,54 +139,13 @@ public class SparkPropagateRelation extends AbstractSparkAction {
 			.filter(getRelationFilterFunction())
 			.groupByKey(
 				(MapFunction<Relation, String>) r -> String
-					.join(r.getSource(), r.getTarget(), r.getRelType(), r.getSubRelType(), r.getRelClass()),
+					.join(" ", r.getSource(), r.getTarget(), r.getRelType(), r.getSubRelType(), r.getRelClass()),
 				Encoders.STRING())
-			.agg(new RelationAggregator().toColumn())
-			.map((MapFunction<Tuple2<String, Relation>, Relation>) Tuple2::_2, Encoders.bean(Relation.class));
-	}
-
-	// redirect the relations to the dedupID
-	private static Dataset<Relation> createNewRels(
-		Dataset<Relation> rels, // all the relations to be redirected
-		Dataset<Tuple2<String, String>> mergedIds, // merge rels: <mergedObjectID, dedupID>
-		MapFunction<Tuple2<Tuple2<Tuple3<String, Relation, String>, Tuple2<String, String>>, Tuple2<String, String>>, Relation> mapRel) {
-
-		// <sourceID, relation, targetID>
-		Dataset<Tuple3<String, Relation, String>> mapped = rels
-			.map(
-				(MapFunction<Relation, Tuple3<String, Relation, String>>) r -> new Tuple3<>(getId(r, FieldType.SOURCE),
-					r, getId(r, FieldType.TARGET)),
-				Encoders.tuple(Encoders.STRING(), Encoders.kryo(Relation.class), Encoders.STRING()));
-
-		// < <sourceID, relation, target>, <sourceID, dedupID> >
-		Dataset<Tuple2<Tuple3<String, Relation, String>, Tuple2<String, String>>> relSource = mapped
-			.joinWith(mergedIds, mapped.col("_1").equalTo(mergedIds.col("_1")), "left_outer");
-
-		// < <<sourceID, relation, targetID>, <sourceID, dedupID>>, <targetID, dedupID> >
-		Dataset<Tuple2<Tuple2<Tuple3<String, Relation, String>, Tuple2<String, String>>, Tuple2<String, String>>> relSourceTarget = relSource
-			.joinWith(mergedIds, relSource.col("_1._3").equalTo(mergedIds.col("_1")), "left_outer");
-
-		return relSourceTarget
-			.filter(
-				(FilterFunction<Tuple2<Tuple2<Tuple3<String, Relation, String>, Tuple2<String, String>>, Tuple2<String, String>>>) r -> r
-					._1()
-					._1() != null || r._2() != null)
-			.map(mapRel, Encoders.bean(Relation.class))
-			.distinct();
-	}
-
-	private static Dataset<Relation> processDataset(
-		Dataset<Relation> rels,
-		Dataset<Tuple2<String, String>> mergedIds,
-		FieldType type,
-		MapFunction<Tuple2<Tuple2<String, Relation>, Tuple2<String, String>>, Relation> mapFn) {
-		final Dataset<Tuple2<String, Relation>> mapped = rels
-			.map(
-				(MapFunction<Relation, Tuple2<String, Relation>>) r -> new Tuple2<>(getId(r, type), r),
-				Encoders.tuple(Encoders.STRING(), Encoders.kryo(Relation.class)));
-		return mapped
-			.joinWith(mergedIds, mapped.col("_1").equalTo(mergedIds.col("_1")), "left_outer")
-			.map(mapFn, Encoders.bean(Relation.class));
+			.reduceGroups((ReduceFunction<Relation>) (b, a) -> {
+				b.mergeFrom(a);
+				return b;
+			})
+			.map((MapFunction<Tuple2<String, Relation>, Relation>) Tuple2::_2, REL_BEAN_ENC);
 	}

 	private FilterFunction<Relation> getRelationFilterFunction() {
@ -167,52 +155,4 @@ public class SparkPropagateRelation extends AbstractSparkAction {
 			StringUtils.isNotBlank(r.getSubRelType()) ||
 			StringUtils.isNotBlank(r.getRelClass());
 	}
-
-	private static String getId(Relation r, FieldType type) {
-		switch (type) {
-			case SOURCE:
-				return r.getSource();
-			case TARGET:
-				return r.getTarget();
-			default:
-				throw new IllegalArgumentException("");
-		}
-	}
-
-	private static MapFunction<Tuple2<Tuple2<Tuple3<String, Relation, String>, Tuple2<String, String>>, Tuple2<String, String>>, Relation> getFixRelFn() {
-		return value -> {
-
-			Relation r = value._1()._1()._2();
-			String newSource = value._1()._2() != null ? value._1()._2()._2() : null;
-			String newTarget = value._2() != null ? value._2()._2() : null;
-
-			if (r.getDataInfo() == null) {
-				r.setDataInfo(new DataInfo());
-			}
-			r.getDataInfo().setDeletedbyinference(false);
-
-			if (newSource != null)
-				r.setSource(newSource);
-
-			if (newTarget != null)
-				r.setTarget(newTarget);
-
-			return r;
-		};
-	}
-
-	private static MapFunction<Tuple2<Tuple2<String, Relation>, Tuple2<String, String>>, Relation> getDeletedFn() {
-		return value -> {
-			if (value._2() != null) {
-				Relation r = value._1()._2();
-				if (r.getDataInfo() == null) {
-					r.setDataInfo(new DataInfo());
-				}
-				r.getDataInfo().setDeletedbyinference(true);
-				return r;
-			}
-			return value._1()._2();
-		};
-	}
-
 }
--- a/dhp-workflows/dhp-dedup-openaire/src/main/java/eu/dnetlib/dhp/oa/dedup/SparkSimRelsAnalytics.java
+++ b/dhp-workflows/dhp-dedup-openaire/src/main/java/eu/dnetlib/dhp/oa/dedup/SparkSimRelsAnalytics.java
@ -1,118 +0,0 @@
-
-package eu.dnetlib.dhp.oa.dedup;
-
-import java.io.IOException;
-import java.util.Optional;
-
-import org.apache.commons.io.IOUtils;
-import org.apache.commons.lang3.StringUtils;
-import org.apache.spark.SparkConf;
-import org.apache.spark.api.java.JavaSparkContext;
-import org.apache.spark.api.java.function.MapFunction;
-import org.apache.spark.sql.*;
-import org.dom4j.DocumentException;
-import org.slf4j.Logger;
-import org.slf4j.LoggerFactory;
-import org.xml.sax.SAXException;
-
-import eu.dnetlib.dhp.application.ArgumentApplicationParser;
-import eu.dnetlib.dhp.application.dedup.log.DedupLogModel;
-import eu.dnetlib.dhp.application.dedup.log.DedupLogWriter;
-import eu.dnetlib.dhp.schema.oaf.Relation;
-import eu.dnetlib.dhp.utils.ISLookupClientFactory;
-import eu.dnetlib.enabling.is.lookup.rmi.ISLookUpException;
-import eu.dnetlib.enabling.is.lookup.rmi.ISLookUpService;
-import eu.dnetlib.pace.config.DedupConfig;
-import eu.dnetlib.pace.model.SparkDedupConfig;
-
-public class SparkSimRelsAnalytics extends AbstractSparkAction {
-
-	private static final Logger log = LoggerFactory.getLogger(SparkSimRelsAnalytics.class);
-
-	public SparkSimRelsAnalytics(ArgumentApplicationParser parser, SparkSession spark) {
-		super(parser, spark);
-		spark.sparkContext().setLogLevel("WARN");
-	}
-
-	public static void main(String[] args) throws Exception {
-		ArgumentApplicationParser parser = new ArgumentApplicationParser(
-			IOUtils
-				.toString(
-					SparkSimRelsAnalytics.class
-						.getResourceAsStream(
-							"/eu/dnetlib/dhp/oa/dedup/createSimRels_parameters.json")));
-		parser.parseArgument(args);
-
-		SparkConf conf = new SparkConf();
-		new SparkSimRelsAnalytics(parser, getSparkSession(conf))
-			.run(ISLookupClientFactory.getLookUpService(parser.get("isLookUpUrl")));
-	}
-
-	@Override
-	public void run(ISLookUpService isLookUpService)
-		throws DocumentException, IOException, ISLookUpException, SAXException {
-
-		// read oozie parameters
-		final String graphBasePath = parser.get("graphBasePath");
-		final String isLookUpUrl = parser.get("isLookUpUrl");
-		final String actionSetId = parser.get("actionSetId");
-		final String workingPath = parser.get("workingPath");
-		final int numPartitions = Optional
-			.ofNullable(parser.get("numPartitions"))
-			.map(Integer::valueOf)
-			.orElse(NUM_PARTITIONS);
-
-		log.info("numPartitions: '{}'", numPartitions);
-		log.info("graphBasePath: '{}'", graphBasePath);
-		log.info("isLookUpUrl:   '{}'", isLookUpUrl);
-		log.info("actionSetId:   '{}'", actionSetId);
-		log.info("workingPath:   '{}'", workingPath);
-
-		final String dfLogPath = parser.get("dataframeLog");
-		final String runTag = Optional.ofNullable(parser.get("runTAG")).orElse("UNKNOWN");
-
-		// for each dedup configuration
-		for (DedupConfig dedupConf : getConfigurations(isLookUpService, actionSetId)) {
-			final long start = System.currentTimeMillis();
-
-			final String entity = dedupConf.getWf().getEntityType();
-			final String subEntity = dedupConf.getWf().getSubEntityValue();
-			log.info("Creating simrels for: '{}'", subEntity);
-
-			final String outputPath = DedupUtility.createSimRelPath(workingPath, actionSetId, subEntity);
-			removeOutputDir(spark, outputPath);
-
-			JavaSparkContext sc = JavaSparkContext.fromSparkContext(spark.sparkContext());
-
-			SparkDedupConfig sparkConfig = new SparkDedupConfig(dedupConf, numPartitions);
-
-			spark.udf().register("collect_sort_slice", sparkConfig.collectSortSliceUDAF());
-
-			Dataset<?> simRels = spark
-				.read()
-				.textFile(DedupUtility.createEntityPath(graphBasePath, subEntity))
-				.transform(sparkConfig.modelExtractor()) // Extract fields from input json column according to model
-				// definition
-				.transform(sparkConfig.generateClustersWithWindows()) // generate <key,block> pairs according to
-				// filters, clusters, and model
-				// definition
-				.transform(sparkConfig.processClusters()) // process blocks and emits <from,to> pairs of found
-				// similarities
-				.map(
-					(MapFunction<Row, Relation>) t -> DedupUtility
-						.createSimRel(t.getStruct(0).getString(0), t.getStruct(0).getString(1), entity),
-					Encoders.bean(Relation.class));
-
-			saveParquet(simRels, outputPath, SaveMode.Overwrite);
-			final long end = System.currentTimeMillis();
-			if (StringUtils.isNotBlank(dfLogPath)) {
-				final DedupLogModel model = new DedupLogModel(runTag, dedupConf.toString(), subEntity, start, end,
-					end - start);
-				new DedupLogWriter(dfLogPath).appendLog(model, spark);
-
-			}
-
-		}
-	}
-
-}
--- a/dhp-workflows/dhp-dedup-openaire/src/main/java/eu/dnetlib/dhp/oa/dedup/SparkWhitelistSimRels.java
+++ b/dhp-workflows/dhp-dedup-openaire/src/main/java/eu/dnetlib/dhp/oa/dedup/SparkWhitelistSimRels.java
@ -104,18 +104,6 @@ public class SparkWhitelistSimRels extends AbstractSparkAction {
 				.join(entities, whiteListRels1.col("to").equalTo(entities.col("id")), "inner")
 				.select("from", "to");

-//			Dataset<Tuple2<String, String>> whiteListRels1 = whiteListRels
-//				.joinWith(entities, whiteListRels.col("_1").equalTo(entities.col("_1")), "inner")
-//				.map(
-//					(MapFunction<Tuple2<Tuple2<String, String>, Tuple2<String, String>>, Tuple2<String, String>>) Tuple2::_1,
-//					Encoders.tuple(Encoders.STRING(), Encoders.STRING()));
-//
-//			Dataset<Tuple2<String, String>> whiteListRels2 = whiteListRels1
-//				.joinWith(entities, whiteListRels1.col("_2").equalTo(entities.col("_1")), "inner")
-//				.map(
-//					(MapFunction<Tuple2<Tuple2<String, String>, Tuple2<String, String>>, Tuple2<String, String>>) Tuple2::_1,
-//					Encoders.tuple(Encoders.STRING(), Encoders.STRING()));
-
 			Dataset<Relation> whiteListSimRels = whiteListRels2
 				.map(
 					(MapFunction<Row, Relation>) r -> DedupUtility
--- a/dhp-workflows/dhp-dedup-openaire/src/main/java/eu/dnetlib/dhp/oa/dedup/dsl/Clustering.scala
+++ b/dhp-workflows/dhp-dedup-openaire/src/main/java/eu/dnetlib/dhp/oa/dedup/dsl/Clustering.scala
@ -1,15 +0,0 @@
-package eu.dnetlib.dhp.oa.dedup.dsl
-
-case class Clustering(name: String = "",
-                 fields: Seq[String] = Seq(),
-                 params: Map[String,Int] = Map()) {
-  def withName(name: String) : Clustering =
-    copy(name = name)
-
-  def withFields(fields: String*): Clustering =
-    copy(fields = fields)
-
-  def withParams(params: Map[String,Int]): Clustering =
-    copy(params = params)
-}
-
--- a/dhp-workflows/dhp-dedup-openaire/src/main/java/eu/dnetlib/dhp/oa/dedup/dsl/Deduper.scala
+++ b/dhp-workflows/dhp-dedup-openaire/src/main/java/eu/dnetlib/dhp/oa/dedup/dsl/Deduper.scala
@ -1,11 +0,0 @@
-package eu.dnetlib.dhp.oa.dedup.dsl
-
-import org.apache.spark.sql.types.StructType
-
-case class Deduper (schema: StructType,
-               clusterings: Seq[Clustering] = Seq()) {
-
-  def withClustering(clusterings: Clustering*) =
-    copy(clusterings = clusterings)
-
-}
--- a/dhp-workflows/dhp-dedup-openaire/src/main/resources/eu/dnetlib/dhp/oa/dedup/cleanRelation_parameters.json
+++ b/dhp-workflows/dhp-dedup-openaire/src/main/resources/eu/dnetlib/dhp/oa/dedup/cleanRelation_parameters.json
@ -0,0 +1,20 @@
+[
+  {
+    "paramName": "i",
+    "paramLongName": "graphBasePath",
+    "paramDescription": "the base path of raw graph",
+    "paramRequired": true
+  },
+  {
+    "paramName": "w",
+    "paramLongName": "inputPath",
+    "paramDescription": "the path to the input relation to cleanup",
+    "paramRequired": true
+  },
+  {
+    "paramName": "o",
+    "paramLongName": "outputPath",
+    "paramDescription": "the path of the output relation cleaned",
+    "paramRequired": true
+  }
+]
--- a/dhp-workflows/dhp-dedup-openaire/src/main/resources/eu/dnetlib/dhp/oa/dedup/consistency/oozie_app/config-default.xml
+++ b/dhp-workflows/dhp-dedup-openaire/src/main/resources/eu/dnetlib/dhp/oa/dedup/consistency/oozie_app/config-default.xml
@ -15,4 +15,8 @@
        <name>oozie.action.sharelib.for.spark</name>
        <value>spark2</value>
    </property>
+    <property>
+        <name>sparkExecutorMemoryOverhead</name>
+        <value>1G</value>
+    </property>
 </configuration>
--- a/dhp-workflows/dhp-dedup-openaire/src/main/resources/eu/dnetlib/dhp/oa/dedup/consistency/oozie_app/workflow.xml
+++ b/dhp-workflows/dhp-dedup-openaire/src/main/resources/eu/dnetlib/dhp/oa/dedup/consistency/oozie_app/workflow.xml
@ -12,19 +12,26 @@
            <name>graphOutputPath</name>
            <description>path of the output graph</description>
        </property>
+        <property>
+            <name>filterInvisible</name>
+            <description>whether filter out invisible entities after merge</description>
+        </property>
        <property>
            <name>sparkDriverMemory</name>
-            <description>memory for driver process</description>
+            <description>heap memory for driver process</description>
        </property>
        <property>
            <name>sparkExecutorMemory</name>
-            <description>memory for individual executor</description>
+            <description>heap memory for individual executor</description>
+        </property>
+        <property>
+            <name>sparkExecutorMemoryOverhead</name>
+            <description>off-heap memory for individual executor</description>
        </property>
        <property>
            <name>sparkExecutorCores</name>
            <description>number of cores used by single executor</description>
        </property>
-
        <property>
            <name>oozieActionShareLibForSpark2</name>
            <description>oozie action sharelib for spark 2.*</description>
@ -83,6 +90,7 @@
            <jar>dhp-dedup-openaire-${projectVersion}.jar</jar>
            <spark-opts>
                --executor-memory=${sparkExecutorMemory}
+                --conf spark.executor.memoryOverhead=${sparkExecutorMemoryOverhead}
                --executor-cores=${sparkExecutorCores}
                --driver-memory=${sparkDriverMemory}
                --conf spark.extraListeners=${spark2ExtraListeners}
@ -92,9 +100,35 @@
                --conf spark.sql.shuffle.partitions=15000
            </spark-opts>
            <arg>--graphBasePath</arg><arg>${graphBasePath}</arg>
-            <arg>--o</arg><arg>${graphOutputPath}</arg>
+            <arg>--graphOutputPath</arg><arg>${workingPath}/propagaterelation/</arg>
            <arg>--workingPath</arg><arg>${workingPath}</arg>
        </spark>
+        <ok to="CleanRelation"/>
+        <error to="Kill"/>
+    </action>
+
+    <action name="CleanRelation">
+        <spark xmlns="uri:oozie:spark-action:0.2">
+            <master>yarn</master>
+            <mode>cluster</mode>
+            <name>Clean Relations</name>
+            <class>eu.dnetlib.dhp.oa.dedup.SparkCleanRelation</class>
+            <jar>dhp-dedup-openaire-${projectVersion}.jar</jar>
+            <spark-opts>
+                --executor-memory=${sparkExecutorMemory}
+                --conf spark.executor.memoryOverhead=${sparkExecutorMemoryOverhead}
+                --executor-cores=${sparkExecutorCores}
+                --driver-memory=${sparkDriverMemory}
+                --conf spark.extraListeners=${spark2ExtraListeners}
+                --conf spark.sql.queryExecutionListeners=${spark2SqlQueryExecutionListeners}
+                --conf spark.yarn.historyServer.address=${spark2YarnHistoryServerAddress}
+                --conf spark.eventLog.dir=${nameNode}${spark2EventLogDir}
+                --conf spark.sql.shuffle.partitions=15000
+            </spark-opts>
+            <arg>--graphBasePath</arg><arg>${graphBasePath}</arg>
+            <arg>--inputPath</arg><arg>${workingPath}/propagaterelation/relation</arg>
+            <arg>--outputPath</arg><arg>${graphOutputPath}/relation</arg>
+        </spark>
        <ok to="group_entities"/>
        <error to="Kill"/>
    </action>
@ -107,8 +141,9 @@
            <class>eu.dnetlib.dhp.oa.merge.GroupEntitiesSparkJob</class>
            <jar>dhp-dedup-openaire-${projectVersion}.jar</jar>
            <spark-opts>
-                --executor-cores=${sparkExecutorCores}
                --executor-memory=${sparkExecutorMemory}
+                --conf spark.executor.memoryOverhead=${sparkExecutorMemoryOverhead}
+                --executor-cores=${sparkExecutorCores}
                --driver-memory=${sparkDriverMemory}
                --conf spark.extraListeners=${spark2ExtraListeners}
                --conf spark.sql.queryExecutionListeners=${spark2SqlQueryExecutionListeners}
@ -119,30 +154,21 @@
            <arg>--graphInputPath</arg><arg>${graphBasePath}</arg>
            <arg>--outputPath</arg><arg>${workingPath}/grouped_entities</arg>
        </spark>
-        <ok to="fork_dispatch_entities"/>
+        <ok to="dispatch_entities"/>
        <error to="Kill"/>
    </action>

-    <fork name="fork_dispatch_entities">
-        <path start="dispatch_datasource"/>
-        <path start="dispatch_project"/>
-        <path start="dispatch_organization"/>
-        <path start="dispatch_publication"/>
-        <path start="dispatch_dataset"/>
-        <path start="dispatch_software"/>
-        <path start="dispatch_otherresearchproduct"/>
-    </fork>
-
-    <action name="dispatch_datasource">
+    <action name="dispatch_entities">
        <spark xmlns="uri:oozie:spark-action:0.2">
            <master>yarn</master>
            <mode>cluster</mode>
-            <name>Dispatch publications</name>
+            <name>Dispatch grouped entitities</name>
            <class>eu.dnetlib.dhp.oa.merge.DispatchEntitiesSparkJob</class>
            <jar>dhp-dedup-openaire-${projectVersion}.jar</jar>
            <spark-opts>
-                --executor-cores=${sparkExecutorCores}
                --executor-memory=${sparkExecutorMemory}
+                --conf spark.executor.memoryOverhead=${sparkExecutorMemoryOverhead}
+                --executor-cores=${sparkExecutorCores}
                --driver-memory=${sparkDriverMemory}
                --conf spark.extraListeners=${spark2ExtraListeners}
                --conf spark.sql.queryExecutionListeners=${spark2SqlQueryExecutionListeners}
@ -151,164 +177,12 @@
                --conf spark.sql.shuffle.partitions=7680
            </spark-opts>
            <arg>--inputPath</arg><arg>${workingPath}/grouped_entities</arg>
-            <arg>--outputPath</arg><arg>${graphOutputPath}/datasource</arg>
-            <arg>--graphTableClassName</arg><arg>eu.dnetlib.dhp.schema.oaf.Datasource</arg>
+            <arg>--outputPath</arg><arg>${graphOutputPath}</arg>
+            <arg>--filterInvisible</arg><arg>${filterInvisible}</arg>
        </spark>
-        <ok to="wait_dispatch"/>
+        <ok to="End"/>
        <error to="Kill"/>
    </action>

-    <action name="dispatch_project">
-        <spark xmlns="uri:oozie:spark-action:0.2">
-            <master>yarn</master>
-            <mode>cluster</mode>
-            <name>Dispatch project</name>
-            <class>eu.dnetlib.dhp.oa.merge.DispatchEntitiesSparkJob</class>
-            <jar>dhp-dedup-openaire-${projectVersion}.jar</jar>
-            <spark-opts>
-                --executor-cores=${sparkExecutorCores}
-                --executor-memory=${sparkExecutorMemory}
-                --driver-memory=${sparkDriverMemory}
-                --conf spark.extraListeners=${spark2ExtraListeners}
-                --conf spark.sql.queryExecutionListeners=${spark2SqlQueryExecutionListeners}
-                --conf spark.yarn.historyServer.address=${spark2YarnHistoryServerAddress}
-                --conf spark.eventLog.dir=${nameNode}${spark2EventLogDir}
-                --conf spark.sql.shuffle.partitions=7680
-            </spark-opts>
-            <arg>--inputPath</arg><arg>${workingPath}/grouped_entities</arg>
-            <arg>--outputPath</arg><arg>${graphOutputPath}/project</arg>
-            <arg>--graphTableClassName</arg><arg>eu.dnetlib.dhp.schema.oaf.Project</arg>
-        </spark>
-        <ok to="wait_dispatch"/>
-        <error to="Kill"/>
-    </action>
-
-    <action name="dispatch_organization">
-        <spark xmlns="uri:oozie:spark-action:0.2">
-            <master>yarn</master>
-            <mode>cluster</mode>
-            <name>Dispatch organization</name>
-            <class>eu.dnetlib.dhp.oa.merge.DispatchEntitiesSparkJob</class>
-            <jar>dhp-dedup-openaire-${projectVersion}.jar</jar>
-            <spark-opts>
-                --executor-cores=${sparkExecutorCores}
-                --executor-memory=${sparkExecutorMemory}
-                --driver-memory=${sparkDriverMemory}
-                --conf spark.extraListeners=${spark2ExtraListeners}
-                --conf spark.sql.queryExecutionListeners=${spark2SqlQueryExecutionListeners}
-                --conf spark.yarn.historyServer.address=${spark2YarnHistoryServerAddress}
-                --conf spark.eventLog.dir=${nameNode}${spark2EventLogDir}
-                --conf spark.sql.shuffle.partitions=7680
-            </spark-opts>
-            <arg>--inputPath</arg><arg>${workingPath}/grouped_entities</arg>
-            <arg>--outputPath</arg><arg>${graphOutputPath}/organization</arg>
-            <arg>--graphTableClassName</arg><arg>eu.dnetlib.dhp.schema.oaf.Organization</arg>
-        </spark>
-        <ok to="wait_dispatch"/>
-        <error to="Kill"/>
-    </action>
-
-    <action name="dispatch_publication">
-        <spark xmlns="uri:oozie:spark-action:0.2">
-            <master>yarn</master>
-            <mode>cluster</mode>
-            <name>Dispatch publication</name>
-            <class>eu.dnetlib.dhp.oa.merge.DispatchEntitiesSparkJob</class>
-            <jar>dhp-dedup-openaire-${projectVersion}.jar</jar>
-            <spark-opts>
-                --executor-cores=${sparkExecutorCores}
-                --executor-memory=${sparkExecutorMemory}
-                --driver-memory=${sparkDriverMemory}
-                --conf spark.extraListeners=${spark2ExtraListeners}
-                --conf spark.sql.queryExecutionListeners=${spark2SqlQueryExecutionListeners}
-                --conf spark.yarn.historyServer.address=${spark2YarnHistoryServerAddress}
-                --conf spark.eventLog.dir=${nameNode}${spark2EventLogDir}
-                --conf spark.sql.shuffle.partitions=7680
-            </spark-opts>
-            <arg>--inputPath</arg><arg>${workingPath}/grouped_entities</arg>
-            <arg>--outputPath</arg><arg>${graphOutputPath}/publication</arg>
-            <arg>--graphTableClassName</arg><arg>eu.dnetlib.dhp.schema.oaf.Publication</arg>
-        </spark>
-        <ok to="wait_dispatch"/>
-        <error to="Kill"/>
-    </action>
-
-    <action name="dispatch_dataset">
-        <spark xmlns="uri:oozie:spark-action:0.2">
-            <master>yarn</master>
-            <mode>cluster</mode>
-            <name>Dispatch dataset</name>
-            <class>eu.dnetlib.dhp.oa.merge.DispatchEntitiesSparkJob</class>
-            <jar>dhp-dedup-openaire-${projectVersion}.jar</jar>
-            <spark-opts>
-                --executor-cores=${sparkExecutorCores}
-                --executor-memory=${sparkExecutorMemory}
-                --driver-memory=${sparkDriverMemory}
-                --conf spark.extraListeners=${spark2ExtraListeners}
-                --conf spark.sql.queryExecutionListeners=${spark2SqlQueryExecutionListeners}
-                --conf spark.yarn.historyServer.address=${spark2YarnHistoryServerAddress}
-                --conf spark.eventLog.dir=${nameNode}${spark2EventLogDir}
-                --conf spark.sql.shuffle.partitions=7680
-            </spark-opts>
-            <arg>--inputPath</arg><arg>${workingPath}/grouped_entities</arg>
-            <arg>--outputPath</arg><arg>${graphOutputPath}/dataset</arg>
-            <arg>--graphTableClassName</arg><arg>eu.dnetlib.dhp.schema.oaf.Dataset</arg>
-        </spark>
-        <ok to="wait_dispatch"/>
-        <error to="Kill"/>
-    </action>
-
-    <action name="dispatch_software">
-        <spark xmlns="uri:oozie:spark-action:0.2">
-            <master>yarn</master>
-            <mode>cluster</mode>
-            <name>Dispatch software</name>
-            <class>eu.dnetlib.dhp.oa.merge.DispatchEntitiesSparkJob</class>
-            <jar>dhp-dedup-openaire-${projectVersion}.jar</jar>
-            <spark-opts>
-                --executor-cores=${sparkExecutorCores}
-                --executor-memory=${sparkExecutorMemory}
-                --driver-memory=${sparkDriverMemory}
-                --conf spark.extraListeners=${spark2ExtraListeners}
-                --conf spark.sql.queryExecutionListeners=${spark2SqlQueryExecutionListeners}
-                --conf spark.yarn.historyServer.address=${spark2YarnHistoryServerAddress}
-                --conf spark.eventLog.dir=${nameNode}${spark2EventLogDir}
-                --conf spark.sql.shuffle.partitions=7680
-            </spark-opts>
-            <arg>--inputPath</arg><arg>${workingPath}/grouped_entities</arg>
-            <arg>--outputPath</arg><arg>${graphOutputPath}/software</arg>
-            <arg>--graphTableClassName</arg><arg>eu.dnetlib.dhp.schema.oaf.Software</arg>
-        </spark>
-        <ok to="wait_dispatch"/>
-        <error to="Kill"/>
-    </action>
-
-    <action name="dispatch_otherresearchproduct">
-        <spark xmlns="uri:oozie:spark-action:0.2">
-            <master>yarn</master>
-            <mode>cluster</mode>
-            <name>Dispatch otherresearchproduct</name>
-            <class>eu.dnetlib.dhp.oa.merge.DispatchEntitiesSparkJob</class>
-            <jar>dhp-dedup-openaire-${projectVersion}.jar</jar>
-            <spark-opts>
-                --executor-cores=${sparkExecutorCores}
-                --executor-memory=${sparkExecutorMemory}
-                --driver-memory=${sparkDriverMemory}
-                --conf spark.extraListeners=${spark2ExtraListeners}
-                --conf spark.sql.queryExecutionListeners=${spark2SqlQueryExecutionListeners}
-                --conf spark.yarn.historyServer.address=${spark2YarnHistoryServerAddress}
-                --conf spark.eventLog.dir=${nameNode}${spark2EventLogDir}
-                --conf spark.sql.shuffle.partitions=7680
-            </spark-opts>
-            <arg>--inputPath</arg><arg>${workingPath}/grouped_entities</arg>
-            <arg>--outputPath</arg><arg>${graphOutputPath}/otherresearchproduct</arg>
-            <arg>--graphTableClassName</arg><arg>eu.dnetlib.dhp.schema.oaf.OtherResearchProduct</arg>
-        </spark>
-        <ok to="wait_dispatch"/>
-        <error to="Kill"/>
-    </action>
-
-    <join name="wait_dispatch" to="End"/>
-
    <end name="End"/>
 </workflow-app>
--- a/dhp-workflows/dhp-dedup-openaire/src/main/resources/eu/dnetlib/dhp/oa/dedup/scan/oozie_app/config-default.xml
+++ b/dhp-workflows/dhp-dedup-openaire/src/main/resources/eu/dnetlib/dhp/oa/dedup/scan/oozie_app/config-default.xml
@ -13,10 +13,6 @@
    </property>
    <property>
        <name>oozie.action.sharelib.for.spark</name>
-        <value>spark342</value>
-    </property>
-    <property>
-        <name>oozie.launcher.mapreduce.user.classpath.first</name>
-        <value>true</value>
+        <value>spark2</value>
    </property>
 </configuration>
--- a/dhp-workflows/dhp-dedup-openaire/src/main/resources/eu/dnetlib/dhp/oa/dedup/scan/oozie_app/workflow.xml
+++ b/dhp-workflows/dhp-dedup-openaire/src/main/resources/eu/dnetlib/dhp/oa/dedup/scan/oozie_app/workflow.xml
@ -126,25 +126,15 @@
                --conf spark.sql.queryExecutionListeners=${spark2SqlQueryExecutionListeners}
                --conf spark.yarn.historyServer.address=${spark2YarnHistoryServerAddress}
                --conf spark.eventLog.dir=${nameNode}${spark2EventLogDir}
-                --conf spark.sql.shuffle.partitions=5000
-                --conf spark.driver.extraJavaOptions="-Xss256k"
-                --conf spark.executor.extraJavaOptions="-Dlog4j.configuration=spark-log4j.properties -Xss256k"
-                --conf spark.extraListeners=
-                --conf spark.sql.queryExecutionListeners=
-                --conf spark.dynamicAllocation.enabled=true --conf spark.dynamicAllocation.minExecutors=100 --conf spark.dynamicAllocation.shuffleTracking.enabled=true
-                --conf spark.network.io.preferDirectBufs=true --conf spark.memory.fraction=0.4 --conf spark.sql.adaptive.coalescePartitions.minPartitionNum=5000
-                --conf spark.shuffle.useOldFetchProtocol=true --conf spark.shuffle.service.enabled=true  --conf spark.eventLog.enabled=true
-                --conf spark.executor.heartbeatInterval=60s
-                --conf spark.network.timeout=640s
-                --conf spark.sql.legacy.allowUntypedScalaUDF=true
+                --conf spark.sql.shuffle.partitions=15000
            </spark-opts>
            <arg>--graphBasePath</arg><arg>${graphBasePath}</arg>
            <arg>--isLookUpUrl</arg><arg>${isLookUpUrl}</arg>
            <arg>--actionSetId</arg><arg>${actionSetId}</arg>
            <arg>--workingPath</arg><arg>${workingPath}</arg>
-            <arg>--numPartitions</arg><arg>5000</arg>
+            <arg>--numPartitions</arg><arg>15000</arg>
        </spark>
-        <ok to="End"/>
+        <ok to="WhitelistSimRels"/>
        <error to="Kill"/>
    </action>

--- a/dhp-workflows/dhp-dedup-openaire/src/test/java/eu/dnetlib/dhp/oa/dedup/EntityMergerTest.java
+++ b/dhp-workflows/dhp-dedup-openaire/src/test/java/eu/dnetlib/dhp/oa/dedup/EntityMergerTest.java
@ -9,7 +9,8 @@ import java.io.IOException;
 import java.io.Serializable;
 import java.lang.reflect.InvocationTargetException;
 import java.nio.file.Paths;
-import java.util.*;
+import java.util.ArrayList;
+import java.util.List;
 import java.util.stream.Collectors;

 import org.codehaus.jackson.map.ObjectMapper;
@ -17,7 +18,10 @@ import org.junit.jupiter.api.BeforeEach;
 import org.junit.jupiter.api.Test;

 import eu.dnetlib.dhp.oa.merge.AuthorMerger;
-import eu.dnetlib.dhp.schema.oaf.*;
+import eu.dnetlib.dhp.schema.oaf.DataInfo;
+import eu.dnetlib.dhp.schema.oaf.Publication;
+import eu.dnetlib.dhp.schema.oaf.Software;
+import eu.dnetlib.dhp.schema.oaf.StructuredProperty;
 import eu.dnetlib.pace.util.MapDocumentUtil;
 import scala.Tuple2;

--- a/dhp-workflows/dhp-dedup-openaire/src/test/java/eu/dnetlib/dhp/oa/dedup/SparkDSLExampleTest.java
+++ b/dhp-workflows/dhp-dedup-openaire/src/test/java/eu/dnetlib/dhp/oa/dedup/SparkDSLExampleTest.java
@ -1,125 +0,0 @@
-
-package eu.dnetlib.dhp.oa.dedup;
-
-import static java.nio.file.Files.createTempDirectory;
-
-import static org.junit.jupiter.api.Assertions.assertEquals;
-import static org.mockito.Mockito.lenient;
-
-import java.io.File;
-import java.io.IOException;
-import java.io.Serializable;
-import java.net.URISyntaxException;
-import java.nio.file.Paths;
-
-import org.apache.commons.io.FileUtils;
-import org.apache.commons.io.IOUtils;
-import org.apache.spark.SparkConf;
-import org.apache.spark.api.java.JavaSparkContext;
-import org.apache.spark.sql.SparkSession;
-import org.junit.jupiter.api.AfterAll;
-import org.junit.jupiter.api.BeforeAll;
-import org.junit.jupiter.api.BeforeEach;
-import org.junit.jupiter.api.Test;
-import org.junit.jupiter.api.extension.ExtendWith;
-import org.mockito.Mock;
-import org.mockito.Mockito;
-import org.mockito.junit.jupiter.MockitoExtension;
-
-import eu.dnetlib.dhp.application.ArgumentApplicationParser;
-import eu.dnetlib.enabling.is.lookup.rmi.ISLookUpException;
-import eu.dnetlib.enabling.is.lookup.rmi.ISLookUpService;
-
-@ExtendWith(MockitoExtension.class)
-public class SparkDSLExampleTest implements Serializable {
-
-	@Mock(serializable = true)
-	ISLookUpService isLookUpService;
-
-	private static SparkSession spark;
-	private static JavaSparkContext jsc;
-
-	private static String testGraphBasePath;
-	private static String testOutputBasePath;
-	private static final String testActionSetId = "test-orchestrator";
-
-	@BeforeAll
-	public static void beforeAll() throws IOException, URISyntaxException {
-
-		testGraphBasePath = Paths
-			.get(SparkDedupTest.class.getResource("/eu/dnetlib/dhp/dedup/entities").toURI())
-			.toFile()
-			.getAbsolutePath();
-		testOutputBasePath = createTempDirectory(SparkDedupTest.class.getSimpleName() + "-")
-			.toAbsolutePath()
-			.toString();
-
-		FileUtils.deleteDirectory(new File(testOutputBasePath));
-
-		final SparkConf conf = new SparkConf();
-		conf.set("spark.sql.shuffle.partitions", "200");
-		spark = SparkSession
-			.builder()
-			.appName(SparkDedupTest.class.getSimpleName())
-			.master("local[*]")
-			.config(conf)
-			.getOrCreate();
-
-		jsc = JavaSparkContext.fromSparkContext(spark.sparkContext());
-	}
-
-	@BeforeEach
-	public void setUp() throws IOException, ISLookUpException {
-
-		lenient()
-			.when(isLookUpService.getResourceProfileByQuery(Mockito.contains(testActionSetId)))
-			.thenReturn(
-				IOUtils
-					.toString(
-						SparkDSLExampleTest.class
-							.getResourceAsStream(
-								"/eu/dnetlib/dhp/dedup/profiles/mock_orchestrator.xml")));
-
-		lenient()
-			.when(isLookUpService.getResourceProfileByQuery(Mockito.contains("organization")))
-			.thenReturn(
-				IOUtils
-					.toString(
-						SparkDSLExampleTest.class
-							.getResourceAsStream(
-								"/eu/dnetlib/dhp/dedup/conf/org.curr.conf.json")));
-	}
-
-	@Test
-	void createBlockStatsTest() throws Exception {
-
-		ArgumentApplicationParser parser = new ArgumentApplicationParser(
-			IOUtils
-				.toString(
-					SparkDSLExampleTest.class
-						.getResourceAsStream(
-							"/eu/dnetlib/dhp/oa/dedup/createBlockStats_parameters.json")));
-		parser
-			.parseArgument(
-				new String[] {
-					"-i", testGraphBasePath,
-					"-asi", testActionSetId,
-					"-la", "lookupurl",
-					"-w", testOutputBasePath
-				});
-
-		new DSLExample(parser, spark).run(isLookUpService);
-
-		long orgs_blocks = spark
-			.read()
-			.textFile(testOutputBasePath + "/" + testActionSetId + "/organization_blockstats")
-			.count();
-
-		assertEquals(480, orgs_blocks);
-	}
-
-	@AfterAll
-	public static void tearDown() {
-		spark.close();
-	}
-}
--- a/Show More
+++ b/Show More