minor fix

erge branch 'dedup-with-dataframe-spark34' of code-repo.d4science.org:D-Net/dnet-hadoop into dedup-with-dataframe-spark34
improved scholix generation using bean
2023-08-02 12:12:20 +02:00 · 2023-07-19 16:55:19 +02:00 · 2023-07-19 16:53:28 +02:00 · 2023-07-17 17:01:07 +02:00 · 2023-07-14 16:05:14 +02:00 · 2023-07-12 10:32:25 +02:00
162 changed files with 2555 additions and 7821 deletions
--- a/.gitignore
+++ b/.gitignore
@ -26,4 +26,3 @@ spark-warehouse
 /**/*.log
 /**/.factorypath
 /**/.scalafmt.conf
-/.java-version
--- a/README.md
+++ b/README.md
@ -1,128 +1,2 @@
 # dnet-hadoop
-
-Dnet-hadoop is the project that defined all the [OOZIE workflows](https://oozie.apache.org/) for the OpenAIRE Graph construction, processing, provisioning.
-
-How to build, package and run oozie workflows
-====================
-
-Oozie-installer is a utility allowing building, uploading and running oozie workflows. In practice, it creates a `*.tar.gz` 
-package that contains resources that define a workflow and some helper scripts.
-
-This module is automatically executed when running:
-
-`mvn package -Poozie-package -Dworkflow.source.dir=classpath/to/parent/directory/of/oozie_app`
-
-on module having set:
-
-```
-<parent>
-    <groupId>eu.dnetlib.dhp</groupId>
-    <artifactId>dhp-workflows</artifactId>
-</parent>
-```
-
-in `pom.xml` file. `oozie-package` profile initializes oozie workflow packaging, `workflow.source.dir` property points to 
-a workflow (notice: this is not a relative path but a classpath to directory usually holding `oozie_app` subdirectory).
-
-The outcome of this packaging is `oozie-package.tar.gz` file containing inside all the resources required to run Oozie workflow:
-
- jar packages
- workflow definitions
- job properties
- maintenance scripts
-
-Required properties
-====================
-
-In order to include proper workflow within package, `workflow.source.dir` property has to be set. It could be provided 
-by setting `-Dworkflow.source.dir=some/job/dir` maven parameter.
-
-In oder to define full set of cluster environment properties one should create `~/.dhp/application.properties` file with 
-the following properties:
-
- `dhp.hadoop.frontend.user.name` - your user name on hadoop cluster and frontend machine
- `dhp.hadoop.frontend.host.name` - frontend host name
- `dhp.hadoop.frontend.temp.dir` - frontend directory for temporary files
- `dhp.hadoop.frontend.port.ssh` - frontend machine ssh port
- `oozieServiceLoc` - oozie service location required by run_workflow.sh script executing oozie job
- `nameNode` - name node address
- `jobTracker` - job tracker address
- `oozie.execution.log.file.location` - location of file that will be created when executing oozie job, it contains output 
-produced by `run_workflow.sh` script (needed to obtain oozie job id)
- `maven.executable` - mvn command location, requires parameterization due to a different setup of CI cluster
- `sparkDriverMemory` - amount of memory assigned to spark jobs driver
- `sparkExecutorMemory` - amount of memory assigned to spark jobs executors
- `sparkExecutorCores` - number of cores assigned to spark jobs executors
-
-All values will be overriden with the ones from `job.properties` and eventually `job-override.properties` stored in module's 
-main folder.
-
-When overriding properties from `job.properties`, `job-override.properties` file can be created in main module directory 
-(the one containing `pom.xml` file) and define all new properties which will override existing properties. 
-One can provide those properties one by one as command line `-D` arguments.
-
-Properties overriding order is the following:
-
-1. `pom.xml` defined properties (located in the project root dir)
-2. `~/.dhp/application.properties` defined properties
-3. `${workflow.source.dir}/job.properties`
-4. `job-override.properties` (located in the project root dir)
-5. `maven -Dparam=value`
-
-where the maven `-Dparam` property is overriding all the other ones.
-
-Workflow definition requirements
-====================
-
-`workflow.source.dir` property should point to the following directory structure:
-
-	[${workflow.source.dir}]
-		|
-		|-job.properties (optional)
-		|
-		\-[oozie_app]
-			|
-			\-workflow.xml
-
-This property can be set using maven `-D` switch.
-
-`[oozie_app]` is the default directory name however it can be set to any value as soon as `oozieAppDir` property is 
-provided with directory name as value.
-
-Sub-workflows are supported as well and sub-workflow directories should be nested within `[oozie_app]` directory.
-
-Creating oozie installer step-by-step
-=====================================
-
-Automated oozie-installer steps are the following:
-
-1. creating jar packages:  `*.jar` and `*tests.jar` along with copying all dependencies in `target/dependencies`
-2. reading properties from maven, `~/.dhp/application.properties`, `job.properties`, `job-override.properties`
-3. invoking priming mechanism linking resources from import.txt file (currently resolving subworkflow resources)
-4. assembling shell scripts for preparing Hadoop filesystem, uploading Oozie application and starting workflow
-5. copying whole `${workflow.source.dir}` content to `target/${oozie.package.file.name}`
-6. generating updated `job.properties` file in `target/${oozie.package.file.name}` based on maven, 
-`~/.dhp/application.properties`, `job.properties` and `job-override.properties`
-7. creating `lib` directory (or multiple directories for sub-workflows for each nested directory) and copying jar packages 
-created at step (1) to each one of them
-8. bundling whole `${oozie.package.file.name}` directory into single tar.gz package
-
-Uploading oozie package and running workflow on cluster
-=======================================================
-
-In order to simplify deployment and execution process two dedicated profiles were introduced:
-
- `deploy`
- `run`
-
-to be used along with `oozie-package` profile e.g. by providing `-Poozie-package,deploy,run` maven parameters.
-
-The `deploy` profile supplements packaging process with:
-1) uploading oozie-package via scp to `/home/${user.name}/oozie-packages` directory on `${dhp.hadoop.frontend.host.name}` machine
-2) extracting uploaded package
-3) uploading oozie content to hadoop cluster HDFS location defined in `oozie.wf.application.path` property (generated dynamically by maven build process, based on `${dhp.hadoop.frontend.user.name}` and `workflow.source.dir` properties)
-
-The `run` profile introduces:
-1) executing oozie application uploaded to HDFS cluster using `deploy` command. Triggers `run_workflow.sh` script providing runtime properties defined in `job.properties` file.
-
-Notice: ssh access to frontend machine has to be configured on system level and it is preferable to set key-based authentication in order to simplify remote operations.
+Dnet-hadoop is the project that defined all the OOZIE workflows for the OpenAIRE Graph construction, processing, provisioning.
--- a/dhp-common/pom.xml
+++ b/dhp-common/pom.xml
@ -62,11 +62,6 @@
 	</build>

 	<dependencies>
-		<dependency>
-			<groupId>eu.dnetlib.dhp</groupId>
-			<artifactId>dhp-pace-core</artifactId>
-			<version>${project.version}</version>
-		</dependency>

 		<dependency>
 			<groupId>org.apache.hadoop</groupId>
@ -129,6 +124,12 @@
 		<dependency>
 			<groupId>eu.dnetlib</groupId>
 			<artifactId>cnr-rmi-api</artifactId>
+			<exclusions>
+				<exclusion>
+					<groupId>log4j</groupId>
+					<artifactId>log4j</artifactId>
+				</exclusion>
+			</exclusions>
 		</dependency>

 		<dependency>
@ -149,6 +150,11 @@
 			<artifactId>okhttp</artifactId>
 		</dependency>

+		<dependency>
+			<groupId>eu.dnetlib.dhp</groupId>
+			<artifactId>dhp-pace-core</artifactId>
+		</dependency>
+
 		<dependency>
 			<groupId>org.apache.httpcomponents</groupId>
 			<artifactId>httpclient</artifactId>
@ -161,7 +167,7 @@

 		<dependency>
 			<groupId>eu.dnetlib.dhp</groupId>
-			<artifactId>${dhp-schemas.artifact}</artifactId>
+			<artifactId>dhp-schemas_${scala.binary.version}</artifactId>
 		</dependency>

 		<dependency>
--- a/dhp-common/src/main/java/eu/dnetlib/dhp/oa/merge/DispatchEntitiesSparkJob.java
+++ b/dhp-common/src/main/java/eu/dnetlib/dhp/oa/merge/DispatchEntitiesSparkJob.java
@ -11,18 +11,25 @@ import org.apache.commons.lang3.StringUtils;
 import org.apache.spark.SparkConf;
 import org.apache.spark.api.java.function.FilterFunction;
 import org.apache.spark.api.java.function.MapFunction;
-import org.apache.spark.sql.*;
+import org.apache.spark.sql.Encoders;
+import org.apache.spark.sql.SaveMode;
+import org.apache.spark.sql.SparkSession;
 import org.slf4j.Logger;
 import org.slf4j.LoggerFactory;

+import com.fasterxml.jackson.databind.ObjectMapper;
+
 import eu.dnetlib.dhp.application.ArgumentApplicationParser;
 import eu.dnetlib.dhp.common.HdfsSupport;
-import eu.dnetlib.dhp.schema.common.ModelSupport;
+import eu.dnetlib.dhp.schema.oaf.Oaf;
+import eu.dnetlib.dhp.schema.oaf.OafEntity;

 public class DispatchEntitiesSparkJob {

 	private static final Logger log = LoggerFactory.getLogger(DispatchEntitiesSparkJob.class);

+	private static final ObjectMapper OBJECT_MAPPER = new ObjectMapper();
+
 	public static void main(String[] args) throws Exception {

 		String jsonConfiguration = IOUtils
@ -47,51 +54,44 @@ public class DispatchEntitiesSparkJob {
 		String outputPath = parser.get("outputPath");
 		log.info("outputPath: {}", outputPath);

-		boolean filterInvisible = Boolean.parseBoolean(parser.get("filterInvisible"));
-		log.info("filterInvisible: {}", filterInvisible);
+		String graphTableClassName = parser.get("graphTableClassName");
+		log.info("graphTableClassName: {}", graphTableClassName);
+
+		@SuppressWarnings("unchecked")
+		Class<? extends OafEntity> entityClazz = (Class<? extends OafEntity>) Class.forName(graphTableClassName);

 		SparkConf conf = new SparkConf();
 		runWithSparkSession(
 			conf,
 			isSparkSessionManaged,
-			spark -> dispatchEntities(spark, inputPath, outputPath, filterInvisible));
+			spark -> {
+				HdfsSupport.remove(outputPath, spark.sparkContext().hadoopConfiguration());
+				dispatchEntities(spark, inputPath, entityClazz, outputPath);
+			});
 	}

-	private static void dispatchEntities(
+	private static <T extends Oaf> void dispatchEntities(
 		SparkSession spark,
 		String inputPath,
-		String outputPath,
-		boolean filterInvisible) {
+		Class<T> clazz,
+		String outputPath) {

-		Dataset<String> df = spark.read().textFile(inputPath);
-
-		ModelSupport.oafTypes.entrySet().parallelStream().forEach(entry -> {
-			String entityType = entry.getKey();
-			Class<?> clazz = entry.getValue();
-
-			final String entityPath = outputPath + "/" + entityType;
-			if (!entityType.equalsIgnoreCase("relation")) {
-				HdfsSupport.remove(entityPath, spark.sparkContext().hadoopConfiguration());
-				Dataset<Row> entityDF = spark
-					.read()
-					.schema(Encoders.bean(clazz).schema())
-					.json(
-						df
-							.filter((FilterFunction<String>) s -> s.startsWith(clazz.getName()))
-							.map(
-								(MapFunction<String, String>) s -> StringUtils.substringAfter(s, "|"),
-								Encoders.STRING()));
-
-				if (filterInvisible) {
-					entityDF = entityDF.filter("dataInfo.invisible != true");
-				}
-
-				entityDF
-					.write()
-					.mode(SaveMode.Overwrite)
-					.option("compression", "gzip")
-					.json(entityPath);
-			}
-		});
+		spark
+			.read()
+			.textFile(inputPath)
+			.filter((FilterFunction<String>) s -> isEntityType(s, clazz))
+			.map((MapFunction<String, String>) s -> StringUtils.substringAfter(s, "|"), Encoders.STRING())
+			.map(
+				(MapFunction<String, T>) value -> OBJECT_MAPPER.readValue(value, clazz),
+				Encoders.bean(clazz))
+			.write()
+			.mode(SaveMode.Overwrite)
+			.option("compression", "gzip")
+			.json(outputPath);
 	}
+
+	private static <T extends Oaf> boolean isEntityType(final String s, final Class<T> clazz) {
+		return StringUtils.substringBefore(s, "|").equals(clazz.getName());
+	}
+
 }
--- a/dhp-common/src/main/java/eu/dnetlib/dhp/schema/oaf/utils/FundRefCleaningRule.java
+++ b/dhp-common/src/main/java/eu/dnetlib/dhp/schema/oaf/utils/FundRefCleaningRule.java
@ -6,16 +6,14 @@ import java.util.regex.Pattern;

 public class FundRefCleaningRule {

-	public static final Pattern PATTERN = Pattern.compile("\\d+");
+	public static String clean(final String fundrefId) {

-	public static String clean(final String fundRefId) {
-
-		String s = fundRefId
+		String s = fundrefId
 			.toLowerCase()
 			.replaceAll("\\s", "");

-		Matcher m = PATTERN.matcher(s);
-		if (m.find()) {
+		Matcher m = Pattern.compile("\\d+").matcher(s);
+		if (m.matches()) {
 			return m.group();
 		} else {
 			return "";
--- a/dhp-common/src/main/java/eu/dnetlib/dhp/schema/oaf/utils/GraphCleaningFunctions.java
+++ b/dhp-common/src/main/java/eu/dnetlib/dhp/schema/oaf/utils/GraphCleaningFunctions.java
@ -13,7 +13,11 @@ import java.util.stream.Collectors;
 import java.util.stream.Stream;

 import org.apache.commons.lang3.StringUtils;
+import org.apache.spark.api.java.function.MapFunction;
+import org.apache.spark.sql.Encoders;

+import com.fasterxml.jackson.core.JsonProcessingException;
+import com.fasterxml.jackson.databind.ObjectMapper;
 import com.github.sisyphsu.dateparser.DateParserUtils;
 import com.google.common.collect.Lists;
 import com.google.common.collect.Sets;
@ -35,7 +39,6 @@ public class GraphCleaningFunctions extends CleaningFunctions {
 	public static final String TITLE_FILTER_REGEX = String.format("(%s)|\\W|\\d", TITLE_TEST);

 	public static final int TITLE_FILTER_RESIDUAL_LENGTH = 5;
-	private static final String NAME_CLEANING_REGEX = "[\\r\\n\\t\\s]+";

 	public static <T extends Oaf> T cleanContext(T value, String contextId, String verifyParam) {
 		if (ModelSupport.isSubClass(value, Result.class)) {
@ -225,7 +228,7 @@ public class GraphCleaningFunctions extends CleaningFunctions {
 	}

 	public static <T extends Oaf> boolean filter(T value) {
-		if (!(value instanceof Relation) && (Boolean.TRUE
+		if (Boolean.TRUE
 			.equals(
 				Optional
 					.ofNullable(value)
@ -236,16 +239,15 @@ public class GraphCleaningFunctions extends CleaningFunctions {
 								d -> Optional
 									.ofNullable(d.getInvisible())
 									.orElse(true))
-							.orElse(false))
-					.orElse(true)))) {
+							.orElse(true))
+					.orElse(true))) {
 			return true;
 		}

 		if (value instanceof Datasource) {
 			// nothing to evaluate here
 		} else if (value instanceof Project) {
-			final Project p = (Project) value;
-			return Objects.nonNull(p.getCode()) && StringUtils.isNotBlank(p.getCode().getValue());
+			// nothing to evaluate here
 		} else if (value instanceof Organization) {
 			// nothing to evaluate here
 		} else if (value instanceof Relation) {
@ -292,13 +294,6 @@ public class GraphCleaningFunctions extends CleaningFunctions {
 			} else if (value instanceof Result) {
 				Result r = (Result) value;

-				if (Objects.nonNull(r.getFulltext())
-					&& (ModelConstants.SOFTWARE_RESULTTYPE_CLASSID.equals(r.getResulttype().getClassid()) ||
-						ModelConstants.DATASET_RESULTTYPE_CLASSID.equals(r.getResulttype().getClassid()))) {
-					r.setFulltext(null);
-
-				}
-
 				if (Objects.nonNull(r.getDateofacceptance())) {
 					Optional<String> date = cleanDateField(r.getDateofacceptance());
 					if (date.isPresent()) {
@ -323,18 +318,8 @@ public class GraphCleaningFunctions extends CleaningFunctions {
 								.filter(sp -> StringUtils.isNotBlank(sp.getValue()))
 								.collect(Collectors.toList()));
 				}
-				if (Objects.nonNull(r.getPublisher())) {
-					if (StringUtils.isBlank(r.getPublisher().getValue())) {
-						r.setPublisher(null);
-					} else {
-						r
-							.getPublisher()
-							.setValue(
-								r
-									.getPublisher()
-									.getValue()
-									.replaceAll(NAME_CLEANING_REGEX, " "));
-					}
+				if (Objects.nonNull(r.getPublisher()) && StringUtils.isBlank(r.getPublisher().getValue())) {
+					r.setPublisher(null);
 				}
 				if (Objects.isNull(r.getLanguage()) || StringUtils.isBlank(r.getLanguage().getClassid())) {
 					r
@ -501,11 +486,6 @@ public class GraphCleaningFunctions extends CleaningFunctions {
 								i.setDateofacceptance(null);
 							}
 						}
-						if (StringUtils.isNotBlank(i.getFulltext()) &&
-							(ModelConstants.SOFTWARE_RESULTTYPE_CLASSID.equals(r.getResulttype().getClassid()) ||
-								ModelConstants.DATASET_RESULTTYPE_CLASSID.equals(r.getResulttype().getClassid()))) {
-							i.setFulltext(null);
-						}
 					}
 				}
 				if (Objects.isNull(r.getBestaccessright())
@ -530,7 +510,6 @@ public class GraphCleaningFunctions extends CleaningFunctions {
 								.filter(Objects::nonNull)
 								.filter(a -> StringUtils.isNotBlank(a.getFullname()))
 								.filter(a -> StringUtils.isNotBlank(a.getFullname().replaceAll("[\\W]", "")))
-								.map(GraphCleaningFunctions::cleanupAuthor)
 								.collect(Collectors.toList()));

 					boolean nullRank = r
@ -625,35 +604,6 @@ public class GraphCleaningFunctions extends CleaningFunctions {
 		return value;
 	}

-	private static Author cleanupAuthor(Author author) {
-		if (StringUtils.isNotBlank(author.getFullname())) {
-			author
-				.setFullname(
-					author
-						.getFullname()
-						.replaceAll(NAME_CLEANING_REGEX, " ")
-						.replace("\"", "\\\""));
-		}
-		if (StringUtils.isNotBlank(author.getName())) {
-			author
-				.setName(
-					author
-						.getName()
-						.replaceAll(NAME_CLEANING_REGEX, " ")
-						.replace("\"", "\\\""));
-		}
-		if (StringUtils.isNotBlank(author.getSurname())) {
-			author
-				.setSurname(
-					author
-						.getSurname()
-						.replaceAll(NAME_CLEANING_REGEX, " ")
-						.replace("\"", "\\\""));
-		}
-
-		return author;
-	}
-
 	private static Optional<String> cleanDateField(Field<String> dateofacceptance) {
 		return Optional
 			.ofNullable(dateofacceptance)
--- a/dhp-common/src/main/java/eu/dnetlib/dhp/schema/oaf/utils/GridCleaningRule.java
+++ b/dhp-common/src/main/java/eu/dnetlib/dhp/schema/oaf/utils/GridCleaningRule.java
@ -6,19 +6,13 @@ import java.util.regex.Pattern;

 public class GridCleaningRule {

-	public static final Pattern PATTERN = Pattern.compile("(?<grid>\\d{4,6}\\.[0-9a-z]{1,2})");
-
 	public static String clean(String grid) {
 		String s = grid
 			.replaceAll("\\s", "")
 			.toLowerCase();

-		Matcher m = PATTERN.matcher(s);
-		if (m.find()) {
-			return "grid." + m.group("grid");
-		}
-
-		return "";
+		Matcher m = Pattern.compile("\\d{4,6}\\.[0-9a-z]{1,2}").matcher(s);
+		return m.matches() ? "grid." + m.group() : "";
 	}

 }
--- a/dhp-common/src/main/java/eu/dnetlib/dhp/schema/oaf/utils/ISNICleaningRule.java
+++ b/dhp-common/src/main/java/eu/dnetlib/dhp/schema/oaf/utils/ISNICleaningRule.java
@ -7,12 +7,10 @@ import java.util.regex.Pattern;
 // https://www.wikidata.org/wiki/Property:P213
 public class ISNICleaningRule {

-	public static final Pattern PATTERN = Pattern.compile("([0]{4}) ?([0-9]{4}) ?([0-9]{4}) ?([0-9]{3}[0-9X])");
-
 	public static String clean(final String isni) {

-		Matcher m = PATTERN.matcher(isni);
-		if (m.find()) {
+		Matcher m = Pattern.compile("([0]{4}) ?([0-9]{4}) ?([0-9]{4}) ?([0-9]{3}[0-9X])").matcher(isni);
+		if (m.matches()) {
 			return String.join("", m.group(1), m.group(2), m.group(3), m.group(4));
 		} else {
 			return "";
--- a/dhp-common/src/main/java/eu/dnetlib/dhp/schema/oaf/utils/PICCleaningRule.java
+++ b/dhp-common/src/main/java/eu/dnetlib/dhp/schema/oaf/utils/PICCleaningRule.java
@ -6,12 +6,10 @@ import java.util.regex.Pattern;

 public class PICCleaningRule {

-	public static final Pattern PATTERN = Pattern.compile("\\d{9}");
-
 	public static String clean(final String pic) {

-		Matcher m = PATTERN.matcher(pic);
-		if (m.find()) {
+		Matcher m = Pattern.compile("\\d{9}").matcher(pic);
+		if (m.matches()) {
 			return m.group();
 		} else {
 			return "";
--- a/dhp-common/src/main/java/eu/dnetlib/dhp/schema/oaf/utils/PmcCleaningRule.java
+++ b/dhp-common/src/main/java/eu/dnetlib/dhp/schema/oaf/utils/PmcCleaningRule.java
@ -1,24 +1,13 @@

 package eu.dnetlib.dhp.schema.oaf.utils;

-import java.util.regex.Matcher;
-import java.util.regex.Pattern;
-
 public class PmcCleaningRule {

-	public static final Pattern PATTERN = Pattern.compile("PMC\\d{1,8}");
-
 	public static String clean(String pmc) {
 		String s = pmc
 			.replaceAll("\\s", "")
 			.toUpperCase();
-
-		final Matcher m = PATTERN.matcher(s);
-
-		if (m.find()) {
-			return m.group();
-		}
-		return "";
+		return s.matches("^PMC\\d{1,8}$") ? s : "";
 	}

 }
--- a/dhp-common/src/main/java/eu/dnetlib/dhp/schema/oaf/utils/PmidCleaningRule.java
+++ b/dhp-common/src/main/java/eu/dnetlib/dhp/schema/oaf/utils/PmidCleaningRule.java
@ -1,25 +1,16 @@

 package eu.dnetlib.dhp.schema.oaf.utils;

-import java.util.regex.Matcher;
-import java.util.regex.Pattern;
-
 // https://researchguides.stevens.edu/c.php?g=442331&p=6577176
 public class PmidCleaningRule {

-	public static final Pattern PATTERN = Pattern.compile("0*(\\d{1,8})");
-
 	public static String clean(String pmid) {
 		String s = pmid
 			.toLowerCase()
-			.replaceAll("\\s", "");
-
-		final Matcher m = PATTERN.matcher(s);
-
-		if (m.find()) {
-			return m.group(1);
-		}
-		return "";
+			.replaceAll("\\s", "")
+			.trim()
+			.replaceAll("^0+", "");
+		return s.matches("^\\d{1,8}$") ? s : "";
 	}

 }
--- a/dhp-common/src/main/java/eu/dnetlib/dhp/schema/oaf/utils/RorCleaningRule.java
+++ b/dhp-common/src/main/java/eu/dnetlib/dhp/schema/oaf/utils/RorCleaningRule.java
@ -7,21 +7,12 @@ import java.util.regex.Pattern;
 // https://ror.readme.io/docs/ror-identifier-pattern
 public class RorCleaningRule {

-	public static final String ROR_PREFIX = "https://ror.org/";
-
-	private static final Pattern PATTERN = Pattern.compile("(?<ror>0[a-hj-km-np-tv-z|0-9]{6}[0-9]{2})");
-
 	public static String clean(String ror) {
 		String s = ror
 			.replaceAll("\\s", "")
 			.toLowerCase();
-
-		Matcher m = PATTERN.matcher(s);
-
-		if (m.find()) {
-			return ROR_PREFIX + m.group("ror");
-		}
-		return "";
+		Matcher m = Pattern.compile("0[a-hj-km-np-tv-z|0-9]{6}[0-9]{2}").matcher(s);
+		return m.matches() ? "https://ror.org/" + m.group() : "";
 	}

 }
--- a/dhp-common/src/main/resources/eu/dnetlib/dhp/oa/merge/dispatch_entities_parameters.json
+++ b/dhp-common/src/main/resources/eu/dnetlib/dhp/oa/merge/dispatch_entities_parameters.json
@ -18,9 +18,9 @@
    "paramRequired": true
  },
  {
-    "paramName": "fi",
-    "paramLongName": "filterInvisible",
-    "paramDescription": "if true filters out invisible entities",
+    "paramName": "c",
+    "paramLongName": "graphTableClassName",
+    "paramDescription": "the graph entity class name",
    "paramRequired": true
  }
 ]
--- a/dhp-common/src/main/scala/eu/dnetlib/dhp/sx/graph/scholix/ScholixUtils.scala
+++ b/dhp-common/src/main/scala/eu/dnetlib/dhp/sx/graph/scholix/ScholixUtils.scala
@ -50,13 +50,10 @@ object ScholixUtils extends Serializable {
    }
  }

-  def extractRelationDate(summary: ScholixSummary): String = {
+  def extractRelationDate(summary: ScholixResource): String = {
+
+    summary.getPublicationDate

-    if (summary.getDate == null || summary.getDate.isEmpty)
-      null
-    else {
-      summary.getDate.get(0)
-    }
  }

  def inverseRelationShip(rel: ScholixRelationship): ScholixRelationship = {
@ -144,11 +141,7 @@ object ScholixUtils extends Serializable {
    s.setRelationship(inverseRelationShip(scholix.getRelationship))
    s.setSource(scholix.getTarget)
    s.setTarget(scholix.getSource)
-    s.setIdentifier(
-      DHPUtils.md5(
-        s"${s.getSource.getIdentifier}::${s.getRelationship.getName}::${s.getTarget.getIdentifier}"
-      )
-    )
+    updateId(s)
    s

  }
@ -187,6 +180,21 @@ object ScholixUtils extends Serializable {
    } else List()
  }

+  def updateId(scholix: Scholix): Scholix = {
+    scholix.setIdentifier(
+      generateIdentifier(
+        scholix.getSource.getDnetIdentifier,
+        scholix.getTarget.getDnetIdentifier,
+        scholix.getRelationship.getName
+      )
+    )
+    scholix
+  }
+
+  def generateIdentifier(sourceId: String, targetId: String, relation: String): String = {
+    DHPUtils.md5(s"$sourceId::$relation::$targetId")
+  }
+
  def generateCompleteScholix(scholix: Scholix, target: ScholixSummary): Scholix = {
    val s = new Scholix
    s.setPublicationDate(scholix.getPublicationDate)
@ -195,11 +203,7 @@ object ScholixUtils extends Serializable {
    s.setRelationship(scholix.getRelationship)
    s.setSource(scholix.getSource)
    s.setTarget(generateScholixResourceFromSummary(target))
-    s.setIdentifier(
-      DHPUtils.md5(
-        s"${s.getSource.getIdentifier}::${s.getRelationship.getName}::${s.getTarget.getIdentifier}"
-      )
-    )
+    updateId(s)
    s
  }

@ -211,11 +215,7 @@ object ScholixUtils extends Serializable {
    s.setRelationship(scholix.getRelationship)
    s.setSource(scholix.getSource)
    s.setTarget(target)
-    s.setIdentifier(
-      DHPUtils.md5(
-        s"${s.getSource.getIdentifier}::${s.getRelationship.getName}::${s.getTarget.getIdentifier}"
-      )
-    )
+    updateId(s)
    s
  }

@ -232,7 +232,7 @@ object ScholixUtils extends Serializable {

    if (summaryObject.getAuthor != null && !summaryObject.getAuthor.isEmpty) {
      val l: List[ScholixEntityId] =
-        summaryObject.getAuthor.asScala.map(a => new ScholixEntityId(a, null)).toList
+        summaryObject.getAuthor.asScala.map(a => new ScholixEntityId(a, null)).take(100).toList
      if (l.nonEmpty)
        r.setCreator(l.asJava)
    }
@ -241,7 +241,7 @@ object ScholixUtils extends Serializable {
      r.setPublicationDate(summaryObject.getDate.get(0))
    if (summaryObject.getPublisher != null && !summaryObject.getPublisher.isEmpty) {
      val plist: List[ScholixEntityId] =
-        summaryObject.getPublisher.asScala.map(p => new ScholixEntityId(p, null)).toList
+        summaryObject.getPublisher.asScala.map(p => new ScholixEntityId(p, null)).take(100).toList

      if (plist.nonEmpty)
        r.setPublisher(plist.asJava)
@ -260,6 +260,7 @@ object ScholixUtils extends Serializable {
            "complete"
          )
        )
+        .take(100)
        .toList

      if (l.nonEmpty)
@ -269,38 +270,38 @@ object ScholixUtils extends Serializable {
    r
  }

+//  def scholixFromSource(relation: Relation, source: ScholixResource): Scholix = {
+//    if (relation == null || source == null)
+//      return null
+//    val s = new Scholix
+//    var l: List[ScholixEntityId] = extractCollectedFrom(relation)
+//    if (l.isEmpty)
+//      l = extractCollectedFrom(source)
+//    if (l.isEmpty)
+//      return null
+//    s.setLinkprovider(l.asJava)
+//    var d = extractRelationDate(relation)
+//    if (d == null)
+//      d = source.getPublicationDate
+//
+//    s.setPublicationDate(d)
+//
+//    if (source.getPublisher != null && !source.getPublisher.isEmpty) {
+//      s.setPublisher(source.getPublisher)
+//    }
+//
+//    val semanticRelation = relations.getOrElse(relation.getRelClass.toLowerCase, null)
+//    if (semanticRelation == null)
+//      return null
+//    s.setRelationship(
+//      new ScholixRelationship(semanticRelation.original, "datacite", semanticRelation.inverse)
+//    )
+//    s.setSource(source)
+//
+//    s
+//  }
+
  def scholixFromSource(relation: Relation, source: ScholixResource): Scholix = {
-    if (relation == null || source == null)
-      return null
-    val s = new Scholix
-    var l: List[ScholixEntityId] = extractCollectedFrom(relation)
-    if (l.isEmpty)
-      l = extractCollectedFrom(source)
-    if (l.isEmpty)
-      return null
-    s.setLinkprovider(l.asJava)
-    var d = extractRelationDate(relation)
-    if (d == null)
-      d = source.getPublicationDate
-
-    s.setPublicationDate(d)
-
-    if (source.getPublisher != null && !source.getPublisher.isEmpty) {
-      s.setPublisher(source.getPublisher)
-    }
-
-    val semanticRelation = relations.getOrElse(relation.getRelClass.toLowerCase, null)
-    if (semanticRelation == null)
-      return null
-    s.setRelationship(
-      new ScholixRelationship(semanticRelation.original, "datacite", semanticRelation.inverse)
-    )
-    s.setSource(source)
-
-    s
-  }
-
-  def scholixFromSource(relation: Relation, source: ScholixSummary): Scholix = {

    if (relation == null || source == null)
      return null
@ -322,11 +323,8 @@ object ScholixUtils extends Serializable {
    s.setPublicationDate(d)

    if (source.getPublisher != null && !source.getPublisher.isEmpty) {
-      val l: List[ScholixEntityId] = source.getPublisher.asScala
-        .map { p =>
-          new ScholixEntityId(p, null)
-        }(collection.breakOut)
-
+      source.getPublisher
+      val l: List[ScholixEntityId] = source.getPublisher.asScala.toList
      if (l.nonEmpty)
        s.setPublisher(l.asJava)
    }
@ -337,7 +335,7 @@ object ScholixUtils extends Serializable {
    s.setRelationship(
      new ScholixRelationship(semanticRelation.original, "datacite", semanticRelation.inverse)
    )
-    s.setSource(generateScholixResourceFromSummary(source))
+    s.setSource(source)

    s
  }
--- a/dhp-common/src/test/java/eu/dnetlib/dhp/common/MdStoreClientTest.java
+++ b/dhp-common/src/test/java/eu/dnetlib/dhp/common/MdStoreClientTest.java
@ -15,7 +15,7 @@ import com.fasterxml.jackson.databind.ObjectMapper;

 public class MdStoreClientTest {

-	// @Test
+	@Test
 	public void testMongoCollection() throws IOException {
 		final MdstoreClient client = new MdstoreClient("mongodb://localhost:27017", "mdstore");

--- a/dhp-common/src/test/java/eu/dnetlib/dhp/schema/oaf/utils/GridCleaningRuleTest.java
+++ b/dhp-common/src/test/java/eu/dnetlib/dhp/schema/oaf/utils/GridCleaningRuleTest.java
@ -1,18 +0,0 @@
-
-package eu.dnetlib.dhp.schema.oaf.utils;
-
-import static org.junit.jupiter.api.Assertions.assertEquals;
-
-import org.junit.jupiter.api.Test;
-
-class GridCleaningRuleTest {
-
-	@Test
-	void testCleaning() {
-		assertEquals("grid.493784.5", GridCleaningRule.clean("grid.493784.5"));
-		assertEquals("grid.493784.5x", GridCleaningRule.clean("grid.493784.5x"));
-		assertEquals("grid.493784.5x", GridCleaningRule.clean("493784.5x"));
-		assertEquals("", GridCleaningRule.clean("493x784.5x"));
-	}
-
-}
--- a/dhp-common/src/test/java/eu/dnetlib/dhp/schema/oaf/utils/ISNICleaningRuleTest.java
+++ b/dhp-common/src/test/java/eu/dnetlib/dhp/schema/oaf/utils/ISNICleaningRuleTest.java
@ -1,19 +0,0 @@
-
-package eu.dnetlib.dhp.schema.oaf.utils;
-
-import static org.junit.jupiter.api.Assertions.assertEquals;
-
-import org.junit.jupiter.api.Test;
-
-class ISNICleaningRuleTest {
-
-	@Test
-	void testCleaning() {
-		assertEquals("0000000463436020", ISNICleaningRule.clean("0000 0004 6343 6020"));
-		assertEquals("0000000463436020", ISNICleaningRule.clean("0000000463436020"));
-		assertEquals("", ISNICleaningRule.clean("Q30256598"));
-		assertEquals("0000000493403529", ISNICleaningRule.clean("ISNI:0000000493403529"));
-		assertEquals("000000008614884X", ISNICleaningRule.clean("0000 0000 8614 884X"));
-	}
-
-}
--- a/dhp-common/src/test/java/eu/dnetlib/dhp/schema/oaf/utils/PICCleaningRuleTest.java
+++ b/dhp-common/src/test/java/eu/dnetlib/dhp/schema/oaf/utils/PICCleaningRuleTest.java
@ -1,19 +0,0 @@
-
-package eu.dnetlib.dhp.schema.oaf.utils;
-
-import static org.junit.jupiter.api.Assertions.assertEquals;
-
-import org.junit.jupiter.api.Test;
-
-class PICCleaningRuleTest {
-
-	@Test
-	void testCleaning() {
-		assertEquals("887624982", PICCleaningRule.clean("887624982"));
-		assertEquals("", PICCleaningRule.clean("887 624982"));
-		assertEquals("887624982", PICCleaningRule.clean(" 887624982 "));
-		assertEquals("887624982", PICCleaningRule.clean(" 887624982x "));
-		assertEquals("887624982", PICCleaningRule.clean(" 88762498200 "));
-	}
-
-}
--- a/dhp-common/src/test/java/eu/dnetlib/dhp/schema/oaf/utils/PmcCleaningRuleTest.java
+++ b/dhp-common/src/test/java/eu/dnetlib/dhp/schema/oaf/utils/PmcCleaningRuleTest.java
@ -1,19 +0,0 @@
-
-package eu.dnetlib.dhp.schema.oaf.utils;
-
-import static org.junit.jupiter.api.Assertions.assertEquals;
-
-import org.junit.jupiter.api.Test;
-
-class PmcCleaningRuleTest {
-
-	@Test
-	void testCleaning() {
-		assertEquals("PMC1234", PmcCleaningRule.clean("PMC1234"));
-		assertEquals("PMC1234", PmcCleaningRule.clean(" PMC1234"));
-		assertEquals("PMC12345678", PmcCleaningRule.clean("PMC12345678"));
-		assertEquals("PMC12345678", PmcCleaningRule.clean("PMC123456789"));
-		assertEquals("PMC12345678", PmcCleaningRule.clean("PMC 12345678"));
-	}
-
-}
--- a/dhp-common/src/test/java/eu/dnetlib/dhp/schema/oaf/utils/PmidCleaningRuleTest.java
+++ b/dhp-common/src/test/java/eu/dnetlib/dhp/schema/oaf/utils/PmidCleaningRuleTest.java
@ -1,24 +0,0 @@
-
-package eu.dnetlib.dhp.schema.oaf.utils;
-
-import static org.junit.jupiter.api.Assertions.assertEquals;
-
-import org.junit.jupiter.api.Test;
-
-class PmidCleaningRuleTest {
-
-	@Test
-	void testCleaning() {
-		// leading zeros are removed
-		assertEquals("1234", PmidCleaningRule.clean("01234"));
-		// tolerant to spaces in the middle
-		assertEquals("1234567", PmidCleaningRule.clean("0123 4567"));
-		// stop parsing at first not numerical char
-		assertEquals("123", PmidCleaningRule.clean("0123x4567"));
-		// invalid id leading to empty result
-		assertEquals("", PmidCleaningRule.clean("abc"));
-		// valid id with zeroes in the number
-		assertEquals("20794075", PmidCleaningRule.clean("20794075"));
-	}
-
-}
--- a/dhp-common/src/test/java/eu/dnetlib/dhp/schema/oaf/utils/RorCleaningRuleTest.java
+++ b/dhp-common/src/test/java/eu/dnetlib/dhp/schema/oaf/utils/RorCleaningRuleTest.java
@ -1,17 +0,0 @@
-
-package eu.dnetlib.dhp.schema.oaf.utils;
-
-import static org.junit.jupiter.api.Assertions.assertEquals;
-
-import org.junit.jupiter.api.Test;
-
-class RorCleaningRuleTest {
-
-	@Test
-	void testCleaning() {
-		assertEquals("https://ror.org/05rpz9w55", RorCleaningRule.clean("https://ror.org/05rpz9w55"));
-		assertEquals("https://ror.org/05rpz9w55", RorCleaningRule.clean("05rpz9w55"));
-		assertEquals("", RorCleaningRule.clean("05rpz9w_55"));
-	}
-
-}
--- a/dhp-pace-core/pom.xml
+++ b/dhp-pace-core/pom.xml
@ -7,7 +7,7 @@
 		<groupId>eu.dnetlib.dhp</groupId>
 		<artifactId>dhp</artifactId>
 		<version>1.2.5-SNAPSHOT</version>
-        <relativePath>../pom.xml</relativePath>
+		<relativePath>../pom.xml</relativePath>
 	</parent>

 	<groupId>eu.dnetlib.dhp</groupId>
--- a/dhp-pace-core/src/main/java/eu/dnetlib/pace/clustering/NGramUtils.java
+++ b/dhp-pace-core/src/main/java/eu/dnetlib/pace/clustering/NGramUtils.java
@ -16,9 +16,8 @@ public class NGramUtils extends AbstractPaceFunctions {
 		.loadFromClasspath("/eu/dnetlib/pace/config/stopwords_en.txt");

 	public static String cleanupForOrdering(String s) {
-		return (NGRAMUTILS.filterStopWords(NGRAMUTILS.normalize(s), stopwords) + StringUtils.repeat(" ", SIZE))
-			.substring(0, SIZE)
-			.replaceAll(" ", "");
+		String result = NGRAMUTILS.filterStopWords(NGRAMUTILS.normalize(s), stopwords);
+		return result.isEmpty() ? result : result.replace(" ", "");
 	}

 }
--- a/dhp-pace-core/src/main/java/eu/dnetlib/pace/clustering/NgramPairs.java
+++ b/dhp-pace-core/src/main/java/eu/dnetlib/pace/clustering/NgramPairs.java
@ -2,6 +2,7 @@
 package eu.dnetlib.pace.clustering;

 import java.util.Collection;
+import java.util.HashMap;
 import java.util.List;
 import java.util.Map;

--- a/dhp-pace-core/src/main/java/eu/dnetlib/pace/clustering/SortedNgramPairs.java
+++ b/dhp-pace-core/src/main/java/eu/dnetlib/pace/clustering/SortedNgramPairs.java
@ -13,19 +13,7 @@ import eu.dnetlib.pace.config.Config;
 public class SortedNgramPairs extends NgramPairs {

 	public SortedNgramPairs(Map<String, Integer> params) {
-		super(params, false);
-	}
-
-	@Override
-	protected Collection<String> doApply(Config conf, String s) {
-
-		final List<String> tokens = Lists.newArrayList(Splitter.on(" ").omitEmptyStrings().trimResults().split(s));
-
-		Collections.sort(tokens);
-
-		return ngramPairs(
-			Lists.newArrayList(getNgrams(Joiner.on(" ").join(tokens), param("ngramLen"), param("max") * 2, 1, 2)),
-			param("max"));
+		super(params, true);
 	}

 }
--- a/dhp-pace-core/src/main/java/eu/dnetlib/pace/common/AbstractPaceFunctions.java
+++ b/dhp-pace-core/src/main/java/eu/dnetlib/pace/common/AbstractPaceFunctions.java
@ -49,18 +49,18 @@ public abstract class AbstractPaceFunctions {
 	protected static Set<String> ngramBlacklist = loadFromClasspath("/eu/dnetlib/pace/config/ngram_blacklist.txt");

 	// html regex for normalization
-	public static final Pattern HTML_REGEX = Pattern.compile("<[^>]*>");
+	public final Pattern HTML_REGEX = Pattern.compile("<[^>]*>");

 	private static final String alpha = "abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789 ";
 	private static final String aliases_from = "⁰¹²³⁴⁵⁶⁷⁸⁹⁺⁻⁼⁽⁾ⁿ₀₁₂₃₄₅₆₇₈₉₊₋₌₍₎àáâäæãåāèéêëēėęəîïíīįìôöòóœøōõûüùúūßśšłžźżçćčñń";
 	private static final String aliases_to = "0123456789+-=()n0123456789+-=()aaaaaaaaeeeeeeeeiiiiiioooooooouuuuussslzzzcccnn";

 	// doi prefix for normalization
-	public static final Pattern DOI_PREFIX = Pattern.compile("(https?:\\/\\/dx\\.doi\\.org\\/)|(doi:)");
+	public final Pattern DOI_PREFIX = Pattern.compile("(https?:\\/\\/dx\\.doi\\.org\\/)|(doi:)");

-	private static Pattern numberPattern = Pattern.compile("-?\\d+(\\.\\d+)?");
+	private Pattern numberPattern = Pattern.compile("-?\\d+(\\.\\d+)?");

-	private static Pattern hexUnicodePattern = Pattern.compile("\\\\u(\\p{XDigit}{4})");
+	private Pattern hexUnicodePattern = Pattern.compile("\\\\u(\\p{XDigit}{4})");

 	protected String concat(final List<String> l) {
 		return Joiner.on(" ").skipNulls().join(l);
@ -130,12 +130,10 @@ public abstract class AbstractPaceFunctions {

 	protected static String fixAliases(final String s) {
 		final StringBuilder sb = new StringBuilder();
-
-		s.chars().forEach(ch -> {
+		for (final char ch : Lists.charactersOf(s)) {
 			final int i = StringUtils.indexOf(aliases_from, ch);
-			sb.append(i >= 0 ? aliases_to.charAt(i) : (char) ch);
-		});
-
+			sb.append(i >= 0 ? aliases_to.charAt(i) : ch);
+		}
 		return sb.toString();
 	}

@ -150,10 +148,9 @@ public abstract class AbstractPaceFunctions {
 	protected String removeSymbols(final String s) {
 		final StringBuilder sb = new StringBuilder();

-		s.chars().forEach(ch -> {
-			sb.append(StringUtils.contains(alpha, ch) ? (char) ch : ' ');
-		});
-
+		for (final char ch : Lists.charactersOf(s)) {
+			sb.append(StringUtils.contains(alpha, ch) ? ch : " ");
+		}
 		return sb.toString().replaceAll("\\s+", " ");
 	}

@ -237,8 +234,7 @@ public abstract class AbstractPaceFunctions {

 		final Set<String> h = Sets.newHashSet();
 		try {
-			for (final String s : IOUtils
-				.readLines(NGramUtils.class.getResourceAsStream(classpath), StandardCharsets.UTF_8)) {
+			for (final String s : IOUtils.readLines(NGramUtils.class.getResourceAsStream(classpath))) {
 				h.add(fixAliases(transliterator.transliterate(s))); // transliteration of the stopwords
 			}
 		} catch (final Throwable e) {
@ -253,8 +249,7 @@ public abstract class AbstractPaceFunctions {

 		final Map<String, String> m = new HashMap<>();
 		try {
-			for (final String s : IOUtils
-				.readLines(AbstractPaceFunctions.class.getResourceAsStream(classpath), StandardCharsets.UTF_8)) {
+			for (final String s : IOUtils.readLines(AbstractPaceFunctions.class.getResourceAsStream(classpath))) {
 				// string is like this: code;word1;word2;word3
 				String[] line = s.split(";");
 				String value = line[0];
@ -347,7 +342,7 @@ public abstract class AbstractPaceFunctions {
 	public static <T> String readFromClasspath(final String filename, final Class<T> clazz) {
 		final StringWriter sw = new StringWriter();
 		try {
-			IOUtils.copy(clazz.getResourceAsStream(filename), sw, StandardCharsets.UTF_8);
+			IOUtils.copy(clazz.getResourceAsStream(filename), sw);
 			return sw.toString();
 		} catch (final IOException e) {
 			throw new RuntimeException("cannot load resource from classpath: " + filename);
--- a/dhp-pace-core/src/main/java/eu/dnetlib/pace/config/Config.java
+++ b/dhp-pace-core/src/main/java/eu/dnetlib/pace/config/Config.java
@ -4,6 +4,7 @@ package eu.dnetlib.pace.config;
 import java.util.List;
 import java.util.Map;
 import java.util.function.Predicate;
+import java.util.regex.Pattern;

 import eu.dnetlib.pace.model.ClusteringDef;
 import eu.dnetlib.pace.model.FieldDef;
--- a/dhp-pace-core/src/main/java/eu/dnetlib/pace/model/RowDataOrderingComparator.java
+++ b/dhp-pace-core/src/main/java/eu/dnetlib/pace/model/RowDataOrderingComparator.java
@ -13,8 +13,7 @@ import eu.dnetlib.pace.clustering.NGramUtils;
 public class RowDataOrderingComparator implements Comparator<Row> {

 	/** The comparator field. */
-	private final int comparatorField;
-	private final int identityFieldPosition;
+	private int comparatorField;

 	/**
 	 * Instantiates a new map document comparator.
@ -22,9 +21,8 @@ public class RowDataOrderingComparator implements Comparator<Row> {
 	 * @param comparatorField
 	 *            the comparator field
 	 */
-	public RowDataOrderingComparator(final int comparatorField, int identityFieldPosition) {
+	public RowDataOrderingComparator(final int comparatorField) {
 		this.comparatorField = comparatorField;
-		this.identityFieldPosition = identityFieldPosition;
 	}

 	/*
@ -53,10 +51,7 @@ public class RowDataOrderingComparator implements Comparator<Row> {

 		int res = to1.compareTo(to2);
 		if (res == 0) {
-			res = o1.compareTo(o2);
-			if (res == 0) {
-				return d1.getString(identityFieldPosition).compareTo(d2.getString(identityFieldPosition));
-			}
+			return o1.compareTo(o2);
 		}

 		return res;
--- a/dhp-pace-core/src/main/java/eu/dnetlib/pace/model/SparkDedupConfig.scala
+++ b/dhp-pace-core/src/main/java/eu/dnetlib/pace/model/SparkDedupConfig.scala
@ -0,0 +1,644 @@
+package eu.dnetlib.pace.model
+
+import com.jayway.jsonpath.{Configuration, JsonPath, Option}
+import eu.dnetlib.pace.config.{DedupConfig, Type}
+import eu.dnetlib.pace.tree.support.TreeProcessor
+import eu.dnetlib.pace.util.MapDocumentUtil.truncateValue
+import eu.dnetlib.pace.util.{BlockProcessor, MapDocumentUtil, SparkReporter}
+import org.apache.spark.SparkContext
+import org.apache.spark.rdd.RDD.rddToPairRDDFunctions
+import org.apache.spark.sql.catalyst.encoders.{ExpressionEncoder, RowEncoder}
+import org.apache.spark.sql.{Column, Dataset, Encoder, Encoders, Row, functions}
+import org.apache.spark.sql.catalyst.expressions.{GenericRowWithSchema, Literal}
+import org.apache.spark.sql.expressions.{Aggregator, MutableAggregationBuffer, UserDefinedAggregateFunction, UserDefinedFunction, Window}
+import org.apache.spark.sql.types.{ArrayType, DataType, DataTypes, Metadata, StructField, StructType}
+
+import java.util
+import java.util.function.Predicate
+import java.util.regex.Pattern
+import scala.collection.JavaConverters._
+import scala.collection.mutable
+import org.apache.spark.sql.functions.{col, lit, udf}
+
+import java.util.Collections
+import java.util.stream.Collectors
+
+case class SparkDedupConfig(conf: DedupConfig, numPartitions: Int) extends Serializable {
+
+  private val URL_REGEX: Pattern = Pattern.compile("^\\s*(http|https|ftp)\\://.*")
+
+  private val CONCAT_REGEX: Pattern = Pattern.compile("\\|\\|\\|")
+
+  private val urlFilter = (s: String) => URL_REGEX.matcher(s).matches
+
+  val modelExtractor: (Dataset[String] => Dataset[Row]) = df => {
+
+    df.withColumn("mapDocument", rowFromJsonUDF.apply(df.col(df.columns(0))))
+      .withColumn("identifier", new Column("mapDocument.identifier"))
+      //.repartition(new Column("identifier"))
+      .dropDuplicates("identifier")
+      .select("mapDocument.*")
+
+    df.map(r => rowFromJson(r))(RowEncoder(rowDataType))
+      .dropDuplicates("identifier")
+  }
+
+  val generateClusters: (Dataset[Row] => Dataset[Row]) = df => {
+
+    val df_with_filters = conf.getPace.getModel.asScala.foldLeft(df)((res, fdef) => {
+      if (conf.blacklists.containsKey(fdef.getName)) {
+        res.withColumn(
+          fdef.getName + "_filtered",
+          filterColumnUDF(fdef).apply(new Column(fdef.getName))
+        )
+      } else {
+        res
+      }
+    })
+
+    val df_with_keys = conf
+      .clusterings()
+      .asScala
+      .foldLeft(df_with_filters)((res, cd) => {
+        res.withColumn(
+          cd.getName + "_clustered",
+          functions.explode_outer(
+            clusterValuesUDF(cd).apply(
+              functions.array(
+                cd.getFields.asScala
+                  .map(f => res.col(if (conf.blacklists.containsKey(f)) f.concat("_filtered") else f)): _*
+              )
+            )
+          )
+        )
+      })
+
+    // filter blacklisted values// filter blacklisted values
+    // create one column per cluster prefix// create one column per cluster prefix
+
+    // GROUPING sets approach// GROUPING sets approach
+    val tempTable = this.getClass.getSimpleName + "__generateClusters";
+
+    df_with_keys.createOrReplaceTempView(this.getClass.getSimpleName + "__generateClusters")
+
+    val keys = conf.clusterings().asScala.map(_.getName + "_clustered").mkString(",")
+    val fields = rowDataType.fieldNames.mkString(",")
+
+    //  Using SQL because GROUPING SETS are not available through Scala/Java DSL
+    df_with_keys.sqlContext.sql(
+      ("SELECT coalesce(" + keys + ") as key, sort_array(collect_sort_slice(" + fields + ")) as block FROM " + tempTable + " WHERE coalesce(" + keys + ") IS NOT NULL GROUP BY GROUPING SETS (" + keys + ") HAVING size(block) > 1")
+    )
+
+  }
+
+  val generateClustersWithDFAPI: (Dataset[Row] => Dataset[Row]) = df => {
+
+   System.out.println(conf.getWf.getEntityType + "::" +conf.getWf.getSubEntityType)
+
+    val df_with_filters = conf.getPace.getModel.asScala.foldLeft(df)((res, fdef) => {
+      if (conf.blacklists.containsKey(fdef.getName)) {
+        res.withColumn(
+          fdef.getName + "_filtered",
+          filterColumnUDF(fdef).apply(new Column(fdef.getName))
+        )
+      } else {
+        res
+      }
+    })
+
+    var relBlocks: Dataset[Row] = null
+
+    import scala.collection.JavaConversions._
+
+    for (cd <- conf.clusterings()) {
+      val columns: util.List[Column] = new util.ArrayList[Column](cd.getFields().size)
+
+      for (fName <- cd.getFields()) {
+        if (conf.blacklists.containsKey(fName))
+          columns.add(new Column(fName + "_filtered"))
+        else
+          columns.add(new Column(fName))
+      }
+
+      val tmp: Dataset[Row] = df_with_filters.withColumn("key", functions.explode(clusterValuesUDF(cd).apply(functions.array(columns.asScala: _*))))
+
+
+
+
+       /*.select((Seq(rowDataType.fieldNames: _*) ++ Seq("key")).map(col): _*)
+       .groupByKey(r => r.getAs[String]("key"))(Encoders.STRING)
+        .agg(collectSortSliceAggregator.toColumn)
+        .toDF("key", "block")
+        .select(col("block.block").as("block"))*/
+
+      System.out.println(cd.getName)
+
+      val  ds = tmp.groupBy("key")
+      // .agg(functions.sort_array(collectSortSliceUDAF(rowDataType.fieldNames.map(col): _*)).as("block"))
+       .agg(functions.collect_set(functions.struct(rowDataType.fieldNames.map(col): _*)).as("block"))
+        //.filter(functions.size(new Column("block")).geq(new Literal(2, DataTypes.IntegerType)))
+
+
+      //df_with_filters.printSchema()
+      //ds.printSchema()
+
+      if (relBlocks == null) relBlocks = ds
+      else relBlocks = relBlocks.union(ds)
+    }
+
+   // System.out.println()
+
+    relBlocks
+  }
+
+  val generateClustersWithWindows: (Dataset[Row] => Dataset[Row]) = df => {
+
+    val df_with_filters = conf.getPace.getModel.asScala.foldLeft(df)((res, fdef) => {
+      if (conf.blacklists.containsKey(fdef.getName)) {
+        res.withColumn(
+          fdef.getName + "_filtered",
+          filterColumnUDF(fdef).apply(new Column(fdef.getName))
+        )
+      } else {
+        res
+      }
+    })
+
+    var relBlocks: Dataset[Row] = null
+
+    import scala.collection.JavaConversions._
+
+    for (cd <- conf.clusterings()) {
+      System.out.println(conf.getWf.getEntityType + "::" + conf.getWf.getSubEntityType+ ": " + cd.getName + " " + cd.toString)
+
+      val columns: util.List[Column] = new util.ArrayList[Column](cd.getFields().size)
+
+      for (fName <- cd.getFields()) {
+        if (conf.blacklists.containsKey(fName))
+          columns.add(new Column(fName + "_filtered"))
+        else
+          columns.add(new Column(fName))
+      }
+
+      // Add 'key' column with the value generated by the given clustering definition
+      val ds: Dataset[Row] = df_with_filters.withColumn("key", functions.explode(clusterValuesUDF(cd).apply(functions.array(columns.asScala: _*))))
+        // Add position column having the position of the row within the set of rows having the same key value ordered by the sorting value
+        .withColumn("position", functions.row_number().over(Window.partitionBy("key").orderBy(col(conf.getWf.getOrderField))))
+        // filter out rows with position exceeding the maxqueuesize parameter
+        .filter(col("position").leq(conf.getWf.getQueueMaxSize))
+        .groupBy("key")
+        .agg(functions.collect_set(functions.struct(rowDataType.fieldNames.map(col): _*)).as("block"))
+        .filter(functions.size(new Column("block")).geq(new Literal(2, DataTypes.IntegerType)))
+
+      if (relBlocks == null) relBlocks = ds
+      else relBlocks = relBlocks.union(ds)
+    }
+
+    relBlocks
+  }
+
+  val generateClustersWithDFAPIMerged: (Dataset[Row] => Dataset[Row]) = df => {
+    val df_with_filters = conf.getPace.getModel.asScala.foldLeft(df)((res, fdef) => {
+      if (conf.blacklists.containsKey(fdef.getName)) {
+        res.withColumn(
+          fdef.getName + "_filtered",
+          filterColumnUDF(fdef).apply(new Column(fdef.getName))
+        )
+      } else {
+        res
+      }
+    })
+
+    import scala.collection.JavaConversions._
+
+   val keys = conf.clusterings().foldLeft(null : Column)((res, cd) => {
+     val columns: util.List[Column] = new util.ArrayList[Column](cd.getFields().size)
+
+     for (fName <- cd.getFields()) {
+       if (conf.blacklists.containsKey(fName))
+         columns.add(new Column(fName + "_filtered"))
+       else
+         columns.add(new Column(fName))
+     }
+
+     if (res != null)
+     functions.array_union(res, clusterValuesUDF(cd).apply(functions.array(columns.asScala: _*)))
+     else
+     clusterValuesUDF(cd).apply(functions.array(columns.asScala: _*))
+   })
+
+      val ds: Dataset[Row] = df_with_filters.withColumn("key", functions.explode(keys))
+        .select((Seq(rowDataType.fieldNames: _*) ++ Seq("key")).map(col): _*)
+        .groupByKey(r => r.getAs[String]("key"))(Encoders.STRING)
+        .agg(collectSortSliceAggregator.toColumn)
+        .toDF("key", "block")
+        .select(col("block.block").as("block"))
+
+        /*.groupBy("key")
+        .agg(collectSortSliceUDAF(rowDataType.fieldNames.map(col): _*).as("block"))*/
+        .filter(functions.size(new Column("block")).geq(new Literal(2, DataTypes.IntegerType)))
+
+    ds
+  }
+
+  val generateClustersWithRDDReduction: (Dataset[Row] => Dataset[Row]) = df => {
+    val df_with_filters = conf.getPace.getModel.asScala.foldLeft(df)((res, fdef) => {
+      if (conf.blacklists.containsKey(fdef.getName)) {
+        res.withColumn(
+          fdef.getName + "_filtered",
+          filterColumnUDF(fdef).apply(new Column(fdef.getName))
+        )
+      } else {
+        res
+      }
+    })
+
+    var relBlocks: Dataset[Row] = null
+
+    import scala.collection.JavaConversions._
+
+    for (cd <- conf.clusterings()) {
+      val columns: util.List[Column] = new util.ArrayList[Column](cd.getFields().size)
+
+      for (fName <- cd.getFields()) {
+        if (conf.blacklists.containsKey(fName))
+          columns.add(new Column(fName + "_filtered"))
+        else
+          columns.add(new Column(fName))
+      }
+
+      val ds: Dataset[Row] = df.sparkSession.createDataFrame(df_with_filters.withColumn("key", functions.explode(clusterValuesUDF(cd).apply(functions.array(columns.asScala: _*))))
+        .select(col("key"), functions.array(functions.struct(rowDataType.fieldNames.map(col): _*).as("value")))
+        .rdd.keyBy(_.getString(0))
+        .reduceByKey((a, b) => {
+          val b1 = a.getSeq[Row](1)
+          val b2 = b.getSeq[Row](1)
+
+          if (b1.size + b2.size > conf.getWf.getQueueMaxSize)
+            Row(a.get(0), b1.union(b2).sortBy(_.getString(orderingFieldPosition)).slice(0, conf.getWf.getQueueMaxSize))
+          else
+            Row(a.get(0), b1.union(b2))
+        })
+        .map(_._2)
+        .filter(k => k.getSeq(1).size > 1),
+        new StructType().add(StructField("key", DataTypes.StringType)).add(StructField("block", ArrayType(rowDataType)))
+      )
+
+      if (relBlocks == null) relBlocks = ds
+      else relBlocks = relBlocks.union(ds)
+    }
+
+    relBlocks
+  }
+
+  val printAnalytics: (Dataset[Row] => Dataset[Row]) = df => {
+
+    val df_with_filters = conf.getPace.getModel.asScala.foldLeft(df)((res, fdef) => {
+      if (conf.blacklists.containsKey(fdef.getName)) {
+        res.withColumn(
+          fdef.getName + "_filtered",
+          filterColumnUDF(fdef).apply(new Column(fdef.getName))
+        )
+      } else {
+        res
+      }
+    })
+
+    var relBlocks: Dataset[Row] = null
+
+    import scala.collection.JavaConversions._
+
+    for (cd <- conf.clusterings()) {
+      val columns: util.List[Column] = new util.ArrayList[Column](cd.getFields().size)
+
+      for (fName <- cd.getFields()) {
+        if (conf.blacklists.containsKey(fName))
+          columns.add(new Column(fName + "_filtered"))
+        else
+          columns.add(new Column(fName))
+      }
+
+      // Add 'key' column with the value generated by the given clustering definition
+      val ds: Dataset[Row] = df_with_filters.withColumn("key", functions.explode(clusterValuesUDF(cd).apply(functions.array(columns.asScala: _*))))
+        // Add position column having the position of the row within the set of rows having the same key value ordered by the sorting value
+        .withColumn("position", functions.row_number().over(Window.partitionBy("key").orderBy(conf.getWf.getOrderField)))
+        // filter out rows with position exceeding the maxqueuesize parameter
+        .filter(col("position").lt(conf.getWf.getQueueMaxSize))
+
+
+      // inner join to compute all combination of rows to compare
+      // note the condition on position to obtain 'windowing': given a row this is compared at most with the next
+      // SlidingWindowSize rows following the sort order
+      val dsWithMatch = ds.as("l").join(ds.as("r"),
+        col("l.key").equalTo(col("r.key")),
+        "inner"
+      )
+        .filter((col("l.position").lt(col("r.position")))
+          && (col("r.position").lt(col("l.position").plus(lit(conf.getWf.getSlidingWindowSize)))))
+        // Add match column with the result of comparison
+      // dsWithMatch.show(false)
+
+      if (relBlocks == null)
+        relBlocks = dsWithMatch
+      else
+        relBlocks = relBlocks.union(dsWithMatch)
+    }
+
+    System.out.println(conf.getWf.getEntityType + "::" + conf.getWf.getSubEntityType)
+    System.out.println("Total number of comparations: " + relBlocks.count())
+
+
+    df
+  }
+
+
+  val generateAndProcessClustersWithJoins: (Dataset[Row] => Dataset[Row]) = df => {
+
+    val df_with_filters = conf.getPace.getModel.asScala.foldLeft(df)((res, fdef) => {
+      if (conf.blacklists.containsKey(fdef.getName)) {
+        res.withColumn(
+          fdef.getName + "_filtered",
+          filterColumnUDF(fdef).apply(new Column(fdef.getName))
+        )
+      } else {
+        res
+      }
+    })
+
+    var relBlocks: Dataset[Row] = null
+
+    import scala.collection.JavaConversions._
+
+    for (cd <- conf.clusterings()) {
+      val columns: util.List[Column] = new util.ArrayList[Column](cd.getFields().size)
+
+      for (fName <- cd.getFields()) {
+        if (conf.blacklists.containsKey(fName))
+          columns.add(new Column(fName + "_filtered"))
+        else
+          columns.add(new Column(fName))
+      }
+
+      // Add 'key' column with the value generated by the given clustering definition
+      val ds: Dataset[Row] = df_with_filters.withColumn("key", functions.explode(clusterValuesUDF(cd).apply(functions.array(columns.asScala: _*))))
+        // Add position column having the position of the row within the set of rows having the same key value ordered by the sorting value
+        .withColumn("position", functions.row_number().over(Window.partitionBy("key").orderBy(conf.getWf.getOrderField)))
+        // filter out rows with position exceeding the maxqueuesize parameter
+        .filter(col("position").lt(conf.getWf.getQueueMaxSize))
+
+
+      // inner join to compute all combination of rows to compare
+      // note the condition on position to obtain 'windowing': given a row this is compared at most with the next
+      // SlidingWindowSize rows following the sort order
+      val dsWithMatch = ds.as("l").join(ds.as("r"),
+        col("l.key").equalTo(col("r.key")),
+        "inner"
+        )
+        .filter((col("l.position").lt(col("r.position")))
+          && (col("r.position").lt(col("l.position").plus(lit(conf.getWf.getSlidingWindowSize)))))
+        // Add match column with the result of comparison
+        .withColumn("match", udf[Boolean, Row, Row]((a, b) => {
+          val treeProcessor = new TreeProcessor(conf)
+
+         treeProcessor.compare(a, b)
+        }).apply(functions.struct(rowDataType.fieldNames.map(s => col("l.".concat(s))): _*), functions.struct(rowDataType.fieldNames.map(s => col("r.".concat(s))): _*)))
+        .filter(col("match").equalTo(true))
+        .select(col("l.identifier").as("from"), col("r.identifier").as("to"))
+
+     // dsWithMatch.show(false)
+
+      if (relBlocks == null)
+        relBlocks = dsWithMatch
+      else
+        relBlocks = relBlocks.union(dsWithMatch)
+    }
+
+    val res = relBlocks
+      //.select(col("l.identifier").as("from"), col("r.identifier").as("to"))
+      //.repartition()
+      .distinct()
+
+   // res.show(false)
+    res.select(functions.struct("from", "to"))
+  }
+
+  val processClusters: (Dataset[Row] => Dataset[Row]) = df => {
+
+    val entity = conf.getWf.getEntityType
+
+    df.filter(functions.size(new Column("block")).geq(new Literal(2, DataTypes.IntegerType)))
+      .withColumn("relations", processBlock(df.sqlContext.sparkContext).apply(new Column("block")))
+      .select(functions.explode(new Column("relations")).as("relation"))
+      //.repartition(new Column("relation"))
+      .dropDuplicates("relation")
+  }
+
+  val rowDataType: StructType = {
+//    val unordered = conf.getPace.getModel.asScala.foldLeft(
+//      new StructType()
+//    )((resType, fdef) => {
+//      resType.add(fdef.getType match {
+//        case Type.List | Type.JSON =>
+//          StructField(fdef.getName, DataTypes.createArrayType(DataTypes.StringType), true, Metadata.empty)
+//        case Type.DoubleArray =>
+//          StructField(fdef.getName, DataTypes.createArrayType(DataTypes.DoubleType), true, Metadata.empty)
+//        case _ =>
+//          StructField(fdef.getName, DataTypes.StringType, true, Metadata.empty)
+//      })
+//    })
+//
+//    conf.getPace.getModel.asScala.filterNot(_.getName.equals(conf.getWf.getOrderField)).foldLeft(
+//      new StructType()
+//        .add(unordered(conf.getWf.getOrderField))
+//        .add(StructField("identifier", DataTypes.StringType, false, Metadata.empty))
+//    )((resType, fdef) => resType.add(unordered(fdef.getName)))
+
+    val identifier = new FieldDef()
+    identifier.setName("identifier")
+    identifier.setType(Type.String)
+
+    (conf.getPace.getModel.asScala ++ Seq(identifier)).sortBy(_.getName)
+      .foldLeft(
+          new StructType()
+        )((resType, fdef) => {
+          resType.add(fdef.getType match {
+            case Type.List | Type.JSON =>
+              StructField(fdef.getName, DataTypes.createArrayType(DataTypes.StringType), true, Metadata.empty)
+            case Type.DoubleArray =>
+              StructField(fdef.getName, DataTypes.createArrayType(DataTypes.DoubleType), true, Metadata.empty)
+            case _ =>
+              StructField(fdef.getName, DataTypes.StringType, true, Metadata.empty)
+          })
+        })
+
+
+
+  }
+
+  val identityFieldPosition: Int = rowDataType.fieldIndex("identifier")
+
+  val orderingFieldPosition: Int = rowDataType.fieldIndex(conf.getWf.getOrderField)
+
+  def rowFromJson(json: String) : Row = {
+    val documentContext =
+      JsonPath.using(Configuration.defaultConfiguration.addOptions(Option.SUPPRESS_EXCEPTIONS)).parse(json)
+    val values = new Array[Any](rowDataType.size)
+
+    values(identityFieldPosition) = MapDocumentUtil.getJPathString(conf.getWf.getIdPath, documentContext)
+
+    rowDataType.fieldNames.zipWithIndex.foldLeft(values) {
+      case ((res, (fname, index))) => {
+        val fdef = conf.getPace.getModelMap.get(fname)
+
+        if (fdef != null) {
+          res(index) = fdef.getType match {
+            case Type.String | Type.Int =>
+              MapDocumentUtil.truncateValue(
+                MapDocumentUtil.getJPathString(fdef.getPath, documentContext),
+                fdef.getLength
+              )
+
+            case Type.URL =>
+              var uv = MapDocumentUtil.getJPathString(fdef.getPath, documentContext)
+              if (!urlFilter(uv)) uv = ""
+              uv
+
+            case Type.List | Type.JSON =>
+              MapDocumentUtil.truncateList(
+                MapDocumentUtil.getJPathList(fdef.getPath, documentContext, fdef.getType),
+                fdef.getSize
+              ).toArray
+
+            case Type.StringConcat =>
+              val jpaths = CONCAT_REGEX.split(fdef.getPath)
+
+              truncateValue(
+                jpaths
+                  .map(jpath => MapDocumentUtil.getJPathString(jpath, documentContext))
+                  .mkString(" "),
+                fdef.getLength
+              )
+
+            case Type.DoubleArray =>
+              MapDocumentUtil.getJPathArray(fdef.getPath, json)
+          }
+        }
+
+        res
+      }
+    }
+
+    new GenericRowWithSchema(values, rowDataType)
+  }
+
+  val rowFromJsonUDF = udf(rowFromJson(_), rowDataType)
+
+  def filterColumnUDF(fdef: FieldDef): UserDefinedFunction = {
+
+    val blacklist: Predicate[String] = conf.blacklists().get(fdef.getName)
+
+    if (blacklist == null) {
+      throw new IllegalArgumentException("Column: " + fdef.getName + " does not have any filter")
+    } else {
+      fdef.getType match {
+        case Type.List | Type.JSON =>
+          udf[Array[String], Array[String]](values => {
+            values.filter((v: String) => !blacklist.test(v))
+          })
+
+        case _ =>
+          udf[String, String](v => {
+            if (blacklist.test(v)) ""
+            else v
+          })
+      }
+    }
+  }
+
+  def clusterValuesUDF(cd: ClusteringDef) = {
+    udf[mutable.WrappedArray[String], mutable.WrappedArray[Object]](values => {
+      values.flatMap(f => cd.clusteringFunction().apply(conf, Seq(f.toString).asJava).asScala).map(cd.getName.concat(_))
+    })
+  }
+
+  def processBlock(implicit sc: SparkContext) = {
+    val accumulators = SparkReporter.constructAccumulator(conf, sc)
+
+    udf[Array[Tuple2[String, String]], mutable.WrappedArray[Row]](block => {
+      val reporter = new SparkReporter(accumulators)
+
+      val mapDocuments = block.asJava.stream
+        .sorted(new RowDataOrderingComparator(orderingFieldPosition))
+        .limit(conf.getWf.getQueueMaxSize)
+        .collect(Collectors.toList[Row]())
+
+      new BlockProcessor(conf, identityFieldPosition, orderingFieldPosition).processSortedRows(mapDocuments, reporter)
+
+      reporter.getRelations.asScala.toArray
+    }).asNondeterministic()
+  }
+
+  val collectSortSliceAggregator : Aggregator[Row,Seq[Row], Row] = new Aggregator[Row, Seq[Row], Row] () {
+    override def zero: Seq[Row] = Seq[Row]()
+
+
+    override def reduce(buffer: Seq[Row], input: Row): Seq[Row] = {
+      merge(buffer, Seq(input))
+    }
+
+    override def merge(buffer: Seq[Row], toMerge: Seq[Row]): Seq[Row] = {
+      val newBlock = buffer ++ toMerge
+
+      if (newBlock.size > conf.getWf.getQueueMaxSize)
+        newBlock.sortBy(_.getString(orderingFieldPosition)).slice(0, conf.getWf.getQueueMaxSize)
+      else
+        newBlock
+    }
+
+    override def finish(reduction: Seq[Row]): Row = {
+      Row(reduction.toArray)
+    }
+
+    override def bufferEncoder: Encoder[Seq[Row]] = Encoders.kryo[Seq[Row]]
+
+    override def outputEncoder: Encoder[Row] = RowEncoder.apply(new StructType().add("block", DataTypes.createArrayType(rowDataType), nullable = true))
+  }
+
+  val collectSortSliceUDAF : UserDefinedAggregateFunction = new UserDefinedAggregateFunction {
+    override def inputSchema: StructType = rowDataType
+
+    override def bufferSchema: StructType = {
+      new StructType().add("block", DataTypes.createArrayType(rowDataType), nullable = true)
+    }
+
+    override def dataType: DataType = DataTypes.createArrayType(rowDataType)
+
+    override def deterministic: Boolean = true
+
+    override def initialize(buffer: MutableAggregationBuffer): Unit = {
+      buffer(0) = Seq[Row]()
+    }
+
+    override def update(buffer: MutableAggregationBuffer, input: Row): Unit = {
+      val newBlock = buffer.getSeq[Row](0) ++ Seq(input)
+
+      if (newBlock.size > conf.getWf.getQueueMaxSize)
+        buffer(0) = newBlock.sortBy(_.getString(orderingFieldPosition)).slice(0, conf.getWf.getQueueMaxSize)
+      else
+        buffer(0) = newBlock
+    }
+
+    override def merge(buffer: MutableAggregationBuffer, row: Row): Unit = {
+      val newBlock = buffer.getSeq[Row](0) ++ row.getSeq[Row](0)
+
+      if (newBlock.size > conf.getWf.getQueueMaxSize)
+        buffer(0) = newBlock.sortBy(_.getString(orderingFieldPosition)).slice(0, conf.getWf.getQueueMaxSize)
+      else
+        buffer(0) = newBlock
+    }
+
+    override def evaluate(buffer: Row): Any = {
+      buffer.getSeq[Row](0)
+    }
+  }
+}
--- a/dhp-pace-core/src/main/java/eu/dnetlib/pace/model/SparkDeduper.scala
+++ b/dhp-pace-core/src/main/java/eu/dnetlib/pace/model/SparkDeduper.scala
@ -1,131 +0,0 @@
-package eu.dnetlib.pace.model
-
-import eu.dnetlib.pace.config.{DedupConfig, Type}
-import eu.dnetlib.pace.util.{BlockProcessor, SparkReporter}
-import org.apache.spark.SparkContext
-import org.apache.spark.sql.catalyst.expressions.Literal
-import org.apache.spark.sql.expressions._
-import org.apache.spark.sql.functions.{col, lit, udf}
-import org.apache.spark.sql.types._
-import org.apache.spark.sql.{Column, Dataset, Row, functions}
-
-import java.util.function.Predicate
-import java.util.stream.Collectors
-import scala.collection.JavaConversions._
-import scala.collection.JavaConverters._
-import scala.collection.mutable
-case class SparkDeduper(conf: DedupConfig) extends Serializable {
-
-  val model: SparkModel = SparkModel(conf)
-
-  val dedup: (Dataset[Row] => Dataset[Row]) = df => {
-    df.transform(filterAndCleanup)
-      .transform(generateClustersWithCollect)
-      .transform(processBlocks)
-  }
-
-
-  val filterAndCleanup: (Dataset[Row] => Dataset[Row]) = df => {
-    val df_with_filters = conf.getPace.getModel.asScala.foldLeft(df)((res, fdef) => {
-      if (conf.blacklists.containsKey(fdef.getName)) {
-        res.withColumn(
-          fdef.getName + "_filtered",
-          filterColumnUDF(fdef).apply(new Column(fdef.getName))
-        )
-      } else {
-        res
-      }
-    })
-
-    df_with_filters
-  }
-
-  def filterColumnUDF(fdef: FieldDef): UserDefinedFunction = {
-    val blacklist: Predicate[String] = conf.blacklists().get(fdef.getName)
-
-    if (blacklist == null) {
-      throw new IllegalArgumentException("Column: " + fdef.getName + " does not have any filter")
-    } else {
-      fdef.getType match {
-        case Type.List | Type.JSON =>
-          udf[Array[String], Array[String]](values => {
-            values.filter((v: String) => !blacklist.test(v))
-          })
-
-        case _ =>
-          udf[String, String](v => {
-            if (blacklist.test(v)) ""
-            else v
-          })
-      }
-    }
-  }
-
-  val generateClustersWithCollect: (Dataset[Row] => Dataset[Row]) = df_with_filters => {
-    var df_with_clustering_keys: Dataset[Row] = null
-
-    for ((cd, idx) <- conf.clusterings().zipWithIndex) {
-      val inputColumns = cd.getFields().foldLeft(Seq[Column]())((acc, fName) => {
-        val column = if (conf.blacklists.containsKey(fName))
-          Seq(col(fName + "_filtered"))
-        else
-          Seq(col(fName))
-
-        acc ++ column
-      })
-
-      // Add 'key' column with the value generated by the given clustering definition
-      val ds: Dataset[Row] = df_with_filters
-        .withColumn("clustering", lit(cd.getName + "::" + idx))
-        .withColumn("key", functions.explode(clusterValuesUDF(cd).apply(functions.array(inputColumns: _*))))
-        // Add position column having the position of the row within the set of rows having the same key value ordered by the sorting value
-        .withColumn("position", functions.row_number().over(Window.partitionBy("key").orderBy(col(model.orderingFieldName), col(model.identifierFieldName))))
-
-      if (df_with_clustering_keys == null)
-        df_with_clustering_keys = ds
-      else
-        df_with_clustering_keys = df_with_clustering_keys.union(ds)
-    }
-
-    //TODO: analytics
-
-    val df_with_blocks = df_with_clustering_keys
-      // filter out rows with position exceeding the maxqueuesize parameter
-      .filter(col("position").leq(conf.getWf.getQueueMaxSize))
-      .groupBy("clustering", "key")
-      .agg(functions.collect_set(functions.struct(model.schema.fieldNames.map(col): _*)).as("block"))
-      .filter(functions.size(new Column("block")).gt(1))
-
-    df_with_blocks
-  }
-
-  def clusterValuesUDF(cd: ClusteringDef) = {
-    udf[mutable.WrappedArray[String], mutable.WrappedArray[Any]](values => {
-      values.flatMap(f => cd.clusteringFunction().apply(conf, Seq(f.toString).asJava).asScala)
-    })
-  }
-
-  val processBlocks: (Dataset[Row] => Dataset[Row]) = df => {
-    df.filter(functions.size(new Column("block")).geq(new Literal(2, DataTypes.IntegerType)))
-      .withColumn("relations", processBlock(df.sqlContext.sparkContext).apply(new Column("block")))
-      .select(functions.explode(new Column("relations")).as("relation"))
-  }
-
-  def processBlock(implicit sc: SparkContext) = {
-    val accumulators = SparkReporter.constructAccumulator(conf, sc)
-
-    udf[Array[(String, String)], mutable.WrappedArray[Row]](block => {
-      val reporter = new SparkReporter(accumulators)
-
-      val mapDocuments = block.asJava.stream()
-        .sorted(new RowDataOrderingComparator(model.orderingFieldPosition, model.identityFieldPosition))
-        .limit(conf.getWf.getQueueMaxSize)
-        .collect(Collectors.toList[Row]())
-
-      new BlockProcessor(conf, model.identityFieldPosition, model.orderingFieldPosition).processSortedRows(mapDocuments, reporter)
-
-      reporter.getRelations.asScala.toArray
-    }).asNondeterministic()
-  }
-
-}
--- a/dhp-pace-core/src/main/java/eu/dnetlib/pace/model/SparkModel.scala
+++ b/dhp-pace-core/src/main/java/eu/dnetlib/pace/model/SparkModel.scala
@ -1,108 +0,0 @@
-package eu.dnetlib.pace.model
-
-import com.jayway.jsonpath.{Configuration, JsonPath}
-import eu.dnetlib.pace.config.{DedupConfig, Type}
-import eu.dnetlib.pace.util.MapDocumentUtil
-import org.apache.spark.sql.catalyst.encoders.RowEncoder
-import org.apache.spark.sql.catalyst.expressions.GenericRowWithSchema
-import org.apache.spark.sql.types.{DataTypes, Metadata, StructField, StructType}
-import org.apache.spark.sql.{Dataset, Row}
-
-import java.util.regex.Pattern
-import scala.collection.JavaConverters._
-
-case class SparkModel(conf: DedupConfig) {
-  private val URL_REGEX: Pattern = Pattern.compile("^\\s*(http|https|ftp)\\://.*")
-
-  private val CONCAT_REGEX: Pattern = Pattern.compile("\\|\\|\\|")
-
-  val identifierFieldName = "identifier"
-
-  val orderingFieldName = if (!conf.getWf.getOrderField.isEmpty) conf.getWf.getOrderField else identifierFieldName
-
-  val schema: StructType = {
-    // create an implicit identifier field
-    val identifier = new FieldDef()
-    identifier.setName(identifierFieldName)
-    identifier.setType(Type.String)
-
-    // Construct a Spark StructType representing the schema of the model
-    (Seq(identifier) ++ conf.getPace.getModel.asScala)
-      .foldLeft(
-        new StructType()
-      )((resType, fieldDef) => {
-        resType.add(fieldDef.getType match {
-          case Type.List | Type.JSON =>
-            StructField(fieldDef.getName, DataTypes.createArrayType(DataTypes.StringType), true, Metadata.empty)
-          case Type.DoubleArray =>
-            StructField(fieldDef.getName, DataTypes.createArrayType(DataTypes.DoubleType), true, Metadata.empty)
-          case _ =>
-            StructField(fieldDef.getName, DataTypes.StringType, true, Metadata.empty)
-        })
-      })
-
-
-  }
-
-  val identityFieldPosition: Int = schema.fieldIndex(identifierFieldName)
-
-  val orderingFieldPosition: Int = schema.fieldIndex(orderingFieldName)
-
-  val parseJsonDataset: (Dataset[String] => Dataset[Row]) = df => {
-    df.map(r => rowFromJson(r))(RowEncoder(schema))
-  }
-
-  def rowFromJson(json: String): Row = {
-    val documentContext =
-      JsonPath.using(Configuration.defaultConfiguration.addOptions(com.jayway.jsonpath.Option.SUPPRESS_EXCEPTIONS)).parse(json)
-    val values = new Array[Any](schema.size)
-
-    values(identityFieldPosition) = MapDocumentUtil.getJPathString(conf.getWf.getIdPath, documentContext)
-
-    schema.fieldNames.zipWithIndex.foldLeft(values) {
-      case ((res, (fname, index))) => {
-        val fdef = conf.getPace.getModelMap.get(fname)
-
-        if (fdef != null) {
-          res(index) = fdef.getType match {
-            case Type.String | Type.Int =>
-              MapDocumentUtil.truncateValue(
-                MapDocumentUtil.getJPathString(fdef.getPath, documentContext),
-                fdef.getLength
-              )
-
-            case Type.URL =>
-              var uv = MapDocumentUtil.getJPathString(fdef.getPath, documentContext)
-              if (!URL_REGEX.matcher(uv).matches)
-                uv = ""
-              uv
-
-            case Type.List | Type.JSON =>
-              MapDocumentUtil.truncateList(
-                MapDocumentUtil.getJPathList(fdef.getPath, documentContext, fdef.getType),
-                fdef.getSize
-              ).asScala
-
-            case Type.StringConcat =>
-              val jpaths = CONCAT_REGEX.split(fdef.getPath)
-
-              MapDocumentUtil.truncateValue(
-                jpaths
-                  .map(jpath => MapDocumentUtil.getJPathString(jpath, documentContext))
-                  .mkString(" "),
-                fdef.getLength
-              )
-
-            case Type.DoubleArray =>
-              MapDocumentUtil.getJPathArray(fdef.getPath, json)
-          }
-        }
-
-        res
-      }
-    }
-
-    new GenericRowWithSchema(values, schema)
-  }
-}
-
--- a/dhp-pace-core/src/main/java/eu/dnetlib/pace/tree/ExactMatchIgnoreCase.java
+++ b/dhp-pace-core/src/main/java/eu/dnetlib/pace/tree/ExactMatchIgnoreCase.java
@ -1,8 +1,11 @@

 package eu.dnetlib.pace.tree;

+import java.util.List;
 import java.util.Map;

+import com.google.common.base.Joiner;
+
 import eu.dnetlib.pace.config.Config;
 import eu.dnetlib.pace.tree.support.AbstractStringComparator;
 import eu.dnetlib.pace.tree.support.ComparatorClass;
--- a/dhp-pace-core/src/main/java/eu/dnetlib/pace/tree/Level2JaroWinkler.java
+++ b/dhp-pace-core/src/main/java/eu/dnetlib/pace/tree/Level2JaroWinkler.java
@ -5,6 +5,7 @@ import java.util.Map;

 import com.wcohen.ss.AbstractStringDistance;

+import eu.dnetlib.pace.tree.support.AbstractComparator;
 import eu.dnetlib.pace.tree.support.AbstractStringComparator;
 import eu.dnetlib.pace.tree.support.ComparatorClass;

--- a/dhp-pace-core/src/main/java/eu/dnetlib/pace/tree/NumbersMatch.java
+++ b/dhp-pace-core/src/main/java/eu/dnetlib/pace/tree/NumbersMatch.java
@ -4,6 +4,7 @@ package eu.dnetlib.pace.tree;
 import java.util.Map;

 import eu.dnetlib.pace.config.Config;
+import eu.dnetlib.pace.tree.support.AbstractComparator;
 import eu.dnetlib.pace.tree.support.AbstractStringComparator;
 import eu.dnetlib.pace.tree.support.ComparatorClass;

--- a/dhp-pace-core/src/main/java/eu/dnetlib/pace/tree/RomansMatch.java
+++ b/dhp-pace-core/src/main/java/eu/dnetlib/pace/tree/RomansMatch.java
@ -4,6 +4,7 @@ package eu.dnetlib.pace.tree;
 import java.util.Map;

 import eu.dnetlib.pace.config.Config;
+import eu.dnetlib.pace.tree.support.AbstractComparator;
 import eu.dnetlib.pace.tree.support.AbstractStringComparator;
 import eu.dnetlib.pace.tree.support.ComparatorClass;

--- a/dhp-pace-core/src/main/java/eu/dnetlib/pace/tree/SizeMatch.java
+++ b/dhp-pace-core/src/main/java/eu/dnetlib/pace/tree/SizeMatch.java
@ -4,6 +4,8 @@ package eu.dnetlib.pace.tree;
 import java.util.List;
 import java.util.Map;

+import com.google.common.collect.Lists;
+
 import eu.dnetlib.pace.config.Config;
 import eu.dnetlib.pace.tree.support.AbstractListComparator;
 import eu.dnetlib.pace.tree.support.ComparatorClass;
--- a/dhp-pace-core/src/main/java/eu/dnetlib/pace/tree/StringContainsMatch.java
+++ b/dhp-pace-core/src/main/java/eu/dnetlib/pace/tree/StringContainsMatch.java
@ -4,6 +4,7 @@ package eu.dnetlib.pace.tree;
 import java.util.Map;

 import eu.dnetlib.pace.config.Config;
+import eu.dnetlib.pace.tree.support.AbstractComparator;
 import eu.dnetlib.pace.tree.support.AbstractStringComparator;
 import eu.dnetlib.pace.tree.support.ComparatorClass;

@ -43,25 +44,22 @@ public class StringContainsMatch extends AbstractStringComparator {
 			STRING = STRING.toLowerCase();
 		}

-		if (AGGREGATOR != null) {
-			switch (AGGREGATOR) {
-				case "AND":
-					if (ca.contains(STRING) && cb.contains(STRING))
-						return 1.0;
-					break;
-				case "OR":
-					if (ca.contains(STRING) || cb.contains(STRING))
-						return 1.0;
-					break;
-				case "XOR":
-					if (ca.contains(STRING) ^ cb.contains(STRING))
-						return 1.0;
-					break;
-				default:
-					return 0.0;
-			}
+		switch (AGGREGATOR) {
+			case "AND":
+				if (ca.contains(STRING) && cb.contains(STRING))
+					return 1.0;
+				break;
+			case "OR":
+				if (ca.contains(STRING) || cb.contains(STRING))
+					return 1.0;
+				break;
+			case "XOR":
+				if (ca.contains(STRING) ^ cb.contains(STRING))
+					return 1.0;
+				break;
+			default:
+				return 0.0;
 		}
-
 		return 0.0;
 	}
 }
--- a/dhp-pace-core/src/main/java/eu/dnetlib/pace/tree/support/AbstractComparator.java
+++ b/dhp-pace-core/src/main/java/eu/dnetlib/pace/tree/support/AbstractComparator.java
@ -1,6 +1,7 @@

 package eu.dnetlib.pace.tree.support;

+import java.util.Collections;
 import java.util.List;
 import java.util.Map;

--- a/dhp-pace-core/src/main/java/eu/dnetlib/pace/tree/support/AbstractListComparator.java
+++ b/dhp-pace-core/src/main/java/eu/dnetlib/pace/tree/support/AbstractListComparator.java
@ -4,9 +4,11 @@ package eu.dnetlib.pace.tree.support;
 import java.util.List;
 import java.util.Map;

+import com.google.common.collect.Lists;
 import com.wcohen.ss.AbstractStringDistance;

 import eu.dnetlib.pace.config.Config;
+import eu.dnetlib.pace.config.Type;

 abstract public class AbstractListComparator extends AbstractComparator<List<String>> {
 	protected AbstractListComparator(Map<String, String> params) {
--- a/dhp-pace-core/src/main/java/eu/dnetlib/pace/tree/support/AbstractSortedComparator.java
+++ b/dhp-pace-core/src/main/java/eu/dnetlib/pace/tree/support/AbstractSortedComparator.java
@ -1,6 +1,7 @@

 package eu.dnetlib.pace.tree.support;

+import java.util.AbstractList;
 import java.util.Collections;
 import java.util.List;
 import java.util.Map;
--- a/dhp-pace-core/src/main/java/eu/dnetlib/pace/util/BlockProcessor.java
+++ b/dhp-pace-core/src/main/java/eu/dnetlib/pace/util/BlockProcessor.java
@ -2,6 +2,8 @@
 package eu.dnetlib.pace.util;

 import java.util.ArrayList;
+import java.util.Collection;
+import java.util.Iterator;
 import java.util.List;

 import org.apache.commons.lang3.StringUtils;
@ -11,6 +13,7 @@ import org.apache.spark.sql.Row;
 import org.apache.spark.sql.types.ArrayType;
 import org.apache.spark.sql.types.DataType;
 import org.apache.spark.sql.types.StringType;
+import org.apache.spark.sql.types.StructType;

 import eu.dnetlib.pace.config.DedupConfig;
 import eu.dnetlib.pace.config.WfConfig;
--- a/dhp-pace-core/src/main/java/eu/dnetlib/pace/util/DiffPatchMatch.java
+++ b/dhp-pace-core/src/main/java/eu/dnetlib/pace/util/DiffPatchMatch.java
@ -18,7 +18,6 @@ package eu.dnetlib.pace.util;
 * See the License for the specific language governing permissions and
 * limitations under the License.
 */
-
 /*
 * Diff Match and Patch
 * Copyright 2018 The diff-match-patch Authors.
--- a/dhp-pace-core/src/main/java/eu/dnetlib/pace/util/MapDocumentUtil.java
+++ b/dhp-pace-core/src/main/java/eu/dnetlib/pace/util/MapDocumentUtil.java
@ -2,20 +2,20 @@
 package eu.dnetlib.pace.util;

 import java.math.BigDecimal;
-import java.util.ArrayList;
-import java.util.LinkedHashMap;
-import java.util.List;
-import java.util.concurrent.ConcurrentHashMap;
+import java.util.*;
 import java.util.function.Predicate;
+import java.util.stream.Collectors;

 import com.fasterxml.jackson.core.JsonProcessingException;
 import com.fasterxml.jackson.databind.ObjectMapper;
+import com.jayway.jsonpath.Configuration;
 import com.jayway.jsonpath.DocumentContext;
 import com.jayway.jsonpath.JsonPath;
-import com.jayway.jsonpath.spi.cache.Cache;
-import com.jayway.jsonpath.spi.cache.CacheProvider;
+import com.jayway.jsonpath.Option;

+import eu.dnetlib.pace.config.DedupConfig;
 import eu.dnetlib.pace.config.Type;
+import eu.dnetlib.pace.model.*;
 import net.minidev.json.JSONArray;

 public class MapDocumentUtil {
@ -23,20 +23,47 @@ public class MapDocumentUtil {
 	public static final String URL_REGEX = "^(http|https|ftp)\\://.*";
 	public static Predicate<String> urlFilter = s -> s.trim().matches(URL_REGEX);

-	static {
-		CacheProvider.setCache(new Cache() {
-			private final ConcurrentHashMap<String, JsonPath> jsonPathCache = new ConcurrentHashMap();
+	public static List<String> getJPathList(String path, String json, Type type) {
+		if (type == Type.List)
+			return JsonPath
+				.using(
+					Configuration
+						.defaultConfiguration()
+						.addOptions(Option.ALWAYS_RETURN_LIST, Option.SUPPRESS_EXCEPTIONS))
+				.parse(json)
+				.read(path);
+		Object jresult;
+		List<String> result = new ArrayList<>();
+		try {
+			jresult = JsonPath.read(json, path);
+		} catch (Throwable e) {
+			return result;
+		}
+		if (jresult instanceof JSONArray) {

-			@Override
-			public JsonPath get(String key) {
-				return jsonPathCache.get(key);
-			}
+			((JSONArray) jresult).forEach(it -> {
+
+				try {
+					result.add(new ObjectMapper().writeValueAsString(it));
+				} catch (JsonProcessingException e) {
+
+				}
+			});
+			return result;
+		}
+
+		if (jresult instanceof LinkedHashMap) {
+			try {
+				result.add(new ObjectMapper().writeValueAsString(jresult));
+			} catch (JsonProcessingException e) {

-			@Override
-			public void put(String key, JsonPath value) {
-				jsonPathCache.put(key, value);
 			}
-		});
+			return result;
+		}
+		if (jresult instanceof String) {
+			result.add((String) jresult);
+		}
+		return result;
 	}

 	public static String getJPathString(final String jsonPath, final String json) {
@ -117,11 +144,6 @@ public class MapDocumentUtil {
 			return result;
 		}

-		if (type == Type.List && jresult instanceof List) {
-			((List<?>) jresult).forEach(x -> result.add(x.toString()));
-			return result;
-		}
-
 		if (jresult instanceof JSONArray) {
 			((JSONArray) jresult).forEach(it -> {
 				try {
--- a/dhp-pace-core/src/main/java/eu/dnetlib/pace/util/SparkReporter.java
+++ b/dhp-pace-core/src/main/java/eu/dnetlib/pace/util/SparkReporter.java
@ -10,6 +10,7 @@ import org.apache.spark.SparkContext;
 import org.apache.spark.util.LongAccumulator;

 import eu.dnetlib.pace.config.DedupConfig;
+import eu.dnetlib.pace.util.Reporter;
 import scala.Serializable;
 import scala.Tuple2;

--- a/dhp-pace-core/src/test/java/eu/dnetlib/pace/clustering/ClusteringFunctionTest.java
+++ b/dhp-pace-core/src/test/java/eu/dnetlib/pace/clustering/ClusteringFunctionTest.java
@ -2,12 +2,14 @@
 package eu.dnetlib.pace.clustering;

 import java.util.Map;
+import java.util.Set;
+import java.util.stream.Collectors;

-import org.junit.jupiter.api.BeforeAll;
-import org.junit.jupiter.api.Test;
+import org.junit.jupiter.api.*;

 import com.google.common.collect.Lists;
 import com.google.common.collect.Maps;
+import com.google.common.collect.Sets;

 import eu.dnetlib.pace.AbstractPaceTest;
 import eu.dnetlib.pace.common.AbstractPaceFunctions;
--- a/dhp-pace-core/src/test/java/eu/dnetlib/pace/comparators/ComparatorTest.java
+++ b/dhp-pace-core/src/test/java/eu/dnetlib/pace/comparators/ComparatorTest.java
@ -6,7 +6,6 @@ import static org.junit.jupiter.api.Assertions.assertEquals;
 import java.util.*;

 import org.junit.jupiter.api.BeforeAll;
-import org.junit.jupiter.api.BeforeEach;
 import org.junit.jupiter.api.Test;
 import org.junit.jupiter.api.TestInstance;

@ -23,18 +22,14 @@ public class ComparatorTest extends AbstractPaceTest {

 	@BeforeAll
 	public void setup() {
-		conf = DedupConfig
-			.load(readFromClasspath("/eu/dnetlib/pace/config/organization.current.conf.json", ComparatorTest.class));
-	}
-
-	@BeforeEach
-	public void beforeEachTest() {
 		params = new HashMap<>();
 		params.put("weight", "1.0");
 		params.put("surname_th", "0.99");
 		params.put("name_th", "0.95");
 		params.put("jpath_value", "$.value");
 		params.put("jpath_classid", "$.qualifier.classid");
+		conf = DedupConfig
+			.load(readFromClasspath("/eu/dnetlib/pace/config/organization.current.conf.json", ComparatorTest.class));
 	}

 	@Test
@ -68,10 +63,7 @@ public class ComparatorTest extends AbstractPaceTest {
 				.distance(
 					"Politechniki Warszawskiej (Warsaw University of Technology)", "Warsaw University of Technology",
 					conf));
-
-		// failing becasuse 'Allen' is a transliterrated greek stopword
-		// assertEquals(-1.0, cityMatch.distance("Allen (United States)", "United States Military Academy", conf));
-		assertEquals(-1.0, cityMatch.distance("Washington (United States)", "United States Military Academy", conf));
+		assertEquals(-1.0, cityMatch.distance("Allen (United States)", "United States Military Academy", conf));
 	}

 	@Test
@ -86,7 +78,7 @@ public class ComparatorTest extends AbstractPaceTest {
 		assertEquals(1.0, keywordMatch.distance("Polytechnic University of Turin", "POLITECNICO DI TORINO", conf));
 		assertEquals(1.0, keywordMatch.distance("Istanbul Commerce University", "İstanbul Ticarət Universiteti", conf));
 		assertEquals(1.0, keywordMatch.distance("Franklin College", "Concordia College", conf));
-		assertEquals(2.0 / 3.0, keywordMatch.distance("University of Georgia", "Georgia State University", conf));
+		assertEquals(0.5, keywordMatch.distance("University of Georgia", "Georgia State University", conf));
 		assertEquals(0.5, keywordMatch.distance("University College London", "University of London", conf));
 		assertEquals(0.5, keywordMatch.distance("Washington State University", "University of Washington", conf));
 		assertEquals(-1.0, keywordMatch.distance("Allen (United States)", "United States Military Academy", conf));
@ -120,7 +112,7 @@ public class ComparatorTest extends AbstractPaceTest {
 	public void stringContainsMatchTest() {

 		params.put("string", "openorgs");
-		params.put("aggregator", "XOR");
+		params.put("bool", "XOR");
 		params.put("caseSensitive", "false");

 		StringContainsMatch stringContainsMatch = new StringContainsMatch(params);
@ -128,7 +120,7 @@ public class ComparatorTest extends AbstractPaceTest {
 		assertEquals(0.0, stringContainsMatch.distance("openorgs", "openorgs", conf));

 		params.put("string", "openorgs");
-		params.put("aggregator", "AND");
+		params.put("bool", "AND");
 		params.put("caseSensitive", "false");

 		stringContainsMatch = new StringContainsMatch(params);
--- a/dhp-pace-core/src/test/java/eu/dnetlib/pace/util/UtilTest.java
+++ b/dhp-pace-core/src/test/java/eu/dnetlib/pace/util/UtilTest.java
@ -6,8 +6,7 @@ import static org.junit.jupiter.api.Assertions.assertEquals;
 import java.util.HashMap;
 import java.util.Map;

-import org.junit.jupiter.api.BeforeAll;
-import org.junit.jupiter.api.Test;
+import org.junit.jupiter.api.*;

 import eu.dnetlib.pace.model.Person;
 import jdk.nashorn.internal.ir.annotations.Ignore;
--- a/dhp-workflows/dhp-actionmanager/README.md
+++ b/dhp-workflows/dhp-actionmanager/README.md
@ -1,72 +0,0 @@
-# Action Management Framework
-
-This module implements the oozie workflow for the integration of pre-built contents into the OpenAIRE Graph.
-
-Such contents can be 
-
-* brand new, non-existing records to be introduced as nodes of the graph
-* updates (or enrichment) for records that does exist in the graph (e.g. a new subject term for a publication)
-* relations among existing nodes
-
-The actionset contents are organised into logical containers, each of them can contain multiple versions contents and is characterised by
-
-* a name
-* an identifier
-* the paths on HDFS where each version of the contents is stored
-
-Each version is then characterised by 
-
-* the creation date
-* the last update date
-* the indication where it is the latest one or it is an expired version, candidate for garbage collection
-
-## ActionSet serialization
-
-Each actionset version contains records compliant to the graph internal data model, i.e. subclasses of `eu.dnetlib.dhp.schema.oaf.Oaf`,
-defined in the external schemas module
-
-```
-<dependency>
-    <groupId>eu.dnetlib.dhp</groupId>
-    <artifactId>${dhp-schemas.artifact}</artifactId>
-    <version>${dhp-schemas.version}</version>
-</dependency>
-```
-
-When the actionset contains a relationship, the model class to use is `eu.dnetlib.dhp.schema.oaf.Relation`, otherwise 
-when the actionset contains an entity, it is a `eu.dnetlib.dhp.schema.oaf.OafEntity` or one of its subclasses 
-`Datasource`, `Organization`, `Project`, `Result` (or one of its subclasses `Publication`, `Dataset`, etc...). 
-
-Then, each OpenAIRE Graph model class instance must be wrapped using the class `eu.dnetlib.dhp.schema.action.AtomicAction`, a generic 
-container that defines two attributes
-
-* `T payload` the OpenAIRE Graph class instance containing the data;
-* `Class<T> clazz` must contain the class whose instance is contained in the payload.
-
-Each AtomicAction can be then serialised in JSON format using `com.fasterxml.jackson.databind.ObjectMapper` from
-
-```
-<dependency>
-    <groupId>com.fasterxml.jackson.core</groupId>
-    <artifactId>jackson-databind</artifactId>
-    <version>${dhp.jackson.version}</version>
-</dependency>
-```
-
-Then, the JSON serialization must be stored as a GZip compressed sequence file (`org.apache.hadoop.mapred.SequenceFileOutputFormat`). 
-As such, it contains a set of tuples, a key and a value defined as `org.apache.hadoop.io.Text` where
-
-* the `key` must be set to the class canonical name contained in the `AtomicAction`;
-* the `value` must be set to the AtomicAction JSON serialization.
-
-The following snippet provides an example of how create an actionset version of Relation records:
-
-```
-  rels // JavaRDD<Relation>
-    .map(relation -> new AtomicAction<Relation>(Relation.class, relation))
-    .mapToPair(
-        aa -> new Tuple2<>(new Text(aa.getClazz().getCanonicalName()),
-            new Text(OBJECT_MAPPER.writeValueAsString(aa))))
-    .saveAsHadoopFile(outputPath, Text.class, Text.class, SequenceFileOutputFormat.class, GzipCodec.class);
-```
-
--- a/dhp-workflows/dhp-actionmanager/pom.xml
+++ b/dhp-workflows/dhp-actionmanager/pom.xml
@ -63,6 +63,10 @@
                    <groupId>eu.dnetlib</groupId>
                    <artifactId>dnet-openaireplus-mapping-utils</artifactId>
                </exclusion>
+                <exclusion>
+                    <groupId>eu.dnetlib</groupId>
+                    <artifactId>dnet-index-solr-common</artifactId>
+                </exclusion>
                <exclusion>
                    <groupId>saxonica</groupId>
                    <artifactId>saxon</artifactId>
--- a/dhp-workflows/dhp-actionmanager/src/main/java/eu/dnetlib/dhp/actionmanager/promote/PromoteActionPayloadForGraphTableJob.java
+++ b/dhp-workflows/dhp-actionmanager/src/main/java/eu/dnetlib/dhp/actionmanager/promote/PromoteActionPayloadForGraphTableJob.java
@ -20,7 +20,6 @@ import org.apache.spark.sql.SparkSession;
 import org.slf4j.Logger;
 import org.slf4j.LoggerFactory;

-import com.fasterxml.jackson.databind.DeserializationFeature;
 import com.fasterxml.jackson.databind.ObjectMapper;
 import com.fasterxml.jackson.databind.exc.UnrecognizedPropertyException;

@ -34,8 +33,7 @@ import eu.dnetlib.dhp.schema.oaf.*;
 public class PromoteActionPayloadForGraphTableJob {
 	private static final Logger logger = LoggerFactory.getLogger(PromoteActionPayloadForGraphTableJob.class);

-	private static final ObjectMapper OBJECT_MAPPER = new ObjectMapper()
-		.configure(DeserializationFeature.FAIL_ON_UNKNOWN_PROPERTIES, false);
+	private static final ObjectMapper OBJECT_MAPPER = new ObjectMapper();

 	public static void main(String[] args) throws Exception {
 		String jsonConfiguration = IOUtils
--- a/dhp-workflows/dhp-actionmanager/src/test/java/eu/dnetlib/dhp/actionmanager/partition/PartitionActionSetsByPayloadTypeJobTest.java
+++ b/dhp-workflows/dhp-actionmanager/src/test/java/eu/dnetlib/dhp/actionmanager/partition/PartitionActionSetsByPayloadTypeJobTest.java
@ -31,7 +31,6 @@ import org.mockito.Mock;
 import org.mockito.Mockito;
 import org.mockito.junit.jupiter.MockitoExtension;

-import com.fasterxml.jackson.databind.DeserializationFeature;
 import com.fasterxml.jackson.databind.ObjectMapper;

 import eu.dnetlib.dhp.actionmanager.ISClient;
@ -47,8 +46,7 @@ public class PartitionActionSetsByPayloadTypeJobTest {
 	private static Configuration configuration;
 	private static SparkSession spark;

-	private static final ObjectMapper OBJECT_MAPPER = new ObjectMapper()
-		.configure(DeserializationFeature.FAIL_ON_UNKNOWN_PROPERTIES, false);
+	private static final ObjectMapper OBJECT_MAPPER = new ObjectMapper();

 	private static final StructType ATOMIC_ACTION_SCHEMA = StructType$.MODULE$
 		.apply(
--- a/dhp-workflows/dhp-actionmanager/src/test/java/eu/dnetlib/dhp/actionmanager/promote/PromoteActionPayloadForGraphTableJobTest.java
+++ b/dhp-workflows/dhp-actionmanager/src/test/java/eu/dnetlib/dhp/actionmanager/promote/PromoteActionPayloadForGraphTableJobTest.java
@ -25,7 +25,6 @@ import org.junit.jupiter.params.ParameterizedTest;
 import org.junit.jupiter.params.provider.Arguments;
 import org.junit.jupiter.params.provider.MethodSource;

-import com.fasterxml.jackson.databind.DeserializationFeature;
 import com.fasterxml.jackson.databind.ObjectMapper;

 import eu.dnetlib.dhp.schema.common.ModelSupport;
@ -42,8 +41,7 @@ public class PromoteActionPayloadForGraphTableJobTest {
 	private Path inputActionPayloadRootDir;
 	private Path outputDir;

-	private static final ObjectMapper OBJECT_MAPPER = new ObjectMapper()
-		.configure(DeserializationFeature.FAIL_ON_UNKNOWN_PROPERTIES, false);
+	private static final ObjectMapper OBJECT_MAPPER = new ObjectMapper();

 	@BeforeAll
 	public static void beforeAll() {
@ -156,10 +154,6 @@ public class PromoteActionPayloadForGraphTableJobTest {
 			List<? extends Oaf> actualOutputRows = readGraphTableFromJobOutput(outputGraphTableDir.toString(), rowClazz)
 				.collectAsList()
 				.stream()
-				.map(s -> {
-					s.setLastupdatetimestamp(0L);
-					return s;
-				})
 				.sorted(Comparator.comparingInt(Object::hashCode))
 				.collect(Collectors.toList());
 			String expectedOutputGraphTableJsonDumpPath = resultFileLocation(strategy, rowClazz, actionPayloadClazz);
@ -172,10 +166,6 @@ public class PromoteActionPayloadForGraphTableJobTest {
 				expectedOutputGraphTableJsonDumpFile.toString(), rowClazz)
 					.collectAsList()
 					.stream()
-					.map(s -> {
-						s.setLastupdatetimestamp(0L);
-						return s;
-					})
 					.sorted(Comparator.comparingInt(Object::hashCode))
 					.collect(Collectors.toList());
 			assertIterableEquals(expectedOutputRows, actualOutputRows);
--- a/dhp-workflows/dhp-aggregation/pom.xml
+++ b/dhp-workflows/dhp-aggregation/pom.xml
@ -79,8 +79,8 @@
        </dependency>
        <dependency>
            <groupId>org.scala-lang.modules</groupId>
-            <artifactId>scala-xml_${scala.binary.version}</artifactId>
-            <version>${scala-xml.version}</version>
+            <artifactId>scala-xml_2.12</artifactId>
+            <version>2.1.0</version>
        </dependency>

        <dependency>
--- a/dhp-workflows/dhp-aggregation/src/main/java/eu/dnetlib/dhp/actionmanager/Constants.java
+++ b/dhp-workflows/dhp-aggregation/src/main/java/eu/dnetlib/dhp/actionmanager/Constants.java
@ -11,7 +11,6 @@ import org.apache.spark.sql.SparkSession;
 import com.fasterxml.jackson.databind.ObjectMapper;

 import eu.dnetlib.dhp.application.ArgumentApplicationParser;
-import eu.dnetlib.dhp.common.HdfsSupport;
 import eu.dnetlib.dhp.schema.common.ModelConstants;
 import eu.dnetlib.dhp.schema.oaf.StructuredProperty;
 import eu.dnetlib.dhp.schema.oaf.Subject;
@ -94,9 +93,4 @@ public class Constants {
 		return s;

 	}
-
-	public static void removeOutputDir(SparkSession spark, String path) {
-		HdfsSupport.remove(path, spark.sparkContext().hadoopConfiguration());
-	}
-
 }
--- a/dhp-workflows/dhp-aggregation/src/main/java/eu/dnetlib/dhp/actionmanager/bipaffiliations/PrepareAffiliationRelations.java
+++ b/dhp-workflows/dhp-aggregation/src/main/java/eu/dnetlib/dhp/actionmanager/bipaffiliations/PrepareAffiliationRelations.java
@ -1,162 +0,0 @@
-
-package eu.dnetlib.dhp.actionmanager.bipaffiliations;
-
-import static eu.dnetlib.dhp.common.SparkSessionSupport.runWithSparkSession;
-
-import java.io.Serializable;
-import java.util.Arrays;
-import java.util.List;
-
-import org.apache.commons.io.IOUtils;
-import org.apache.hadoop.io.Text;
-import org.apache.hadoop.io.compress.GzipCodec;
-import org.apache.hadoop.mapred.SequenceFileOutputFormat;
-import org.apache.spark.SparkConf;
-import org.apache.spark.api.java.function.FlatMapFunction;
-import org.apache.spark.sql.*;
-import org.apache.spark.sql.Dataset;
-import org.slf4j.Logger;
-import org.slf4j.LoggerFactory;
-
-import com.fasterxml.jackson.databind.ObjectMapper;
-
-import eu.dnetlib.dhp.actionmanager.Constants;
-import eu.dnetlib.dhp.actionmanager.ror.GenerateRorActionSetJob;
-import eu.dnetlib.dhp.application.ArgumentApplicationParser;
-import eu.dnetlib.dhp.schema.action.AtomicAction;
-import eu.dnetlib.dhp.schema.common.ModelConstants;
-import eu.dnetlib.dhp.schema.oaf.*;
-import eu.dnetlib.dhp.schema.oaf.utils.CleaningFunctions;
-import eu.dnetlib.dhp.schema.oaf.utils.IdentifierFactory;
-import eu.dnetlib.dhp.schema.oaf.utils.OafMapperUtils;
-import scala.Tuple2;
-
-/**
- * Creates action sets for Crossref affiliation relations inferred by BIP!
- */
-public class PrepareAffiliationRelations implements Serializable {
-
-	private static final Logger log = LoggerFactory.getLogger(PrepareAffiliationRelations.class);
-	private static final ObjectMapper OBJECT_MAPPER = new ObjectMapper();
-	private static final String ID_PREFIX = "50|doi_________::";
-	public static final String BIP_AFFILIATIONS_CLASSID = "result:organization:bipinference";
-	public static final String BIP_AFFILIATIONS_CLASSNAME = "Affiliation relation inferred by BIP!";
-	public static final String BIP_INFERENCE_PROVENANCE = "bip:affiliation:crossref";
-
-	public static <I extends Result> void main(String[] args) throws Exception {
-
-		String jsonConfiguration = IOUtils
-			.toString(
-				PrepareAffiliationRelations.class
-					.getResourceAsStream(
-						"/eu/dnetlib/dhp/actionmanager/bipaffiliations/input_actionset_parameter.json"));
-
-		final ArgumentApplicationParser parser = new ArgumentApplicationParser(jsonConfiguration);
-		parser.parseArgument(args);
-
-		Boolean isSparkSessionManaged = Constants.isSparkSessionManaged(parser);
-		log.info("isSparkSessionManaged: {}", isSparkSessionManaged);
-
-		final String inputPath = parser.get("inputPath");
-		log.info("inputPath {}: ", inputPath);
-
-		final String outputPath = parser.get("outputPath");
-		log.info("outputPath {}: ", outputPath);
-
-		SparkConf conf = new SparkConf();
-
-		runWithSparkSession(
-			conf,
-			isSparkSessionManaged,
-			spark -> {
-				Constants.removeOutputDir(spark, outputPath);
-				prepareAffiliationRelations(spark, inputPath, outputPath);
-			});
-	}
-
-	private static <I extends Result> void prepareAffiliationRelations(SparkSession spark, String inputPath,
-		String outputPath) {
-
-		// load and parse affiliation relations from HDFS
-		Dataset<Row> df = spark
-			.read()
-			.schema("`DOI` STRING, `Matchings` ARRAY<STRUCT<`RORid`:STRING,`Confidence`:DOUBLE>>")
-			.json(inputPath);
-
-		// unroll nested arrays
-		df = df
-			.withColumn("matching", functions.explode(new Column("Matchings")))
-			.select(
-				new Column("DOI").as("doi"),
-				new Column("matching.RORid").as("rorid"),
-				new Column("matching.Confidence").as("confidence"));
-
-		// prepare action sets for affiliation relations
-		df
-			.toJavaRDD()
-			.flatMap((FlatMapFunction<Row, Relation>) row -> {
-
-				// DOI to OpenAIRE id
-				final String paperId = ID_PREFIX
-					+ IdentifierFactory.md5(CleaningFunctions.normalizePidValue("doi", row.getAs("doi")));
-
-				// ROR id to OpenAIRE id
-				final String affId = GenerateRorActionSetJob.calculateOpenaireId(row.getAs("rorid"));
-
-				Qualifier qualifier = OafMapperUtils
-					.qualifier(
-						BIP_AFFILIATIONS_CLASSID,
-						BIP_AFFILIATIONS_CLASSNAME,
-						ModelConstants.DNET_PROVENANCE_ACTIONS,
-						ModelConstants.DNET_PROVENANCE_ACTIONS);
-
-				// format data info; setting `confidence` into relation's `trust`
-				DataInfo dataInfo = OafMapperUtils
-					.dataInfo(
-						false,
-						BIP_INFERENCE_PROVENANCE,
-						true,
-						false,
-						qualifier,
-						Double.toString(row.getAs("confidence")));
-
-				List<KeyValue> collectedfrom = OafMapperUtils.listKeyValues(ModelConstants.CROSSREF_ID, "Crossref");
-
-				// return bi-directional relations
-				return getAffiliationRelationPair(paperId, affId, collectedfrom, dataInfo).iterator();
-
-			})
-			.map(p -> new AtomicAction(Relation.class, p))
-			.mapToPair(
-				aa -> new Tuple2<>(new Text(aa.getClazz().getCanonicalName()),
-					new Text(OBJECT_MAPPER.writeValueAsString(aa))))
-			.saveAsHadoopFile(outputPath, Text.class, Text.class, SequenceFileOutputFormat.class, GzipCodec.class);
-
-	}
-
-	private static List<Relation> getAffiliationRelationPair(String paperId, String affId, List<KeyValue> collectedfrom,
-		DataInfo dataInfo) {
-		return Arrays
-			.asList(
-				OafMapperUtils
-					.getRelation(
-						paperId,
-						affId,
-						ModelConstants.RESULT_ORGANIZATION,
-						ModelConstants.AFFILIATION,
-						ModelConstants.HAS_AUTHOR_INSTITUTION,
-						collectedfrom,
-						dataInfo,
-						null),
-				OafMapperUtils
-					.getRelation(
-						affId,
-						paperId,
-						ModelConstants.RESULT_ORGANIZATION,
-						ModelConstants.AFFILIATION,
-						ModelConstants.IS_AUTHOR_INSTITUTION_OF,
-						collectedfrom,
-						dataInfo,
-						null));
-	}
-}
--- a/dhp-workflows/dhp-aggregation/src/main/java/eu/dnetlib/dhp/actionmanager/bipfinder/SparkAtomicActionScoreJob.java
+++ b/dhp-workflows/dhp-aggregation/src/main/java/eu/dnetlib/dhp/actionmanager/bipfinder/SparkAtomicActionScoreJob.java
@ -6,14 +6,13 @@ import static eu.dnetlib.dhp.common.SparkSessionSupport.runWithSparkSession;

 import java.io.Serializable;
 import java.util.List;
+import java.util.Optional;
 import java.util.stream.Collectors;

 import org.apache.commons.io.IOUtils;
 import org.apache.hadoop.io.Text;
-import org.apache.hadoop.io.compress.GzipCodec;
 import org.apache.hadoop.mapred.SequenceFileOutputFormat;
 import org.apache.spark.SparkConf;
-import org.apache.spark.api.java.JavaPairRDD;
 import org.apache.spark.api.java.JavaRDD;
 import org.apache.spark.api.java.JavaSparkContext;
 import org.apache.spark.api.java.function.MapFunction;
@ -25,9 +24,8 @@ import org.slf4j.LoggerFactory;

 import com.fasterxml.jackson.databind.ObjectMapper;

+import eu.dnetlib.dhp.actionmanager.bipmodel.BipDeserialize;
 import eu.dnetlib.dhp.actionmanager.bipmodel.BipScore;
-import eu.dnetlib.dhp.actionmanager.bipmodel.score.deserializers.BipProjectModel;
-import eu.dnetlib.dhp.actionmanager.bipmodel.score.deserializers.BipResultModel;
 import eu.dnetlib.dhp.application.ArgumentApplicationParser;
 import eu.dnetlib.dhp.common.HdfsSupport;
 import eu.dnetlib.dhp.schema.action.AtomicAction;
@ -42,6 +40,7 @@ import scala.Tuple2;
 */
 public class SparkAtomicActionScoreJob implements Serializable {

+	private static final String DOI = "doi";
 	private static final Logger log = LoggerFactory.getLogger(SparkAtomicActionScoreJob.class);
 	private static final ObjectMapper OBJECT_MAPPER = new ObjectMapper();

@ -57,17 +56,18 @@ public class SparkAtomicActionScoreJob implements Serializable {

 		parser.parseArgument(args);

-		Boolean isSparkSessionManaged = isSparkSessionManaged(parser);
+		Boolean isSparkSessionManaged = Optional
+			.ofNullable(parser.get("isSparkSessionManaged"))
+			.map(Boolean::valueOf)
+			.orElse(Boolean.TRUE);
+
 		log.info("isSparkSessionManaged: {}", isSparkSessionManaged);

-		final String resultsInputPath = parser.get("resultsInputPath");
-		log.info("resultsInputPath: {}", resultsInputPath);
-
-		final String projectsInputPath = parser.get("projectsInputPath");
-		log.info("projectsInputPath: {}", projectsInputPath);
+		final String inputPath = parser.get("inputPath");
+		log.info("inputPath {}: ", inputPath);

 		final String outputPath = parser.get("outputPath");
-		log.info("outputPath: {}", outputPath);
+		log.info("outputPath {}: ", outputPath);

 		SparkConf conf = new SparkConf();

@ -76,45 +76,17 @@ public class SparkAtomicActionScoreJob implements Serializable {
 			isSparkSessionManaged,
 			spark -> {
 				removeOutputDir(spark, outputPath);
-
-				JavaPairRDD<Text, Text> resultsRDD = prepareResults(spark, resultsInputPath, outputPath);
-				JavaPairRDD<Text, Text> projectsRDD = prepareProjects(spark, projectsInputPath, outputPath);
-
-				resultsRDD
-					.union(projectsRDD)
-					.saveAsHadoopFile(
-						outputPath, Text.class, Text.class, SequenceFileOutputFormat.class, GzipCodec.class);
+				prepareResults(spark, inputPath, outputPath);
 			});
 	}

-	private static <I extends Project> JavaPairRDD<Text, Text> prepareProjects(SparkSession spark, String inputPath,
-		String outputPath) {
-
-		// read input bip project scores
-		Dataset<BipProjectModel> projectScores = readPath(spark, inputPath, BipProjectModel.class);
-
-		return projectScores.map((MapFunction<BipProjectModel, Project>) bipProjectScores -> {
-			Project project = new Project();
-			project.setId(bipProjectScores.getProjectId());
-			project.setMeasures(bipProjectScores.toMeasures());
-			return project;
-		}, Encoders.bean(Project.class))
-			.toJavaRDD()
-			.map(p -> new AtomicAction(Project.class, p))
-			.mapToPair(
-				aa -> new Tuple2<>(new Text(aa.getClazz().getCanonicalName()),
-					new Text(OBJECT_MAPPER.writeValueAsString(aa))));
-
-	}
-
-	private static <I extends Result> JavaPairRDD<Text, Text> prepareResults(SparkSession spark, String bipScorePath,
-		String outputPath) {
+	private static <I extends Result> void prepareResults(SparkSession spark, String bipScorePath, String outputPath) {

 		final JavaSparkContext sc = JavaSparkContext.fromSparkContext(spark.sparkContext());

-		JavaRDD<BipResultModel> bipDeserializeJavaRDD = sc
+		JavaRDD<BipDeserialize> bipDeserializeJavaRDD = sc
 			.textFile(bipScorePath)
-			.map(item -> OBJECT_MAPPER.readValue(item, BipResultModel.class));
+			.map(item -> OBJECT_MAPPER.readValue(item, BipDeserialize.class));

 		Dataset<BipScore> bipScores = spark
 			.createDataset(bipDeserializeJavaRDD.flatMap(entry -> entry.keySet().stream().map(key -> {
@ -124,20 +96,24 @@ public class SparkAtomicActionScoreJob implements Serializable {
 				return bs;
 			}).collect(Collectors.toList()).iterator()).rdd(), Encoders.bean(BipScore.class));

-		return bipScores.map((MapFunction<BipScore, Result>) bs -> {
-			Result ret = new Result();
+		bipScores

-			ret.setId(bs.getId());
+			.map((MapFunction<BipScore, Result>) bs -> {
+				Result ret = new Result();

-			ret.setMeasures(getMeasure(bs));
+				ret.setId(bs.getId());

-			return ret;
-		}, Encoders.bean(Result.class))
+				ret.setMeasures(getMeasure(bs));
+
+				return ret;
+			}, Encoders.bean(Result.class))
 			.toJavaRDD()
 			.map(p -> new AtomicAction(Result.class, p))
 			.mapToPair(
 				aa -> new Tuple2<>(new Text(aa.getClazz().getCanonicalName()),
-					new Text(OBJECT_MAPPER.writeValueAsString(aa))));
+					new Text(OBJECT_MAPPER.writeValueAsString(aa))))
+			.saveAsHadoopFile(outputPath, Text.class, Text.class, SequenceFileOutputFormat.class);
+
 	}

 	private static List<Measure> getMeasure(BipScore value) {
@ -183,4 +159,12 @@ public class SparkAtomicActionScoreJob implements Serializable {
 		HdfsSupport.remove(path, spark.sparkContext().hadoopConfiguration());
 	}

+	public static <R> Dataset<R> readPath(
+		SparkSession spark, String inputPath, Class<R> clazz) {
+		return spark
+			.read()
+			.textFile(inputPath)
+			.map((MapFunction<String, R>) value -> OBJECT_MAPPER.readValue(value, clazz), Encoders.bean(clazz));
+	}
+
 }
--- a/dhp-workflows/dhp-aggregation/src/main/java/eu/dnetlib/dhp/actionmanager/bipmodel/score/deserializers/BipResultModel.java
+++ b/dhp-workflows/dhp-aggregation/src/main/java/eu/dnetlib/dhp/actionmanager/bipmodel/score/deserializers/BipResultModel.java
@ -1,21 +1,19 @@

-package eu.dnetlib.dhp.actionmanager.bipmodel.score.deserializers;
+package eu.dnetlib.dhp.actionmanager.bipmodel;

 import java.io.Serializable;
 import java.util.ArrayList;
 import java.util.HashMap;
 import java.util.List;

-import eu.dnetlib.dhp.actionmanager.bipmodel.Score;
-
 /**
 * Class that maps the model of the bipFinder! input data.
 * Only needed for deserialization purposes
 */

-public class BipResultModel extends HashMap<String, List<Score>> implements Serializable {
+public class BipDeserialize extends HashMap<String, List<Score>> implements Serializable {

-	public BipResultModel() {
+	public BipDeserialize() {
 		super();
 	}

--- a/dhp-workflows/dhp-aggregation/src/main/java/eu/dnetlib/dhp/actionmanager/bipmodel/score/deserializers/BipProjectModel.java
+++ b/dhp-workflows/dhp-aggregation/src/main/java/eu/dnetlib/dhp/actionmanager/bipmodel/score/deserializers/BipProjectModel.java
@ -1,74 +0,0 @@
-
-package eu.dnetlib.dhp.actionmanager.bipmodel.score.deserializers;
-
-import static eu.dnetlib.dhp.actionmanager.Constants.*;
-
-import java.util.ArrayList;
-import java.util.Arrays;
-import java.util.Collections;
-import java.util.List;
-
-import com.opencsv.bean.CsvBindByPosition;
-
-import eu.dnetlib.dhp.schema.common.ModelConstants;
-import eu.dnetlib.dhp.schema.oaf.KeyValue;
-import eu.dnetlib.dhp.schema.oaf.Measure;
-import eu.dnetlib.dhp.schema.oaf.utils.OafMapperUtils;
-import lombok.AllArgsConstructor;
-import lombok.Getter;
-import lombok.NoArgsConstructor;
-import lombok.Setter;
-
-@NoArgsConstructor
-@AllArgsConstructor
-@Getter
-@Setter
-public class BipProjectModel {
-	String projectId;
-
-	String numOfInfluentialResults;
-
-	String numOfPopularResults;
-
-	String totalImpulse;
-
-	String totalCitationCount;
-
-	// each project bip measure has exactly one value, hence one key-value pair
-	private Measure createMeasure(String measureId, String measureValue) {
-
-		KeyValue kv = new KeyValue();
-		kv.setKey("score");
-		kv.setValue(measureValue);
-		kv
-			.setDataInfo(
-				OafMapperUtils
-					.dataInfo(
-						false,
-						UPDATE_DATA_INFO_TYPE,
-						true,
-						false,
-						OafMapperUtils
-							.qualifier(
-								UPDATE_MEASURE_BIP_CLASS_ID,
-								UPDATE_CLASS_NAME,
-								ModelConstants.DNET_PROVENANCE_ACTIONS,
-								ModelConstants.DNET_PROVENANCE_ACTIONS),
-						""));
-
-		Measure measure = new Measure();
-		measure.setId(measureId);
-		measure.setUnit(Collections.singletonList(kv));
-		return measure;
-	}
-
-	public List<Measure> toMeasures() {
-		return Arrays
-			.asList(
-				createMeasure("numOfInfluentialResults", numOfInfluentialResults),
-				createMeasure("numOfPopularResults", numOfPopularResults),
-				createMeasure("totalImpulse", totalImpulse),
-				createMeasure("totalCitationCount", totalCitationCount));
-	}
-
-}
--- a/dhp-workflows/dhp-aggregation/src/main/java/eu/dnetlib/dhp/actionmanager/createunresolvedentities/PrepareBipFinder.java
+++ b/dhp-workflows/dhp-aggregation/src/main/java/eu/dnetlib/dhp/actionmanager/createunresolvedentities/PrepareBipFinder.java
@ -24,8 +24,8 @@ import org.slf4j.LoggerFactory;

 import com.fasterxml.jackson.databind.ObjectMapper;

+import eu.dnetlib.dhp.actionmanager.bipmodel.BipDeserialize;
 import eu.dnetlib.dhp.actionmanager.bipmodel.BipScore;
-import eu.dnetlib.dhp.actionmanager.bipmodel.score.deserializers.BipResultModel;
 import eu.dnetlib.dhp.application.ArgumentApplicationParser;
 import eu.dnetlib.dhp.common.HdfsSupport;
 import eu.dnetlib.dhp.schema.common.ModelConstants;
@ -82,9 +82,9 @@ public class PrepareBipFinder implements Serializable {

 		final JavaSparkContext sc = JavaSparkContext.fromSparkContext(spark.sparkContext());

-		JavaRDD<BipResultModel> bipDeserializeJavaRDD = sc
+		JavaRDD<BipDeserialize> bipDeserializeJavaRDD = sc
 			.textFile(inputPath)
-			.map(item -> OBJECT_MAPPER.readValue(item, BipResultModel.class));
+			.map(item -> OBJECT_MAPPER.readValue(item, BipDeserialize.class));

 		spark
 			.createDataset(bipDeserializeJavaRDD.flatMap(entry -> entry.keySet().stream().map(key -> {
--- a/dhp-workflows/dhp-aggregation/src/main/java/eu/dnetlib/dhp/actionmanager/project/utils/ReadTopics.java
+++ b/dhp-workflows/dhp-aggregation/src/main/java/eu/dnetlib/dhp/actionmanager/project/utils/ReadTopics.java
@ -6,6 +6,7 @@ import java.io.IOException;
 import java.io.OutputStreamWriter;
 import java.io.Serializable;
 import java.nio.charset.StandardCharsets;
+import java.util.ArrayList;
 import java.util.List;

 import org.apache.commons.io.IOUtils;
@ -22,6 +23,7 @@ import com.fasterxml.jackson.databind.ObjectMapper;

 import eu.dnetlib.dhp.actionmanager.project.PrepareProjects;
 import eu.dnetlib.dhp.actionmanager.project.utils.model.JsonTopic;
+import eu.dnetlib.dhp.actionmanager.project.utils.model.Project;
 import eu.dnetlib.dhp.application.ArgumentApplicationParser;

 /**
--- a/dhp-workflows/dhp-aggregation/src/main/java/eu/dnetlib/dhp/actionmanager/ror/GenerateRorActionSetJob.java
+++ b/dhp-workflows/dhp-aggregation/src/main/java/eu/dnetlib/dhp/actionmanager/ror/GenerateRorActionSetJob.java
@ -168,7 +168,7 @@ public class GenerateRorActionSetJob {

 	}

-	public static String calculateOpenaireId(final String rorId) {
+	private static String calculateOpenaireId(final String rorId) {
 		return String.format("20|%s::%s", Constants.ROR_NS_PREFIX, DHPUtils.md5(rorId));
 	}

--- a/dhp-workflows/dhp-aggregation/src/main/java/eu/dnetlib/dhp/actionmanager/usagestats/SparkAtomicActionUsageJob.java
+++ b/dhp-workflows/dhp-aggregation/src/main/java/eu/dnetlib/dhp/actionmanager/usagestats/SparkAtomicActionUsageJob.java
@ -75,7 +75,7 @@ public class SparkAtomicActionUsageJob implements Serializable {
 				removeOutputDir(spark, outputPath);
 				prepareData(dbname, spark, workingPath + "/usageDb", "usage_stats", "result_id");
 				prepareData(dbname, spark, workingPath + "/projectDb", "project_stats", "id");
-				prepareData(dbname, spark, workingPath + "/datasourceDb", "datasource_stats", "repository_id");
+				prepareData(dbname, spark, workingPath + "/datasourceDb", "datasource_stats", "repositor_id");
 				writeActionSet(spark, workingPath, outputPath);
 			});
 	}
--- a/dhp-workflows/dhp-aggregation/src/main/resources/eu/dnetlib/dhp/actionmanager/bipaffiliations/input_actionset_parameter.json
+++ b/dhp-workflows/dhp-aggregation/src/main/resources/eu/dnetlib/dhp/actionmanager/bipaffiliations/input_actionset_parameter.json
@ -1,20 +0,0 @@
-[
-  {
-    "paramName": "issm",
-    "paramLongName": "isSparkSessionManaged",
-    "paramDescription": "when true will stop SparkSession after job execution",
-    "paramRequired": false
-  },
-  {
-    "paramName": "ip",
-    "paramLongName": "inputPath",
-    "paramDescription": "the URL from where to get the programme file",
-    "paramRequired": true
-  },
-  {
-    "paramName": "o",
-    "paramLongName": "outputPath",
-    "paramDescription": "the path of the new ActionSet",
-    "paramRequired": true
-  }
-]
--- a/dhp-workflows/dhp-aggregation/src/main/resources/eu/dnetlib/dhp/actionmanager/bipaffiliations/job.properties
+++ b/dhp-workflows/dhp-aggregation/src/main/resources/eu/dnetlib/dhp/actionmanager/bipaffiliations/job.properties
@ -1,35 +0,0 @@
-# --- You can override the following properties (if needed) coming from your ~/.dhp/application.properties ---
-# dhp.hadoop.frontend.temp.dir=/home/ilias.kanellos
-# dhp.hadoop.frontend.user.name=ilias.kanellos
-# dhp.hadoop.frontend.host.name=iis-cdh5-test-gw.ocean.icm.edu.pl
-# dhp.hadoop.frontend.port.ssh=22
-# oozieServiceLoc=http://iis-cdh5-test-m3:11000/oozie
-# jobTracker=yarnRM
-# nameNode=hdfs://nameservice1
-# oozie.execution.log.file.location = target/extract-and-run-on-remote-host.log
-# maven.executable=mvn
-
-# Some memory and driver settings for more demanding tasks
-sparkDriverMemory=10G
-sparkExecutorMemory=10G
-sparkExecutorCores=4
-sparkShufflePartitions=7680
-
-# The above is given differently in an example I found online
-oozie.action.sharelib.for.spark=spark2
-oozieActionShareLibForSpark2=spark2
-spark2YarnHistoryServerAddress=http://iis-cdh5-test-gw.ocean.icm.edu.pl:18089
-spark2EventLogDir=/user/spark/spark2ApplicationHistory
-sparkSqlWarehouseDir=/user/hive/warehouse
-hiveMetastoreUris=thrift://iis-cdh5-test-m3.ocean.icm.edu.pl:9083
-# This MAY avoid the no library used error
-oozie.use.system.libpath=true
-# Some stuff copied from openaire's jobs
-spark2ExtraListeners=com.cloudera.spark.lineage.NavigatorAppListener
-spark2SqlQueryExecutionListeners=com.cloudera.spark.lineage.NavigatorQueryListener
-
-# The following is needed as a property of a workflow
-oozie.wf.application.path=${oozieTopWfApplicationPath}
-
-inputPath=/data/bip-affiliations/data.json
-outputPath=/tmp/crossref-affiliations-output-v5
--- a/dhp-workflows/dhp-aggregation/src/main/resources/eu/dnetlib/dhp/actionmanager/bipaffiliations/oozie_app/config-default.xml
+++ b/dhp-workflows/dhp-aggregation/src/main/resources/eu/dnetlib/dhp/actionmanager/bipaffiliations/oozie_app/config-default.xml
@ -1,30 +0,0 @@
-<configuration>
-    <property>
-        <name>jobTracker</name>
-        <value>yarnRM</value>
-    </property>
-    <property>
-        <name>nameNode</name>
-        <value>hdfs://nameservice1</value>
-    </property>
-    <property>
-        <name>oozie.use.system.libpath</name>
-        <value>true</value>
-    </property>
-    <property>
-        <name>hiveMetastoreUris</name>
-        <value>thrift://iis-cdh5-test-m3.ocean.icm.edu.pl:9083</value>
-    </property>
-    <property>
-        <name>hiveJdbcUrl</name>
-        <value>jdbc:hive2://iis-cdh5-test-m3.ocean.icm.edu.pl:10000</value>
-    </property>
-    <property>
-        <name>hiveDbName</name>
-        <value>openaire</value>
-    </property>
-    <property>
-        <name>oozie.launcher.mapreduce.user.classpath.first</name>
-        <value>true</value>
-    </property>
-</configuration>
--- a/dhp-workflows/dhp-aggregation/src/main/resources/eu/dnetlib/dhp/actionmanager/bipaffiliations/oozie_app/workflow.xml
+++ b/dhp-workflows/dhp-aggregation/src/main/resources/eu/dnetlib/dhp/actionmanager/bipaffiliations/oozie_app/workflow.xml
@ -1,107 +0,0 @@
-<workflow-app name="BipAffiliations" xmlns="uri:oozie:workflow:0.5">
-    <parameters>
-
-        <property>
-            <name>inputPath</name>
-            <description>the path where to find the inferred affiliation relations</description>
-        </property>
-        <property>
-            <name>outputPath</name>
-            <description>the path where to store the actionset</description>
-        </property>
-        <property>
-            <name>sparkDriverMemory</name>
-            <description>memory for driver process</description>
-        </property>
-        <property>
-            <name>sparkExecutorMemory</name>
-            <description>memory for individual executor</description>
-        </property>
-        <property>
-            <name>sparkExecutorCores</name>
-            <description>number of cores used by single executor</description>
-        </property>
-        <property>
-            <name>oozieActionShareLibForSpark2</name>
-            <description>oozie action sharelib for spark 2.*</description>
-        </property>
-        <property>
-            <name>spark2ExtraListeners</name>
-            <value>com.cloudera.spark.lineage.NavigatorAppListener</value>
-            <description>spark 2.* extra listeners classname</description>
-        </property>
-        <property>
-            <name>spark2SqlQueryExecutionListeners</name>
-            <value>com.cloudera.spark.lineage.NavigatorQueryListener</value>
-            <description>spark 2.* sql query execution listeners classname</description>
-        </property>
-        <property>
-            <name>spark2YarnHistoryServerAddress</name>
-            <description>spark 2.* yarn history server address</description>
-        </property>
-        <property>
-            <name>spark2EventLogDir</name>
-            <description>spark 2.* event log dir location</description>
-        </property>
-    </parameters>
-
-    <global>
-        <job-tracker>${jobTracker}</job-tracker>
-        <name-node>${nameNode}</name-node>
-        <configuration>
-            <property>
-                <name>mapreduce.job.queuename</name>
-                <value>${queueName}</value>
-            </property>
-            <property>
-                <name>oozie.launcher.mapred.job.queue.name</name>
-                <value>${oozieLauncherQueueName}</value>
-            </property>
-            <property>
-                <name>oozie.action.sharelib.for.spark</name>
-                <value>${oozieActionShareLibForSpark2}</value>
-            </property>
-
-        </configuration>
-    </global>
-    <start to="deleteoutputpath"/>
-    <kill name="Kill">
-        <message>Action failed, error message[${wf:errorMessage(wf:lastErrorNode())}]</message>
-    </kill>
-    <action name="deleteoutputpath">
-        <fs>
-            <delete path="${outputPath}"/>
-            <mkdir path="${outputPath}"/>
-            <delete path="${workingDir}"/>
-            <mkdir path="${workingDir}"/>
-        </fs>
-        <ok to="atomicactions"/>
-        <error to="Kill"/>
-    </action>
-
-    <action name="atomicactions">
-        <spark xmlns="uri:oozie:spark-action:0.2">
-            <master>yarn</master>
-            <mode>cluster</mode>
-            <name>Produces the atomic action with the inferred by BIP! affiliation relations from Crossref</name>
-            <class>eu.dnetlib.dhp.actionmanager.bipaffiliations.PrepareAffiliationRelations</class>
-            <jar>dhp-aggregation-${projectVersion}.jar</jar>
-            <spark-opts>
-                --executor-memory=${sparkExecutorMemory}
-                --executor-cores=${sparkExecutorCores}
-                --driver-memory=${sparkDriverMemory}
-                --conf spark.extraListeners=${spark2ExtraListeners}
-                --conf spark.sql.queryExecutionListeners=${spark2SqlQueryExecutionListeners}
-                --conf spark.yarn.historyServer.address=${spark2YarnHistoryServerAddress}
-                --conf spark.eventLog.dir=${nameNode}${spark2EventLogDir}
-                --conf spark.sql.warehouse.dir=${sparkSqlWarehouseDir}
-            </spark-opts>
-            <arg>--inputPath</arg><arg>${inputPath}</arg>
-            <arg>--outputPath</arg><arg>${outputPath}</arg>
-        </spark>
-        <ok to="End"/>
-        <error to="Kill"/>
-    </action>
-
-    <end name="End"/>
-</workflow-app>
--- a/dhp-workflows/dhp-aggregation/src/main/resources/eu/dnetlib/dhp/actionmanager/bipfinder/input_actionset_parameter.json
+++ b/dhp-workflows/dhp-aggregation/src/main/resources/eu/dnetlib/dhp/actionmanager/bipfinder/input_actionset_parameter.json
@ -6,15 +6,9 @@
    "paramRequired": false
  },
  {
-    "paramName": "rip",
-    "paramLongName": "resultsInputPath",
-    "paramDescription": "the URL from where to get the input file for results",
-    "paramRequired": true
-  },
-  {
-    "paramName": "pip",
-    "paramLongName": "projectsInputPath",
-    "paramDescription": "the URL from where to get the input file for projects",
+    "paramName": "ip",
+    "paramLongName": "inputPath",
+    "paramDescription": "the URL from where to get the programme file",
    "paramRequired": true
  },
  {
--- a/dhp-workflows/dhp-aggregation/src/main/resources/eu/dnetlib/dhp/datacite/hostedBy_map.json
+++ b/dhp-workflows/dhp-aggregation/src/main/resources/eu/dnetlib/dhp/datacite/hostedBy_map.json
@ -1,9 +1,4 @@
 {
- "ETHZ.UNIGENF": {
-  "openaire_id": "opendoar____::1400",
-  "datacite_name": "Uni Genf",
-  "official_name": "Archive ouverte UNIGE"
- },
 "GESIS.RKI": {
  "openaire_id": "re3data_____::r3d100010436",
  "datacite_name": "Forschungsdatenzentrum  am Robert Koch Institut",
--- a/dhp-workflows/dhp-aggregation/src/main/scala/eu/dnetlib/dhp/sx/bio/BioDBToOAF.scala
+++ b/dhp-workflows/dhp-aggregation/src/main/scala/eu/dnetlib/dhp/sx/bio/BioDBToOAF.scala
@ -222,7 +222,7 @@ object BioDBToOAF {
  def uniprotToOAF(input: String): List[Oaf] = {
    implicit lazy val formats: DefaultFormats.type = org.json4s.DefaultFormats
    lazy val json = parse(input)
-    val pid = (json \ "pid").extract[String].trim()
+    val pid = (json \ "pid").extract[String]

    val d = new Dataset

--- a/dhp-workflows/dhp-aggregation/src/main/scala/eu/dnetlib/dhp/sx/bio/ebi/SparkCreateBaselineDataFrame.scala
+++ b/dhp-workflows/dhp-aggregation/src/main/scala/eu/dnetlib/dhp/sx/bio/ebi/SparkCreateBaselineDataFrame.scala
@ -18,9 +18,9 @@ import org.apache.spark.sql.expressions.Aggregator
 import org.apache.spark.sql._
 import org.slf4j.{Logger, LoggerFactory}

-import java.io.InputStream
+import java.io.{ByteArrayInputStream, InputStream}
 import scala.io.Source
-import scala.xml.pull.XMLEventReader
+//import scala.xml.pull.XMLEventReader

 object SparkCreateBaselineDataFrame {

@ -197,8 +197,8 @@ object SparkCreateBaselineDataFrame {
      val ds: Dataset[PMArticle] = spark.createDataset(
        k.filter(i => i._1.endsWith(".gz"))
          .flatMap(i => {
-            val xml = new XMLEventReader(Source.fromBytes(i._2.getBytes()))
-            new PMParser(xml)
+//            val xml = new XMLEventReader(Source.fromBytes(i._2.getBytes()))
+            new PMParser(new ByteArrayInputStream(i._2.getBytes()))
          })
      )
      ds.map(p => (p.getPmid, p))(Encoders.tuple(Encoders.STRING, PMEncoder))
--- a/dhp-workflows/dhp-aggregation/src/main/scala/eu/dnetlib/dhp/sx/bio/pubmed/PMParser.scala
+++ b/dhp-workflows/dhp-aggregation/src/main/scala/eu/dnetlib/dhp/sx/bio/pubmed/PMParser.scala
@ -1,11 +1,20 @@
 package eu.dnetlib.dhp.sx.bio.pubmed

+import javax.xml.stream.{XMLEventReader, XMLInputFactory, XMLStreamConstants}
+import scala.language.postfixOps
 import scala.xml.MetaData
-import scala.xml.pull.{EvElemEnd, EvElemStart, EvText, XMLEventReader}
+//import scala.xml.pull.{EvElemEnd, EvElemStart, EvText, XMLEventReader}

 /** @param xml
  */
-class PMParser(xml: XMLEventReader) extends Iterator[PMArticle] {
+class PMParser(stream: java.io.InputStream) extends Iterator[PMArticle] {
+
+  private val reader: XMLEventReader = {
+    println("INSTANTIATE READER")
+    val factory = XMLInputFactory.newInstance()
+    factory.createXMLEventReader(stream)
+
+  }

  var currentArticle: PMArticle = generateNextArticle()

@ -49,85 +58,142 @@ class PMParser(xml: XMLEventReader) extends Iterator[PMArticle] {
    var currentMonth = "01"
    var currentDay = "01"
    var currentArticleType: String = null
+    var sb = new StringBuilder()
+    var insideChar = false
+    var complete = false
+    while (reader.hasNext && !complete) {

-    while (xml.hasNext) {
-      xml.next match {
-        case EvElemStart(_, label, attrs, _) =>
-          currNode = label
+      val next = reader.nextEvent()

-          label match {
-            case "PubmedArticle" => currentArticle = new PMArticle
-            case "Author"        => currentAuthor = new PMAuthor
-            case "Journal"       => currentJournal = new PMJournal
-            case "Grant"         => currentGrant = new PMGrant
-            case "PublicationType" | "DescriptorName" =>
-              currentSubject = new PMSubject
-              currentSubject.setMeshId(extractAttributes(attrs, "UI"))
-            case "ArticleId" => currentArticleType = extractAttributes(attrs, "IdType")
-            case _           =>
-          }
-        case EvElemEnd(_, label) =>
-          label match {
-            case "PubmedArticle" => return currentArticle
-            case "Author"        => currentArticle.getAuthors.add(currentAuthor)
-            case "Journal"       => currentArticle.setJournal(currentJournal)
-            case "Grant"         => currentArticle.getGrants.add(currentGrant)
-            case "PubMedPubDate" =>
-              if (currentArticle.getDate == null)
-                currentArticle.setDate(validate_Date(currentYear, currentMonth, currentDay))
-            case "PubDate"         => currentJournal.setDate(s"$currentYear-$currentMonth-$currentDay")
-            case "DescriptorName"  => currentArticle.getSubjects.add(currentSubject)
-            case "PublicationType" => currentArticle.getPublicationTypes.add(currentSubject)
-            case _                 =>
-          }
-        case EvText(text) =>
-          if (currNode != null && text.trim.nonEmpty)
-            currNode match {
-              case "ArticleTitle" => {
-                if (currentArticle.getTitle == null)
-                  currentArticle.setTitle(text.trim)
-                else
-                  currentArticle.setTitle(currentArticle.getTitle + text.trim)
-              }
-              case "AbstractText" => {
-                if (currentArticle.getDescription == null)
-                  currentArticle.setDescription(text.trim)
-                else
-                  currentArticle.setDescription(currentArticle.getDescription + text.trim)
-              }
-              case "PMID" => currentArticle.setPmid(text.trim)
-              case "ArticleId" =>
-                if ("doi".equalsIgnoreCase(currentArticleType)) currentArticle.setDoi(text.trim)
-                if ("pmc".equalsIgnoreCase(currentArticleType)) currentArticle.setPmcId(text.trim)
-              case "Language"                           => currentArticle.setLanguage(text.trim)
-              case "ISSN"                               => currentJournal.setIssn(text.trim)
-              case "GrantID"                            => currentGrant.setGrantID(text.trim)
-              case "Agency"                             => currentGrant.setAgency(text.trim)
-              case "Country"                            => if (currentGrant != null) currentGrant.setCountry(text.trim)
-              case "Year"                               => currentYear = text.trim
-              case "Month"                              => currentMonth = text.trim
-              case "Day"                                => currentDay = text.trim
-              case "Volume"                             => currentJournal.setVolume(text.trim)
-              case "Issue"                              => currentJournal.setIssue(text.trim)
-              case "PublicationType" | "DescriptorName" => currentSubject.setValue(text.trim)
-              case "LastName" => {
-                if (currentAuthor != null)
-                  currentAuthor.setLastName(text.trim)
-              }
-              case "ForeName" =>
-                if (currentAuthor != null)
-                  currentAuthor.setForeName(text.trim)
-              case "Title" =>
-                if (currentJournal.getTitle == null)
-                  currentJournal.setTitle(text.trim)
-                else
-                  currentJournal.setTitle(currentJournal.getTitle + text.trim)
-              case _ =>
+      if (next.isStartElement) {
+        if (insideChar) {
+          if (sb.nonEmpty)
+            println(s"got data ${sb.toString.trim}")
+          insideChar = false
+        }
+        val name = next.asStartElement().getName.getLocalPart
+        println(s"Start Element $name")
+        next.asStartElement().getAttributes.forEachRemaining(e => print(e.toString))

-            }
-        case _ =>
+      } else if (next.isEndElement) {
+        if (insideChar) {
+          if (sb.nonEmpty)
+            println(s"got data ${sb.toString.trim}")
+          insideChar = false
+        }
+        val name = next.asEndElement().getName.getLocalPart
+        println(s"End Element $name")
+        if (name.equalsIgnoreCase("PubmedArticle")) {
+          complete = true
+          println("Condizione di uscita")
+        }
+
+      } else if (next.isCharacters) {
+        if (!insideChar) {
+          insideChar = true
+          sb.clear()
+        }
+        val d = next.asCharacters().getData
+        if (d.trim.nonEmpty)
+          sb.append(d.trim)
      }

+//      next match {
+//        case _ if (next.isStartElement) =>
+//          val name = next.asStartElement().getName.getLocalPart
+//          println(s"Start Element $name")
+//        case _ if (next.isEndElement) =>
+//          val name = next.asStartElement().getName.getLocalPart
+//          println(s"End Element $name")
+//        case _ if (next.isCharacters) =>
+//          val c = next.asCharacters()
+//          val data = c.getData
+//          println(s"Text value $data")
+//
+//      }
+
+      //
+//
+//      reader.next match {
+//
+//        case
+//
+//        case EvElemStart(_, label, attrs, _) =>
+//          currNode = label
+//
+//          label match {
+//            case "PubmedArticle" => currentArticle = new PMArticle
+//            case "Author"        => currentAuthor = new PMAuthor
+//            case "Journal"       => currentJournal = new PMJournal
+//            case "Grant"         => currentGrant = new PMGrant
+//            case "PublicationType" | "DescriptorName" =>
+//              currentSubject = new PMSubject
+//              currentSubject.setMeshId(extractAttributes(attrs, "UI"))
+//            case "ArticleId" => currentArticleType = extractAttributes(attrs, "IdType")
+//            case _           =>
+//          }
+//        case EvElemEnd(_, label) =>
+//          label match {
+//            case "PubmedArticle" => return currentArticle
+//            case "Author"        => currentArticle.getAuthors.add(currentAuthor)
+//            case "Journal"       => currentArticle.setJournal(currentJournal)
+//            case "Grant"         => currentArticle.getGrants.add(currentGrant)
+//            case "PubMedPubDate" =>
+//              if (currentArticle.getDate == null)
+//                currentArticle.setDate(validate_Date(currentYear, currentMonth, currentDay))
+//            case "PubDate"         => currentJournal.setDate(s"$currentYear-$currentMonth-$currentDay")
+//            case "DescriptorName"  => currentArticle.getSubjects.add(currentSubject)
+//            case "PublicationType" => currentArticle.getPublicationTypes.add(currentSubject)
+//            case _                 =>
+//          }
+//        case EvText(text) =>
+//          if (currNode != null && text.trim.nonEmpty)
+//            currNode match {
+//              case "ArticleTitle" => {
+//                if (currentArticle.getTitle == null)
+//                  currentArticle.setTitle(text.trim)
+//                else
+//                  currentArticle.setTitle(currentArticle.getTitle + text.trim)
+//              }
+//              case "AbstractText" => {
+//                if (currentArticle.getDescription == null)
+//                  currentArticle.setDescription(text.trim)
+//                else
+//                  currentArticle.setDescription(currentArticle.getDescription + text.trim)
+//              }
+//              case "PMID" => currentArticle.setPmid(text.trim)
+//              case "ArticleId" =>
+//                if ("doi".equalsIgnoreCase(currentArticleType)) currentArticle.setDoi(text.trim)
+//                if ("pmc".equalsIgnoreCase(currentArticleType)) currentArticle.setPmcId(text.trim)
+//              case "Language"                           => currentArticle.setLanguage(text.trim)
+//              case "ISSN"                               => currentJournal.setIssn(text.trim)
+//              case "GrantID"                            => currentGrant.setGrantID(text.trim)
+//              case "Agency"                             => currentGrant.setAgency(text.trim)
+//              case "Country"                            => if (currentGrant != null) currentGrant.setCountry(text.trim)
+//              case "Year"                               => currentYear = text.trim
+//              case "Month"                              => currentMonth = text.trim
+//              case "Day"                                => currentDay = text.trim
+//              case "Volume"                             => currentJournal.setVolume(text.trim)
+//              case "Issue"                              => currentJournal.setIssue(text.trim)
+//              case "PublicationType" | "DescriptorName" => currentSubject.setValue(text.trim)
+//              case "LastName" => {
+//                if (currentAuthor != null)
+//                  currentAuthor.setLastName(text.trim)
+//              }
+//              case "ForeName" =>
+//                if (currentAuthor != null)
+//                  currentAuthor.setForeName(text.trim)
+//              case "Title" =>
+//                if (currentJournal.getTitle == null)
+//                  currentJournal.setTitle(text.trim)
+//                else
+//                  currentJournal.setTitle(currentJournal.getTitle + text.trim)
+//              case _ =>
+//
+//            }
+//        case _ =>
+//      }
+
    }
    null
  }
--- a/dhp-workflows/dhp-aggregation/src/test/java/eu/dnetlib/dhp/actionmanager/bipaffiliations/PrepareAffiliationRelationsTest.java
+++ b/dhp-workflows/dhp-aggregation/src/test/java/eu/dnetlib/dhp/actionmanager/bipaffiliations/PrepareAffiliationRelationsTest.java
@ -1,145 +0,0 @@
-
-package eu.dnetlib.dhp.actionmanager.bipaffiliations;
-
-import static org.junit.jupiter.api.Assertions.*;
-
-import java.io.IOException;
-import java.nio.file.Files;
-import java.nio.file.Path;
-
-import org.apache.commons.io.FileUtils;
-import org.apache.hadoop.io.Text;
-import org.apache.spark.SparkConf;
-import org.apache.spark.api.java.JavaRDD;
-import org.apache.spark.api.java.JavaSparkContext;
-import org.apache.spark.sql.Dataset;
-import org.apache.spark.sql.Encoders;
-import org.apache.spark.sql.Row;
-import org.apache.spark.sql.SparkSession;
-import org.junit.jupiter.api.AfterAll;
-import org.junit.jupiter.api.Assertions;
-import org.junit.jupiter.api.BeforeAll;
-import org.junit.jupiter.api.Test;
-import org.slf4j.Logger;
-import org.slf4j.LoggerFactory;
-
-import com.fasterxml.jackson.databind.ObjectMapper;
-
-import eu.dnetlib.dhp.schema.action.AtomicAction;
-import eu.dnetlib.dhp.schema.common.ModelConstants;
-import eu.dnetlib.dhp.schema.oaf.Relation;
-import eu.dnetlib.dhp.schema.oaf.utils.CleaningFunctions;
-import eu.dnetlib.dhp.schema.oaf.utils.IdentifierFactory;
-
-public class PrepareAffiliationRelationsTest {
-
-	private static final ObjectMapper OBJECT_MAPPER = new ObjectMapper();
-
-	private static SparkSession spark;
-
-	private static Path workingDir;
-	private static final String ID_PREFIX = "50|doi_________::";
-	private static final Logger log = LoggerFactory
-		.getLogger(PrepareAffiliationRelationsTest.class);
-
-	@BeforeAll
-	public static void beforeAll() throws IOException {
-		workingDir = Files.createTempDirectory(PrepareAffiliationRelationsTest.class.getSimpleName());
-
-		log.info("Using work dir {}", workingDir);
-
-		SparkConf conf = new SparkConf();
-		conf.setAppName(PrepareAffiliationRelationsTest.class.getSimpleName());
-
-		conf.setMaster("local[*]");
-		conf.set("spark.driver.host", "localhost");
-		conf.set("hive.metastore.local", "true");
-		conf.set("spark.ui.enabled", "false");
-		conf.set("spark.sql.warehouse.dir", workingDir.toString());
-		conf.set("hive.metastore.warehouse.dir", workingDir.resolve("warehouse").toString());
-
-		spark = SparkSession
-			.builder()
-			.appName(PrepareAffiliationRelationsTest.class.getSimpleName())
-			.config(conf)
-			.getOrCreate();
-	}
-
-	@AfterAll
-	public static void afterAll() throws IOException {
-		FileUtils.deleteDirectory(workingDir.toFile());
-		spark.stop();
-	}
-
-	@Test
-	void testMatch() throws Exception {
-
-		String affiliationRelationsPath = getClass()
-			.getResource("/eu/dnetlib/dhp/actionmanager/bipaffiliations/doi_to_ror.json")
-			.getPath();
-
-		String outputPath = workingDir.toString() + "/actionSet";
-
-		PrepareAffiliationRelations
-			.main(
-				new String[] {
-					"-isSparkSessionManaged", Boolean.FALSE.toString(),
-					"-inputPath", affiliationRelationsPath,
-					"-outputPath", outputPath
-				});
-
-		final JavaSparkContext sc = new JavaSparkContext(spark.sparkContext());
-
-		JavaRDD<Relation> tmp = sc
-			.sequenceFile(outputPath, Text.class, Text.class)
-			.map(value -> OBJECT_MAPPER.readValue(value._2().toString(), AtomicAction.class))
-			.map(aa -> ((Relation) aa.getPayload()));
-
-//        for (Relation r : tmp.collect()) {
-//            System.out.println(
-//                    r.getSource() + "\t" + r.getTarget() + "\t" + r.getRelType() + "\t" + r.getRelClass() + "\t" + r.getSubRelType() + "\t" + r.getValidationDate() + "\t" + r.getDataInfo().getTrust() + "\t" + r.getDataInfo().getInferred()
-//            );
-//        }
-		// count the number of relations
-		assertEquals(20, tmp.count());
-
-		Dataset<Relation> dataset = spark.createDataset(tmp.rdd(), Encoders.bean(Relation.class));
-		dataset.createOrReplaceTempView("result");
-
-		Dataset<Row> execVerification = spark
-			.sql("select r.relType, r.relClass, r.source, r.target, r.dataInfo.trust from result r");
-
-		// verify that we have equal number of bi-directional relations
-		Assertions
-			.assertEquals(
-				10, execVerification
-					.filter(
-						"relClass='" + ModelConstants.HAS_AUTHOR_INSTITUTION + "'")
-					.collectAsList()
-					.size());
-
-		Assertions
-			.assertEquals(
-				10, execVerification
-					.filter(
-						"relClass='" + ModelConstants.IS_AUTHOR_INSTITUTION_OF + "'")
-					.collectAsList()
-					.size());
-
-		// check confidence value of a specific relation
-		String sourceDOI = "10.1061/(asce)0733-9399(2002)128:7(759)";
-
-		final String sourceOpenaireId = ID_PREFIX
-			+ IdentifierFactory.md5(CleaningFunctions.normalizePidValue("doi", sourceDOI));
-
-		Assertions
-			.assertEquals(
-				"0.7071067812", execVerification
-					.filter(
-						"source='" + sourceOpenaireId + "'")
-					.collectAsList()
-					.get(0)
-					.getString(4));
-
-	}
-}
--- a/dhp-workflows/dhp-aggregation/src/test/java/eu/dnetlib/dhp/actionmanager/bipfinder/SparkAtomicActionScoreJobTest.java
+++ b/dhp-workflows/dhp-aggregation/src/test/java/eu/dnetlib/dhp/actionmanager/bipfinder/SparkAtomicActionScoreJobTest.java
@ -6,8 +6,7 @@ import static org.junit.jupiter.api.Assertions.*;
 import java.io.IOException;
 import java.nio.file.Files;
 import java.nio.file.Path;
-
-import javax.xml.crypto.Data;
+import java.util.List;

 import org.apache.commons.io.FileUtils;
 import org.apache.hadoop.io.Text;
@ -28,9 +27,7 @@ import org.slf4j.LoggerFactory;
 import com.fasterxml.jackson.databind.ObjectMapper;

 import eu.dnetlib.dhp.schema.action.AtomicAction;
-import eu.dnetlib.dhp.schema.oaf.KeyValue;
-import eu.dnetlib.dhp.schema.oaf.OafEntity;
-import eu.dnetlib.dhp.schema.oaf.Project;
+import eu.dnetlib.dhp.schema.oaf.Publication;
 import eu.dnetlib.dhp.schema.oaf.Result;

 public class SparkAtomicActionScoreJobTest {
@ -40,8 +37,8 @@ public class SparkAtomicActionScoreJobTest {
 	private static SparkSession spark;

 	private static Path workingDir;
-
-	private static final Logger log = LoggerFactory.getLogger(SparkAtomicActionScoreJobTest.class);
+	private static final Logger log = LoggerFactory
+		.getLogger(SparkAtomicActionScoreJobTest.class);

 	@BeforeAll
 	public static void beforeAll() throws IOException {
@ -72,64 +69,47 @@ public class SparkAtomicActionScoreJobTest {
 		spark.stop();
 	}

-	private void runJob(String resultsInputPath, String projectsInputPath, String outputPath) throws Exception {
+	@Test
+	void testMatch() throws Exception {
+		String bipScoresPath = getClass()
+			.getResource("/eu/dnetlib/dhp/actionmanager/bipfinder/bip_scores_oid.json")
+			.getPath();
+
 		SparkAtomicActionScoreJob
 			.main(
 				new String[] {
-					"-isSparkSessionManaged", Boolean.FALSE.toString(),
-					"-resultsInputPath", resultsInputPath,
-					"-projectsInputPath", projectsInputPath,
-					"-outputPath", outputPath,
+					"-isSparkSessionManaged",
+					Boolean.FALSE.toString(),
+					"-inputPath",
+
+					bipScoresPath,
+
+					"-outputPath",
+					workingDir.toString() + "/actionSet"
 				});
-	}
-
-	@Test
-	void testScores() throws Exception {
-
-		String resultsInputPath = getClass()
-			.getResource("/eu/dnetlib/dhp/actionmanager/bipfinder/result_bip_scores.json")
-			.getPath();
-
-		String projectsInputPath = getClass()
-			.getResource("/eu/dnetlib/dhp/actionmanager/bipfinder/project_bip_scores.json")
-			.getPath();
-
-		String outputPath = workingDir.toString() + "/actionSet";
-
-		// execute the job to generate the action sets for result scores
-		runJob(resultsInputPath, projectsInputPath, outputPath);

 		final JavaSparkContext sc = new JavaSparkContext(spark.sparkContext());

-		JavaRDD<OafEntity> tmp = sc
-			.sequenceFile(outputPath, Text.class, Text.class)
+		JavaRDD<Result> tmp = sc
+			.sequenceFile(workingDir.toString() + "/actionSet", Text.class, Text.class)
 			.map(value -> OBJECT_MAPPER.readValue(value._2().toString(), AtomicAction.class))
-			.map(aa -> ((OafEntity) aa.getPayload()));
+			.map(aa -> ((Result) aa.getPayload()));

-		assertEquals(8, tmp.count());
+		assertEquals(4, tmp.count());

-		Dataset<OafEntity> verificationDataset = spark.createDataset(tmp.rdd(), Encoders.bean(OafEntity.class));
+		Dataset<Result> verificationDataset = spark.createDataset(tmp.rdd(), Encoders.bean(Result.class));
 		verificationDataset.createOrReplaceTempView("result");

-		Dataset<Row> testDataset = spark
+		Dataset<Row> execVerification = spark
 			.sql(
 				"Select p.id oaid, mes.id, mUnit.value from result p " +
 					"lateral view explode(measures) m as mes " +
 					"lateral view explode(mes.unit) u as mUnit ");

-//		execVerification.show();
-
-		Assertions.assertEquals(28, testDataset.count());
-
-		assertResultImpactScores(testDataset);
-		assertProjectImpactScores(testDataset);
-
-	}
-
-	void assertResultImpactScores(Dataset<Row> testDataset) {
+		Assertions.assertEquals(12, execVerification.count());
 		Assertions
 			.assertEquals(
-				"6.63451994567e-09", testDataset
+				"6.63451994567e-09", execVerification
 					.filter(
 						"oaid='50|arXiv_dedup_::4a2d5fd8d71daec016c176ec71d957b1' " +
 							"and id = 'influence'")
@ -139,7 +119,7 @@ public class SparkAtomicActionScoreJobTest {
 					.getString(0));
 		Assertions
 			.assertEquals(
-				"0.348694533145", testDataset
+				"0.348694533145", execVerification
 					.filter(
 						"oaid='50|arXiv_dedup_::4a2d5fd8d71daec016c176ec71d957b1' " +
 							"and id = 'popularity_alt'")
@ -149,7 +129,7 @@ public class SparkAtomicActionScoreJobTest {
 					.getString(0));
 		Assertions
 			.assertEquals(
-				"2.16094680115e-09", testDataset
+				"2.16094680115e-09", execVerification
 					.filter(
 						"oaid='50|arXiv_dedup_::4a2d5fd8d71daec016c176ec71d957b1' " +
 							"and id = 'popularity'")
@ -157,49 +137,7 @@ public class SparkAtomicActionScoreJobTest {
 					.collectAsList()
 					.get(0)
 					.getString(0));
+
 	}

-	void assertProjectImpactScores(Dataset<Row> testDataset) throws Exception {
-
-		Assertions
-			.assertEquals(
-				"0", testDataset
-					.filter(
-						"oaid='40|nih_________::c02a8233e9b60f05bb418f0c9b714833' " +
-							"and id = 'numOfInfluentialResults'")
-					.select("value")
-					.collectAsList()
-					.get(0)
-					.getString(0));
-		Assertions
-			.assertEquals(
-				"1", testDataset
-					.filter(
-						"oaid='40|nih_________::c02a8233e9b60f05bb418f0c9b714833' " +
-							"and id = 'numOfPopularResults'")
-					.select("value")
-					.collectAsList()
-					.get(0)
-					.getString(0));
-		Assertions
-			.assertEquals(
-				"25", testDataset
-					.filter(
-						"oaid='40|nih_________::c02a8233e9b60f05bb418f0c9b714833' " +
-							"and id = 'totalImpulse'")
-					.select("value")
-					.collectAsList()
-					.get(0)
-					.getString(0));
-		Assertions
-			.assertEquals(
-				"43", testDataset
-					.filter(
-						"oaid='40|nih_________::c02a8233e9b60f05bb418f0c9b714833' " +
-							"and id = 'totalCitationCount'")
-					.select("value")
-					.collectAsList()
-					.get(0)
-					.getString(0));
-	}
 }
--- a/dhp-workflows/dhp-aggregation/src/test/resources/eu/dnetlib/dhp/actionmanager/bipaffiliations/doi_to_ror.json
+++ b/dhp-workflows/dhp-aggregation/src/test/resources/eu/dnetlib/dhp/actionmanager/bipaffiliations/doi_to_ror.json
@ -1,7 +0,0 @@
-{"DOI":"10.1061\/(asce)0733-9399(2002)128:7(759)","Matchings":[{"RORid":"https:\/\/ror.org\/03yxnpp24","Confidence":0.7071067812},{"RORid":"https:\/\/ror.org\/01teme464","Confidence":0.89}]}
-{"DOI":"10.1105\/tpc.8.3.343","Matchings":[{"RORid":"https:\/\/ror.org\/02k40bc56","Confidence":0.7071067812}]}
-{"DOI":"10.1161\/01.cir.0000013305.01850.37","Matchings":[{"RORid":"https:\/\/ror.org\/00qjgza05","Confidence":1}]}
-{"DOI":"10.1142\/s021821650200186x","Matchings":[{"RORid":"https:\/\/ror.org\/035xkbk20","Confidence":1},{"RORid":"https:\/\/ror.org\/05apxxy63","Confidence":1}]}
-{"DOI":"10.1061\/(asce)0733-9372(2002)128:7(575)","Matchings":[{"RORid":"https:\/\/ror.org\/04j198w64","Confidence":0.82}]}
-{"DOI":"10.1061\/(asce)0733-9372(2002)128:7(588)","Matchings":[{"RORid":"https:\/\/ror.org\/03m8km719","Confidence":0.8660254038},{"RORid":"https:\/\/ror.org\/02aze4h65","Confidence":0.87}]}
-{"DOI":"10.1161\/hy0202.103001","Matchings":[{"RORid":"https:\/\/ror.org\/057xtrt18","Confidence":0.7071067812}]}
--- a/dhp-workflows/dhp-aggregation/src/test/resources/eu/dnetlib/dhp/actionmanager/bipfinder/result_bip_scores.json
+++ b/dhp-workflows/dhp-aggregation/src/test/resources/eu/dnetlib/dhp/actionmanager/bipfinder/result_bip_scores.json
--- a/dhp-workflows/dhp-aggregation/src/test/resources/eu/dnetlib/dhp/actionmanager/bipfinder/project_bip_scores.json
+++ b/dhp-workflows/dhp-aggregation/src/test/resources/eu/dnetlib/dhp/actionmanager/bipfinder/project_bip_scores.json
@ -1,4 +0,0 @@
-{"projectId":"40|nsf_________::d93e50d22374a1cf59f6a232413ea027","numOfInfluentialResults":0,"numOfPopularResults":10,"totalImpulse":181,"totalCitationCount":235}
-{"projectId":"40|nih_________::1c93debc7085e440f245fbe70b2e8b21","numOfInfluentialResults":14,"numOfPopularResults":17,"totalImpulse":1558,"totalCitationCount":4226}
-{"projectId":"40|nih_________::c02a8233e9b60f05bb418f0c9b714833","numOfInfluentialResults":0,"numOfPopularResults":1,"totalImpulse":25,"totalCitationCount":43}
-{"projectId":"40|corda_______::d91dcf3a87dd7f72248fab0b8a4ba273","numOfInfluentialResults":2,"numOfPopularResults":3,"totalImpulse":78,"totalCitationCount":178}
--- a/dhp-workflows/dhp-aggregation/src/test/resources/eu/dnetlib/dhp/sx/graph/bio/pdb_dump
+++ b/dhp-workflows/dhp-aggregation/src/test/resources/eu/dnetlib/dhp/sx/graph/bio/pdb_dump
@ -1,44 +1,15 @@
-{"classification": "Signaling protein", "pdb": "5NM4", "deposition_date": "2017-04-05", "title": "A2a adenosine receptor room-temperature structure determined by serial Femtosecond crystallography", "Keywords": ["Oom-temperature", " serial crystallography", " signaling protein"], "authors": ["T.weinert", "R.cheng", "D.james", "D.gashi", "P.nogly", "K.jaeger", "M.hennig", "", "J.standfuss"], "pmid": "28912485", "doi": "10.1038/S41467-017-00630-4"}
-{"classification": "Oxidoreductase/oxidoreductase inhibitor", "pdb": "4KN3", "deposition_date": "2013-05-08", "title": "Structure of the y34ns91g double mutant of dehaloperoxidase from Amphitrite ornata with 2,4,6-trichlorophenol", "Keywords": ["Lobin", " oxygen storage", " peroxidase", " oxidoreductase", " oxidoreductase-", "Oxidoreductase inhibitor complex"], "authors": ["C.wang", "L.lovelace", "L.lebioda"], "pmid": "23952341", "doi": "10.1021/BI400627W"}
-{"classification": "Transport protein", "pdb": "8HKM", "deposition_date": "2022-11-27", "title": "Ion channel", "Keywords": ["On channel", " transport protein"], "authors": ["D.h.jiang", "J.t.zhang"], "pmid": "37494189", "doi": "10.1016/J.CELREP.2023.112858"}
-{"classification": "Signaling protein", "pdb": "6JT1", "deposition_date": "2019-04-08", "title": "Structure of human soluble guanylate cyclase in the heme oxidised State", "Keywords": ["Oluble guanylate cyclase", " signaling protein"], "authors": ["L.chen", "Y.kang", "R.liu", "J.-x.wu"], "pmid": "31514202", "doi": "10.1038/S41586-019-1584-6"}
-{"classification": "Immune system", "pdb": "7OW6", "deposition_date": "2021-06-16", "title": "Crystal structure of a tcr in complex with hla-a*11:01 bound to kras G12d peptide (vvvgadgvgk)", "Keywords": ["La", " kras", " tcr", " immune system"], "authors": ["V.karuppiah", "R.a.robinson"], "doi": "10.1038/S41467-022-32811-1"}
-{"classification": "Biosynthetic protein", "pdb": "5EQ8", "deposition_date": "2015-11-12", "title": "Crystal structure of medicago truncatula histidinol-phosphate Phosphatase (mthpp) in complex with l-histidinol", "Keywords": ["Istidine biosynthesis", " metabolic pathways", " dimer", " plant", "", "Biosynthetic protein"], "authors": ["M.ruszkowski", "Z.dauter"], "pmid": "26994138", "doi": "10.1074/JBC.M115.708727"}
-{"classification": "De novo protein", "pdb": "8CWA", "deposition_date": "2022-05-18", "title": "Solution nmr structure of 8-residue rosetta-designed cyclic peptide D8.21 in cdcl3 with cis/trans switching (tc conformation, 53%)", "Keywords": ["Yclic peptide", " non natural amino acids", " cis/trans", " switch peptides", "", "De novo design", "Membrane permeability", "De novo protein"], "authors": ["T.a.ramelot", "R.tejero", "G.t.montelione"], "pmid": "36041435", "doi": "10.1016/J.CELL.2022.07.019"}
-{"classification": "Hydrolase", "pdb": "3R6M", "deposition_date": "2011-03-21", "title": "Crystal structure of vibrio parahaemolyticus yeaz", "Keywords": ["Ctin/hsp70 nucleotide-binding fold", " bacterial resuscitation", " viable", "But non-culturable state", "Resuscitation promoting factor", "Ygjd", "", "Yjee", "Vibrio parahaemolyticus", "Hydrolase"], "authors": ["A.roujeinikova", "I.aydin"], "pmid": "21858042", "doi": "10.1371/JOURNAL.PONE.0023245"}
-{"classification": "Hydrolase", "pdb": "2W5J", "deposition_date": "2008-12-10", "title": "Structure of the c14-rotor ring of the proton translocating Chloroplast atp synthase", "Keywords": ["Ydrolase", " chloroplast", " atp synthase", " lipid-binding", " cf(0)", " membrane", "", "Transport", "Formylation", "Energy transduction", "Hydrogen ion transport", "", "Ion transport", "Transmembrane", "Membrane protein"], "authors": ["M.vollmar", "D.schlieper", "M.winn", "C.buechner", "G.groth"], "pmid": "19423706", "doi": "10.1074/JBC.M109.006916"}
-{"classification": "De novo protein", "pdb": "4GLU", "deposition_date": "2012-08-14", "title": "Crystal structure of the mirror image form of vegf-a", "Keywords": ["-protein", " covalent dimer", " cysteine knot protein", " growth factor", " de", "Novo protein"], "authors": ["K.mandal", "M.uppalapati", "D.ault-riche", "J.kenney", "J.lowitz", "S.sidhu", "", "S.b.h.kent"], "pmid": "22927390", "doi": "10.1073/PNAS.1210483109"}
-{"classification": "Hydrolase/hydrolase inhibitor", "pdb": "3WYL", "deposition_date": "2014-09-01", "title": "Crystal structure of the catalytic domain of pde10a complexed with 5- Methoxy-3-(1-phenyl-1h-pyrazol-5-yl)-1-(3-(trifluoromethyl)phenyl) Pyridazin-4(1h)-one", "Keywords": ["Ydrolase-hydrolase inhibitor complex"], "authors": ["H.oki", "Y.hayano"], "pmid": "25384088", "doi": "10.1021/JM5013648"}
-{"classification": "Isomerase", "pdb": "5BOR", "deposition_date": "2015-05-27", "title": "Structure of acetobacter aceti pure-s57c, sulfonate form", "Keywords": ["Cidophile", " pure", " purine biosynthesis", " isomerase"], "authors": ["K.l.sullivan", "T.j.kappock"]}
-{"classification": "Hydrolase", "pdb": "1X0C", "deposition_date": "2005-03-17", "title": "Improved crystal structure of isopullulanase from aspergillus niger Atcc 9642", "Keywords": ["Ullulan", " glycoside hydrolase family 49", " glycoprotein", " hydrolase"], "authors": ["M.mizuno", "T.tonozuka", "A.yamamura", "Y.miyasaka", "H.akeboshi", "S.kamitori", "", "A.nishikawa", "Y.sakano"], "pmid": "18155243", "doi": "10.1016/J.JMB.2007.11.098"}
-{"classification": "Oxidoreductase", "pdb": "7CUP", "deposition_date": "2020-08-23", "title": "Structure of 2,5-dihydroxypridine dioxygenase from pseudomonas putida Kt2440", "Keywords": ["On-heme dioxygenase", " oxidoreductase"], "authors": ["G.q.liu", "H.z.tang"]}
-{"classification": "Ligase", "pdb": "1VCN", "deposition_date": "2004-03-10", "title": "Crystal structure of t.th. hb8 ctp synthetase complex with sulfate Anion", "Keywords": ["Etramer", " riken structural genomics/proteomics initiative", " rsgi", "", "Structural genomics", "Ligase"], "authors": ["M.goto", "Riken structural genomics/proteomics initiative (rsgi)"], "pmid": "15296735", "doi": "10.1016/J.STR.2004.05.013"}
-{"classification": "Transferase/transferase inhibitor", "pdb": "6C9V", "deposition_date": "2018-01-28", "title": "Mycobacterium tuberculosis adenosine kinase bound to (2r,3s,4r,5r)-2- (hydroxymethyl)-5-(6-(4-phenylpiperazin-1-yl)-9h-purin-9-yl) Tetrahydrofuran-3,4-diol", "Keywords": ["Ucleoside analog", " complex", " inhibitor", " structural genomics", " psi-2", "", "Protein structure initiative", "Tb structural genomics consortium", "", "Tbsgc", "Transferase-transferase inhibitor complex"], "authors": ["R.a.crespo", "Tb structural genomics consortium (tbsgc)"], "pmid": "31002508", "doi": "10.1021/ACS.JMEDCHEM.9B00020"}
-{"classification": "De novo protein", "pdb": "4LPY", "deposition_date": "2013-07-16", "title": "Crystal structure of tencon variant g10", "Keywords": ["Ibronectin type iii fold", " alternate scaffold", " de novo protein"], "authors": ["A.teplyakov", "G.obmolova", "G.l.gilliland"], "pmid": "24375666", "doi": "10.1002/PROT.24502"}
-{"classification": "Isomerase", "pdb": "2Y88", "deposition_date": "2011-02-03", "title": "Crystal structure of mycobacterium tuberculosis phosphoribosyl Isomerase (variant d11n) with bound prfar", "Keywords": ["Romatic amino acid biosynthesis", " isomerase", " tim-barrel", " histidine", "Biosynthesis", "Tryptophan biosynthesis"], "authors": ["J.kuper", "A.v.due", "A.geerlof", "M.wilmanns"], "pmid": "21321225", "doi": "10.1073/PNAS.1015996108"}
-{"classification": "Unknown function", "pdb": "1SR0", "deposition_date": "2004-03-22", "title": "Crystal structure of signalling protein from sheep(sps-40) at 3.0a Resolution using crystal grown in the presence of polysaccharides", "Keywords": ["Ignalling protein", " involution", " unknown function"], "authors": ["D.b.srivastava", "A.s.ethayathulla", "N.singh", "J.kumar", "S.sharma", "T.p.singh"]}
-{"classification": "Dna binding protein", "pdb": "3RH2", "deposition_date": "2011-04-11", "title": "Crystal structure of a tetr-like transcriptional regulator (sama_0099) From shewanella amazonensis sb2b at 2.42 a resolution", "Keywords": ["Na/rna-binding 3-helical bundle", " structural genomics", " joint center", "For structural genomics", "Jcsg", "Protein structure initiative", "Psi-", "Biology", "Dna binding protein"], "authors": ["Joint center for structural genomics (jcsg)"]}
-{"classification": "Transferase", "pdb": "2WK5", "deposition_date": "2009-06-05", "title": "Structural features of native human thymidine phosphorylase And in complex with 5-iodouracil", "Keywords": ["Lycosyltransferase", " developmental protein", " angiogenesis", "", "5-iodouracil", "Growth factor", "Enzyme kinetics", "", "Differentiation", "Disease mutation", "Thymidine", "Phosphorylase", "Chemotaxis", "Transferase", "Mutagenesis", "", "Polymorphism"], "authors": ["E.mitsiki", "A.c.papageorgiou", "S.iyer", "N.thiyagarajan", "S.h.prior", "", "D.sleep", "C.finnis", "K.r.acharya"], "pmid": "19555658", "doi": "10.1016/J.BBRC.2009.06.104"}
-{"classification": "Hydrolase", "pdb": "3P9Y", "deposition_date": "2010-10-18", "title": "Crystal structure of the drosophila melanogaster ssu72-pctd complex", "Keywords": ["Hosphatase", " cis proline", " lmw ptp-like fold", " rna polymerase ii ctd", "", "Hydrolase"], "authors": ["J.w.werner-allen", "P.zhou"], "pmid": "21159777", "doi": "10.1074/JBC.M110.197129"}
-{"classification": "Recombination/dna", "pdb": "6OEO", "deposition_date": "2019-03-27", "title": "Cryo-em structure of mouse rag1/2 nfc complex (dna1)", "Keywords": ["(d)j recombination", " dna transposition", " rag", " scid", " recombination", "", "Recombination-dna complex"], "authors": ["X.chen", "Y.cui", "Z.h.zhou", "W.yang", "M.gellert"], "pmid": "32015552", "doi": "10.1038/S41594-019-0363-2"}
-{"classification": "Hydrolase", "pdb": "4ECA", "deposition_date": "1997-02-21", "title": "Asparaginase from e. coli, mutant t89v with covalently bound aspartate", "Keywords": ["Ydrolase", " acyl-enzyme intermediate", " threonine amidohydrolase"], "authors": ["G.j.palm", "J.lubkowski", "A.wlodawer"], "pmid": "8706862", "doi": "10.1016/0014-5793(96)00660-6"}
-{"classification": "Transcription/protein binding", "pdb": "3UVX", "deposition_date": "2011-11-30", "title": "Crystal structure of the first bromodomain of human brd4 in complex With a diacetylated histone 4 peptide (h4k12ack16ac)", "Keywords": ["Romodomain", " bromodomain containing protein 4", " cap", " hunk1", " mcap", "", "Mitotic chromosome associated protein", "Peptide complex", "Structural", "Genomics consortium", "Sgc", "Transcription-protein binding complex"], "authors": ["P.filippakopoulos", "S.picaud", "T.keates", "E.ugochukwu", "F.von delft", "", "C.h.arrowsmith", "A.m.edwards", "J.weigelt", "C.bountra", "S.knapp", "Structural", "Genomics consortium (sgc)"], "pmid": "22464331", "doi": "10.1016/J.CELL.2012.02.013"}
-{"classification": "Membrane protein", "pdb": "1TLZ", "deposition_date": "2004-06-10", "title": "Tsx structure complexed with uridine", "Keywords": ["Ucleoside transporter", " beta barrel", " uridine", " membrane", "Protein"], "authors": ["J.ye", "B.van den berg"], "pmid": "15272310", "doi": "10.1038/SJ.EMBOJ.7600330"}
-{"classification": "Dna binding protein", "pdb": "7AZD", "deposition_date": "2020-11-16", "title": "Dna polymerase sliding clamp from escherichia coli with peptide 20 Bound", "Keywords": ["Ntibacterial drug", " dna binding protein"], "authors": ["C.monsarrat", "G.compain", "C.andre", "I.martiel", "S.engilberge", "V.olieric", "", "P.wolff", "K.brillet", "M.landolfo", "C.silva da veiga", "J.wagner", "G.guichard", "", "D.y.burnouf"], "pmid": "34806883", "doi": "10.1021/ACS.JMEDCHEM.1C00918"}
-{"classification": "Transferase", "pdb": "5N3K", "deposition_date": "2017-02-08", "title": "Camp-dependent protein kinase a from cricetulus griseus in complex With fragment like molecule o-guanidino-l-homoserine", "Keywords": ["Ragment", " complex", " transferase", " serine threonine kinase", " camp", "", "Kinase", "Pka"], "authors": ["C.siefker", "A.heine", "G.klebe"]}
-{"classification": "Biosynthetic protein", "pdb": "8H52", "deposition_date": "2022-10-11", "title": "Crystal structure of helicobacter pylori carboxyspermidine Dehydrogenase in complex with nadp", "Keywords": ["Arboxyspermidine dehydrogenase", " biosynthetic protein"], "authors": ["K.y.ko", "S.c.park", "S.y.cho", "S.i.yoon"], "pmid": "36283333", "doi": "10.1016/J.BBRC.2022.10.049"}
-{"classification": "Metal binding protein", "pdb": "6DYC", "deposition_date": "2018-07-01", "title": "Co(ii)-bound structure of the engineered cyt cb562 variant, ch3", "Keywords": ["Esigned protein", " 4-helix bundle", " electron transport", " metal binding", "Protein"], "authors": ["F.a.tezcan", "J.rittle"], "pmid": "30778140", "doi": "10.1038/S41557-019-0218-9"}
-{"classification": "Protein fibril", "pdb": "6A6B", "deposition_date": "2018-06-27", "title": "Cryo-em structure of alpha-synuclein fiber", "Keywords": ["Lpha-syn fiber", " parkinson disease", " protein fibril"], "authors": ["Y.w.li", "C.y.zhao", "F.luo", "Z.liu", "X.gui", "Z.luo", "X.zhang", "D.li", "C.liu", "X.li"], "pmid": "30065316", "doi": "10.1038/S41422-018-0075-X"}
-{"classification": "Dna", "pdb": "7D5E", "deposition_date": "2020-09-25", "title": "Left-handed g-quadruplex containing two bulges", "Keywords": ["-quadruplex", " bulge", " dna", " left-handed"], "authors": ["P.das", "A.maity", "K.h.ngo", "F.r.winnerdy", "B.bakalar", "Y.mechulam", "E.schmitt", "", "A.t.phan"], "pmid": "33503265", "doi": "10.1093/NAR/GKAA1259"}
-{"classification": "Transferase", "pdb": "3RSY", "deposition_date": "2011-05-02", "title": "Cellobiose phosphorylase from cellulomonas uda in complex with sulfate And glycerol", "Keywords": ["H94", " alpha barrel", " cellobiose phosphorylase", " disaccharide", "Phosphorylase", "Transferase"], "authors": ["A.van hoorebeke", "J.stout", "W.soetaert", "J.van beeumen", "T.desmet", "S.savvides"]}
-{"classification": "Oxidoreductase", "pdb": "7MCI", "deposition_date": "2021-04-02", "title": "Mofe protein from azotobacter vinelandii with a sulfur-replenished Cofactor", "Keywords": ["Zotobacter vinelandii", " mofe-protein", " nitrogenase", " oxidoreductase"], "authors": ["W.kang", "C.lee", "Y.hu", "M.w.ribbe"], "doi": "10.1038/S41929-022-00782-7"}
-{"classification": "Dna", "pdb": "1XUW", "deposition_date": "2004-10-26", "title": "Structural rationalization of a large difference in rna affinity Despite a small difference in chemistry between two 2'-o-modified Nucleic acid analogs", "Keywords": ["Na mimetic methylcarbamate amide analog", " dna"], "authors": ["R.pattanayek", "L.sethaphong", "C.pan", "M.prhavc", "T.p.prakash", "M.manoharan", "", "M.egli"], "pmid": "15547979", "doi": "10.1021/JA044637K"}
-{"classification": "Lyase", "pdb": "7C0D", "deposition_date": "2020-05-01", "title": "Crystal structure of azospirillum brasilense l-2-keto-3-deoxyarabonate Dehydratase (hydroxypyruvate-bound form)", "Keywords": ["-2-keto-3-deoxyarabonate dehydratase", " lyase"], "authors": ["Y.watanabe", "S.watanabe"], "pmid": "32697085", "doi": "10.1021/ACS.BIOCHEM.0C00515"}
-{"classification": "Signaling protein", "pdb": "5LYK", "deposition_date": "2016-09-28", "title": "Crystal structure of intracellular b30.2 domain of btn3a1 bound to Citrate", "Keywords": ["30.2", " butyrophilin", " signaling protein"], "authors": ["F.mohammed", "A.t.baker", "M.salim", "B.e.willcox"], "pmid": "28862425", "doi": "10.1021/ACSCHEMBIO.7B00694"}
-{"classification": "Toxin", "pdb": "4IZL", "deposition_date": "2013-01-30", "title": "Structure of the n248a mutant of the panton-valentine leucocidin s Component from staphylococcus aureus", "Keywords": ["I-component leucotoxin", " staphylococcus aureus", " s component", "Leucocidin", "Beta-barrel pore forming toxin", "Toxin"], "authors": ["L.maveyraud", "B.j.laventie", "G.prevost", "L.mourey"], "pmid": "24643034", "doi": "10.1371/JOURNAL.PONE.0092094"}
-{"classification": "Dna", "pdb": "6F3C", "deposition_date": "2017-11-28", "title": "The cytotoxic [pt(h2bapbpy)] platinum complex interacting with the Cgtacg hexamer", "Keywords": ["Rug-dna complex", " four-way junction", " dna"], "authors": ["M.ferraroni", "C.bazzicalupi", "P.gratteri", "F.papi"], "pmid": "31046177", "doi": "10.1002/ANIE.201814532"}
-{"classification": "Signaling protein/inhibitor", "pdb": "4L5M", "deposition_date": "2013-06-11", "title": "Complexe of arno sec7 domain with the protein-protein interaction Inhibitor n-(4-hydroxy-2,6-dimethylphenyl)benzenesulfonamide at ph6.5", "Keywords": ["Ec-7domain", " signaling protein-inhibitor complex"], "authors": ["F.hoh", "J.rouhana"], "pmid": "24112024", "doi": "10.1021/JM4009357"}
-{"classification": "Signaling protein", "pdb": "5I6J", "deposition_date": "2016-02-16", "title": "Crystal structure of srgap2 f-barx", "Keywords": ["Rgap2", " f-bar", " fx", " signaling protein"], "authors": ["M.sporny", "J.guez-haddad", "M.n.isupov", "Y.opatowsky"], "pmid": "28333212", "doi": "10.1093/MOLBEV/MSX094"}
-{"classification": "Metal binding protein", "pdb": "1Q80", "deposition_date": "2003-08-20", "title": "Solution structure and dynamics of nereis sarcoplasmic calcium binding Protein", "Keywords": ["Ll-alpha", " metal binding protein"], "authors": ["G.rabah", "R.popescu", "J.a.cox", "Y.engelborghs", "C.t.craescu"], "pmid": "15819893", "doi": "10.1111/J.1742-4658.2005.04629.X"}
-{"classification": "Transferase", "pdb": "1TW1", "deposition_date": "2004-06-30", "title": "Beta-1,4-galactosyltransferase mutant met344his (m344h-gal-t1) complex With udp-galactose and magnesium", "Keywords": ["Et344his mutation; closed conformation; mn binding", " transferase"], "authors": ["B.ramakrishnan", "E.boeggeman", "P.k.qasba"], "pmid": "15449940", "doi": "10.1021/BI049007+"}
-{"classification": "Rna", "pdb": "2PN4", "deposition_date": "2007-04-23", "title": "Crystal structure of hepatitis c virus ires subdomain iia", "Keywords": ["Cv", " ires", " subdoamin iia", " rna", " strontium", " hepatitis"], "authors": ["Q.zhao", "Q.han", "C.r.kissinger", "P.a.thompson"], "pmid": "18391410", "doi": "10.1107/S0907444908002011"}
+{"pdb": "1CW0", "title": "crystal structure analysis of very short patch repair (vsr) endonuclease in complex with a duplex dna", "authors": ["S.E.Tsutakawa", "H.Jingami", "K.Morikawa"], "doi": "10.1016/S0092-8674(00)81550-0", "pmid": "10612397"}
+{"pdb": "2CWW", "title": "crystal structure of thermus thermophilus ttha1280, a putative sam- dependent rna methyltransferase, in complex with s-adenosyl-l- homocysteine", "authors": ["A.A.Pioszak", "K.Murayama", "N.Nakagawa", "A.Ebihara", "S.Kuramitsu", "M.Shirouzu", "S.Yokoyama", "Riken Structural Genomics/proteomics Initiative (Rsgi)"], "doi": "10.1107/S1744309105029842", "pmid": "16511182"}
+{"pdb": "6CWE", "title": "structure of alpha-gsa[8,6p] bound by cd1d and in complex with the va14vb8.2 tcr", "authors": ["J.Wang", "D.Zajonc"], "doi": null, "pmid": null}
+{"pdb": "5CWS", "title": "crystal structure of the intact chaetomium thermophilum nsp1-nup49- nup57 channel nucleoporin heterotrimer bound to its nic96 nuclear pore complex attachment site", "authors": ["C.J.Bley", "S.Petrovic", "M.Paduch", "V.Lu", "A.A.Kossiakoff", "A.Hoelz"], "doi": "10.1126/SCIENCE.AAC9176", "pmid": "26316600"}
+{"pdb": "5CWE", "title": "structure of cyp107l2 from streptomyces avermitilis with lauric acid", "authors": ["T.-V.Pham", "S.-H.Han", "J.-H.Kim", "D.-H.Kim", "L.-W.Kang"], "doi": null, "pmid": null}
+{"pdb": "7CW4", "title": "acetyl-coa acetyltransferase from bacillus cereus atcc 14579", "authors": ["J.Hong", "K.J.Kim"], "doi": "10.1016/J.BBRC.2020.09.048", "pmid": "32972748"}
+{"pdb": "2CWP", "title": "crystal structure of metrs related protein from pyrococcus horikoshii", "authors": ["K.Murayama", "M.Kato-Murayama", "M.Shirouzu", "S.Yokoyama", "Riken StructuralGenomics/proteomics Initiative (Rsgi)"], "doi": null, "pmid": null}
+{"pdb": "2CW7", "title": "crystal structure of intein homing endonuclease ii", "authors": ["H.Matsumura", "H.Takahashi", "T.Inoue", "H.Hashimoto", "M.Nishioka", "S.Fujiwara", "M.Takagi", "T.Imanaka", "Y.Kai"], "doi": "10.1002/PROT.20858", "pmid": "16493661"}
+{"pdb": "1CWU", "title": "brassica napus enoyl acp reductase a138g mutant complexed with nad+ and thienodiazaborine", "authors": ["A.Roujeinikova", "J.B.Rafferty", "D.W.Rice"], "doi": "10.1074/JBC.274.43.30811", "pmid": "10521472"}
+{"pdb": "3CWN", "title": "escherichia coli transaldolase b mutant f178y", "authors": ["T.Sandalova", "G.Schneider", "A.Samland"], "doi": "10.1074/JBC.M803184200", "pmid": "18687684"}
+{"pdb": "1CWL", "title": "human cyclophilin a complexed with 4 4-hydroxy-meleu cyclosporin", "authors": ["V.Mikol", "J.Kallen", "P.Taylor", "M.D.Walkinshaw"], "doi": "10.1006/JMBI.1998.2108", "pmid": "9769216"}
+{"pdb": "3CW2", "title": "crystal structure of the intact archaeal translation initiation factor 2 from sulfolobus solfataricus .", "authors": ["E.A.Stolboushkina", "S.V.Nikonov", "A.D.Nikulin", "U.Blaesi", "D.J.Manstein", "R.V.Fedorov", "M.B.Garber", "O.S.Nikonov"], "doi": "10.1016/J.JMB.2008.07.039", "pmid": "18675278"}
+{"pdb": "3CW9", "title": "4-chlorobenzoyl-coa ligase/synthetase in the thioester-forming conformation, bound to 4-chlorophenacyl-coa", "authors": ["A.S.Reger", "J.Cao", "R.Wu", "D.Dunaway-Mariano", "A.M.Gulick"], "doi": "10.1021/BI800696Y", "pmid": "18620418"}
+{"pdb": "3CWU", "title": "crystal structure of an alka host/guest complex 2'-fluoro-2'-deoxy-1, n6-ethenoadenine:thymine base pair", "authors": ["B.R.Bowman", "S.Lee", "S.Wang", "G.L.Verdine"], "doi": "10.1016/J.STR.2008.04.012", "pmid": "18682218"}
+{"pdb": "5CWF", "title": "crystal structure of de novo designed helical repeat protein dhr8", "authors": ["G.Bhabha", "D.C.Ekiert"], "doi": "10.1038/NATURE16162", "pmid": "26675729"}
--- a/dhp-workflows/dhp-aggregation/src/test/resources/eu/dnetlib/dhp/sx/graph/bio/uniprot_dump
+++ b/dhp-workflows/dhp-aggregation/src/test/resources/eu/dnetlib/dhp/sx/graph/bio/uniprot_dump
@ -1,36 +1,6 @@
-{"pid": " Q6GZX4", "dates": [{"date": "2011-06-28", "date_info": "integrated into UniProtKB/Swiss-Prot"}, {"date": "2004-07-19", "date_info": "sequence version 1"}, {"date": "2023-09-13", "date_info": "entry version 43"}], "title": "Putative transcription factor 001R", "organism_species": "Frog virus 3 (isolate Goorha) (FV-3)", "subjects": ["Viruses", "Varidnaviria", "Bamfordvirae", "Nucleocytoviricota", "Megaviricetes", "Pimascovirales", "Iridoviridae", "Alphairidovirinae", "Ranavirus"], "references": [{"PubMed": "15165820"}, {"DOI": "10.1016/j.virol.2004.02.019"}]}
-{"pid": " Q6GZX3", "dates": [{"date": "2011-06-28", "date_info": "integrated into UniProtKB/Swiss-Prot"}, {"date": "2004-07-19", "date_info": "sequence version 1"}, {"date": "2023-09-13", "date_info": "entry version 45"}], "title": "Uncharacterized protein 002L", "organism_species": "Frog virus 3 (isolate Goorha) (FV-3)", "subjects": ["Viruses", "Varidnaviria", "Bamfordvirae", "Nucleocytoviricota", "Megaviricetes", "Pimascovirales", "Iridoviridae", "Alphairidovirinae", "Ranavirus"], "references": [{"PubMed": "15165820"}, {"DOI": "10.1016/j.virol.2004.02.019"}]}
-{"pid": " Q197F8", "dates": [{"date": "2009-06-16", "date_info": "integrated into UniProtKB/Swiss-Prot"}, {"date": "2006-07-11", "date_info": "sequence version 1"}, {"date": "2022-02-23", "date_info": "entry version 29"}], "title": "Uncharacterized protein 002R", "organism_species": "Invertebrate iridescent virus 3 (IIV-3) (Mosquito iridescent virus)", "subjects": ["Viruses", "Varidnaviria", "Bamfordvirae", "Nucleocytoviricota", "Megaviricetes", "Pimascovirales", "Iridoviridae", "Betairidovirinae", "Chloriridovirus"], "references": [{"PubMed": "16912294"}, {"DOI": "10.1128/jvi.00464-06"}]}
-{"pid": " Q197F7", "dates": [{"date": "2009-06-16", "date_info": "integrated into UniProtKB/Swiss-Prot"}, {"date": "2006-07-11", "date_info": "sequence version 1"}, {"date": "2020-08-12", "date_info": "entry version 23"}], "title": "Uncharacterized protein 003L", "organism_species": "Invertebrate iridescent virus 3 (IIV-3) (Mosquito iridescent virus)", "subjects": ["Viruses", "Varidnaviria", "Bamfordvirae", "Nucleocytoviricota", "Megaviricetes", "Pimascovirales", "Iridoviridae", "Betairidovirinae", "Chloriridovirus"], "references": [{"PubMed": "16912294"}, {"DOI": "10.1128/jvi.00464-06"}]}
-{"pid": " Q6GZX2", "dates": [{"date": "2011-06-28", "date_info": "integrated into UniProtKB/Swiss-Prot"}, {"date": "2004-07-19", "date_info": "sequence version 1"}, {"date": "2023-09-13", "date_info": "entry version 37"}], "title": "Uncharacterized protein 3R", "organism_species": "Frog virus 3 (isolate Goorha) (FV-3)", "subjects": ["Viruses", "Varidnaviria", "Bamfordvirae", "Nucleocytoviricota", "Megaviricetes", "Pimascovirales", "Iridoviridae", "Alphairidovirinae", "Ranavirus"], "references": [{"PubMed": "15165820"}, {"DOI": "10.1016/j.virol.2004.02.019"}]}
-{"pid": " Q6GZX1", "dates": [{"date": "2011-06-28", "date_info": "integrated into UniProtKB/Swiss-Prot"}, {"date": "2004-07-19", "date_info": "sequence version 1"}, {"date": "2023-09-13", "date_info": "entry version 38"}], "title": "Uncharacterized protein 004R", "organism_species": "Frog virus 3 (isolate Goorha) (FV-3)", "subjects": ["Viruses", "Varidnaviria", "Bamfordvirae", "Nucleocytoviricota", "Megaviricetes", "Pimascovirales", "Iridoviridae", "Alphairidovirinae", "Ranavirus"], "references": [{"PubMed": "15165820"}, {"DOI": "10.1016/j.virol.2004.02.019"}]}
-{"pid": " Q197F5", "dates": [{"date": "2009-06-16", "date_info": "integrated into UniProtKB/Swiss-Prot"}, {"date": "2006-07-11", "date_info": "sequence version 1"}, {"date": "2022-10-12", "date_info": "entry version 32"}], "title": "Uncharacterized protein 005L", "organism_species": "Invertebrate iridescent virus 3 (IIV-3) (Mosquito iridescent virus)", "subjects": ["Viruses", "Varidnaviria", "Bamfordvirae", "Nucleocytoviricota", "Megaviricetes", "Pimascovirales", "Iridoviridae", "Betairidovirinae", "Chloriridovirus"], "references": [{"PubMed": "16912294"}, {"DOI": "10.1128/jvi.00464-06"}]}
-{"pid": " Q6GZX0", "dates": [{"date": "2011-06-28", "date_info": "integrated into UniProtKB/Swiss-Prot"}, {"date": "2004-07-19", "date_info": "sequence version 1"}, {"date": "2023-09-13", "date_info": "entry version 47"}], "title": "Uncharacterized protein 005R", "organism_species": "Frog virus 3 (isolate Goorha) (FV-3)", "subjects": ["Viruses", "Varidnaviria", "Bamfordvirae", "Nucleocytoviricota", "Megaviricetes", "Pimascovirales", "Iridoviridae", "Alphairidovirinae", "Ranavirus"], "references": [{"PubMed": "15165820"}, {"DOI": "10.1016/j.virol.2004.02.019"}]}
-{"pid": " Q91G88", "dates": [{"date": "2009-06-16", "date_info": "integrated into UniProtKB/Swiss-Prot"}, {"date": "2001-12-01", "date_info": "sequence version 1"}, {"date": "2023-06-28", "date_info": "entry version 53"}], "title": "Putative KilA-N domain-containing protein 006L", "organism_species": "Invertebrate iridescent virus 6 (IIV-6) (Chilo iridescent virus)", "subjects": ["Viruses", "Varidnaviria", "Bamfordvirae", "Nucleocytoviricota", "Megaviricetes", "Pimascovirales", "Iridoviridae", "Betairidovirinae", "Iridovirus"], "references": [{"PubMed": "17239238"}, {"DOI": "10.1186/1743-422x-4-11"}]}
-{"pid": " Q6GZW9", "dates": [{"date": "2011-06-28", "date_info": "integrated into UniProtKB/Swiss-Prot"}, {"date": "2004-07-19", "date_info": "sequence version 1"}, {"date": "2023-09-13", "date_info": "entry version 34"}], "title": "Uncharacterized protein 006R", "organism_species": "Frog virus 3 (isolate Goorha) (FV-3)", "subjects": ["Viruses", "Varidnaviria", "Bamfordvirae", "Nucleocytoviricota", "Megaviricetes", "Pimascovirales", "Iridoviridae", "Alphairidovirinae", "Ranavirus"], "references": [{"PubMed": "15165820"}, {"DOI": "10.1016/j.virol.2004.02.019"}]}
-{"pid": " Q6GZW8", "dates": [{"date": "2011-06-28", "date_info": "integrated into UniProtKB/Swiss-Prot"}, {"date": "2004-07-19", "date_info": "sequence version 1"}, {"date": "2023-09-13", "date_info": "entry version 32"}], "title": "Uncharacterized protein 007R", "organism_species": "Frog virus 3 (isolate Goorha) (FV-3)", "subjects": ["Viruses", "Varidnaviria", "Bamfordvirae", "Nucleocytoviricota", "Megaviricetes", "Pimascovirales", "Iridoviridae", "Alphairidovirinae", "Ranavirus"], "references": [{"PubMed": "15165820"}, {"DOI": "10.1016/j.virol.2004.02.019"}]}
-{"pid": " Q197F3", "dates": [{"date": "2009-06-16", "date_info": "integrated into UniProtKB/Swiss-Prot"}, {"date": "2006-07-11", "date_info": "sequence version 1"}, {"date": "2023-02-22", "date_info": "entry version 28"}], "title": "Uncharacterized protein 007R", "organism_species": "Invertebrate iridescent virus 3 (IIV-3) (Mosquito iridescent virus)", "subjects": ["Viruses", "Varidnaviria", "Bamfordvirae", "Nucleocytoviricota", "Megaviricetes", "Pimascovirales", "Iridoviridae", "Betairidovirinae", "Chloriridovirus"], "references": [{"PubMed": "16912294"}, {"DOI": "10.1128/jvi.00464-06"}]}
-{"pid": " Q197F2", "dates": [{"date": "2009-06-16", "date_info": "integrated into UniProtKB/Swiss-Prot"}, {"date": "2006-07-11", "date_info": "sequence version 1"}, {"date": "2022-02-23", "date_info": "entry version 22"}], "title": "Uncharacterized protein 008L", "organism_species": "Invertebrate iridescent virus 3 (IIV-3) (Mosquito iridescent virus)", "subjects": ["Viruses", "Varidnaviria", "Bamfordvirae", "Nucleocytoviricota", "Megaviricetes", "Pimascovirales", "Iridoviridae", "Betairidovirinae", "Chloriridovirus"], "references": [{"PubMed": "16912294"}, {"DOI": "10.1128/jvi.00464-06"}]}
-{"pid": " Q6GZW6", "dates": [{"date": "2011-06-28", "date_info": "integrated into UniProtKB/Swiss-Prot"}, {"date": "2004-07-19", "date_info": "sequence version 1"}, {"date": "2023-09-13", "date_info": "entry version 67"}], "title": "Putative helicase 009L", "organism_species": "Frog virus 3 (isolate Goorha) (FV-3)", "subjects": ["Viruses", "Varidnaviria", "Bamfordvirae", "Nucleocytoviricota", "Megaviricetes", "Pimascovirales", "Iridoviridae", "Alphairidovirinae", "Ranavirus"], "references": [{"PubMed": "15165820"}, {"DOI": "10.1016/j.virol.2004.02.019"}]}
-{"pid": " Q91G85", "dates": [{"date": "2009-06-16", "date_info": "integrated into UniProtKB/Swiss-Prot"}, {"date": "2001-12-01", "date_info": "sequence version 1"}, {"date": "2023-02-22", "date_info": "entry version 38"}], "title": "Uncharacterized protein 009R", "organism_species": "Invertebrate iridescent virus 6 (IIV-6) (Chilo iridescent virus)", "subjects": ["Viruses", "Varidnaviria", "Bamfordvirae", "Nucleocytoviricota", "Megaviricetes", "Pimascovirales", "Iridoviridae", "Betairidovirinae", "Iridovirus"], "references": [{"PubMed": "17239238"}, {"DOI": "10.1186/1743-422x-4-11"}]}
-{"pid": " Q6GZW5", "dates": [{"date": "2011-06-28", "date_info": "integrated into UniProtKB/Swiss-Prot"}, {"date": "2004-07-19", "date_info": "sequence version 1"}, {"date": "2023-09-13", "date_info": "entry version 37"}], "title": "Uncharacterized protein 010R", "organism_species": "Frog virus 3 (isolate Goorha) (FV-3)", "subjects": ["Viruses", "Varidnaviria", "Bamfordvirae", "Nucleocytoviricota", "Megaviricetes", "Pimascovirales", "Iridoviridae", "Alphairidovirinae", "Ranavirus"], "references": [{"PubMed": "15165820"}, {"DOI": "10.1016/j.virol.2004.02.019"}]}
-{"pid": " Q197E9", "dates": [{"date": "2009-06-16", "date_info": "integrated into UniProtKB/Swiss-Prot"}, {"date": "2006-07-11", "date_info": "sequence version 1"}, {"date": "2023-02-22", "date_info": "entry version 28"}], "title": "Uncharacterized protein 011L", "organism_species": "Invertebrate iridescent virus 3 (IIV-3) (Mosquito iridescent virus)", "subjects": ["Viruses", "Varidnaviria", "Bamfordvirae", "Nucleocytoviricota", "Megaviricetes", "Pimascovirales", "Iridoviridae", "Betairidovirinae", "Chloriridovirus"], "references": [{"PubMed": "16912294"}, {"DOI": "10.1128/jvi.00464-06"}]}
-{"pid": " Q6GZW4", "dates": [{"date": "2011-06-28", "date_info": "integrated into UniProtKB/Swiss-Prot"}, {"date": "2004-07-19", "date_info": "sequence version 1"}, {"date": "2023-09-13", "date_info": "entry version 37"}], "title": "Uncharacterized protein 011R", "organism_species": "Frog virus 3 (isolate Goorha) (FV-3)", "subjects": ["Viruses", "Varidnaviria", "Bamfordvirae", "Nucleocytoviricota", "Megaviricetes", "Pimascovirales", "Iridoviridae", "Alphairidovirinae", "Ranavirus"], "references": [{"PubMed": "15165820"}, {"DOI": "10.1016/j.virol.2004.02.019"}]}
-{"pid": " Q6GZW3", "dates": [{"date": "2011-06-28", "date_info": "integrated into UniProtKB/Swiss-Prot"}, {"date": "2004-07-19", "date_info": "sequence version 1"}, {"date": "2023-09-13", "date_info": "entry version 35"}], "title": "Uncharacterized protein 012L", "organism_species": "Frog virus 3 (isolate Goorha) (FV-3)", "subjects": ["Viruses", "Varidnaviria", "Bamfordvirae", "Nucleocytoviricota", "Megaviricetes", "Pimascovirales", "Iridoviridae", "Alphairidovirinae", "Ranavirus"], "references": [{"PubMed": "15165820"}, {"DOI": "10.1016/j.virol.2004.02.019"}]}
-{"pid": " Q197E7", "dates": [{"date": "2009-06-16", "date_info": "integrated into UniProtKB/Swiss-Prot"}, {"date": "2006-07-11", "date_info": "sequence version 1"}, {"date": "2023-02-22", "date_info": "entry version 37"}], "title": "Uncharacterized protein IIV3-013L", "organism_species": "Invertebrate iridescent virus 3 (IIV-3) (Mosquito iridescent virus)", "subjects": ["Viruses", "Varidnaviria", "Bamfordvirae", "Nucleocytoviricota", "Megaviricetes", "Pimascovirales", "Iridoviridae", "Betairidovirinae", "Chloriridovirus"], "references": [{"PubMed": "16912294"}, {"DOI": "10.1128/jvi.00464-06"}]}
-{"pid": " Q6GZW2", "dates": [{"date": "2011-06-28", "date_info": "integrated into UniProtKB/Swiss-Prot"}, {"date": "2004-07-19", "date_info": "sequence version 1"}, {"date": "2023-09-13", "date_info": "entry version 30"}], "title": "Uncharacterized protein 013R", "organism_species": "Frog virus 3 (isolate Goorha) (FV-3)", "subjects": ["Viruses", "Varidnaviria", "Bamfordvirae", "Nucleocytoviricota", "Megaviricetes", "Pimascovirales", "Iridoviridae", "Alphairidovirinae", "Ranavirus"], "references": [{"PubMed": "15165820"}, {"DOI": "10.1016/j.virol.2004.02.019"}]}
-{"pid": " Q6GZW1", "dates": [{"date": "2011-06-28", "date_info": "integrated into UniProtKB/Swiss-Prot"}, {"date": "2004-07-19", "date_info": "sequence version 1"}, {"date": "2023-09-13", "date_info": "entry version 35"}], "title": "Uncharacterized protein 014R", "organism_species": "Frog virus 3 (isolate Goorha) (FV-3)", "subjects": ["Viruses", "Varidnaviria", "Bamfordvirae", "Nucleocytoviricota", "Megaviricetes", "Pimascovirales", "Iridoviridae", "Alphairidovirinae", "Ranavirus"], "references": [{"PubMed": "15165820"}, {"DOI": "10.1016/j.virol.2004.02.019"}]}
-{"pid": " Q6GZW0", "dates": [{"date": "2011-06-28", "date_info": "integrated into UniProtKB/Swiss-Prot"}, {"date": "2004-07-19", "date_info": "sequence version 1"}, {"date": "2023-09-13", "date_info": "entry version 50"}], "title": "Uncharacterized protein 015R", "organism_species": "Frog virus 3 (isolate Goorha) (FV-3)", "subjects": ["Viruses", "Varidnaviria", "Bamfordvirae", "Nucleocytoviricota", "Megaviricetes", "Pimascovirales", "Iridoviridae", "Alphairidovirinae", "Ranavirus"], "references": [{"PubMed": "15165820"}, {"DOI": "10.1016/j.virol.2004.02.019"}]}
-{"pid": " Q6GZV8", "dates": [{"date": "2011-06-28", "date_info": "integrated into UniProtKB/Swiss-Prot"}, {"date": "2004-07-19", "date_info": "sequence version 1"}, {"date": "2023-09-13", "date_info": "entry version 35"}], "title": "Uncharacterized protein 017L", "organism_species": "Frog virus 3 (isolate Goorha) (FV-3)", "subjects": ["Viruses", "Varidnaviria", "Bamfordvirae", "Nucleocytoviricota", "Megaviricetes", "Pimascovirales", "Iridoviridae", "Alphairidovirinae", "Ranavirus"], "references": [{"PubMed": "15165820"}, {"DOI": "10.1016/j.virol.2004.02.019"}]}
-{"pid": " Q6GZV7", "dates": [{"date": "2011-06-28", "date_info": "integrated into UniProtKB/Swiss-Prot"}, {"date": "2004-07-19", "date_info": "sequence version 1"}, {"date": "2023-09-13", "date_info": "entry version 33"}], "title": "Uncharacterized protein 018L", "organism_species": "Frog virus 3 (isolate Goorha) (FV-3)", "subjects": ["Viruses", "Varidnaviria", "Bamfordvirae", "Nucleocytoviricota", "Megaviricetes", "Pimascovirales", "Iridoviridae", "Alphairidovirinae", "Ranavirus"], "references": [{"PubMed": "15165820"}, {"DOI": "10.1016/j.virol.2004.02.019"}]}
-{"pid": " Q6GZV6", "dates": [{"date": "2011-06-28", "date_info": "integrated into UniProtKB/Swiss-Prot"}, {"date": "2004-07-19", "date_info": "sequence version 1"}, {"date": "2023-09-13", "date_info": "entry version 87"}], "title": "Putative serine/threonine-protein kinase 019R", "organism_species": "Frog virus 3 (isolate Goorha) (FV-3)", "subjects": ["Viruses", "Varidnaviria", "Bamfordvirae", "Nucleocytoviricota", "Megaviricetes", "Pimascovirales", "Iridoviridae", "Alphairidovirinae", "Ranavirus"], "references": [{"PubMed": "15165820"}, {"DOI": "10.1016/j.virol.2004.02.019"}]}
-{"pid": " Q6GZV5", "dates": [{"date": "2011-06-28", "date_info": "integrated into UniProtKB/Swiss-Prot"}, {"date": "2004-07-19", "date_info": "sequence version 1"}, {"date": "2023-09-13", "date_info": "entry version 40"}], "title": "Uncharacterized protein 020R", "organism_species": "Frog virus 3 (isolate Goorha) (FV-3)", "subjects": ["Viruses", "Varidnaviria", "Bamfordvirae", "Nucleocytoviricota", "Megaviricetes", "Pimascovirales", "Iridoviridae", "Alphairidovirinae", "Ranavirus"], "references": [{"PubMed": "15165820"}, {"DOI": "10.1016/j.virol.2004.02.019"}]}
-{"pid": " Q6GZV4", "dates": [{"date": "2011-06-28", "date_info": "integrated into UniProtKB/Swiss-Prot"}, {"date": "2004-07-19", "date_info": "sequence version 1"}, {"date": "2023-09-13", "date_info": "entry version 35"}], "title": "Uncharacterized protein 021L", "organism_species": "Frog virus 3 (isolate Goorha) (FV-3)", "subjects": ["Viruses", "Varidnaviria", "Bamfordvirae", "Nucleocytoviricota", "Megaviricetes", "Pimascovirales", "Iridoviridae", "Alphairidovirinae", "Ranavirus"], "references": [{"PubMed": "15165820"}, {"DOI": "10.1016/j.virol.2004.02.019"}]}
-{"pid": " Q197D8", "dates": [{"date": "2009-06-16", "date_info": "integrated into UniProtKB/Swiss-Prot"}, {"date": "2006-07-11", "date_info": "sequence version 1"}, {"date": "2022-12-14", "date_info": "entry version 35"}], "title": "Transmembrane protein 022L", "organism_species": "Invertebrate iridescent virus 3 (IIV-3) (Mosquito iridescent virus)", "subjects": ["Viruses", "Varidnaviria", "Bamfordvirae", "Nucleocytoviricota", "Megaviricetes", "Pimascovirales", "Iridoviridae", "Betairidovirinae", "Chloriridovirus"], "references": [{"PubMed": "16912294"}, {"DOI": "10.1128/jvi.00464-06"}]}
-{"pid": " Q6GZV2", "dates": [{"date": "2011-06-28", "date_info": "integrated into UniProtKB/Swiss-Prot"}, {"date": "2004-07-19", "date_info": "sequence version 1"}, {"date": "2023-09-13", "date_info": "entry version 33"}], "title": "Uncharacterized protein 023R", "organism_species": "Frog virus 3 (isolate Goorha) (FV-3)", "subjects": ["Viruses", "Varidnaviria", "Bamfordvirae", "Nucleocytoviricota", "Megaviricetes", "Pimascovirales", "Iridoviridae", "Alphairidovirinae", "Ranavirus"], "references": [{"PubMed": "15165820"}, {"DOI": "10.1016/j.virol.2004.02.019"}]}
-{"pid": " Q197D7", "dates": [{"date": "2009-06-16", "date_info": "integrated into UniProtKB/Swiss-Prot"}, {"date": "2006-07-11", "date_info": "sequence version 1"}, {"date": "2023-02-22", "date_info": "entry version 25"}], "title": "Uncharacterized protein 023R", "organism_species": "Invertebrate iridescent virus 3 (IIV-3) (Mosquito iridescent virus)", "subjects": ["Viruses", "Varidnaviria", "Bamfordvirae", "Nucleocytoviricota", "Megaviricetes", "Pimascovirales", "Iridoviridae", "Betairidovirinae", "Chloriridovirus"], "references": [{"PubMed": "16912294"}, {"DOI": "10.1128/jvi.00464-06"}]}
-{"pid": " Q6GZV1", "dates": [{"date": "2011-06-28", "date_info": "integrated into UniProtKB/Swiss-Prot"}, {"date": "2004-07-19", "date_info": "sequence version 1"}, {"date": "2023-09-13", "date_info": "entry version 37"}], "title": "Uncharacterized protein 024R", "organism_species": "Frog virus 3 (isolate Goorha) (FV-3)", "subjects": ["Viruses", "Varidnaviria", "Bamfordvirae", "Nucleocytoviricota", "Megaviricetes", "Pimascovirales", "Iridoviridae", "Alphairidovirinae", "Ranavirus"], "references": [{"PubMed": "15165820"}, {"DOI": "10.1016/j.virol.2004.02.019"}]}
-{"pid": " Q197D5", "dates": [{"date": "2009-06-16", "date_info": "integrated into UniProtKB/Swiss-Prot"}, {"date": "2006-07-11", "date_info": "sequence version 1"}, {"date": "2022-10-12", "date_info": "entry version 24"}], "title": "Uncharacterized protein 025R", "organism_species": "Invertebrate iridescent virus 3 (IIV-3) (Mosquito iridescent virus)", "subjects": ["Viruses", "Varidnaviria", "Bamfordvirae", "Nucleocytoviricota", "Megaviricetes", "Pimascovirales", "Iridoviridae", "Betairidovirinae", "Chloriridovirus"], "references": [{"PubMed": "16912294"}, {"DOI": "10.1128/jvi.00464-06"}]}
-{"pid": " Q91G70", "dates": [{"date": "2009-06-16", "date_info": "integrated into UniProtKB/Swiss-Prot"}, {"date": "2001-12-01", "date_info": "sequence version 1"}, {"date": "2020-08-12", "date_info": "entry version 32"}], "title": "Uncharacterized protein 026R", "organism_species": "Invertebrate iridescent virus 6 (IIV-6) (Chilo iridescent virus)", "subjects": ["Viruses", "Varidnaviria", "Bamfordvirae", "Nucleocytoviricota", "Megaviricetes", "Pimascovirales", "Iridoviridae", "Betairidovirinae", "Iridovirus"], "references": [{"PubMed": "17239238"}, {"DOI": "10.1186/1743-422x-4-11"}]}
-{"pid": " Q6GZU9", "dates": [{"date": "2011-06-28", "date_info": "integrated into UniProtKB/Swiss-Prot"}, {"date": "2004-07-19", "date_info": "sequence version 1"}, {"date": "2023-09-13", "date_info": "entry version 49"}], "title": "Uncharacterized protein 027R", "organism_species": "Frog virus 3 (isolate Goorha) (FV-3)", "subjects": ["Viruses", "Varidnaviria", "Bamfordvirae", "Nucleocytoviricota", "Megaviricetes", "Pimascovirales", "Iridoviridae", "Alphairidovirinae", "Ranavirus"], "references": [{"PubMed": "15165820"}, {"DOI": "10.1016/j.virol.2004.02.019"}]}
-{"pid": " Q6GZU8", "dates": [{"date": "2011-06-28", "date_info": "integrated into UniProtKB/Swiss-Prot"}, {"date": "2004-07-19", "date_info": "sequence version 1"}, {"date": "2023-09-13", "date_info": "entry version 55"}], "title": "Uncharacterized protein 028R", "organism_species": "Frog virus 3 (isolate Goorha) (FV-3)", "subjects": ["Viruses", "Varidnaviria", "Bamfordvirae", "Nucleocytoviricota", "Megaviricetes", "Pimascovirales", "Iridoviridae", "Alphairidovirinae", "Ranavirus"], "references": [{"PubMed": "15165820"}, {"DOI": "10.1016/j.virol.2004.02.019"}]}
+{"pid": "Q6GZX4", "dates": [{"date": "28-JUN-2011", "date_info": " integrated into UniProtKB/Swiss-Prot."}, {"date": "19-JUL-2004", "date_info": " sequence version 1."}, {"date": "12-AUG-2020", "date_info": " entry version 41."}], "title": "Putative transcription factor 001R;", "organism_species": "Frog virus 3 (isolate Goorha) (FV-3).", "subjects": ["Viruses", "Varidnaviria", "Bamfordvirae", "Nucleocytoviricota", "Megaviricetes", "Pimascovirales", "Iridoviridae", "Alphairidovirinae", "Ranavirus."], "references": [{"PubMed": "15165820"}, {" DOI": "10.1016/j.virol.2004.02.019"}]}
+{"pid": "Q6GZX3", "dates": [{"date": "28-JUN-2011", "date_info": " integrated into UniProtKB/Swiss-Prot."}, {"date": "19-JUL-2004", "date_info": " sequence version 1."}, {"date": "12-AUG-2020", "date_info": " entry version 42."}], "title": "Uncharacterized protein 002L;", "organism_species": "Frog virus 3 (isolate Goorha) (FV-3).", "subjects": ["Viruses", "Varidnaviria", "Bamfordvirae", "Nucleocytoviricota", "Megaviricetes", "Pimascovirales", "Iridoviridae", "Alphairidovirinae", "Ranavirus."], "references": [{"PubMed": "15165820"}, {" DOI": "10.1016/j.virol.2004.02.019"}]}
+{"pid": "Q197F8", "dates": [{"date": "16-JUN-2009", "date_info": " integrated into UniProtKB/Swiss-Prot."}, {"date": "11-JUL-2006", "date_info": " sequence version 1."}, {"date": "12-AUG-2020", "date_info": " entry version 27."}], "title": "Uncharacterized protein 002R;", "organism_species": "Invertebrate iridescent virus 3 (IIV-3) (Mosquito iridescent virus).", "subjects": ["Viruses", "Varidnaviria", "Bamfordvirae", "Nucleocytoviricota", "Megaviricetes", "Pimascovirales", "Iridoviridae", "Betairidovirinae", "Chloriridovirus."], "references": [{"PubMed": "16912294"}, {" DOI": "10.1128/jvi.00464-06"}]}
+{"pid": "Q197F7", "dates": [{"date": "16-JUN-2009", "date_info": " integrated into UniProtKB/Swiss-Prot."}, {"date": "11-JUL-2006", "date_info": " sequence version 1."}, {"date": "12-AUG-2020", "date_info": " entry version 23."}], "title": "Uncharacterized protein 003L;", "organism_species": "Invertebrate iridescent virus 3 (IIV-3) (Mosquito iridescent virus).", "subjects": ["Viruses", "Varidnaviria", "Bamfordvirae", "Nucleocytoviricota", "Megaviricetes", "Pimascovirales", "Iridoviridae", "Betairidovirinae", "Chloriridovirus."], "references": [{"PubMed": "16912294"}, {" DOI": "10.1128/jvi.00464-06"}]}
+{"pid": "Q6GZX2", "dates": [{"date": "28-JUN-2011", "date_info": " integrated into UniProtKB/Swiss-Prot."}, {"date": "19-JUL-2004", "date_info": " sequence version 1."}, {"date": "12-AUG-2020", "date_info": " entry version 36."}], "title": "Uncharacterized protein 3R;", "organism_species": "Frog virus 3 (isolate Goorha) (FV-3).", "subjects": ["Viruses", "Varidnaviria", "Bamfordvirae", "Nucleocytoviricota", "Megaviricetes", "Pimascovirales", "Iridoviridae", "Alphairidovirinae", "Ranavirus."], "references": [{"PubMed": "15165820"}, {" DOI": "10.1016/j.virol.2004.02.019"}]}
+{"pid": "Q6GZX1", "dates": [{"date": "28-JUN-2011", "date_info": " integrated into UniProtKB/Swiss-Prot."}, {"date": "19-JUL-2004", "date_info": " sequence version 1."}, {"date": "12-AUG-2020", "date_info": " entry version 34."}], "title": "Uncharacterized protein 004R;", "organism_species": "Frog virus 3 (isolate Goorha) (FV-3).", "subjects": ["Viruses", "Varidnaviria", "Bamfordvirae", "Nucleocytoviricota", "Megaviricetes", "Pimascovirales", "Iridoviridae", "Alphairidovirinae", "Ranavirus."], "references": [{"PubMed": "15165820"}, {" DOI": "10.1016/j.virol.2004.02.019"}]}
--- a/dhp-workflows/dhp-aggregation/src/test/scala/eu/dnetlib/dhp/sx/bio/BioScholixTest.scala
+++ b/dhp-workflows/dhp-aggregation/src/test/scala/eu/dnetlib/dhp/sx/bio/BioScholixTest.scala
@ -14,12 +14,10 @@ import org.junit.jupiter.api.extension.ExtendWith
 import org.junit.jupiter.api.{BeforeEach, Test}
 import org.mockito.junit.jupiter.MockitoExtension

-import java.io.{BufferedReader, InputStream, InputStreamReader}
+import java.io.{BufferedReader, FileInputStream, InputStream, InputStreamReader}
 import java.util.zip.GZIPInputStream
 import scala.collection.JavaConverters._
-import scala.collection.mutable.ListBuffer
 import scala.io.Source
-import scala.xml.pull.XMLEventReader

@ExtendWith(Array(classOf[MockitoExtension]))
 class BioScholixTest extends AbstractVocabularyTest {
@ -49,11 +47,11 @@ class BioScholixTest extends AbstractVocabularyTest {

  @Test
  def testEBIData() = {
-    val inputXML = Source
-      .fromInputStream(getClass.getResourceAsStream("/eu/dnetlib/dhp/sx/graph/bio/pubmed.xml"))
-      .mkString
-    val xml = new XMLEventReader(Source.fromBytes(inputXML.getBytes()))
-    new PMParser(xml).foreach(s => println(mapper.writeValueAsString(s)))
+    val inputXML = getClass.getResourceAsStream("/eu/dnetlib/dhp/sx/graph/bio/pubmed.xml")
+
+    // new PubmedParser(new GZIPInputStream(new FileInputStream("/Users/sandro/Downloads/pubmed23n1078.xml.gz")))
+    new PMParser(new GZIPInputStream(new FileInputStream("/Users/sandro/Downloads/pubmed23n1078.xml.gz")))
+    print("DONE")
  }

  @Test
@ -89,14 +87,14 @@ class BioScholixTest extends AbstractVocabularyTest {

  }

-  @Test
-  def testParsingPubmedXML(): Unit = {
-    val xml = new XMLEventReader(
-      Source.fromInputStream(getClass.getResourceAsStream("/eu/dnetlib/dhp/sx/graph/bio/pubmed.xml"))
-    )
-    val parser = new PMParser(xml)
-    parser.foreach(checkPMArticle)
-  }
+//  @Test
+//  def testParsingPubmedXML(): Unit = {
+//    val xml = new XMLEventReader(
+//      Source.fromInputStream(getClass.getResourceAsStream("/eu/dnetlib/dhp/sx/graph/bio/pubmed.xml"))
+//    )
+//    val parser = new PMParser(xml)
+//    parser.foreach(checkPMArticle)
+//  }

  private def checkPubmedPublication(o: Oaf): Unit = {
    assertTrue(o.isInstanceOf[Publication])
@ -153,19 +151,19 @@ class BioScholixTest extends AbstractVocabularyTest {
    assertTrue(hasOldOpenAIREID)
  }

-  @Test
-  def testPubmedMapping(): Unit = {
-
-    val xml = new XMLEventReader(
-      Source.fromInputStream(getClass.getResourceAsStream("/eu/dnetlib/dhp/sx/graph/bio/pubmed.xml"))
-    )
-    val parser = new PMParser(xml)
-    val results = ListBuffer[Oaf]()
-    parser.foreach(x => results += PubMedToOaf.convert(x, vocabularies))
-
-    results.foreach(checkPubmedPublication)
-
-  }
+//  @Test
+//  def testPubmedMapping(): Unit = {
+//
+//    val xml = new XMLEventReader(
+//      Source.fromInputStream(getClass.getResourceAsStream("/eu/dnetlib/dhp/sx/graph/bio/pubmed.xml"))
+//    )
+//    val parser = new PMParser(xml)
+//    val results = ListBuffer[Oaf]()
+//    parser.foreach(x => results += PubMedToOaf.convert(x, vocabularies))
+//
+//    results.foreach(checkPubmedPublication)
+//
+//  }

  @Test
  def testPDBToOAF(): Unit = {
--- a/dhp-workflows/dhp-broker-events/src/main/java/eu/dnetlib/dhp/broker/oa/util/TrustUtils.java
+++ b/dhp-workflows/dhp-broker-events/src/main/java/eu/dnetlib/dhp/broker/oa/util/TrustUtils.java
@ -2,9 +2,7 @@
 package eu.dnetlib.dhp.broker.oa.util;

 import java.io.IOException;
-import java.nio.charset.StandardCharsets;

-import org.apache.commons.io.IOUtils;
 import org.apache.spark.sql.Row;
 import org.slf4j.Logger;
 import org.slf4j.LoggerFactory;
@ -13,7 +11,7 @@ import com.fasterxml.jackson.databind.ObjectMapper;

 import eu.dnetlib.broker.objects.OaBrokerMainEntity;
 import eu.dnetlib.pace.config.DedupConfig;
-import eu.dnetlib.pace.model.SparkDeduper;
+import eu.dnetlib.pace.model.SparkDedupConfig;
 import eu.dnetlib.pace.tree.support.TreeProcessor;

 public class TrustUtils {
@ -22,22 +20,18 @@ public class TrustUtils {

 	private static DedupConfig dedupConfig;

-	private static SparkDeduper deduper;
+	private static SparkDedupConfig sparkDedupConfig;

 	private static final ObjectMapper mapper;

 	static {
 		mapper = new ObjectMapper();
 		try {
-			dedupConfig = DedupConfig
-				.load(
-					IOUtils
-						.toString(
-							DedupConfig.class
-								.getResourceAsStream("/eu/dnetlib/dhp/broker/oa/dedupConfig/dedupConfig.json"),
-							StandardCharsets.UTF_8));
-
-			deduper = new SparkDeduper(dedupConfig);
+			dedupConfig = mapper
+				.readValue(
+					DedupConfig.class.getResourceAsStream("/eu/dnetlib/dhp/broker/oa/dedupConfig/dedupConfig.json"),
+					DedupConfig.class);
+			sparkDedupConfig = new SparkDedupConfig(dedupConfig, 1);
 		} catch (final IOException e) {
 			log.error("Error loading dedupConfig, e");
 		}
@ -53,8 +47,8 @@ public class TrustUtils {
 		}

 		try {
-			final Row doc1 = deduper.model().rowFromJson(mapper.writeValueAsString(r1));
-			final Row doc2 = deduper.model().rowFromJson(mapper.writeValueAsString(r2));
+			final Row doc1 = sparkDedupConfig.rowFromJson(mapper.writeValueAsString(r2));
+			final Row doc2 = sparkDedupConfig.rowFromJson(mapper.writeValueAsString(r2));

 			final double score = new TreeProcessor(dedupConfig).computeScore(doc1, doc2);

@ -63,7 +57,7 @@ public class TrustUtils {
 			return TrustUtils.rescale(score, threshold);
 		} catch (final Exception e) {
 			log.error("Error computing score between results", e);
-			throw new RuntimeException(e);
+			return BrokerConstants.MIN_TRUST;
 		}
 	}

--- a/dhp-workflows/dhp-broker-events/src/test/java/eu/dnetlib/dhp/broker/oa/samples/SimpleVariableJobTest.java
+++ b/dhp-workflows/dhp-broker-events/src/test/java/eu/dnetlib/dhp/broker/oa/samples/SimpleVariableJobTest.java
@ -83,7 +83,7 @@ public class SimpleVariableJobTest {

 		final long n = spark
 			.createDataset(inputList, Encoders.STRING())
-			.filter((FilterFunction<String>) s -> filter(map.get(s)))
+			.filter((FilterFunction<String>)  s -> filter(map.get(s)))
 			.map((MapFunction<String, String>) String::toLowerCase, Encoders.STRING())
 			.count();

--- a/dhp-workflows/dhp-dedup-openaire/pom.xml
+++ b/dhp-workflows/dhp-dedup-openaire/pom.xml
@ -41,18 +41,54 @@
    
    </build>

+    <dependencyManagement>
+        <dependencies>
+            <dependency>
+                <groupId>io.opentelemetry</groupId>
+                <artifactId>opentelemetry-bom</artifactId>
+                <version>1.16.0</version>
+                <type>pom</type>
+                <scope>import</scope>
+            </dependency>
+        </dependencies>
+    </dependencyManagement>
+
    <dependencies>

        <dependency>
            <groupId>eu.dnetlib.dhp</groupId>
            <artifactId>dhp-common</artifactId>
            <version>${project.version}</version>
+            <exclusions>
+                <exclusion>
+                    <artifactId>log4j</artifactId>
+                    <groupId>log4j</groupId>
+                </exclusion>
+                <exclusion>
+                    <artifactId>annotations</artifactId>
+                    <groupId>org.jetbrains</groupId>
+                </exclusion>
+                <exclusion>
+                    <artifactId>slf4j-api</artifactId>
+                    <groupId>org.slf4j</groupId>
+                </exclusion>
+            </exclusions>
        </dependency>

        <dependency>
            <groupId>eu.dnetlib.dhp</groupId>
            <artifactId>dhp-pace-core</artifactId>
            <version>${project.version}</version>
+            <exclusions>
+                <exclusion>
+                    <artifactId>jsr305</artifactId>
+                    <groupId>com.google.code.findbugs</groupId>
+                </exclusion>
+                <exclusion>
+                    <artifactId>javassist</artifactId>
+                    <groupId>org.javassist</groupId>
+                </exclusion>
+            </exclusions>
        </dependency>

        <dependency>
@ -90,14 +126,17 @@
            <groupId>com.arakelian</groupId>
            <artifactId>java-jq</artifactId>
        </dependency>
+
        <dependency>
            <groupId>dom4j</groupId>
            <artifactId>dom4j</artifactId>
        </dependency>
+
        <dependency>
            <groupId>jaxen</groupId>
            <artifactId>jaxen</artifactId>
        </dependency>
+
        <dependency>
            <groupId>com.fasterxml.jackson.core</groupId>
            <artifactId>jackson-databind</artifactId>
@ -116,6 +155,7 @@
            <version>1.4.200</version>
            <scope>test</scope>
        </dependency>
+
    </dependencies>


--- a/dhp-workflows/dhp-dedup-openaire/src/main/java/eu/dnetlib/dhp/oa/dedup/DSLExample.scala
+++ b/dhp-workflows/dhp-dedup-openaire/src/main/java/eu/dnetlib/dhp/oa/dedup/DSLExample.scala
@ -0,0 +1,103 @@
+package eu.dnetlib.dhp.oa.dedup
+
+import eu.dnetlib.dhp.application.ArgumentApplicationParser
+import eu.dnetlib.dhp.oa.dedup.dsl.{Clustering, Deduper}
+import eu.dnetlib.dhp.oa.dedup.model.BlockStats
+import eu.dnetlib.dhp.utils.ISLookupClientFactory
+import eu.dnetlib.enabling.is.lookup.rmi.{ISLookUpException, ISLookUpService}
+import eu.dnetlib.pace.model.{RowDataOrderingComparator, SparkDedupConfig}
+import org.apache.commons.io.IOUtils
+import org.apache.spark.SparkConf
+import org.apache.spark.sql._
+import org.apache.spark.sql.catalyst.expressions.Literal
+import org.apache.spark.sql.types.DataTypes
+import org.dom4j.DocumentException
+import org.slf4j.LoggerFactory
+import org.xml.sax.SAXException
+
+import java.io.IOException
+import java.util.stream.Collectors
+
+object DSLExample {
+  private val log = LoggerFactory.getLogger(classOf[DSLExample])
+
+  @throws[Exception]
+  def main(args: Array[String]): Unit = {
+    val parser = new ArgumentApplicationParser(
+      IOUtils
+        .toString(classOf[DSLExample].getResourceAsStream("/eu/dnetlib/dhp/oa/dedup/createBlockStats_parameters.json"))
+    )
+    parser.parseArgument(args)
+    val conf = new SparkConf
+    new DSLExample(parser, AbstractSparkAction.getSparkSession(conf)).run(ISLookupClientFactory.getLookUpService(parser.get("isLookUpUrl")))
+  }
+}
+
+class DSLExample(parser: ArgumentApplicationParser, spark: SparkSession) extends AbstractSparkAction(parser, spark) {
+
+  def computeComparisons(blockSize: Long, slidingWindowSize: Long): Long =
+    if (slidingWindowSize >= blockSize) (slidingWindowSize * (slidingWindowSize - 1)) / 2
+    else (blockSize - slidingWindowSize + 1) * (slidingWindowSize * (slidingWindowSize - 1)) / 2
+
+  @throws[DocumentException]
+  @throws[IOException]
+  @throws[ISLookUpException]
+  @throws[SAXException]
+  override def run(isLookUpService: ISLookUpService): Unit = {
+// read oozie parameters
+    val graphBasePath = parser.get("graphBasePath")
+    val isLookUpUrl = parser.get("isLookUpUrl")
+    val actionSetId = parser.get("actionSetId")
+    val workingPath = parser.get("workingPath")
+    val numPartitions : Int = Option(parser.get("numPartitions")).map(_.toInt).getOrElse(AbstractSparkAction.NUM_PARTITIONS)
+
+    DSLExample.log.info("graphBasePath: '{}'", graphBasePath)
+    DSLExample.log.info("isLookUpUrl:   '{}'", isLookUpUrl)
+    DSLExample.log.info("actionSetId:   '{}'", actionSetId)
+    DSLExample.log.info("workingPath:   '{}'", workingPath)
+    // for each dedup configuration
+    import scala.collection.JavaConversions._
+    for (dedupConf <- getConfigurations(isLookUpService, actionSetId).subList(0, 1)) {
+      val subEntity = dedupConf.getWf.getSubEntityValue
+      DSLExample.log.info("Creating blockstats for: '{}'", subEntity)
+      val outputPath = DedupUtility.createBlockStatsPath(workingPath, actionSetId, subEntity)
+      AbstractSparkAction.removeOutputDir(spark, outputPath)
+
+      val sparkConfig = SparkDedupConfig(dedupConf, numPartitions)
+
+      val inputDF = spark.read
+        .textFile(DedupUtility.createEntityPath(graphBasePath, subEntity))
+        .transform(sparkConfig.modelExtractor)
+      val simRels = inputDF
+        .transform(sparkConfig.generateClusters)
+        .filter(functions.size(new Column("block")).geq(new Literal(1, DataTypes.IntegerType)))
+
+      val deduper = Deduper(inputDF.schema)
+        .withClustering( Clustering("sortedngrampairs"),
+          Clustering("sortedngrampairs", Seq("legalname"), Map("max" -> 2, "ngramLen" -> 3)),
+          Clustering("suffixprefix", Seq("legalname"), Map("max" -> 1, "len" -> 3)),
+          Clustering("urlclustering", Seq("websiteurl")),
+          Clustering("keywordsclustering", Seq("fields"),  Map("max" -> 2, "windowSize" -> 4))
+        )
+
+      simRels
+        .map[BlockStats](
+          (b:Row) => {
+            val documents = b.getList(1)
+            val mapDocuments = documents.stream
+              .sorted(new RowDataOrderingComparator(sparkConfig.orderingFieldPosition))
+              .limit(dedupConf.getWf.getQueueMaxSize)
+              .collect(Collectors.toList)
+            new BlockStats(
+              b.getString(0),
+              mapDocuments.size.toLong,
+              computeComparisons(mapDocuments.size.toLong, dedupConf.getWf.getSlidingWindowSize.toLong)
+            )
+
+          })(Encoders.bean[BlockStats](classOf[BlockStats]))
+        .write
+        .mode(SaveMode.Overwrite)
+        .save(outputPath)
+    }
+  }
+}
--- a/dhp-workflows/dhp-dedup-openaire/src/main/java/eu/dnetlib/dhp/oa/dedup/DedupRecordFactory.java
+++ b/dhp-workflows/dhp-dedup-openaire/src/main/java/eu/dnetlib/dhp/oa/dedup/DedupRecordFactory.java
@ -110,10 +110,6 @@ public class DedupRecordFactory {

 		// set authors and date
 		if (ModelSupport.isSubClass(entity, Result.class)) {
-			Optional
-				.ofNullable(((Result) entity).getAuthor())
-				.ifPresent(a -> authors.add(a));
-
 			((Result) entity).setAuthor(AuthorMerger.merge(authors));
 		}

--- a/dhp-workflows/dhp-dedup-openaire/src/main/java/eu/dnetlib/dhp/oa/dedup/DedupUtility.java
+++ b/dhp-workflows/dhp-dedup-openaire/src/main/java/eu/dnetlib/dhp/oa/dedup/DedupUtility.java
@ -3,8 +3,12 @@ package eu.dnetlib.dhp.oa.dedup;

 import java.io.StringReader;
 import java.util.ArrayList;
+import java.util.HashMap;
 import java.util.List;
+import java.util.Map;

+import org.apache.spark.SparkContext;
+import org.apache.spark.util.LongAccumulator;
 import org.dom4j.Document;
 import org.dom4j.DocumentException;
 import org.dom4j.Element;
--- a/dhp-workflows/dhp-dedup-openaire/src/main/java/eu/dnetlib/dhp/oa/dedup/SparkBlockStats.java
+++ b/dhp-workflows/dhp-dedup-openaire/src/main/java/eu/dnetlib/dhp/oa/dedup/SparkBlockStats.java
@ -3,13 +3,17 @@ package eu.dnetlib.dhp.oa.dedup;

 import java.io.IOException;
 import java.util.Collection;
+import java.util.List;
 import java.util.Optional;
+import java.util.stream.Collectors;

 import org.apache.commons.io.IOUtils;
 import org.apache.spark.SparkConf;
 import org.apache.spark.api.java.JavaSparkContext;
 import org.apache.spark.api.java.function.MapFunction;
 import org.apache.spark.sql.*;
+import org.apache.spark.sql.catalyst.expressions.Literal;
+import org.apache.spark.sql.types.DataTypes;
 import org.dom4j.DocumentException;
 import org.slf4j.Logger;
 import org.slf4j.LoggerFactory;
@ -21,7 +25,8 @@ import eu.dnetlib.dhp.utils.ISLookupClientFactory;
 import eu.dnetlib.enabling.is.lookup.rmi.ISLookUpException;
 import eu.dnetlib.enabling.is.lookup.rmi.ISLookUpService;
 import eu.dnetlib.pace.config.DedupConfig;
-import eu.dnetlib.pace.model.SparkDeduper;
+import eu.dnetlib.pace.model.RowDataOrderingComparator;
+import eu.dnetlib.pace.model.SparkDedupConfig;

 public class SparkBlockStats extends AbstractSparkAction {

@ -85,28 +90,27 @@ public class SparkBlockStats extends AbstractSparkAction {

 			JavaSparkContext sc = JavaSparkContext.fromSparkContext(spark.sparkContext());

-			SparkDeduper deduper = new SparkDeduper(dedupConf);
+			SparkDedupConfig sparkConfig = new SparkDedupConfig(dedupConf, numPartitions);

-			Dataset<Row> simRels = spark
+			Dataset<Row> inputDF = spark
 				.read()
 				.textFile(DedupUtility.createEntityPath(graphBasePath, subEntity))
-				.transform(deduper.model().parseJsonDataset())
-				.transform(deduper.filterAndCleanup())
-				.transform(deduper.generateClustersWithCollect())
-				.filter(functions.size(new Column("block")).geq(1));
+				.transform(sparkConfig.modelExtractor());
+			Dataset<Row> simRels = inputDF
+				.transform(sparkConfig.generateClusters())
+				.filter(functions.size(new Column("block")).geq(new Literal(1, DataTypes.IntegerType)));

-			simRels.map((MapFunction<Row, BlockStats>) row -> {
-				Collection<Row> mapDocuments = row.getList(row.fieldIndex("block"));
+			simRels.map((MapFunction<Row, BlockStats>) b -> {
+				Collection<Row> documents = b.getList(1);

-				/*
-				 * List<Row> mapDocuments = documents .stream() .sorted( new
-				 * RowDataOrderingComparator(deduper.model().orderingFieldPosition(),
-				 * deduper.model().identityFieldPosition())) .limit(dedupConf.getWf().getQueueMaxSize())
-				 * .collect(Collectors.toList());
-				 */
+				List<Row> mapDocuments = documents
+					.stream()
+					.sorted(new RowDataOrderingComparator(sparkConfig.orderingFieldPosition()))
+					.limit(dedupConf.getWf().getQueueMaxSize())
+					.collect(Collectors.toList());

 				return new BlockStats(
-					row.getString(row.fieldIndex("key")),
+					b.getString(0),
 					(long) mapDocuments.size(),
 					computeComparisons(
 						(long) mapDocuments.size(), (long) dedupConf.getWf().getSlidingWindowSize()));
--- a/dhp-workflows/dhp-dedup-openaire/src/main/java/eu/dnetlib/dhp/oa/dedup/SparkCleanRelation.scala
+++ b/dhp-workflows/dhp-dedup-openaire/src/main/java/eu/dnetlib/dhp/oa/dedup/SparkCleanRelation.scala
@ -1,78 +0,0 @@
-package eu.dnetlib.dhp.oa.dedup
-
-import eu.dnetlib.dhp.application.ArgumentApplicationParser
-import eu.dnetlib.dhp.common.HdfsSupport
-import eu.dnetlib.dhp.schema.oaf.Relation
-import eu.dnetlib.dhp.utils.ISLookupClientFactory
-import eu.dnetlib.enabling.is.lookup.rmi.ISLookUpService
-import org.apache.commons.io.IOUtils
-import org.apache.spark.SparkConf
-import org.apache.spark.sql._
-import org.apache.spark.sql.functions.col
-import org.apache.spark.sql.types.{DataTypes, StructField, StructType}
-import org.slf4j.LoggerFactory
-
-object SparkCleanRelation {
-  private val log = LoggerFactory.getLogger(classOf[SparkCleanRelation])
-
-  @throws[Exception]
-  def main(args: Array[String]): Unit = {
-    val parser = new ArgumentApplicationParser(
-      IOUtils.toString(
-        classOf[SparkCleanRelation].getResourceAsStream("/eu/dnetlib/dhp/oa/dedup/cleanRelation_parameters.json")
-      )
-    )
-    parser.parseArgument(args)
-    val conf = new SparkConf
-
-    new SparkCleanRelation(parser, AbstractSparkAction.getSparkSession(conf))
-      .run(ISLookupClientFactory.getLookUpService(parser.get("isLookUpUrl")))
-  }
-}
-
-class SparkCleanRelation(parser: ArgumentApplicationParser, spark: SparkSession)
-    extends AbstractSparkAction(parser, spark) {
-  override def run(isLookUpService: ISLookUpService): Unit = {
-    val graphBasePath = parser.get("graphBasePath")
-    val inputPath = parser.get("inputPath")
-    val outputPath = parser.get("outputPath")
-
-    SparkCleanRelation.log.info("graphBasePath: '{}'", graphBasePath)
-    SparkCleanRelation.log.info("inputPath: '{}'", inputPath)
-    SparkCleanRelation.log.info("outputPath: '{}'", outputPath)
-
-    AbstractSparkAction.removeOutputDir(spark, outputPath)
-
-    val entities =
-      Seq("datasource", "project", "organization", "publication", "dataset", "software", "otherresearchproduct")
-
-    val idsSchema = StructType.fromDDL("`id` STRING, `dataInfo` STRUCT<`deletedbyinference`:BOOLEAN,`invisible`:BOOLEAN>")
-
-    val emptyIds = spark.createDataFrame(spark.sparkContext.emptyRDD[Row].setName("empty"),
-      idsSchema)
-
-    val ids = entities
-      .foldLeft(emptyIds)((ds, entity) => {
-        val entityPath = graphBasePath + '/' + entity
-        if (HdfsSupport.exists(entityPath, spark.sparkContext.hadoopConfiguration)) {
-          ds.union(spark.read.schema(idsSchema).json(entityPath))
-        } else {
-          ds
-        }
-      })
-      .filter("dataInfo.deletedbyinference != true AND dataInfo.invisible != true")
-      .select("id")
-      .distinct()
-
-    val relations = spark.read.schema(Encoders.bean(classOf[Relation]).schema).json(inputPath)
-      .filter("dataInfo.deletedbyinference != true AND dataInfo.invisible != true")
-
-    AbstractSparkAction.save(
-      relations
-        .join(ids, col("source") === ids("id"), "leftsemi")
-        .join(ids, col("target") === ids("id"), "leftsemi"),
-      outputPath,
-      SaveMode.Overwrite
-    )
-  }
-}
--- a/dhp-workflows/dhp-dedup-openaire/src/main/java/eu/dnetlib/dhp/oa/dedup/SparkCreateSimRels.java
+++ b/dhp-workflows/dhp-dedup-openaire/src/main/java/eu/dnetlib/dhp/oa/dedup/SparkCreateSimRels.java
@ -23,7 +23,7 @@ import eu.dnetlib.dhp.utils.ISLookupClientFactory;
 import eu.dnetlib.enabling.is.lookup.rmi.ISLookUpException;
 import eu.dnetlib.enabling.is.lookup.rmi.ISLookUpService;
 import eu.dnetlib.pace.config.DedupConfig;
-import eu.dnetlib.pace.model.SparkDeduper;
+import eu.dnetlib.pace.model.SparkDedupConfig;

 public class SparkCreateSimRels extends AbstractSparkAction {

@ -84,14 +84,20 @@ public class SparkCreateSimRels extends AbstractSparkAction {

 			JavaSparkContext sc = JavaSparkContext.fromSparkContext(spark.sparkContext());

-			SparkDeduper deduper = new SparkDeduper(dedupConf);
+			SparkDedupConfig sparkConfig = new SparkDedupConfig(dedupConf, numPartitions);
+
+			spark.udf().register("collect_sort_slice", sparkConfig.collectSortSliceUDAF());

 			Dataset<?> simRels = spark
 				.read()
 				.textFile(DedupUtility.createEntityPath(graphBasePath, subEntity))
-				.transform(deduper.model().parseJsonDataset())
-				.transform(deduper.dedup())
-				.distinct()
+				.transform(sparkConfig.modelExtractor()) // Extract fields from input json column according to model
+				// definition
+				.transform(sparkConfig.generateClustersWithWindows()) // generate <key,block> pairs according to
+				// filters, clusters, and model
+				// definition
+				.transform(sparkConfig.processClusters()) // process blocks and emits <from,to> pairs of found
+				// similarities
 				.map(
 					(MapFunction<Row, Relation>) t -> DedupUtility
 						.createSimRel(t.getStruct(0).getString(0), t.getStruct(0).getString(1), entity),
--- a/dhp-workflows/dhp-dedup-openaire/src/main/java/eu/dnetlib/dhp/oa/dedup/SparkPropagateRelation.java
+++ b/dhp-workflows/dhp-dedup-openaire/src/main/java/eu/dnetlib/dhp/oa/dedup/SparkPropagateRelation.java
@ -3,18 +3,13 @@ package eu.dnetlib.dhp.oa.dedup;

 import static org.apache.spark.sql.functions.col;

-import java.util.Arrays;
-import java.util.Collections;
-import java.util.Iterator;
 import java.util.Objects;

-import org.apache.commons.beanutils.BeanUtils;
 import org.apache.commons.io.IOUtils;
 import org.apache.commons.lang3.StringUtils;
 import org.apache.spark.SparkConf;
 import org.apache.spark.api.java.function.FilterFunction;
 import org.apache.spark.api.java.function.MapFunction;
-import org.apache.spark.api.java.function.ReduceFunction;
 import org.apache.spark.sql.*;
 import org.slf4j.Logger;
 import org.slf4j.LoggerFactory;
@ -33,9 +28,9 @@ public class SparkPropagateRelation extends AbstractSparkAction {

 	private static final Logger log = LoggerFactory.getLogger(SparkPropagateRelation.class);

-	private static Encoder<Relation> REL_BEAN_ENC = Encoders.bean(Relation.class);
-
-	private static Encoder<Relation> REL_KRYO_ENC = Encoders.kryo(Relation.class);
+	enum FieldType {
+		SOURCE, TARGET
+	}

 	public SparkPropagateRelation(ArgumentApplicationParser parser, SparkSession spark) {
 		super(parser, spark);
@ -76,62 +71,38 @@ public class SparkPropagateRelation extends AbstractSparkAction {
 		Dataset<Relation> mergeRels = spark
 			.read()
 			.load(DedupUtility.createMergeRelPath(workingPath, "*", "*"))
-			.as(REL_BEAN_ENC);
+			.as(Encoders.bean(Relation.class));

 		// <mergedObjectID, dedupID>
-		Dataset<Row> mergedIds = mergeRels
+		Dataset<Tuple2<String, String>> mergedIds = mergeRels
 			.where(col("relClass").equalTo(ModelConstants.MERGES))
-			.select(col("source").as("dedupID"), col("target").as("mergedObjectID"))
+			.select(col("source"), col("target"))
 			.distinct()
+			.map(
+				(MapFunction<Row, Tuple2<String, String>>) r -> new Tuple2<>(r.getString(1), r.getString(0)),
+				Encoders.tuple(Encoders.STRING(), Encoders.STRING()))
 			.cache();

-		Dataset<Row> allRels = spark
-			.read()
-			.schema(REL_BEAN_ENC.schema())
-			.json(DedupUtility.createEntityPath(graphBasePath, "relation"));
+		final String relationPath = DedupUtility.createEntityPath(graphBasePath, "relation");

-		Dataset<Relation> dedupedRels = allRels
-			.joinWith(mergedIds, allRels.col("source").equalTo(mergedIds.col("mergedObjectID")), "left_outer")
-			.joinWith(mergedIds, col("_1.target").equalTo(mergedIds.col("mergedObjectID")), "left_outer")
-			.select("_1._1", "_1._2.dedupID", "_2.dedupID")
-			.as(Encoders.tuple(REL_BEAN_ENC, Encoders.STRING(), Encoders.STRING()))
-			.flatMap(SparkPropagateRelation::addInferredRelations, REL_KRYO_ENC);
+		Dataset<Relation> rels = spark.read().textFile(relationPath).map(patchRelFn(), Encoders.bean(Relation.class));

-		Dataset<Relation> processedRelations = distinctRelations(
-			dedupedRels.union(mergeRels.map((MapFunction<Relation, Relation>) r -> r, REL_KRYO_ENC)))
-				.filter((FilterFunction<Relation>) r -> !Objects.equals(r.getSource(), r.getTarget()));
+		Dataset<Relation> newRels = createNewRels(rels, mergedIds, getFixRelFn());

-		save(processedRelations, outputRelationPath, SaveMode.Overwrite);
-	}
+		Dataset<Relation> updated = processDataset(
+			processDataset(rels, mergedIds, FieldType.SOURCE, getDeletedFn()),
+			mergedIds,
+			FieldType.TARGET,
+			getDeletedFn());

-	private static Iterator<Relation> addInferredRelations(Tuple3<Relation, String, String> t) throws Exception {
-		Relation existingRel = t._1();
-		String newSource = t._2();
-		String newTarget = t._3();
-
-		if (newSource == null && newTarget == null) {
-			return Collections.singleton(t._1()).iterator();
-		}
-
-		// update existing relation
-		if (existingRel.getDataInfo() == null) {
-			existingRel.setDataInfo(new DataInfo());
-		}
-		existingRel.getDataInfo().setDeletedbyinference(true);
-
-		// Create new relation inferred by dedupIDs
-		Relation inferredRel = (Relation) BeanUtils.cloneBean(existingRel);
-
-		inferredRel.setDataInfo((DataInfo) BeanUtils.cloneBean(existingRel.getDataInfo()));
-		inferredRel.getDataInfo().setDeletedbyinference(false);
-
-		if (newSource != null)
-			inferredRel.setSource(newSource);
-
-		if (newTarget != null)
-			inferredRel.setTarget(newTarget);
-
-		return Arrays.asList(existingRel, inferredRel).iterator();
+		save(
+			distinctRelations(
+				newRels
+					.union(updated)
+					.union(mergeRels)
+					.map((MapFunction<Relation, Relation>) r -> r, Encoders.kryo(Relation.class)))
+						.filter((FilterFunction<Relation>) r -> !Objects.equals(r.getSource(), r.getTarget())),
+			outputRelationPath, SaveMode.Overwrite);
 	}

 	private Dataset<Relation> distinctRelations(Dataset<Relation> rels) {
@ -139,13 +110,54 @@ public class SparkPropagateRelation extends AbstractSparkAction {
 			.filter(getRelationFilterFunction())
 			.groupByKey(
 				(MapFunction<Relation, String>) r -> String
-					.join(" ", r.getSource(), r.getTarget(), r.getRelType(), r.getSubRelType(), r.getRelClass()),
+					.join(r.getSource(), r.getTarget(), r.getRelType(), r.getSubRelType(), r.getRelClass()),
 				Encoders.STRING())
-			.reduceGroups((ReduceFunction<Relation>) (b, a) -> {
-				b.mergeFrom(a);
-				return b;
-			})
-			.map((MapFunction<Tuple2<String, Relation>, Relation>) Tuple2::_2, REL_BEAN_ENC);
+			.agg(new RelationAggregator().toColumn())
+			.map((MapFunction<Tuple2<String, Relation>, Relation>) Tuple2::_2, Encoders.bean(Relation.class));
+	}
+
+	// redirect the relations to the dedupID
+	private static Dataset<Relation> createNewRels(
+		Dataset<Relation> rels, // all the relations to be redirected
+		Dataset<Tuple2<String, String>> mergedIds, // merge rels: <mergedObjectID, dedupID>
+		MapFunction<Tuple2<Tuple2<Tuple3<String, Relation, String>, Tuple2<String, String>>, Tuple2<String, String>>, Relation> mapRel) {
+
+		// <sourceID, relation, targetID>
+		Dataset<Tuple3<String, Relation, String>> mapped = rels
+			.map(
+				(MapFunction<Relation, Tuple3<String, Relation, String>>) r -> new Tuple3<>(getId(r, FieldType.SOURCE),
+					r, getId(r, FieldType.TARGET)),
+				Encoders.tuple(Encoders.STRING(), Encoders.kryo(Relation.class), Encoders.STRING()));
+
+		// < <sourceID, relation, target>, <sourceID, dedupID> >
+		Dataset<Tuple2<Tuple3<String, Relation, String>, Tuple2<String, String>>> relSource = mapped
+			.joinWith(mergedIds, mapped.col("_1").equalTo(mergedIds.col("_1")), "left_outer");
+
+		// < <<sourceID, relation, targetID>, <sourceID, dedupID>>, <targetID, dedupID> >
+		Dataset<Tuple2<Tuple2<Tuple3<String, Relation, String>, Tuple2<String, String>>, Tuple2<String, String>>> relSourceTarget = relSource
+			.joinWith(mergedIds, relSource.col("_1._3").equalTo(mergedIds.col("_1")), "left_outer");
+
+		return relSourceTarget
+			.filter(
+				(FilterFunction<Tuple2<Tuple2<Tuple3<String, Relation, String>, Tuple2<String, String>>, Tuple2<String, String>>>) r -> r
+					._1()
+					._1() != null || r._2() != null)
+			.map(mapRel, Encoders.bean(Relation.class))
+			.distinct();
+	}
+
+	private static Dataset<Relation> processDataset(
+		Dataset<Relation> rels,
+		Dataset<Tuple2<String, String>> mergedIds,
+		FieldType type,
+		MapFunction<Tuple2<Tuple2<String, Relation>, Tuple2<String, String>>, Relation> mapFn) {
+		final Dataset<Tuple2<String, Relation>> mapped = rels
+			.map(
+				(MapFunction<Relation, Tuple2<String, Relation>>) r -> new Tuple2<>(getId(r, type), r),
+				Encoders.tuple(Encoders.STRING(), Encoders.kryo(Relation.class)));
+		return mapped
+			.joinWith(mergedIds, mapped.col("_1").equalTo(mergedIds.col("_1")), "left_outer")
+			.map(mapFn, Encoders.bean(Relation.class));
 	}

 	private FilterFunction<Relation> getRelationFilterFunction() {
@ -155,4 +167,52 @@ public class SparkPropagateRelation extends AbstractSparkAction {
 			StringUtils.isNotBlank(r.getSubRelType()) ||
 			StringUtils.isNotBlank(r.getRelClass());
 	}
+
+	private static String getId(Relation r, FieldType type) {
+		switch (type) {
+			case SOURCE:
+				return r.getSource();
+			case TARGET:
+				return r.getTarget();
+			default:
+				throw new IllegalArgumentException("");
+		}
+	}
+
+	private static MapFunction<Tuple2<Tuple2<Tuple3<String, Relation, String>, Tuple2<String, String>>, Tuple2<String, String>>, Relation> getFixRelFn() {
+		return value -> {
+
+			Relation r = value._1()._1()._2();
+			String newSource = value._1()._2() != null ? value._1()._2()._2() : null;
+			String newTarget = value._2() != null ? value._2()._2() : null;
+
+			if (r.getDataInfo() == null) {
+				r.setDataInfo(new DataInfo());
+			}
+			r.getDataInfo().setDeletedbyinference(false);
+
+			if (newSource != null)
+				r.setSource(newSource);
+
+			if (newTarget != null)
+				r.setTarget(newTarget);
+
+			return r;
+		};
+	}
+
+	private static MapFunction<Tuple2<Tuple2<String, Relation>, Tuple2<String, String>>, Relation> getDeletedFn() {
+		return value -> {
+			if (value._2() != null) {
+				Relation r = value._1()._2();
+				if (r.getDataInfo() == null) {
+					r.setDataInfo(new DataInfo());
+				}
+				r.getDataInfo().setDeletedbyinference(true);
+				return r;
+			}
+			return value._1()._2();
+		};
+	}
+
 }
--- a/dhp-workflows/dhp-dedup-openaire/src/main/java/eu/dnetlib/dhp/oa/dedup/SparkSimRelsAnalytics.java
+++ b/dhp-workflows/dhp-dedup-openaire/src/main/java/eu/dnetlib/dhp/oa/dedup/SparkSimRelsAnalytics.java
@ -0,0 +1,118 @@
+
+package eu.dnetlib.dhp.oa.dedup;
+
+import java.io.IOException;
+import java.util.Optional;
+
+import org.apache.commons.io.IOUtils;
+import org.apache.commons.lang3.StringUtils;
+import org.apache.spark.SparkConf;
+import org.apache.spark.api.java.JavaSparkContext;
+import org.apache.spark.api.java.function.MapFunction;
+import org.apache.spark.sql.*;
+import org.dom4j.DocumentException;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+import org.xml.sax.SAXException;
+
+import eu.dnetlib.dhp.application.ArgumentApplicationParser;
+import eu.dnetlib.dhp.application.dedup.log.DedupLogModel;
+import eu.dnetlib.dhp.application.dedup.log.DedupLogWriter;
+import eu.dnetlib.dhp.schema.oaf.Relation;
+import eu.dnetlib.dhp.utils.ISLookupClientFactory;
+import eu.dnetlib.enabling.is.lookup.rmi.ISLookUpException;
+import eu.dnetlib.enabling.is.lookup.rmi.ISLookUpService;
+import eu.dnetlib.pace.config.DedupConfig;
+import eu.dnetlib.pace.model.SparkDedupConfig;
+
+public class SparkSimRelsAnalytics extends AbstractSparkAction {
+
+	private static final Logger log = LoggerFactory.getLogger(SparkSimRelsAnalytics.class);
+
+	public SparkSimRelsAnalytics(ArgumentApplicationParser parser, SparkSession spark) {
+		super(parser, spark);
+		spark.sparkContext().setLogLevel("WARN");
+	}
+
+	public static void main(String[] args) throws Exception {
+		ArgumentApplicationParser parser = new ArgumentApplicationParser(
+			IOUtils
+				.toString(
+					SparkSimRelsAnalytics.class
+						.getResourceAsStream(
+							"/eu/dnetlib/dhp/oa/dedup/createSimRels_parameters.json")));
+		parser.parseArgument(args);
+
+		SparkConf conf = new SparkConf();
+		new SparkSimRelsAnalytics(parser, getSparkSession(conf))
+			.run(ISLookupClientFactory.getLookUpService(parser.get("isLookUpUrl")));
+	}
+
+	@Override
+	public void run(ISLookUpService isLookUpService)
+		throws DocumentException, IOException, ISLookUpException, SAXException {
+
+		// read oozie parameters
+		final String graphBasePath = parser.get("graphBasePath");
+		final String isLookUpUrl = parser.get("isLookUpUrl");
+		final String actionSetId = parser.get("actionSetId");
+		final String workingPath = parser.get("workingPath");
+		final int numPartitions = Optional
+			.ofNullable(parser.get("numPartitions"))
+			.map(Integer::valueOf)
+			.orElse(NUM_PARTITIONS);
+
+		log.info("numPartitions: '{}'", numPartitions);
+		log.info("graphBasePath: '{}'", graphBasePath);
+		log.info("isLookUpUrl:   '{}'", isLookUpUrl);
+		log.info("actionSetId:   '{}'", actionSetId);
+		log.info("workingPath:   '{}'", workingPath);
+
+		final String dfLogPath = parser.get("dataframeLog");
+		final String runTag = Optional.ofNullable(parser.get("runTAG")).orElse("UNKNOWN");
+
+		// for each dedup configuration
+		for (DedupConfig dedupConf : getConfigurations(isLookUpService, actionSetId)) {
+			final long start = System.currentTimeMillis();
+
+			final String entity = dedupConf.getWf().getEntityType();
+			final String subEntity = dedupConf.getWf().getSubEntityValue();
+			log.info("Creating simrels for: '{}'", subEntity);
+
+			final String outputPath = DedupUtility.createSimRelPath(workingPath, actionSetId, subEntity);
+			removeOutputDir(spark, outputPath);
+
+			JavaSparkContext sc = JavaSparkContext.fromSparkContext(spark.sparkContext());
+
+			SparkDedupConfig sparkConfig = new SparkDedupConfig(dedupConf, numPartitions);
+
+			spark.udf().register("collect_sort_slice", sparkConfig.collectSortSliceUDAF());
+
+			Dataset<?> simRels = spark
+				.read()
+				.textFile(DedupUtility.createEntityPath(graphBasePath, subEntity))
+				.transform(sparkConfig.modelExtractor()) // Extract fields from input json column according to model
+				// definition
+				.transform(sparkConfig.generateClustersWithWindows()) // generate <key,block> pairs according to
+				// filters, clusters, and model
+				// definition
+				.transform(sparkConfig.processClusters()) // process blocks and emits <from,to> pairs of found
+				// similarities
+				.map(
+					(MapFunction<Row, Relation>) t -> DedupUtility
+						.createSimRel(t.getStruct(0).getString(0), t.getStruct(0).getString(1), entity),
+					Encoders.bean(Relation.class));
+
+			saveParquet(simRels, outputPath, SaveMode.Overwrite);
+			final long end = System.currentTimeMillis();
+			if (StringUtils.isNotBlank(dfLogPath)) {
+				final DedupLogModel model = new DedupLogModel(runTag, dedupConf.toString(), subEntity, start, end,
+					end - start);
+				new DedupLogWriter(dfLogPath).appendLog(model, spark);
+
+			}
+
+		}
+	}
+
+}
--- a/dhp-workflows/dhp-dedup-openaire/src/main/java/eu/dnetlib/dhp/oa/dedup/SparkWhitelistSimRels.java
+++ b/dhp-workflows/dhp-dedup-openaire/src/main/java/eu/dnetlib/dhp/oa/dedup/SparkWhitelistSimRels.java
@ -104,6 +104,18 @@ public class SparkWhitelistSimRels extends AbstractSparkAction {
 				.join(entities, whiteListRels1.col("to").equalTo(entities.col("id")), "inner")
 				.select("from", "to");

+//			Dataset<Tuple2<String, String>> whiteListRels1 = whiteListRels
+//				.joinWith(entities, whiteListRels.col("_1").equalTo(entities.col("_1")), "inner")
+//				.map(
+//					(MapFunction<Tuple2<Tuple2<String, String>, Tuple2<String, String>>, Tuple2<String, String>>) Tuple2::_1,
+//					Encoders.tuple(Encoders.STRING(), Encoders.STRING()));
+//
+//			Dataset<Tuple2<String, String>> whiteListRels2 = whiteListRels1
+//				.joinWith(entities, whiteListRels1.col("_2").equalTo(entities.col("_1")), "inner")
+//				.map(
+//					(MapFunction<Tuple2<Tuple2<String, String>, Tuple2<String, String>>, Tuple2<String, String>>) Tuple2::_1,
+//					Encoders.tuple(Encoders.STRING(), Encoders.STRING()));
+
 			Dataset<Relation> whiteListSimRels = whiteListRels2
 				.map(
 					(MapFunction<Row, Relation>) r -> DedupUtility
--- a/dhp-workflows/dhp-dedup-openaire/src/main/java/eu/dnetlib/dhp/oa/dedup/dsl/Clustering.scala
+++ b/dhp-workflows/dhp-dedup-openaire/src/main/java/eu/dnetlib/dhp/oa/dedup/dsl/Clustering.scala
@ -0,0 +1,15 @@
+package eu.dnetlib.dhp.oa.dedup.dsl
+
+case class Clustering(name: String = "",
+                 fields: Seq[String] = Seq(),
+                 params: Map[String,Int] = Map()) {
+  def withName(name: String) : Clustering =
+    copy(name = name)
+
+  def withFields(fields: String*): Clustering =
+    copy(fields = fields)
+
+  def withParams(params: Map[String,Int]): Clustering =
+    copy(params = params)
+}
+
--- a/dhp-workflows/dhp-dedup-openaire/src/main/java/eu/dnetlib/dhp/oa/dedup/dsl/Deduper.scala
+++ b/dhp-workflows/dhp-dedup-openaire/src/main/java/eu/dnetlib/dhp/oa/dedup/dsl/Deduper.scala
@ -0,0 +1,11 @@
+package eu.dnetlib.dhp.oa.dedup.dsl
+
+import org.apache.spark.sql.types.StructType
+
+case class Deduper (schema: StructType,
+               clusterings: Seq[Clustering] = Seq()) {
+
+  def withClustering(clusterings: Clustering*) =
+    copy(clusterings = clusterings)
+
+}
--- a/dhp-workflows/dhp-dedup-openaire/src/main/resources/eu/dnetlib/dhp/oa/dedup/cleanRelation_parameters.json
+++ b/dhp-workflows/dhp-dedup-openaire/src/main/resources/eu/dnetlib/dhp/oa/dedup/cleanRelation_parameters.json
@ -1,20 +0,0 @@
-[
-  {
-    "paramName": "i",
-    "paramLongName": "graphBasePath",
-    "paramDescription": "the base path of raw graph",
-    "paramRequired": true
-  },
-  {
-    "paramName": "w",
-    "paramLongName": "inputPath",
-    "paramDescription": "the path to the input relation to cleanup",
-    "paramRequired": true
-  },
-  {
-    "paramName": "o",
-    "paramLongName": "outputPath",
-    "paramDescription": "the path of the output relation cleaned",
-    "paramRequired": true
-  }
-]
--- a/dhp-workflows/dhp-dedup-openaire/src/main/resources/eu/dnetlib/dhp/oa/dedup/consistency/oozie_app/config-default.xml
+++ b/dhp-workflows/dhp-dedup-openaire/src/main/resources/eu/dnetlib/dhp/oa/dedup/consistency/oozie_app/config-default.xml
@ -15,8 +15,4 @@
        <name>oozie.action.sharelib.for.spark</name>
        <value>spark2</value>
    </property>
-    <property>
-        <name>sparkExecutorMemoryOverhead</name>
-        <value>1G</value>
-    </property>
 </configuration>
--- a/dhp-workflows/dhp-dedup-openaire/src/main/resources/eu/dnetlib/dhp/oa/dedup/consistency/oozie_app/workflow.xml
+++ b/dhp-workflows/dhp-dedup-openaire/src/main/resources/eu/dnetlib/dhp/oa/dedup/consistency/oozie_app/workflow.xml
@ -12,26 +12,19 @@
            <name>graphOutputPath</name>
            <description>path of the output graph</description>
        </property>
-        <property>
-            <name>filterInvisible</name>
-            <description>whether filter out invisible entities after merge</description>
-        </property>
        <property>
            <name>sparkDriverMemory</name>
-            <description>heap memory for driver process</description>
+            <description>memory for driver process</description>
        </property>
        <property>
            <name>sparkExecutorMemory</name>
-            <description>heap memory for individual executor</description>
-        </property>
-        <property>
-            <name>sparkExecutorMemoryOverhead</name>
-            <description>off-heap memory for individual executor</description>
+            <description>memory for individual executor</description>
        </property>
        <property>
            <name>sparkExecutorCores</name>
            <description>number of cores used by single executor</description>
        </property>
+
        <property>
            <name>oozieActionShareLibForSpark2</name>
            <description>oozie action sharelib for spark 2.*</description>
@ -90,7 +83,6 @@
            <jar>dhp-dedup-openaire-${projectVersion}.jar</jar>
            <spark-opts>
                --executor-memory=${sparkExecutorMemory}
-                --conf spark.executor.memoryOverhead=${sparkExecutorMemoryOverhead}
                --executor-cores=${sparkExecutorCores}
                --driver-memory=${sparkDriverMemory}
                --conf spark.extraListeners=${spark2ExtraListeners}
@ -100,35 +92,9 @@
                --conf spark.sql.shuffle.partitions=15000
            </spark-opts>
            <arg>--graphBasePath</arg><arg>${graphBasePath}</arg>
-            <arg>--graphOutputPath</arg><arg>${workingPath}/propagaterelation/</arg>
+            <arg>--o</arg><arg>${graphOutputPath}</arg>
            <arg>--workingPath</arg><arg>${workingPath}</arg>
        </spark>
-        <ok to="CleanRelation"/>
-        <error to="Kill"/>
-    </action>
-
-    <action name="CleanRelation">
-        <spark xmlns="uri:oozie:spark-action:0.2">
-            <master>yarn</master>
-            <mode>cluster</mode>
-            <name>Clean Relations</name>
-            <class>eu.dnetlib.dhp.oa.dedup.SparkCleanRelation</class>
-            <jar>dhp-dedup-openaire-${projectVersion}.jar</jar>
-            <spark-opts>
-                --executor-memory=${sparkExecutorMemory}
-                --conf spark.executor.memoryOverhead=${sparkExecutorMemoryOverhead}
-                --executor-cores=${sparkExecutorCores}
-                --driver-memory=${sparkDriverMemory}
-                --conf spark.extraListeners=${spark2ExtraListeners}
-                --conf spark.sql.queryExecutionListeners=${spark2SqlQueryExecutionListeners}
-                --conf spark.yarn.historyServer.address=${spark2YarnHistoryServerAddress}
-                --conf spark.eventLog.dir=${nameNode}${spark2EventLogDir}
-                --conf spark.sql.shuffle.partitions=15000
-            </spark-opts>
-            <arg>--graphBasePath</arg><arg>${graphBasePath}</arg>
-            <arg>--inputPath</arg><arg>${workingPath}/propagaterelation/relation</arg>
-            <arg>--outputPath</arg><arg>${graphOutputPath}/relation</arg>
-        </spark>
        <ok to="group_entities"/>
        <error to="Kill"/>
    </action>
@ -141,9 +107,8 @@
            <class>eu.dnetlib.dhp.oa.merge.GroupEntitiesSparkJob</class>
            <jar>dhp-dedup-openaire-${projectVersion}.jar</jar>
            <spark-opts>
-                --executor-memory=${sparkExecutorMemory}
-                --conf spark.executor.memoryOverhead=${sparkExecutorMemoryOverhead}
                --executor-cores=${sparkExecutorCores}
+                --executor-memory=${sparkExecutorMemory}
                --driver-memory=${sparkDriverMemory}
                --conf spark.extraListeners=${spark2ExtraListeners}
                --conf spark.sql.queryExecutionListeners=${spark2SqlQueryExecutionListeners}
@ -154,21 +119,30 @@
            <arg>--graphInputPath</arg><arg>${graphBasePath}</arg>
            <arg>--outputPath</arg><arg>${workingPath}/grouped_entities</arg>
        </spark>
-        <ok to="dispatch_entities"/>
+        <ok to="fork_dispatch_entities"/>
        <error to="Kill"/>
    </action>

-    <action name="dispatch_entities">
+    <fork name="fork_dispatch_entities">
+        <path start="dispatch_datasource"/>
+        <path start="dispatch_project"/>
+        <path start="dispatch_organization"/>
+        <path start="dispatch_publication"/>
+        <path start="dispatch_dataset"/>
+        <path start="dispatch_software"/>
+        <path start="dispatch_otherresearchproduct"/>
+    </fork>
+
+    <action name="dispatch_datasource">
        <spark xmlns="uri:oozie:spark-action:0.2">
            <master>yarn</master>
            <mode>cluster</mode>
-            <name>Dispatch grouped entitities</name>
+            <name>Dispatch publications</name>
            <class>eu.dnetlib.dhp.oa.merge.DispatchEntitiesSparkJob</class>
            <jar>dhp-dedup-openaire-${projectVersion}.jar</jar>
            <spark-opts>
-                --executor-memory=${sparkExecutorMemory}
-                --conf spark.executor.memoryOverhead=${sparkExecutorMemoryOverhead}
                --executor-cores=${sparkExecutorCores}
+                --executor-memory=${sparkExecutorMemory}
                --driver-memory=${sparkDriverMemory}
                --conf spark.extraListeners=${spark2ExtraListeners}
                --conf spark.sql.queryExecutionListeners=${spark2SqlQueryExecutionListeners}
@ -177,12 +151,164 @@
                --conf spark.sql.shuffle.partitions=7680
            </spark-opts>
            <arg>--inputPath</arg><arg>${workingPath}/grouped_entities</arg>
-            <arg>--outputPath</arg><arg>${graphOutputPath}</arg>
-            <arg>--filterInvisible</arg><arg>${filterInvisible}</arg>
+            <arg>--outputPath</arg><arg>${graphOutputPath}/datasource</arg>
+            <arg>--graphTableClassName</arg><arg>eu.dnetlib.dhp.schema.oaf.Datasource</arg>
        </spark>
-        <ok to="End"/>
+        <ok to="wait_dispatch"/>
        <error to="Kill"/>
    </action>

+    <action name="dispatch_project">
+        <spark xmlns="uri:oozie:spark-action:0.2">
+            <master>yarn</master>
+            <mode>cluster</mode>
+            <name>Dispatch project</name>
+            <class>eu.dnetlib.dhp.oa.merge.DispatchEntitiesSparkJob</class>
+            <jar>dhp-dedup-openaire-${projectVersion}.jar</jar>
+            <spark-opts>
+                --executor-cores=${sparkExecutorCores}
+                --executor-memory=${sparkExecutorMemory}
+                --driver-memory=${sparkDriverMemory}
+                --conf spark.extraListeners=${spark2ExtraListeners}
+                --conf spark.sql.queryExecutionListeners=${spark2SqlQueryExecutionListeners}
+                --conf spark.yarn.historyServer.address=${spark2YarnHistoryServerAddress}
+                --conf spark.eventLog.dir=${nameNode}${spark2EventLogDir}
+                --conf spark.sql.shuffle.partitions=7680
+            </spark-opts>
+            <arg>--inputPath</arg><arg>${workingPath}/grouped_entities</arg>
+            <arg>--outputPath</arg><arg>${graphOutputPath}/project</arg>
+            <arg>--graphTableClassName</arg><arg>eu.dnetlib.dhp.schema.oaf.Project</arg>
+        </spark>
+        <ok to="wait_dispatch"/>
+        <error to="Kill"/>
+    </action>
+
+    <action name="dispatch_organization">
+        <spark xmlns="uri:oozie:spark-action:0.2">
+            <master>yarn</master>
+            <mode>cluster</mode>
+            <name>Dispatch organization</name>
+            <class>eu.dnetlib.dhp.oa.merge.DispatchEntitiesSparkJob</class>
+            <jar>dhp-dedup-openaire-${projectVersion}.jar</jar>
+            <spark-opts>
+                --executor-cores=${sparkExecutorCores}
+                --executor-memory=${sparkExecutorMemory}
+                --driver-memory=${sparkDriverMemory}
+                --conf spark.extraListeners=${spark2ExtraListeners}
+                --conf spark.sql.queryExecutionListeners=${spark2SqlQueryExecutionListeners}
+                --conf spark.yarn.historyServer.address=${spark2YarnHistoryServerAddress}
+                --conf spark.eventLog.dir=${nameNode}${spark2EventLogDir}
+                --conf spark.sql.shuffle.partitions=7680
+            </spark-opts>
+            <arg>--inputPath</arg><arg>${workingPath}/grouped_entities</arg>
+            <arg>--outputPath</arg><arg>${graphOutputPath}/organization</arg>
+            <arg>--graphTableClassName</arg><arg>eu.dnetlib.dhp.schema.oaf.Organization</arg>
+        </spark>
+        <ok to="wait_dispatch"/>
+        <error to="Kill"/>
+    </action>
+
+    <action name="dispatch_publication">
+        <spark xmlns="uri:oozie:spark-action:0.2">
+            <master>yarn</master>
+            <mode>cluster</mode>
+            <name>Dispatch publication</name>
+            <class>eu.dnetlib.dhp.oa.merge.DispatchEntitiesSparkJob</class>
+            <jar>dhp-dedup-openaire-${projectVersion}.jar</jar>
+            <spark-opts>
+                --executor-cores=${sparkExecutorCores}
+                --executor-memory=${sparkExecutorMemory}
+                --driver-memory=${sparkDriverMemory}
+                --conf spark.extraListeners=${spark2ExtraListeners}
+                --conf spark.sql.queryExecutionListeners=${spark2SqlQueryExecutionListeners}
+                --conf spark.yarn.historyServer.address=${spark2YarnHistoryServerAddress}
+                --conf spark.eventLog.dir=${nameNode}${spark2EventLogDir}
+                --conf spark.sql.shuffle.partitions=7680
+            </spark-opts>
+            <arg>--inputPath</arg><arg>${workingPath}/grouped_entities</arg>
+            <arg>--outputPath</arg><arg>${graphOutputPath}/publication</arg>
+            <arg>--graphTableClassName</arg><arg>eu.dnetlib.dhp.schema.oaf.Publication</arg>
+        </spark>
+        <ok to="wait_dispatch"/>
+        <error to="Kill"/>
+    </action>
+
+    <action name="dispatch_dataset">
+        <spark xmlns="uri:oozie:spark-action:0.2">
+            <master>yarn</master>
+            <mode>cluster</mode>
+            <name>Dispatch dataset</name>
+            <class>eu.dnetlib.dhp.oa.merge.DispatchEntitiesSparkJob</class>
+            <jar>dhp-dedup-openaire-${projectVersion}.jar</jar>
+            <spark-opts>
+                --executor-cores=${sparkExecutorCores}
+                --executor-memory=${sparkExecutorMemory}
+                --driver-memory=${sparkDriverMemory}
+                --conf spark.extraListeners=${spark2ExtraListeners}
+                --conf spark.sql.queryExecutionListeners=${spark2SqlQueryExecutionListeners}
+                --conf spark.yarn.historyServer.address=${spark2YarnHistoryServerAddress}
+                --conf spark.eventLog.dir=${nameNode}${spark2EventLogDir}
+                --conf spark.sql.shuffle.partitions=7680
+            </spark-opts>
+            <arg>--inputPath</arg><arg>${workingPath}/grouped_entities</arg>
+            <arg>--outputPath</arg><arg>${graphOutputPath}/dataset</arg>
+            <arg>--graphTableClassName</arg><arg>eu.dnetlib.dhp.schema.oaf.Dataset</arg>
+        </spark>
+        <ok to="wait_dispatch"/>
+        <error to="Kill"/>
+    </action>
+
+    <action name="dispatch_software">
+        <spark xmlns="uri:oozie:spark-action:0.2">
+            <master>yarn</master>
+            <mode>cluster</mode>
+            <name>Dispatch software</name>
+            <class>eu.dnetlib.dhp.oa.merge.DispatchEntitiesSparkJob</class>
+            <jar>dhp-dedup-openaire-${projectVersion}.jar</jar>
+            <spark-opts>
+                --executor-cores=${sparkExecutorCores}
+                --executor-memory=${sparkExecutorMemory}
+                --driver-memory=${sparkDriverMemory}
+                --conf spark.extraListeners=${spark2ExtraListeners}
+                --conf spark.sql.queryExecutionListeners=${spark2SqlQueryExecutionListeners}
+                --conf spark.yarn.historyServer.address=${spark2YarnHistoryServerAddress}
+                --conf spark.eventLog.dir=${nameNode}${spark2EventLogDir}
+                --conf spark.sql.shuffle.partitions=7680
+            </spark-opts>
+            <arg>--inputPath</arg><arg>${workingPath}/grouped_entities</arg>
+            <arg>--outputPath</arg><arg>${graphOutputPath}/software</arg>
+            <arg>--graphTableClassName</arg><arg>eu.dnetlib.dhp.schema.oaf.Software</arg>
+        </spark>
+        <ok to="wait_dispatch"/>
+        <error to="Kill"/>
+    </action>
+
+    <action name="dispatch_otherresearchproduct">
+        <spark xmlns="uri:oozie:spark-action:0.2">
+            <master>yarn</master>
+            <mode>cluster</mode>
+            <name>Dispatch otherresearchproduct</name>
+            <class>eu.dnetlib.dhp.oa.merge.DispatchEntitiesSparkJob</class>
+            <jar>dhp-dedup-openaire-${projectVersion}.jar</jar>
+            <spark-opts>
+                --executor-cores=${sparkExecutorCores}
+                --executor-memory=${sparkExecutorMemory}
+                --driver-memory=${sparkDriverMemory}
+                --conf spark.extraListeners=${spark2ExtraListeners}
+                --conf spark.sql.queryExecutionListeners=${spark2SqlQueryExecutionListeners}
+                --conf spark.yarn.historyServer.address=${spark2YarnHistoryServerAddress}
+                --conf spark.eventLog.dir=${nameNode}${spark2EventLogDir}
+                --conf spark.sql.shuffle.partitions=7680
+            </spark-opts>
+            <arg>--inputPath</arg><arg>${workingPath}/grouped_entities</arg>
+            <arg>--outputPath</arg><arg>${graphOutputPath}/otherresearchproduct</arg>
+            <arg>--graphTableClassName</arg><arg>eu.dnetlib.dhp.schema.oaf.OtherResearchProduct</arg>
+        </spark>
+        <ok to="wait_dispatch"/>
+        <error to="Kill"/>
+    </action>
+
+    <join name="wait_dispatch" to="End"/>
+
    <end name="End"/>
 </workflow-app>
--- a/dhp-workflows/dhp-dedup-openaire/src/main/resources/eu/dnetlib/dhp/oa/dedup/scan/oozie_app/config-default.xml
+++ b/dhp-workflows/dhp-dedup-openaire/src/main/resources/eu/dnetlib/dhp/oa/dedup/scan/oozie_app/config-default.xml
@ -13,6 +13,10 @@
    </property>
    <property>
        <name>oozie.action.sharelib.for.spark</name>
-        <value>spark2</value>
+        <value>spark342</value>
+    </property>
+    <property>
+        <name>oozie.launcher.mapreduce.user.classpath.first</name>
+        <value>true</value>
    </property>
 </configuration>
--- a/dhp-workflows/dhp-dedup-openaire/src/main/resources/eu/dnetlib/dhp/oa/dedup/scan/oozie_app/workflow.xml
+++ b/dhp-workflows/dhp-dedup-openaire/src/main/resources/eu/dnetlib/dhp/oa/dedup/scan/oozie_app/workflow.xml
@ -126,15 +126,25 @@
                --conf spark.sql.queryExecutionListeners=${spark2SqlQueryExecutionListeners}
                --conf spark.yarn.historyServer.address=${spark2YarnHistoryServerAddress}
                --conf spark.eventLog.dir=${nameNode}${spark2EventLogDir}
-                --conf spark.sql.shuffle.partitions=15000
+                --conf spark.sql.shuffle.partitions=5000
+                --conf spark.driver.extraJavaOptions="-Xss256k"
+                --conf spark.executor.extraJavaOptions="-Dlog4j.configuration=spark-log4j.properties -Xss256k"
+                --conf spark.extraListeners=
+                --conf spark.sql.queryExecutionListeners=
+                --conf spark.dynamicAllocation.enabled=true --conf spark.dynamicAllocation.minExecutors=100 --conf spark.dynamicAllocation.shuffleTracking.enabled=true
+                --conf spark.network.io.preferDirectBufs=true --conf spark.memory.fraction=0.4 --conf spark.sql.adaptive.coalescePartitions.minPartitionNum=5000
+                --conf spark.shuffle.useOldFetchProtocol=true --conf spark.shuffle.service.enabled=true  --conf spark.eventLog.enabled=true
+                --conf spark.executor.heartbeatInterval=60s
+                --conf spark.network.timeout=640s
+                --conf spark.sql.legacy.allowUntypedScalaUDF=true
            </spark-opts>
            <arg>--graphBasePath</arg><arg>${graphBasePath}</arg>
            <arg>--isLookUpUrl</arg><arg>${isLookUpUrl}</arg>
            <arg>--actionSetId</arg><arg>${actionSetId}</arg>
            <arg>--workingPath</arg><arg>${workingPath}</arg>
-            <arg>--numPartitions</arg><arg>15000</arg>
+            <arg>--numPartitions</arg><arg>5000</arg>
        </spark>
-        <ok to="WhitelistSimRels"/>
+        <ok to="End"/>
        <error to="Kill"/>
    </action>

--- a/dhp-workflows/dhp-dedup-openaire/src/test/java/eu/dnetlib/dhp/oa/dedup/EntityMergerTest.java
+++ b/dhp-workflows/dhp-dedup-openaire/src/test/java/eu/dnetlib/dhp/oa/dedup/EntityMergerTest.java
@ -9,8 +9,7 @@ import java.io.IOException;
 import java.io.Serializable;
 import java.lang.reflect.InvocationTargetException;
 import java.nio.file.Paths;
-import java.util.ArrayList;
-import java.util.List;
+import java.util.*;
 import java.util.stream.Collectors;

 import org.codehaus.jackson.map.ObjectMapper;
@ -18,10 +17,7 @@ import org.junit.jupiter.api.BeforeEach;
 import org.junit.jupiter.api.Test;

 import eu.dnetlib.dhp.oa.merge.AuthorMerger;
-import eu.dnetlib.dhp.schema.oaf.DataInfo;
-import eu.dnetlib.dhp.schema.oaf.Publication;
-import eu.dnetlib.dhp.schema.oaf.Software;
-import eu.dnetlib.dhp.schema.oaf.StructuredProperty;
+import eu.dnetlib.dhp.schema.oaf.*;
 import eu.dnetlib.pace.util.MapDocumentUtil;
 import scala.Tuple2;

--- a/dhp-workflows/dhp-dedup-openaire/src/test/java/eu/dnetlib/dhp/oa/dedup/SparkDSLExampleTest.java
+++ b/dhp-workflows/dhp-dedup-openaire/src/test/java/eu/dnetlib/dhp/oa/dedup/SparkDSLExampleTest.java
@ -0,0 +1,125 @@
+
+package eu.dnetlib.dhp.oa.dedup;
+
+import static java.nio.file.Files.createTempDirectory;
+
+import static org.junit.jupiter.api.Assertions.assertEquals;
+import static org.mockito.Mockito.lenient;
+
+import java.io.File;
+import java.io.IOException;
+import java.io.Serializable;
+import java.net.URISyntaxException;
+import java.nio.file.Paths;
+
+import org.apache.commons.io.FileUtils;
+import org.apache.commons.io.IOUtils;
+import org.apache.spark.SparkConf;
+import org.apache.spark.api.java.JavaSparkContext;
+import org.apache.spark.sql.SparkSession;
+import org.junit.jupiter.api.AfterAll;
+import org.junit.jupiter.api.BeforeAll;
+import org.junit.jupiter.api.BeforeEach;
+import org.junit.jupiter.api.Test;
+import org.junit.jupiter.api.extension.ExtendWith;
+import org.mockito.Mock;
+import org.mockito.Mockito;
+import org.mockito.junit.jupiter.MockitoExtension;
+
+import eu.dnetlib.dhp.application.ArgumentApplicationParser;
+import eu.dnetlib.enabling.is.lookup.rmi.ISLookUpException;
+import eu.dnetlib.enabling.is.lookup.rmi.ISLookUpService;
+
+@ExtendWith(MockitoExtension.class)
+public class SparkDSLExampleTest implements Serializable {
+
+	@Mock(serializable = true)
+	ISLookUpService isLookUpService;
+
+	private static SparkSession spark;
+	private static JavaSparkContext jsc;
+
+	private static String testGraphBasePath;
+	private static String testOutputBasePath;
+	private static final String testActionSetId = "test-orchestrator";
+
+	@BeforeAll
+	public static void beforeAll() throws IOException, URISyntaxException {
+
+		testGraphBasePath = Paths
+			.get(SparkDedupTest.class.getResource("/eu/dnetlib/dhp/dedup/entities").toURI())
+			.toFile()
+			.getAbsolutePath();
+		testOutputBasePath = createTempDirectory(SparkDedupTest.class.getSimpleName() + "-")
+			.toAbsolutePath()
+			.toString();
+
+		FileUtils.deleteDirectory(new File(testOutputBasePath));
+
+		final SparkConf conf = new SparkConf();
+		conf.set("spark.sql.shuffle.partitions", "200");
+		spark = SparkSession
+			.builder()
+			.appName(SparkDedupTest.class.getSimpleName())
+			.master("local[*]")
+			.config(conf)
+			.getOrCreate();
+
+		jsc = JavaSparkContext.fromSparkContext(spark.sparkContext());
+	}
+
+	@BeforeEach
+	public void setUp() throws IOException, ISLookUpException {
+
+		lenient()
+			.when(isLookUpService.getResourceProfileByQuery(Mockito.contains(testActionSetId)))
+			.thenReturn(
+				IOUtils
+					.toString(
+						SparkDSLExampleTest.class
+							.getResourceAsStream(
+								"/eu/dnetlib/dhp/dedup/profiles/mock_orchestrator.xml")));
+
+		lenient()
+			.when(isLookUpService.getResourceProfileByQuery(Mockito.contains("organization")))
+			.thenReturn(
+				IOUtils
+					.toString(
+						SparkDSLExampleTest.class
+							.getResourceAsStream(
+								"/eu/dnetlib/dhp/dedup/conf/org.curr.conf.json")));
+	}
+
+	@Test
+	void createBlockStatsTest() throws Exception {
+
+		ArgumentApplicationParser parser = new ArgumentApplicationParser(
+			IOUtils
+				.toString(
+					SparkDSLExampleTest.class
+						.getResourceAsStream(
+							"/eu/dnetlib/dhp/oa/dedup/createBlockStats_parameters.json")));
+		parser
+			.parseArgument(
+				new String[] {
+					"-i", testGraphBasePath,
+					"-asi", testActionSetId,
+					"-la", "lookupurl",
+					"-w", testOutputBasePath
+				});
+
+		new DSLExample(parser, spark).run(isLookUpService);
+
+		long orgs_blocks = spark
+			.read()
+			.textFile(testOutputBasePath + "/" + testActionSetId + "/organization_blockstats")
+			.count();
+
+		assertEquals(480, orgs_blocks);
+	}
+
+	@AfterAll
+	public static void tearDown() {
+		spark.close();
+	}
+}
--- a/Show More
+++ b/Show More
Author	SHA1	Message	Date
Sandro La Bruzzo	edf5a780b8	minor fix	2023-08-02 12:12:20 +02:00
Sandro La Bruzzo	74fcea66e6	erge branch 'dedup-with-dataframe-spark34' of code-repo.d4science.org:D-Net/dnet-hadoop into dedup-with-dataframe-spark34	2023-07-19 16:55:19 +02:00
Sandro La Bruzzo	e4feedd67e	improved scholix generation using bean	2023-07-19 16:53:28 +02:00
Giambattista Bloisi	617ef05e15	Update commons.lang.version to 3.12.0 to match spark 3.4 version and fix an incompatibility when running with Java 11	2023-07-17 17:01:07 +02:00
Giambattista Bloisi	b6a8be813b	oozie.launcher.mapreduce.user.classpath.first property is required to avoid launch problems	2023-07-14 16:05:14 +02:00
Sandro La Bruzzo	f1ae28fe42	implemented new version of pubmed parser	2023-07-12 10:32:25 +02:00
Sandro La Bruzzo	acf947442a	made the project compilable	2023-07-11 11:37:32 +02:00
Giambattista Bloisi	d80f12da06	Build with spark 3.4 (dedup and dependencies only tested)	2023-07-10 15:54:48 +02:00
Giambattista Bloisi	861c368e65	Code for testing other grouping strategies	2023-07-10 15:52:35 +02:00
Giambattista Bloisi	745e70e0d7	When generating similarities put as 'from' component the one with smaller lexicographic id	2023-07-10 15:45:49 +02:00
Giambattista Bloisi	dcc08cc512	Use UDAF and Aggregation class for testing	2023-07-07 12:35:30 +02:00
Giambattista Bloisi	df19548c56	small changes	2023-07-04 18:36:58 +02:00
Sandro La Bruzzo	890b49fb5d	optimized some dedup functions	2023-06-29 14:08:58 +02:00
Giambattista Bloisi	3129c1c48b	Allow processing of immutable sorted blocks in dedup	2023-06-28 14:01:04 +02:00
Giambattista Bloisi	cb7ad9889c	Fix maven dependencies warning while building	2023-06-28 14:01:04 +02:00
Claudio Atzori	75ff902f9d	WIP: various refactors	2023-06-28 14:00:54 +02:00
Claudio Atzori	326367eccc	WIP: various refactors	2023-06-28 14:00:22 +02:00
Claudio Atzori	521dd7f167	WIP: various refactors	2023-06-28 14:00:18 +02:00
Claudio Atzori	649679de8d	WIP: various refactors	2023-06-28 13:59:11 +02:00
Sandro La Bruzzo	4c2dfcbdf7	Added first implementation using UDF function	2023-06-28 13:58:01 +02:00