Alessia Bardi af1f8af788 Updated officialnmae of pangaea in hostedbymap for Datacite to avoid duplicate entries in the source filter of the portal 2023-06-07 10:29:47 +02:00
Sandro La Bruzzo 883dcd910e Make the tests compilable 2023-05-10 09:05:23 +02:00
Sandro La Bruzzo 706631586b Now it compiles 2023-05-09 13:54:57 +02:00
Sandro La Bruzzo ba380f5826 Makes the code compilable 2023-05-09 12:23:42 +02:00
Sandro La Bruzzo 88fffa6dbd - Removed ModelConstants and update Relation enum 2023-05-03 14:03:01 +02:00
Sandro La Bruzzo f8f4b9a018 -Renamed RelationInverse into RelationLabel.
-Removed findRelation from ModelSupport
-code formatted
2023-04-28 11:44:28 +02:00
Claudio Atzori 8c9a77d7eb WIP merged from graph_cleaning_refactoring, applying model simplification 2023-04-26 16:02:06 +02:00
Claudio Atzori cdb5aad82b merge from beta 2023-04-21 08:47:47 +02:00
Claudio Atzori c3dff7e199 Merge branch 'beta' into ticket_8369 2023-03-14 15:24:03 +01:00
Claudio Atzori 2a914e4463 merged from beta. It compiles 2023-03-10 16:00:48 +01:00
Claudio Atzori 74c40fd4e2 Merge branch 'ticket_8369' of into ticket_8369 2023-02-17 10:59:40 +01:00
Miriam Baglioni eca7ebab6d [enrichment] adding relations in one side only 2023-02-15 16:41:28 +01:00
Miriam Baglioni 624c62f62d [enrichment] changed to make it compile against the new model 2023-02-15 16:20:24 +01:00
Miriam Baglioni 8ddcf10075 resolved conflicts 2023-02-15 15:30:17 +01:00
Claudio Atzori 56495cc727 Merge branch 'ticket_8369' of into ticket_8369 2023-02-15 11:39:36 +01:00
Sandro La Bruzzo 8af8b2ea27 Now Crossref mapping and dhp-aggregation compile 2023-02-14 17:20:13 +01:00
Claudio Atzori 6b0a08e29c Merge branch 'ticket_8369' of into ticket_8369 2023-02-14 15:59:18 +01:00
Sandro La Bruzzo 990e3e2f60 code refactor 2023-02-14 10:32:17 +01:00
Sandro La Bruzzo 8f777af827 increased version number of dhp-code-style 2023-02-14 10:30:34 +01:00
Claudio Atzori 63c5c5848d wip: large refactoring 2023-02-13 16:15:19 +01:00
Miriam Baglioni 125657ed4c - 2023-02-13 12:40:14 +01:00
Claudio Atzori 508648e1d8 Merge branch 'ticket_8369' of into ticket_8369 2023-02-09 16:27:46 +01:00
Claudio Atzori d04610480a wip: large refactoring 2023-02-09 16:27:41 +01:00
Sandro La Bruzzo 6bd5a792dd Added vocabolary mocks 2023-02-09 16:13:34 +01:00
Sandro La Bruzzo 04b12a35cd Moved cressref Utility to dhp-aggregation 2023-02-09 16:11:59 +01:00
Claudio Atzori 934c1846f8 wip: large refactoring 2023-02-09 12:32:28 +01:00
Sandro La Bruzzo 606cada7a4 trying to adapt to the new data model 2023-02-07 11:11:05 +01:00
Claudio Atzori d9c9482a5b WIP: refactoring the internal graph data model and its utilities 2023-02-06 13:45:21 +01:00
Claudio Atzori 67735f7e9d WIP: refactoring model utilities 2023-02-02 17:02:23 +01:00
Claudio Atzori 1845dcfedf WIP: refactoring the internal graph data model and its utilities 2023-02-01 16:24:35 +01:00
@ -26,4 +26,3 @@ spark-warehouse
/**/*.log /**/*.log
/**/.factorypath /**/.factorypath
/**/.scalafmt.conf /**/.scalafmt.conf

@ -1,43 +0,0 @@
# Contributor Code of Conduct
Openness, transparency and our community-driven participatory approach guide us in our day-to-day interactions and decision-making. Our open source projects are no exception. Trust, respect, collaboration and transparency are core values we believe should live and breathe within our projects. Our community welcomes participants from around the world with different experiences, unique perspectives, and great ideas to share.
## Our Pledge
In the interest of fostering an open and welcoming environment, we as contributors and maintainers pledge to making participation in our project and our community a harassment-free experience for everyone, regardless of age, body size, disability, ethnicity, sex characteristics, gender identity and expression, level of experience, education, socio-economic status, nationality, personal appearance, race, religion, or sexual identity and orientation.
## Our Standards
Examples of behavior that contributes to creating a positive environment include:
- Using welcoming and inclusive language
- Being respectful of differing viewpoints and experiences
- Gracefully accepting constructive criticism
- Attempting collaboration before conflict
- Focusing on what is best for the community
- Showing empathy towards other community members
Examples of unacceptable behavior by participants include:
- Violence, threats of violence, or inciting others to commit self-harm
- The use of sexualized language or imagery and unwelcome sexual attention or advances
- Trolling, intentionally spreading misinformation, insulting/derogatory comments, and personal or political attacks
- Public or private harassment
- Publishing others' private information, such as a physical or electronic address, without explicit permission
- Abuse of the reporting process to intentionally harass or exclude others
- Advocating for, or encouraging, any of the above behavior
- Other conduct which could reasonably be considered inappropriate in a professional setting
## Our Responsibilities
Project maintainers are responsible for clarifying the standards of acceptable behavior and are expected to take appropriate and fair corrective action in response to any instances of unacceptable behavior.
Project maintainers have the right and responsibility to remove, edit, or reject comments, commits, code, wiki edits, issues, and other contributions that are not aligned to this Code of Conduct, or to ban temporarily or permanently any contributor for other behaviors that they deem inappropriate, threatening, offensive, or harmful.
## Scope
This Code of Conduct applies both within project spaces and in public spaces when an individual is representing the project or its community. Examples of representing a project or community include using an official project e-mail address, posting via an official social media account, or acting as an appointed representative at an online or offline event. Representation of a project may be further defined and clarified by project maintainers.
## Attribution
This Code of Conduct is adapted from the [Contributor Covenant](, [version 1.4](

@ -1,10 +0,0 @@
# Contributing to D-Net Hadoop
:+1::tada: First off, thanks for taking the time to contribute! :tada::+1:
This project and everyone participating in it is governed by our [Code of Conduct]( By participating, you are expected to uphold this code. Please report unacceptable behavior to [](
The following is a set of guidelines for contributing to this project and its packages. These are mostly guidelines, not rules, which applies to this project as a while, including all its sub-modules.
Use your best judgment, and feel free to propose changes to this document in a pull request.
All contributions are welcome, all contributions will be considered to be contributed under the [project license](

View File

@ -1,133 +1,2 @@
# dnet-hadoop # dnet-hadoop
Dnet-hadoop is the project that defined all the OOZIE workflows for the OpenAIRE Graph construction, processing, provisioning.
Dnet-hadoop is the project that defined all the [OOZIE workflows]( for the OpenAIRE Graph construction, processing, provisioning.
This project adheres to the Contributor Covenant [code of conduct](
By participating, you are expected to uphold this code. Please report unacceptable behavior to [](
This project is licensed under the [AGPL v3 or later version](
How to build, package and run oozie workflows
Oozie-installer is a utility allowing building, uploading and running oozie workflows. In practice, it creates a `*.tar.gz`
package that contains resources that define a workflow and some helper scripts.
This module is automatically executed when running:
`mvn package -Poozie-package -Dworkflow.source.dir=classpath/to/parent/directory/of/oozie_app`
on module having set:
in `pom.xml` file. `oozie-package` profile initializes oozie workflow packaging, `workflow.source.dir` property points to
a workflow (notice: this is not a relative path but a classpath to directory usually holding `oozie_app` subdirectory).
The outcome of this packaging is `oozie-package.tar.gz` file containing inside all the resources required to run Oozie workflow:
- jar packages
- workflow definitions
- job properties
- maintenance scripts
Required properties
In order to include proper workflow within package, `workflow.source.dir` property has to be set. It could be provided
by setting `-Dworkflow.source.dir=some/job/dir` maven parameter.
In oder to define full set of cluster environment properties one should create `~/.dhp/` file with
the following properties:
- `` - your user name on hadoop cluster and frontend machine
- `` - frontend host name
- `dhp.hadoop.frontend.temp.dir` - frontend directory for temporary files
- `dhp.hadoop.frontend.port.ssh` - frontend machine ssh port
- `oozieServiceLoc` - oozie service location required by script executing oozie job
- `nameNode` - name node address
- `jobTracker` - job tracker address
- `oozie.execution.log.file.location` - location of file that will be created when executing oozie job, it contains output
produced by `` script (needed to obtain oozie job id)
- `maven.executable` - mvn command location, requires parameterization due to a different setup of CI cluster
- `sparkDriverMemory` - amount of memory assigned to spark jobs driver
- `sparkExecutorMemory` - amount of memory assigned to spark jobs executors
- `sparkExecutorCores` - number of cores assigned to spark jobs executors
All values will be overriden with the ones from `` and eventually `` stored in module's
main folder.
When overriding properties from ``, `` file can be created in main module directory
(the one containing `pom.xml` file) and define all new properties which will override existing properties.
One can provide those properties one by one as command line `-D` arguments.
Properties overriding order is the following:
1. `pom.xml` defined properties (located in the project root dir)
2. `~/.dhp/` defined properties
3. `${workflow.source.dir}/`
4. `` (located in the project root dir)
5. `maven -Dparam=value`
where the maven `-Dparam` property is overriding all the other ones.
Workflow definition requirements
`workflow.source.dir` property should point to the following directory structure:
| (optional)
This property can be set using maven `-D` switch.
`[oozie_app]` is the default directory name however it can be set to any value as soon as `oozieAppDir` property is
provided with directory name as value.
Sub-workflows are supported as well and sub-workflow directories should be nested within `[oozie_app]` directory.
Creating oozie installer step-by-step
Automated oozie-installer steps are the following:
1. creating jar packages: `*.jar` and `*tests.jar` along with copying all dependencies in `target/dependencies`
2. reading properties from maven, `~/.dhp/`, ``, ``
3. invoking priming mechanism linking resources from import.txt file (currently resolving subworkflow resources)
4. assembling shell scripts for preparing Hadoop filesystem, uploading Oozie application and starting workflow
5. copying whole `${workflow.source.dir}` content to `target/${}`
6. generating updated `` file in `target/${}` based on maven,
`~/.dhp/`, `` and ``
7. creating `lib` directory (or multiple directories for sub-workflows for each nested directory) and copying jar packages
created at step (1) to each one of them
8. bundling whole `${}` directory into single tar.gz package
Uploading oozie package and running workflow on cluster
In order to simplify deployment and execution process two dedicated profiles were introduced:
- `deploy`
- `run`
to be used along with `oozie-package` profile e.g. by providing `-Poozie-package,deploy,run` maven parameters.
The `deploy` profile supplements packaging process with:
1) uploading oozie-package via scp to `/home/${}/oozie-packages` directory on `${}` machine
2) extracting uploaded package
3) uploading oozie content to hadoop cluster HDFS location defined in `` property (generated dynamically by maven build process, based on `${}` and `workflow.source.dir` properties)
The `run` profile introduces:
1) executing oozie application uploaded to HDFS cluster using `deploy` command. Triggers `` script providing runtime properties defined in `` file.
Notice: ssh access to frontend machine has to be configured on system level and it is preferable to set key-based authentication in order to simplify remote operations.

@ -6,7 +6,7 @@
<parent> <parent>
<groupId>eu.dnetlib.dhp</groupId> <groupId>eu.dnetlib.dhp</groupId>
<artifactId>dhp-build</artifactId> <artifactId>dhp-build</artifactId>
<version>1.2.5-SNAPSHOT</version> <version>2.0.0-SNAPSHOT</version>
</parent> </parent>
View File

@ -6,7 +6,7 @@
<parent> <parent>
<groupId>eu.dnetlib.dhp</groupId> <groupId>eu.dnetlib.dhp</groupId>
<artifactId>dhp-build</artifactId> <artifactId>dhp-build</artifactId>
<version>1.2.5-SNAPSHOT</version> <version>2.0.0-SNAPSHOT</version>
</parent> </parent>
<artifactId>dhp-build-properties-maven-plugin</artifactId> <artifactId>dhp-build-properties-maven-plugin</artifactId>

View File

@ -5,7 +5,7 @@
<groupId>eu.dnetlib.dhp</groupId> <groupId>eu.dnetlib.dhp</groupId>
<artifactId>dhp-code-style</artifactId> <artifactId>dhp-code-style</artifactId>
<version>1.2.5-SNAPSHOT</version> <version>2.0.0-SNAPSHOT</version>
<packaging>jar</packaging> <packaging>jar</packaging>

View File

@ -4,7 +4,7 @@
<parent> <parent>
<groupId>eu.dnetlib.dhp</groupId> <groupId>eu.dnetlib.dhp</groupId>
<artifactId>dhp</artifactId> <artifactId>dhp</artifactId>
<version>1.2.5-SNAPSHOT</version> <version>2.0.0-SNAPSHOT</version>
</parent> </parent>
<artifactId>dhp-build</artifactId> <artifactId>dhp-build</artifactId>
<packaging>pom</packaging> <packaging>pom</packaging>

View File

@ -5,7 +5,7 @@
<parent> <parent>
<groupId>eu.dnetlib.dhp</groupId> <groupId>eu.dnetlib.dhp</groupId>
<artifactId>dhp</artifactId> <artifactId>dhp</artifactId>
<version>1.2.5-SNAPSHOT</version> <version>2.0.0-SNAPSHOT</version>
<relativePath>../pom.xml</relativePath> <relativePath>../pom.xml</relativePath>
</parent> </parent>
@ -52,8 +52,6 @@
</execution> </execution>
</executions> </executions>
<configuration> <configuration>
<scalaVersion>${scala.version}</scalaVersion> <scalaVersion>${scala.version}</scalaVersion>
</configuration> </configuration>
</plugin> </plugin>
@ -62,11 +60,6 @@
</build> </build>
<dependencies> <dependencies>
<dependency> <dependency>
<groupId>org.apache.hadoop</groupId> <groupId>org.apache.hadoop</groupId>
@ -83,11 +76,11 @@
<dependency> <dependency>
<groupId>org.apache.spark</groupId> <groupId>org.apache.spark</groupId>
<artifactId>spark-core_${scala.binary.version}</artifactId> <artifactId>spark-core_2.11</artifactId>
</dependency> </dependency>
<dependency> <dependency>
<groupId>org.apache.spark</groupId> <groupId>org.apache.spark</groupId>
<artifactId>spark-sql_${scala.binary.version}</artifactId> <artifactId>spark-sql_2.11</artifactId>
</dependency> </dependency>
<dependency> <dependency>
@ -149,6 +142,11 @@
<artifactId>okhttp</artifactId> <artifactId>okhttp</artifactId>
</dependency> </dependency>
<dependency> <dependency>
<groupId>org.apache.httpcomponents</groupId> <groupId>org.apache.httpcomponents</groupId>
<artifactId>httpclient</artifactId> <artifactId>httpclient</artifactId>
@ -161,7 +159,7 @@
<dependency> <dependency>
<groupId>eu.dnetlib.dhp</groupId> <groupId>eu.dnetlib.dhp</groupId>
<artifactId>${dhp-schemas.artifact}</artifactId> <artifactId>dhp-schemas</artifactId>
</dependency> </dependency>
View File

@ -10,12 +10,6 @@ public class Constants {
public static final Map<String, String> accessRightsCoarMap = Maps.newHashMap(); public static final Map<String, String> accessRightsCoarMap = Maps.newHashMap();
public static final Map<String, String> coarCodeLabelMap = Maps.newHashMap(); public static final Map<String, String> coarCodeLabelMap = Maps.newHashMap();
public static final String ROR_NS_PREFIX = "ror_________";
public static final String ROR_OPENAIRE_ID = "10|openaire____::993a7ae7a863813cf95028b50708e222";
public static final String ROR_DATASOURCE_NAME = "Research Organization Registry (ROR)";
public static String COAR_ACCESS_RIGHT_SCHEMA = ""; public static String COAR_ACCESS_RIGHT_SCHEMA = "";
private Constants() { private Constants() {
@ -51,7 +45,6 @@ public class Constants {
public static final String RETRY_DELAY = "retryDelay"; public static final String RETRY_DELAY = "retryDelay";
public static final String CONNECT_TIMEOUT = "connectTimeOut"; public static final String CONNECT_TIMEOUT = "connectTimeOut";
public static final String READ_TIMEOUT = "readTimeOut"; public static final String READ_TIMEOUT = "readTimeOut";
public static final String REQUEST_METHOD = "requestMethod";
public static final String FROM_DATE_OVERRIDE = "fromDateOverride"; public static final String FROM_DATE_OVERRIDE = "fromDateOverride";
public static final String UNTIL_DATE_OVERRIDE = "untilDateOverride"; public static final String UNTIL_DATE_OVERRIDE = "untilDateOverride";

@ -1,6 +1,8 @@
package eu.dnetlib.dhp.common.action; package eu.dnetlib.dhp.common.action;
import static eu.dnetlib.dhp.schema.oaf.utils.IdentifierFactory.*;
import; import;
import; import;
import; import;
@ -19,7 +21,6 @@ import com.fasterxml.jackson.databind.ObjectMapper;
import eu.dnetlib.dhp.common.DbClient; import eu.dnetlib.dhp.common.DbClient;
import eu.dnetlib.dhp.common.action.model.MasterDuplicate; import eu.dnetlib.dhp.common.action.model.MasterDuplicate;
import eu.dnetlib.dhp.schema.oaf.utils.OafMapperUtils;
public class ReadDatasourceMasterDuplicateFromDB { public class ReadDatasourceMasterDuplicateFromDB {
@ -59,8 +60,8 @@ public class ReadDatasourceMasterDuplicateFromDB {
final String masterId = rs.getString("masterId"); final String masterId = rs.getString("masterId");
final String masterName = rs.getString("masterName"); final String masterName = rs.getString("masterName");
md.setDuplicateId(OafMapperUtils.createOpenaireId(10, duplicateId, true)); md.setDuplicateId(createOpenaireId(10, duplicateId, true));
md.setMasterId(OafMapperUtils.createOpenaireId(10, masterId, true)); md.setMasterId(createOpenaireId(10, masterId, true));
md.setMasterName(masterName); md.setMasterName(masterName);
return md; return md;

@ -0,0 +1,53 @@
package eu.dnetlib.dhp.common.api;
import okhttp3.MediaType;
import okhttp3.RequestBody;
import okhttp3.internal.Util;
import okio.BufferedSink;
import okio.Okio;
import okio.Source;
public class InputStreamRequestBody extends RequestBody {
private final InputStream inputStream;
private final MediaType mediaType;
private final long lenght;
public static RequestBody create(final MediaType mediaType, final InputStream inputStream, final long len) {
return new InputStreamRequestBody(inputStream, mediaType, len);
private InputStreamRequestBody(InputStream inputStream, MediaType mediaType, long len) {
this.inputStream = inputStream;
this.mediaType = mediaType;
this.lenght = len;
public MediaType contentType() {
return mediaType;
public long contentLength() {
return lenght;
public void writeTo(BufferedSink sink) throws IOException {
Source source = null;
try {
source = Okio.source(inputStream);
} finally {

View File

@ -0,0 +1,8 @@
package eu.dnetlib.dhp.common.api;
public class MissingConceptDoiException extends Throwable {
public MissingConceptDoiException(String message) {

@ -0,0 +1,365 @@
package eu.dnetlib.dhp.common.api;
import java.util.concurrent.TimeUnit;
import org.apache.http.HttpHeaders;
import org.apache.http.entity.ContentType;
import org.jetbrains.annotations.NotNull;
import eu.dnetlib.dhp.common.api.zenodo.ZenodoModel;
import eu.dnetlib.dhp.common.api.zenodo.ZenodoModelList;
import okhttp3.*;
public class ZenodoAPIClient implements Serializable {
String urlString;
String bucket;
String deposition_id;
String access_token;
public static final MediaType MEDIA_TYPE_JSON = MediaType.parse("application/json; charset=utf-8");
private static final MediaType MEDIA_TYPE_ZIP = MediaType.parse("application/zip");
public String getUrlString() {
return urlString;
public void setUrlString(String urlString) {
this.urlString = urlString;
public String getBucket() {
return bucket;
public void setBucket(String bucket) {
this.bucket = bucket;
public void setDeposition_id(String deposition_id) {
this.deposition_id = deposition_id;
public ZenodoAPIClient(String urlString, String access_token) {
this.urlString = urlString;
this.access_token = access_token;
* Brand new deposition in Zenodo. It sets the deposition_id and the bucket where to store the files to upload
* @return response code
* @throws IOException
public int newDeposition() throws IOException {
String json = "{}";
URL url = new URL(urlString);
HttpURLConnection conn = (HttpURLConnection) url.openConnection();
conn.setRequestProperty(HttpHeaders.CONTENT_TYPE, ContentType.APPLICATION_JSON.toString());
conn.setRequestProperty(HttpHeaders.AUTHORIZATION, "Bearer " + access_token);
try (OutputStream os = conn.getOutputStream()) {
byte[] input = json.getBytes("utf-8");
os.write(input, 0, input.length);
String body = getBody(conn);
int responseCode = conn.getResponseCode();
if (!checkOKStatus(responseCode))
throw new IOException("Unexpected code " + responseCode + body);
ZenodoModel newSubmission = new Gson().fromJson(body, ZenodoModel.class);
this.bucket = newSubmission.getLinks().getBucket();
this.deposition_id = newSubmission.getId();
return responseCode;
* Upload files in Zenodo.
* @param is the inputStream for the file to upload
* @param file_name the name of the file as it will appear on Zenodo
* @return the response code
public int uploadIS(InputStream is, String file_name) throws IOException {
URL url = new URL(bucket + "/" + file_name);
HttpURLConnection conn = (HttpURLConnection) url.openConnection();
conn.setRequestProperty(HttpHeaders.CONTENT_TYPE, "application/zip");
conn.setRequestProperty(HttpHeaders.AUTHORIZATION, "Bearer " + access_token);
byte[] buf = new byte[8192];
int length;
try (OutputStream os = conn.getOutputStream()) {
while ((length = != -1) {
os.write(buf, 0, length);
int responseCode = conn.getResponseCode();
if (!checkOKStatus(responseCode)) {
throw new IOException("Unexpected code " + responseCode + getBody(conn));
return responseCode;
private String getBody(HttpURLConnection conn) throws IOException {
String body = "{}";
try (BufferedReader br = new BufferedReader(
new InputStreamReader(conn.getInputStream(), "utf-8"))) {
StringBuilder response = new StringBuilder();
String responseLine = null;
while ((responseLine = br.readLine()) != null) {
body = response.toString();
return body;
* Associates metadata information to the current deposition
* @param metadata the metadata
* @return response code
* @throws IOException
public int sendMretadata(String metadata) throws IOException {
URL url = new URL(urlString + "/" + deposition_id);
HttpURLConnection conn = (HttpURLConnection) url.openConnection();
conn.setRequestProperty(HttpHeaders.CONTENT_TYPE, ContentType.APPLICATION_JSON.toString());
conn.setRequestProperty(HttpHeaders.AUTHORIZATION, "Bearer " + access_token);
try (OutputStream os = conn.getOutputStream()) {
byte[] input = metadata.getBytes("utf-8");
os.write(input, 0, input.length);
final int responseCode = conn.getResponseCode();
if (!checkOKStatus(responseCode))
throw new IOException("Unexpected code " + responseCode + getBody(conn));
return responseCode;
private boolean checkOKStatus(int responseCode) {
if (HttpURLConnection.HTTP_OK != responseCode ||
HttpURLConnection.HTTP_CREATED != responseCode)
return true;
return false;
* To publish the current deposition. It works for both new deposition or new version of an old deposition
* @return response code
* @throws IOException
public int publish() throws IOException {
String json = "{}";
OkHttpClient httpClient = new OkHttpClient.Builder().connectTimeout(600, TimeUnit.SECONDS).build();
RequestBody body = RequestBody.create(json, MEDIA_TYPE_JSON);
Request request = new Request.Builder()
.url(urlString + "/" + deposition_id + "/actions/publish")
.addHeader("Authorization", "Bearer " + access_token)
try (Response response = httpClient.newCall(request).execute()) {
if (!response.isSuccessful())
throw new IOException("Unexpected code " + response + response.body().string());
return response.code();
* To create a new version of an already published deposition. It sets the deposition_id and the bucket to be used
* for the new version.
* @param concept_rec_id the concept record id of the deposition for which to create a new version. It is the last
* part of the url for the DOI Zenodo suggests to use to cite all versions: DOI:
* concept_rec_id = 656930
* @return response code
* @throws IOException
* @throws MissingConceptDoiException
public int newVersion(String concept_rec_id) throws IOException, MissingConceptDoiException {
setDepositionId(concept_rec_id, 1);
String json = "{}";
URL url = new URL(urlString + "/" + deposition_id + "/actions/newversion");
HttpURLConnection conn = (HttpURLConnection) url.openConnection();
conn.setRequestProperty(HttpHeaders.AUTHORIZATION, "Bearer " + access_token);
try (OutputStream os = conn.getOutputStream()) {
byte[] input = json.getBytes("utf-8");
os.write(input, 0, input.length);
String body = getBody(conn);
int responseCode = conn.getResponseCode();
if (!checkOKStatus(responseCode))
throw new IOException("Unexpected code " + responseCode + body);
ZenodoModel zenodoModel = new Gson().fromJson(body, ZenodoModel.class);
String latest_draft = zenodoModel.getLinks().getLatest_draft();
deposition_id = latest_draft.substring(latest_draft.lastIndexOf("/") + 1);
bucket = getBucket(latest_draft);
return responseCode;
* To finish uploading a version or new deposition not published
* It sets the deposition_id and the bucket to be used
* @param deposition_id the deposition id of the not yet published upload
* concept_rec_id = 656930
* @return response code
* @throws IOException
* @throws MissingConceptDoiException
public int uploadOpenDeposition(String deposition_id) throws IOException, MissingConceptDoiException {
this.deposition_id = deposition_id;
String json = "{}";
URL url = new URL(urlString + "/" + deposition_id);
HttpURLConnection conn = (HttpURLConnection) url.openConnection();
conn.setRequestProperty(HttpHeaders.AUTHORIZATION, "Bearer " + access_token);
try (OutputStream os = conn.getOutputStream()) {
byte[] input = json.getBytes("utf-8");
os.write(input, 0, input.length);
String body = getBody(conn);
int responseCode = conn.getResponseCode();
if (!checkOKStatus(responseCode))
throw new IOException("Unexpected code " + responseCode + body);
ZenodoModel zenodoModel = new Gson().fromJson(body, ZenodoModel.class);
bucket = zenodoModel.getLinks().getBucket();
return responseCode;
private void setDepositionId(String concept_rec_id, Integer page) throws IOException, MissingConceptDoiException {
ZenodoModelList zenodoModelList = new Gson()
.fromJson(getPrevDepositions(String.valueOf(page)), ZenodoModelList.class);
for (ZenodoModel zm : zenodoModelList) {
if (zm.getConceptrecid().equals(concept_rec_id)) {
deposition_id = zm.getId();
if (zenodoModelList.size() == 0)
throw new MissingConceptDoiException(
"The concept record id specified was missing in the list of depositions");
setDepositionId(concept_rec_id, page + 1);
private String getPrevDepositions(String page) throws IOException {
HttpUrl.Builder urlBuilder = HttpUrl.parse(urlString).newBuilder();
urlBuilder.addQueryParameter("page", page);
URL url = new URL(;
HttpURLConnection conn = (HttpURLConnection) url.openConnection();
conn.setRequestProperty(HttpHeaders.CONTENT_TYPE, ContentType.APPLICATION_JSON.toString());
conn.setRequestProperty(HttpHeaders.AUTHORIZATION, "Bearer " + access_token);
String body = getBody(conn);
int responseCode = conn.getResponseCode();
if (!checkOKStatus(responseCode))
throw new IOException("Unexpected code " + responseCode + body);
return body;
private String getBucket(String inputUurl) throws IOException {
URL url = new URL(inputUurl);
HttpURLConnection conn = (HttpURLConnection) url.openConnection();
conn.setRequestProperty(HttpHeaders.CONTENT_TYPE, ContentType.APPLICATION_JSON.toString());
conn.setRequestProperty(HttpHeaders.AUTHORIZATION, "Bearer " + access_token);
String body = getBody(conn);
int responseCode = conn.getResponseCode();
if (!checkOKStatus(responseCode))
throw new IOException("Unexpected code " + responseCode + body);
ZenodoModel zenodoModel = new Gson().fromJson(body, ZenodoModel.class);
return zenodoModel.getLinks().getBucket();

@ -1,39 +0,0 @@
package eu.dnetlib.dhp.common.api.context;
public class CategorySummary {
private String id;
private String label;
private boolean hasConcept;
public String getId() {
return id;
public String getLabel() {
return label;
public boolean isHasConcept() {
return hasConcept;
public CategorySummary setId(final String id) { = id;
return this;
public CategorySummary setLabel(final String label) {
this.label = label;
return this;
public CategorySummary setHasConcept(final boolean hasConcept) {
this.hasConcept = hasConcept;
return this;

@ -1,7 +0,0 @@
package eu.dnetlib.dhp.common.api.context;
import java.util.ArrayList;
public class CategorySummaryList extends ArrayList<CategorySummary> {

@ -1,52 +0,0 @@
package eu.dnetlib.dhp.common.api.context;
import java.util.List;
public class ConceptSummary {
private String id;
private String label;
public boolean hasSubConcept;
private List<ConceptSummary> concepts;
public String getId() {
return id;
public String getLabel() {
return label;
public List<ConceptSummary> getConcepts() {
return concepts;
public ConceptSummary setId(final String id) { = id;
return this;
public ConceptSummary setLabel(final String label) {
this.label = label;
return this;
public boolean isHasSubConcept() {
return hasSubConcept;
public ConceptSummary setHasSubConcept(final boolean hasSubConcept) {
this.hasSubConcept = hasSubConcept;
return this;
public ConceptSummary setConcept(final List<ConceptSummary> concepts) {
this.concepts = concepts;
return this;

@ -1,7 +0,0 @@
package eu.dnetlib.dhp.common.api.context;
import java.util.ArrayList;
public class ConceptSummaryList extends ArrayList<ConceptSummary> {

@ -1,50 +0,0 @@
package eu.dnetlib.dhp.common.api.context;
public class ContextSummary {
private String id;
private String label;
private String type;
private String status;
public String getId() {
return id;
public String getLabel() {
return label;
public String getType() {
return type;
public String getStatus() {
return status;
public ContextSummary setId(final String id) { = id;
return this;
public ContextSummary setLabel(final String label) {
this.label = label;
return this;
public ContextSummary setType(final String type) {
this.type = type;
return this;
public ContextSummary setStatus(final String status) {
this.status = status;
return this;

@ -1,7 +0,0 @@
package eu.dnetlib.dhp.common.api.context;
import java.util.ArrayList;
public class ContextSummaryList extends ArrayList<ContextSummary> {

@ -0,0 +1,14 @@
package eu.dnetlib.dhp.common.api.zenodo;
public class Community {
private String identifier;
public String getIdentifier() {
return identifier;
public void setIdentifier(String identifier) {
this.identifier = identifier;

View File

@ -0,0 +1,47 @@
package eu.dnetlib.dhp.common.api.zenodo;
public class Creator {
private String affiliation;
private String name;
private String orcid;
public String getAffiliation() {
return affiliation;
public void setAffiliation(String affiliation) {
this.affiliation = affiliation;
public String getName() {
return name;
public void setName(String name) { = name;
public String getOrcid() {
return orcid;
public void setOrcid(String orcid) {
this.orcid = orcid;
public static Creator newInstance(String name, String affiliation, String orcid) {
Creator c = new Creator();
if (name != null) { = name;
if (affiliation != null) {
c.affiliation = affiliation;
if (orcid != null) {
c.orcid = orcid;
return c;

View File

@ -0,0 +1,44 @@
package eu.dnetlib.dhp.common.api.zenodo;
public class File implements Serializable {
private String checksum;
private String filename;
private long filesize;
private String id;
public String getChecksum() {
return checksum;
public void setChecksum(String checksum) {
this.checksum = checksum;
public String getFilename() {
return filename;
public void setFilename(String filename) {
this.filename = filename;
public long getFilesize() {
return filesize;
public void setFilesize(long filesize) {
this.filesize = filesize;
public String getId() {
return id;
public void setId(String id) { = id;

@ -0,0 +1,23 @@
package eu.dnetlib.dhp.common.api.zenodo;
public class Grant implements Serializable {
private String id;
public String getId() {
return id;
public void setId(String id) { = id;
public static Grant newInstance(String id) {
Grant g = new Grant(); = id;
return g;

@ -0,0 +1,92 @@
package eu.dnetlib.dhp.common.api.zenodo;
public class Links implements Serializable {
private String bucket;
private String discard;
private String edit;
private String files;
private String html;
private String latest_draft;
private String latest_draft_html;
private String publish;
private String self;
public String getBucket() {
return bucket;
public void setBucket(String bucket) {
this.bucket = bucket;
public String getDiscard() {
return discard;
public void setDiscard(String discard) {
this.discard = discard;
public String getEdit() {
return edit;
public void setEdit(String edit) {
this.edit = edit;
public String getFiles() {
return files;
public void setFiles(String files) {
this.files = files;
public String getHtml() {
return html;
public void setHtml(String html) {
this.html = html;
public String getLatest_draft() {
return latest_draft;
public void setLatest_draft(String latest_draft) {
this.latest_draft = latest_draft;
public String getLatest_draft_html() {
return latest_draft_html;
public void setLatest_draft_html(String latest_draft_html) {
this.latest_draft_html = latest_draft_html;
public String getPublish() {
return publish;
public void setPublish(String publish) {
this.publish = publish;
public String getSelf() {
return self;
public void setSelf(String self) {
this.self = self;

@ -0,0 +1,153 @@
package eu.dnetlib.dhp.common.api.zenodo;
import java.util.List;
public class Metadata implements Serializable {
private String access_right;
private List<Community> communities;
private List<Creator> creators;
private String description;
private String doi;
private List<Grant> grants;
private List<String> keywords;
private String language;
private String license;
private PrereserveDoi prereserve_doi;
private String publication_date;
private List<String> references;
private List<RelatedIdentifier> related_identifiers;
private String title;
private String upload_type;
private String version;
public String getUpload_type() {
return upload_type;
public void setUpload_type(String upload_type) {
this.upload_type = upload_type;
public String getVersion() {
return version;
public void setVersion(String version) {
this.version = version;
public String getAccess_right() {
return access_right;
public void setAccess_right(String access_right) {
this.access_right = access_right;
public List<Community> getCommunities() {
return communities;
public void setCommunities(List<Community> communities) {
this.communities = communities;
public List<Creator> getCreators() {
return creators;
public void setCreators(List<Creator> creators) {
this.creators = creators;
public String getDescription() {
return description;
public void setDescription(String description) {
this.description = description;
public String getDoi() {
return doi;
public void setDoi(String doi) {
this.doi = doi;
public List<Grant> getGrants() {
return grants;
public void setGrants(List<Grant> grants) {
this.grants = grants;
public List<String> getKeywords() {
return keywords;
public void setKeywords(List<String> keywords) {
this.keywords = keywords;
public String getLanguage() {
return language;
public void setLanguage(String language) {
this.language = language;
public String getLicense() {
return license;
public void setLicense(String license) {
this.license = license;
public PrereserveDoi getPrereserve_doi() {
return prereserve_doi;
public void setPrereserve_doi(PrereserveDoi prereserve_doi) {
this.prereserve_doi = prereserve_doi;
public String getPublication_date() {
return publication_date;
public void setPublication_date(String publication_date) {
this.publication_date = publication_date;
public List<String> getReferences() {
return references;
public void setReferences(List<String> references) {
this.references = references;
public List<RelatedIdentifier> getRelated_identifiers() {
return related_identifiers;
public void setRelated_identifiers(List<RelatedIdentifier> related_identifiers) {
this.related_identifiers = related_identifiers;
public String getTitle() {
return title;
public void setTitle(String title) {
this.title = title;

@ -0,0 +1,25 @@
package eu.dnetlib.dhp.common.api.zenodo;
public class PrereserveDoi implements Serializable {
private String doi;
private String recid;
public String getDoi() {
return doi;
public void setDoi(String doi) {
this.doi = doi;
public String getRecid() {
return recid;
public void setRecid(String recid) {
this.recid = recid;

@ -0,0 +1,43 @@
package eu.dnetlib.dhp.common.api.zenodo;
public class RelatedIdentifier implements Serializable {
private String identifier;
private String relation;
private String resource_type;
private String scheme;
public String getIdentifier() {
return identifier;
public void setIdentifier(String identifier) {
this.identifier = identifier;
public String getRelation() {
return relation;
public void setRelation(String relation) {
this.relation = relation;
public String getResource_type() {
return resource_type;
public void setResource_type(String resource_type) {
this.resource_type = resource_type;
public String getScheme() {
return scheme;
public void setScheme(String scheme) {
this.scheme = scheme;

@ -0,0 +1,118 @@
package eu.dnetlib.dhp.common.api.zenodo;
import java.util.List;
public class ZenodoModel implements Serializable {
private String conceptrecid;
private String created;
private List<File> files;
private String id;
private Links links;
private Metadata metadata;
private String modified;
private String owner;
private String record_id;
private String state;
private boolean submitted;
private String title;
public String getConceptrecid() {
return conceptrecid;
public void setConceptrecid(String conceptrecid) {
this.conceptrecid = conceptrecid;
public String getCreated() {
return created;
public void setCreated(String created) {
this.created = created;
public List<File> getFiles() {
return files;
public void setFiles(List<File> files) {
this.files = files;
public String getId() {
return id;
public void setId(String id) { = id;
public Links getLinks() {
return links;
public void setLinks(Links links) {
this.links = links;
public Metadata getMetadata() {
return metadata;
public void setMetadata(Metadata metadata) {
this.metadata = metadata;
public String getModified() {
return modified;
public void setModified(String modified) {
this.modified = modified;
public String getOwner() {
return owner;
public void setOwner(String owner) {
this.owner = owner;
public String getRecord_id() {
return record_id;
public void setRecord_id(String record_id) {
this.record_id = record_id;
public String getState() {
return state;
public void setState(String state) {
this.state = state;
public boolean isSubmitted() {
return submitted;
public void setSubmitted(boolean submitted) {
this.submitted = submitted;
public String getTitle() {
return title;
public void setTitle(String title) {
this.title = title;

@ -0,0 +1,7 @@
package eu.dnetlib.dhp.common.api.zenodo;
import java.util.ArrayList;
public class ZenodoModelList extends ArrayList<ZenodoModel> {

@ -1,9 +1,6 @@
package eu.dnetlib.dhp.common.collection; package eu.dnetlib.dhp.common.collection;
import java.util.HashMap;
import java.util.Map;
/** /**
* Bundles the http connection parameters driving the client behaviour. * Bundles the http connection parameters driving the client behaviour.
*/ */
@ -16,8 +13,6 @@ public class HttpClientParams {
public static int _connectTimeOut = 10; // seconds public static int _connectTimeOut = 10; // seconds
public static int _readTimeOut = 30; // seconds public static int _readTimeOut = 30; // seconds
public static String _requestMethod = "GET";
/** /**
* Maximum number of allowed retires before failing * Maximum number of allowed retires before failing
*/ */
@ -43,30 +38,17 @@ public class HttpClientParams {
*/ */
private int readTimeOut; private int readTimeOut;
* Custom http headers
private Map<String, String> headers;
* Request method (i.e., GET, POST etc)
private String requestMethod;
public HttpClientParams() { public HttpClientParams() {
this(_maxNumberOfRetry, _requestDelay, _retryDelay, _connectTimeOut, _readTimeOut, new HashMap<>(), this(_maxNumberOfRetry, _requestDelay, _retryDelay, _connectTimeOut, _readTimeOut);
} }
public HttpClientParams(int maxNumberOfRetry, int requestDelay, int retryDelay, int connectTimeOut, public HttpClientParams(int maxNumberOfRetry, int requestDelay, int retryDelay, int connectTimeOut,
int readTimeOut, Map<String, String> headers, String requestMethod) { int readTimeOut) {
this.maxNumberOfRetry = maxNumberOfRetry; this.maxNumberOfRetry = maxNumberOfRetry;
this.requestDelay = requestDelay; this.requestDelay = requestDelay;
this.retryDelay = retryDelay; this.retryDelay = retryDelay;
this.connectTimeOut = connectTimeOut; this.connectTimeOut = connectTimeOut;
this.readTimeOut = readTimeOut; this.readTimeOut = readTimeOut;
this.headers = headers;
this.requestMethod = requestMethod;
} }
public int getMaxNumberOfRetry() { public int getMaxNumberOfRetry() {
@ -109,19 +91,4 @@ public class HttpClientParams {
this.readTimeOut = readTimeOut; this.readTimeOut = readTimeOut;
} }
public Map<String, String> getHeaders() {
return headers;
public void setHeaders(Map<String, String> headers) {
this.headers = headers;
public String getRequestMethod() {
return requestMethod;
public void setRequestMethod(String requestMethod) {
this.requestMethod = requestMethod;
} }

@ -8,13 +8,10 @@ import;
import*; import*;
import java.util.List; import java.util.List;
import java.util.Map; import java.util.Map;
import java.util.concurrent.TimeUnit;
import; import;
import org.apache.commons.lang3.math.NumberUtils; import org.apache.commons.lang3.math.NumberUtils;
import org.apache.commons.lang3.time.DateUtils;
import org.apache.http.HttpHeaders; import org.apache.http.HttpHeaders;
import org.joda.time.Instant;
import org.slf4j.Logger; import org.slf4j.Logger;
import org.slf4j.LoggerFactory; import org.slf4j.LoggerFactory;
@ -97,31 +94,23 @@ public class HttpConnector2 {
throw new CollectorException(msg); throw new CollectorException(msg);
} }"Request attempt {} [{}]", retryNumber, requestUrl);
InputStream input = null; InputStream input = null;
long start = System.currentTimeMillis();
try { try {
if (getClientParams().getRequestDelay() > 0) { if (getClientParams().getRequestDelay() > 0) {
backoffAndSleep(getClientParams().getRequestDelay()); backoffAndSleep(getClientParams().getRequestDelay());
} }"Request attempt {} [{}]", retryNumber, requestUrl);
final HttpURLConnection urlConn = (HttpURLConnection) new URL(requestUrl).openConnection(); final HttpURLConnection urlConn = (HttpURLConnection) new URL(requestUrl).openConnection();
urlConn.setInstanceFollowRedirects(false); urlConn.setInstanceFollowRedirects(false);
urlConn.setReadTimeout(getClientParams().getReadTimeOut() * 1000); urlConn.setReadTimeout(getClientParams().getReadTimeOut() * 1000);
urlConn.setConnectTimeout(getClientParams().getConnectTimeOut() * 1000); urlConn.setConnectTimeout(getClientParams().getConnectTimeOut() * 1000);
urlConn.addRequestProperty(HttpHeaders.USER_AGENT, userAgent); urlConn.addRequestProperty(HttpHeaders.USER_AGENT, userAgent);
// if provided, add custom headers
if (!getClientParams().getHeaders().isEmpty()) {
for (Map.Entry<String, String> headerEntry : getClientParams().getHeaders().entrySet()) {
urlConn.addRequestProperty(headerEntry.getKey(), headerEntry.getValue());
if (log.isDebugEnabled()) {
logHeaderFields(urlConn); logHeaderFields(urlConn);
int retryAfter = obtainRetryAfter(urlConn.getHeaderFields()); int retryAfter = obtainRetryAfter(urlConn.getHeaderFields());
String rateLimit = urlConn.getHeaderField(Constants.HTTPHEADER_IETF_DRAFT_RATELIMIT_LIMIT); String rateLimit = urlConn.getHeaderField(Constants.HTTPHEADER_IETF_DRAFT_RATELIMIT_LIMIT);
@ -136,7 +125,9 @@ public class HttpConnector2 {
} }
if (is2xx(urlConn.getResponseCode())) { if (is2xx(urlConn.getResponseCode())) {
return getInputStream(urlConn, start); input = urlConn.getInputStream();
responseType = urlConn.getContentType();
return input;
} }
if (is3xx(urlConn.getResponseCode())) { if (is3xx(urlConn.getResponseCode())) {
@ -146,7 +137,6 @@ public class HttpConnector2 {
.put( .put(
REPORT_PREFIX + urlConn.getResponseCode(), REPORT_PREFIX + urlConn.getResponseCode(),
String.format("Moved to: %s", newUrl)); String.format("Moved to: %s", newUrl));
urlConn.disconnect(); urlConn.disconnect();
if (retryAfter > 0) { if (retryAfter > 0) {
backoffAndSleep(retryAfter); backoffAndSleep(retryAfter);
@ -162,50 +152,26 @@ public class HttpConnector2 {
if (retryAfter > 0) { if (retryAfter > 0) {
log log
.warn( .warn(
"waiting and repeating request after suggested retry-after {} sec for URL {}", "{} - waiting and repeating request after suggested retry-after {} sec.",
retryAfter, requestUrl); requestUrl, retryAfter);
backoffAndSleep(retryAfter * 1000); backoffAndSleep(retryAfter * 1000);
} else { } else {
log log
.warn( .warn(
"waiting and repeating request after default delay of {} sec for URL {}", "{} - waiting and repeating request after default delay of {} sec.",
getClientParams().getRetryDelay(), requestUrl); requestUrl, getClientParams().getRetryDelay());
backoffAndSleep(retryNumber * getClientParams().getRetryDelay()); backoffAndSleep(retryNumber * getClientParams().getRetryDelay() * 1000);
} }
report.put(REPORT_PREFIX + urlConn.getResponseCode(), requestUrl); report.put(REPORT_PREFIX + urlConn.getResponseCode(), requestUrl);
urlConn.disconnect(); urlConn.disconnect();
return attemptDownload(requestUrl, retryNumber + 1, report); return attemptDownload(requestUrl, retryNumber + 1, report);
report.put(REPORT_PREFIX + urlConn.getResponseCode(), requestUrl);
log.warn("waiting and repeating request after 10 sec for URL {}", requestUrl);
try {
return getInputStream(urlConn, start);
} catch (IOException e) {
"server returned 422 and got IOException accessing the response body from URL {}",
log.error("IOException:", e);
return attemptDownload(requestUrl, retryNumber + 1, report);
default: default:
log.error("gor error {} from URL: {}", urlConn.getResponseCode(), urlConn.getURL());
log.error("response message: {}", urlConn.getResponseMessage());
report report
.put( .put(
REPORT_PREFIX + urlConn.getResponseCode(), REPORT_PREFIX + urlConn.getResponseCode(),
String String
.format( .format(
"%s Error: %s", requestUrl, urlConn.getResponseMessage())); "%s Error: %s", requestUrl, urlConn.getResponseMessage()));
throw new CollectorException(urlConn.getResponseCode() + " error " + report); throw new CollectorException(urlConn.getResponseCode() + " error " + report);
} }
} }
@ -226,27 +192,13 @@ public class HttpConnector2 {
} }
} }
private InputStream getInputStream(HttpURLConnection urlConn, long start) throws IOException {
InputStream input = urlConn.getInputStream();
responseType = urlConn.getContentType();
return input;
private static void logRequestTime(long start) {
"request time elapsed: {}sec",
TimeUnit.MILLISECONDS.toSeconds(System.currentTimeMillis() - start));
private void logHeaderFields(final HttpURLConnection urlConn) throws IOException { private void logHeaderFields(final HttpURLConnection urlConn) throws IOException {"Response: {} - {}", urlConn.getResponseCode(), urlConn.getResponseMessage()); log.debug("StatusCode: {}", urlConn.getResponseMessage());
for (Map.Entry<String, List<String>> e : urlConn.getHeaderFields().entrySet()) { for (Map.Entry<String, List<String>> e : urlConn.getHeaderFields().entrySet()) {
if (e.getKey() != null) { if (e.getKey() != null) {
for (String v : e.getValue()) { for (String v : e.getValue()) {" key: {} - value: {}", e.getKey(), v); log.debug(" key: {} - value: {}", e.getKey(), v);
} }
} }
} }
@ -266,7 +218,7 @@ public class HttpConnector2 {
for (String key : headerMap.keySet()) { for (String key : headerMap.keySet()) {
if ((key != null) && key.equalsIgnoreCase(HttpHeaders.RETRY_AFTER) && (!headerMap.get(key).isEmpty()) if ((key != null) && key.equalsIgnoreCase(HttpHeaders.RETRY_AFTER) && (!headerMap.get(key).isEmpty())
&& NumberUtils.isCreatable(headerMap.get(key).get(0))) { && NumberUtils.isCreatable(headerMap.get(key).get(0))) {
return Integer.parseInt(headerMap.get(key).get(0)); return Integer.parseInt(headerMap.get(key).get(0)) + 10;
} }
} }
return -1; return -1;

@ -63,10 +63,7 @@ public class Vocabulary implements Serializable {
} }
public VocabularyTerm getTermBySynonym(final String syn) { public VocabularyTerm getTermBySynonym(final String syn) {
return Optional return getTerm(synonyms.get(syn.toLowerCase()));
.map(s -> getTerm(synonyms.get(s.toLowerCase())))
} }
public Qualifier getTermAsQualifier(final String termId) { public Qualifier getTermAsQualifier(final String termId) {
@ -76,11 +73,11 @@ public class Vocabulary implements Serializable {
public Qualifier getTermAsQualifier(final String termId, boolean strict) { public Qualifier getTermAsQualifier(final String termId, boolean strict) {
final VocabularyTerm term = getTerm(termId); final VocabularyTerm term = getTerm(termId);
if (Objects.nonNull(term)) { if (Objects.nonNull(term)) {
return OafMapperUtils.qualifier(term.getId(), term.getName(), getId(), getName()); return OafMapperUtils.qualifier(term.getId(), term.getName(), getId());
} else if (Objects.isNull(term) && strict) { } else if (Objects.isNull(term) && strict) {
return OafMapperUtils.unknown(getId(), getName()); return OafMapperUtils.unknown(getId());
} else { } else {
return OafMapperUtils.qualifier(termId, termId, getId(), getName()); return OafMapperUtils.qualifier(termId, termId, getId());
} }
} }

@ -125,34 +125,16 @@ public class VocabularyGroup implements Serializable {
if (vocabularyExists(vocId)) { if (vocabularyExists(vocId)) {
return vocs.get(vocId.toLowerCase()).getTermAsQualifier(id); return vocs.get(vocId.toLowerCase()).getTermAsQualifier(id);
} }
return OafMapperUtils.qualifier(id, id, "", ""); return OafMapperUtils.qualifier(id, id, "");
} }
public Qualifier getSynonymAsQualifier(final String vocId, final String syn) { public Qualifier getSynonymAsQualifier(final String vocId, final String syn) {
if (StringUtils.isBlank(vocId)) { if (StringUtils.isBlank(vocId)) {
return OafMapperUtils.unknown("", ""); return OafMapperUtils.unknown("");
} }
return vocs.get(vocId.toLowerCase()).getSynonymAsQualifier(syn); return vocs.get(vocId.toLowerCase()).getSynonymAsQualifier(syn);
} }
public Qualifier lookupTermBySynonym(final String vocId, final String syn) {
return find(vocId)
vocabulary -> Optional
term -> OafMapperUtils
.qualifier(term.getId(), term.getName(), vocabulary.getId(), vocabulary.getName()))
term -> OafMapperUtils
.qualifier(term.getId(), term.getName(), vocabulary.getId(), vocabulary.getName()))
/** /**
* getSynonymAsQualifierCaseSensitive * getSynonymAsQualifierCaseSensitive
* *
@ -160,7 +142,7 @@ public class VocabularyGroup implements Serializable {
*/ */
public Qualifier getSynonymAsQualifierCaseSensitive(final String vocId, final String syn) { public Qualifier getSynonymAsQualifierCaseSensitive(final String vocId, final String syn) {
if (StringUtils.isBlank(vocId)) { if (StringUtils.isBlank(vocId)) {
return OafMapperUtils.unknown("", ""); return OafMapperUtils.unknown("");
} }
return vocs.get(vocId).getSynonymAsQualifier(syn); return vocs.get(vocId).getSynonymAsQualifier(syn);
} }

@ -10,6 +10,7 @@ import org.apache.commons.lang3.StringUtils;
import; import;
import eu.dnetlib.dhp.schema.oaf.Author; import eu.dnetlib.dhp.schema.oaf.Author;
import eu.dnetlib.dhp.schema.oaf.AuthorPid;
import eu.dnetlib.dhp.schema.oaf.StructuredProperty; import eu.dnetlib.dhp.schema.oaf.StructuredProperty;
import eu.dnetlib.pace.model.Person; import eu.dnetlib.pace.model.Person;
import scala.Tuple2; import scala.Tuple2;
@ -75,7 +76,7 @@ public class AuthorMerger {
.collect(Collectors.toMap(Tuple2::_1, Tuple2::_2, (x1, x2) -> x1)); .collect(Collectors.toMap(Tuple2::_1, Tuple2::_2, (x1, x2) -> x1));
// <pid, Author> (list of pid that are missing in the other list) // <pid, Author> (list of pid that are missing in the other list)
final List<Tuple2<StructuredProperty, Author>> pidToEnrich = enrich final List<Tuple2<AuthorPid, Author>> pidToEnrich = enrich
.stream() .stream()
.filter(a -> a.getPid() != null && !a.getPid().isEmpty()) .filter(a -> a.getPid() != null && !a.getPid().isEmpty())
.flatMap( .flatMap(
@ -111,7 +112,7 @@ public class AuthorMerger {
// TERRIBLE HACK but for some reason when we create and Array with Arrays.asList, // TERRIBLE HACK but for some reason when we create and Array with Arrays.asList,
// it creates of fixed size, and the add method raise UnsupportedOperationException at // it creates of fixed size, and the add method raise UnsupportedOperationException at
// java.util.AbstractList.add // java.util.AbstractList.add
final List<StructuredProperty> tmp = new ArrayList<>(r.getPid()); final List<AuthorPid> tmp = new ArrayList<>(r.getPid());
tmp.add(a._1()); tmp.add(a._1());
r.setPid(tmp); r.setPid(tmp);
} }
@ -119,136 +120,13 @@ public class AuthorMerger {
}); });
} }
public static String normalizeFullName(final String fullname) {
return nfd(fullname)
// do not compact the regexes in a single expression, would cause StackOverflowError
// in case
// of large input strings
.replaceAll("(\\W)+", " ")
.replaceAll("(\\p{InCombiningDiacriticalMarks})+", " ")
.replaceAll("(\\p{Punct})+", " ")
.replaceAll("(\\d)+", " ")
.replaceAll("(\\n)+", " ")
private static String authorFieldToBeCompared(Author author) {
if (StringUtils.isNotBlank(author.getSurname())) {
return author.getSurname();
if (StringUtils.isNotBlank(author.getFullname())) {
return author.getFullname();
return null;
* This method tries to figure out when two author are the same in the contest
* of ORCID enrichment
* @param left Author in the OAF entity
* @param right Author ORCID
* @return based on a heuristic on the names of the authors if they are the same.
public static boolean checkORCIDSimilarity(final Author left, final Author right) {
final Person pl = parse(left);
final Person pr = parse(right);
// If one of them didn't have a surname we verify if they have the fullName not empty
// and verify if the normalized version is equal
if (!(pl.getSurname() != null && pl.getSurname().stream().anyMatch(StringUtils::isNotBlank) &&
pr.getSurname() != null && pr.getSurname().stream().anyMatch(StringUtils::isNotBlank))) {
if (pl.getFullname() != null && !pl.getFullname().isEmpty() && pr.getFullname() != null
&& !pr.getFullname().isEmpty()) {
return pl
fl -> pr.getFullname().stream().anyMatch(fr -> normalize(fl).equalsIgnoreCase(normalize(fr))));
} else {
return false;
// The Authors have one surname in common
if (pl.getSurname().stream().anyMatch(sl -> pr.getSurname().stream().anyMatch(sr -> sr.equalsIgnoreCase(sl)))) {
// If one of them has only a surname and is the same we can say that they are the same author
if ((pl.getName() == null || pl.getName().stream().allMatch(StringUtils::isBlank)) ||
(pr.getName() == null || pr.getName().stream().allMatch(StringUtils::isBlank)))
return true;
// The authors have the same initials of Name in common
if (pl
nl -> pr
.anyMatch(nr -> nr.equalsIgnoreCase(nl))))
return true;
// Sometimes we noticed that publication have author wrote in inverse order Surname, Name
// We verify if we have an exact match between name and surname
if (pl.getSurname().stream().anyMatch(sl -> pr.getName().stream().anyMatch(nr -> nr.equalsIgnoreCase(sl))) &&
pl.getName().stream().anyMatch(nl -> pr.getSurname().stream().anyMatch(sr -> sr.equalsIgnoreCase(nl))))
return true;
return false;
* Method to enrich ORCID information in one list of authors based on another list
* @param baseAuthor the Author List in the OAF Entity
* @param orcidAuthor The list of ORCID Author intersected
* @return The Author List of the OAF Entity enriched with the orcid Author
public static List<Author> enrichOrcid(List<Author> baseAuthor, List<Author> orcidAuthor) {
if (baseAuthor == null || baseAuthor.isEmpty())
return orcidAuthor;
if (orcidAuthor == null || orcidAuthor.isEmpty())
return baseAuthor;
if (baseAuthor.size() == 1 && orcidAuthor.size() > 10)
return baseAuthor;
final List<Author> oAuthor = new ArrayList<>();
baseAuthor.forEach(ba -> {
Optional<Author> aMatch = -> checkORCIDSimilarity(ba, oa)).findFirst();
if (aMatch.isPresent()) {
final Author sameAuthor = aMatch.get();
addPid(ba, sameAuthor.getPid());
return baseAuthor;
private static void addPid(final Author a, final List<StructuredProperty> pids) {
if (a.getPid() == null) {
a.setPid(new ArrayList<>());
public static String pidToComparableString(StructuredProperty pid) { public static String pidToComparableString(StructuredProperty pid) {
final String classid = pid.getQualifier().getClassid() != null ? pid.getQualifier().getClassid().toLowerCase() return pid.toComparableString();
: ""; /*
return (pid.getQualifier() != null ? classid : "") * final String classid = pid.getQualifier().getClassid() != null ?
+ (pid.getValue() != null ? pid.getValue().toLowerCase() : ""); * pid.getQualifier().getClassid().toLowerCase() : ""; return (pid.getQualifier() != null ? classid : "") +
* (pid.getValue() != null ? pid.getValue().toLowerCase() : "");
} }
public static int countAuthorsPids(List<Author> authors) { public static int countAuthorsPids(List<Author> authors) {
@ -296,7 +174,7 @@ public class AuthorMerger {
} }
} }
public static String normalize(final String s) { private static String normalize(final String s) {
String[] normalized = nfd(s) String[] normalized = nfd(s)
.toLowerCase() .toLowerCase()
// do not compact the regexes in a single expression, would cause StackOverflowError // do not compact the regexes in a single expression, would cause StackOverflowError

@ -0,0 +1,97 @@
package eu.dnetlib.dhp.oa.merge;
import static eu.dnetlib.dhp.common.SparkSessionSupport.runWithSparkSession;
import java.util.Objects;
import java.util.Optional;
import org.apache.commons.lang3.StringUtils;
import org.apache.spark.SparkConf;
import org.apache.spark.sql.Encoders;
import org.apache.spark.sql.SaveMode;
import org.apache.spark.sql.SparkSession;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;
import com.fasterxml.jackson.databind.ObjectMapper;
import eu.dnetlib.dhp.application.ArgumentApplicationParser;
import eu.dnetlib.dhp.common.HdfsSupport;
import eu.dnetlib.dhp.schema.oaf.Entity;
import eu.dnetlib.dhp.schema.oaf.Oaf;
public class DispatchEntitiesSparkJob {
private static final Logger log = LoggerFactory.getLogger(DispatchEntitiesSparkJob.class);
private static final ObjectMapper OBJECT_MAPPER = new ObjectMapper();
public static void main(String[] args) throws Exception {
String jsonConfiguration = IOUtils
final ArgumentApplicationParser parser = new ArgumentApplicationParser(jsonConfiguration);
Boolean isSparkSessionManaged = Optional
.orElse(Boolean.TRUE);"isSparkSessionManaged: {}", isSparkSessionManaged);
String inputPath = parser.get("inputPath");"inputPath: {}", inputPath);
String outputPath = parser.get("outputPath");"outputPath: {}", outputPath);
String graphTableClassName = parser.get("graphTableClassName");"graphTableClassName: {}", graphTableClassName);
Class<? extends Entity> entityClazz = (Class<? extends Entity>) Class.forName(graphTableClassName);
SparkConf conf = new SparkConf();
spark -> {
HdfsSupport.remove(outputPath, spark.sparkContext().hadoopConfiguration());
dispatchEntities(spark, inputPath, entityClazz, outputPath);
private static <T extends Oaf> void dispatchEntities(
SparkSession spark,
String inputPath,
Class<T> clazz,
String outputPath) {
.filter((FilterFunction<String>) s -> isEntityType(s, clazz))
.map((MapFunction<String, String>) s -> StringUtils.substringAfter(s, "|"), Encoders.STRING())
(MapFunction<String, T>) value -> OBJECT_MAPPER.readValue(value, clazz),
.option("compression", "gzip")
private static <T extends Oaf> boolean isEntityType(final String s, final Class<T> clazz) {
return StringUtils.substringBefore(s, "|").equals(clazz.getName());

@ -2,49 +2,50 @@
package eu.dnetlib.dhp.oa.merge; package eu.dnetlib.dhp.oa.merge;
import static eu.dnetlib.dhp.common.SparkSessionSupport.runWithSparkSession; import static eu.dnetlib.dhp.common.SparkSessionSupport.runWithSparkSession;
import static org.apache.spark.sql.functions.col; import static eu.dnetlib.dhp.utils.DHPUtils.toSeq;
import static org.apache.spark.sql.functions.when;
import java.util.Map; import;
import java.util.List;
import java.util.Objects;
import java.util.Optional; import java.util.Optional;
import java.util.concurrent.ExecutionException;
import java.util.concurrent.ForkJoinPool;
import; import;
import; import;
import org.apache.commons.lang3.StringUtils;
import org.apache.spark.SparkConf; import org.apache.spark.SparkConf;
import; import;
import org.apache.spark.sql.*; import org.apache.spark.sql.*;
import org.apache.spark.sql.expressions.Aggregator;
import org.slf4j.Logger; import org.slf4j.Logger;
import org.slf4j.LoggerFactory; import org.slf4j.LoggerFactory;
import com.fasterxml.jackson.databind.DeserializationFeature;
import com.fasterxml.jackson.databind.ObjectMapper;
import com.jayway.jsonpath.Configuration;
import com.jayway.jsonpath.DocumentContext;
import com.jayway.jsonpath.JsonPath;
import com.jayway.jsonpath.Option;
import eu.dnetlib.dhp.application.ArgumentApplicationParser; import eu.dnetlib.dhp.application.ArgumentApplicationParser;
import eu.dnetlib.dhp.common.HdfsSupport; import eu.dnetlib.dhp.common.HdfsSupport;
import eu.dnetlib.dhp.common.vocabulary.VocabularyGroup; import eu.dnetlib.dhp.schema.oaf.*;
import eu.dnetlib.dhp.schema.common.EntityType; import eu.dnetlib.dhp.schema.oaf.common.ModelSupport;
import eu.dnetlib.dhp.schema.common.ModelSupport; import eu.dnetlib.dhp.schema.oaf.utils.MergeUtils;
import eu.dnetlib.dhp.schema.oaf.OafEntity;
import eu.dnetlib.dhp.schema.oaf.utils.GraphCleaningFunctions;
import eu.dnetlib.dhp.schema.oaf.utils.OafMapperUtils;
import eu.dnetlib.dhp.utils.ISLookupClientFactory;
import scala.Tuple2; import scala.Tuple2;
* Groups the graph content by entity identifier to ensure ID uniqueness * Groups the graph content by entity identifier to ensure ID uniqueness
*/ */
public class GroupEntitiesSparkJob { public class GroupEntitiesSparkJob {
private static final Logger log = LoggerFactory.getLogger(GroupEntitiesSparkJob.class); private static final Logger log = LoggerFactory.getLogger(GroupEntitiesSparkJob.class);
private static final Encoder<OafEntity> OAFENTITY_KRYO_ENC = Encoders.kryo(OafEntity.class); private static final String ID_JPATH = "$.id";
private ArgumentApplicationParser parser; private static final ObjectMapper OBJECT_MAPPER = new ObjectMapper()
.configure(DeserializationFeature.FAIL_ON_UNKNOWN_PROPERTIES, false);
public GroupEntitiesSparkJob(ArgumentApplicationParser parser) {
this.parser = parser;
public static void main(String[] args) throws Exception { public static void main(String[] args) throws Exception {
@ -62,133 +63,141 @@ public class GroupEntitiesSparkJob {
.orElse(Boolean.TRUE); .orElse(Boolean.TRUE);"isSparkSessionManaged: {}", isSparkSessionManaged);"isSparkSessionManaged: {}", isSparkSessionManaged);
final String isLookupUrl = parser.get("isLookupUrl");"isLookupUrl: {}", isLookupUrl);
final ISLookUpService isLookupService = ISLookupClientFactory.getLookUpService(isLookupUrl);
new GroupEntitiesSparkJob(parser).run(isSparkSessionManaged, isLookupService);
public void run(Boolean isSparkSessionManaged, ISLookUpService isLookUpService)
throws ISLookUpException {
String graphInputPath = parser.get("graphInputPath"); String graphInputPath = parser.get("graphInputPath");"graphInputPath: {}", graphInputPath);"graphInputPath: {}", graphInputPath);
String checkpointPath = parser.get("checkpointPath");"checkpointPath: {}", checkpointPath);
String outputPath = parser.get("outputPath"); String outputPath = parser.get("outputPath");"outputPath: {}", outputPath);"outputPath: {}", outputPath);
boolean filterInvisible = Boolean.parseBoolean(parser.get("filterInvisible"));"filterInvisible: {}", filterInvisible);
SparkConf conf = new SparkConf(); SparkConf conf = new SparkConf();
conf.set("spark.serializer", "org.apache.spark.serializer.KryoSerializer"); conf.set("spark.serializer", "org.apache.spark.serializer.KryoSerializer");
conf.registerKryoClasses(ModelSupport.getOafModelClasses()); conf.registerKryoClasses(ModelSupport.getOafModelClasses());
final VocabularyGroup vocs = VocabularyGroup.loadVocsFromIS(isLookUpService);
runWithSparkSession( runWithSparkSession(
conf, conf,
isSparkSessionManaged, isSparkSessionManaged,
spark -> { spark -> {
HdfsSupport.remove(checkpointPath, spark.sparkContext().hadoopConfiguration()); HdfsSupport.remove(outputPath, spark.sparkContext().hadoopConfiguration());
groupEntities(spark, graphInputPath, checkpointPath, outputPath, filterInvisible, vocs); groupEntities(spark, graphInputPath, outputPath);
}); });
} }
private static void groupEntities( private static void groupEntities(
SparkSession spark, SparkSession spark,
String inputPath, String inputPath,
String checkpointPath, String outputPath) {
String outputPath,
boolean filterInvisible, VocabularyGroup vocs) {
Dataset<OafEntity> allEntities = spark.emptyDataset(OAFENTITY_KRYO_ENC);
for (Map.Entry<EntityType, Class> e : ModelSupport.entityTypes.entrySet()) {
String entity = e.getKey().name();
Class<? extends OafEntity> entityClass = e.getValue();
String entityInputPath = inputPath + "/" + entity;
if (!HdfsSupport.exists(entityInputPath, spark.sparkContext().hadoopConfiguration())) {
allEntities = allEntities
((Dataset<OafEntity>) spark
.filter("length(id) > 0")
.map((MapFunction<OafEntity, OafEntity>) r -> r, OAFENTITY_KRYO_ENC));
Dataset<?> groupedEntities = allEntities
(MapFunction<OafEntity, OafEntity>) entity -> GraphCleaningFunctions
.applyCoarVocabularies(entity, vocs),
.groupByKey((MapFunction<OafEntity, String>) OafEntity::getId, Encoders.STRING())
.reduceGroups((ReduceFunction<OafEntity>) OafMapperUtils::mergeEntities)
(MapFunction<Tuple2<String, OafEntity>, Tuple2<String, OafEntity>>) t -> new Tuple2<>(
t._2().getClass().getName(), t._2()),
Encoders.tuple(Encoders.STRING(), OAFENTITY_KRYO_ENC));
// pivot on "_1" (classname of the entity)
// created columns containing only entities of the same class
for (Map.Entry<EntityType, Class> e : ModelSupport.entityTypes.entrySet()) {
String entity = e.getKey().name();
Class<? extends OafEntity> entityClass = e.getValue();
groupedEntities = groupedEntities
when(col("_1").equalTo(entityClass.getName()), col("_2")));
.drop("_1", "_2")
.option("compression", "gzip")
ForkJoinPool parPool = new ForkJoinPool(ModelSupport.entityTypes.size());
.map(e -> parPool.submit(() -> {
String entity = e.getKey().name();
Class<? extends OafEntity> entityClass = e.getValue();
final TypedColumn<Entity, Entity> aggregator = new GroupingAggregator().toColumn();
final JavaSparkContext sc = JavaSparkContext.fromSparkContext(spark.sparkContext());
spark spark
.read() .read()
.load(checkpointPath) .textFile(toSeq(listEntityPaths(inputPath, sc)))
.select(col(entity).as("value")) .map((MapFunction<String, Entity>) GroupEntitiesSparkJob::parseOaf, Encoders.kryo(Entity.class))
.filter("value IS NOT NULL") .filter((FilterFunction<Entity>) e -> StringUtils.isNotBlank(ModelSupport.idFn().apply(e)))
.as(OAFENTITY_KRYO_ENC) .groupByKey((MapFunction<Entity, String>) oaf -> ModelSupport.idFn().apply(oaf), Encoders.STRING())
.map((MapFunction<OafEntity, OafEntity>) r -> r, (Encoder<OafEntity>) Encoders.bean(entityClass)) .agg(aggregator)
.filter(filterInvisible ? "dataInfo.invisible != TRUE" : "TRUE") .map(
(MapFunction<Tuple2<String, Entity>, String>) t -> t._2().getClass().getName() +
"|" + OBJECT_MAPPER.writeValueAsString(t._2()),
.write() .write()
.option("compression", "gzip") .option("compression", "gzip")
.json(outputPath + "/" + entity); .mode(SaveMode.Overwrite)
})) .text(outputPath);
.collect(Collectors.toList()) }
.forEach(t -> {
public static class GroupingAggregator extends Aggregator<Entity, Entity, Entity> {
public Entity zero() {
return null;
public Entity reduce(Entity b, Entity a) {
return mergeAndGet(b, a);
private Entity mergeAndGet(Entity b, Entity a) {
if (Objects.nonNull(a) && Objects.nonNull(b)) {
return MergeUtils.merge(b, a, true);
return Objects.isNull(a) ? b : a;
public Entity merge(Entity b, Entity a) {
return mergeAndGet(b, a);
public Entity finish(Entity j) {
return j;
public Encoder<Entity> bufferEncoder() {
return Encoders.kryo(Entity.class);
public Encoder<Entity> outputEncoder() {
return Encoders.kryo(Entity.class);
private static Entity parseOaf(String s) {
DocumentContext dc = JsonPath
.parse(s, Configuration.defaultConfiguration().addOptions(Option.SUPPRESS_EXCEPTIONS));
final String id =;
if (StringUtils.isNotBlank(id)) {
String prefix = StringUtils.substringBefore(id, "|");
switch (prefix) {
case "10":
return parse(s, Datasource.class);
case "20":
return parse(s, Organization.class);
case "40":
return parse(s, Project.class);
case "50":
String resultType ="$.resulttype.classid");
switch (resultType) {
case "publication":
return parse(s, Publication.class);
case "dataset":
return parse(s, eu.dnetlib.dhp.schema.oaf.Dataset.class);
case "software":
return parse(s, Software.class);
case "other":
return parse(s, OtherResearchProduct.class);
throw new IllegalArgumentException(String.format("invalid resultType: '%s'", resultType));
throw new IllegalArgumentException(String.format("invalid id prefix: '%s'", prefix));
} else {
throw new IllegalArgumentException(String.format("invalid oaf: '%s'", s));
private static <T extends Entity> Entity parse(String s, Class<T> clazz) {
try { try {
t.get(); return OBJECT_MAPPER.readValue(s, clazz);
} catch (InterruptedException | ExecutionException e) { } catch (IOException e) {
throw new RuntimeException(e); throw new IllegalArgumentException(e);
} }
} }
private static List<String> listEntityPaths(String inputPath, JavaSparkContext sc) {
return HdfsSupport
.listFiles(inputPath, sc.hadoopConfiguration())
.filter(f -> !f.toLowerCase().contains("relation"))

@ -1,77 +0,0 @@
package eu.dnetlib.dhp.oozie;
import static eu.dnetlib.dhp.common.SparkSessionSupport.runWithSparkHiveSession;
import java.nio.charset.StandardCharsets;
import java.util.HashMap;
import java.util.Map;
import java.util.Optional;
import org.apache.commons.lang3.time.DurationFormatUtils;
import org.apache.commons.text.StringSubstitutor;
import org.apache.spark.SparkConf;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;
import eu.dnetlib.dhp.application.ArgumentApplicationParser;
public class RunSQLSparkJob {
private static final Logger log = LoggerFactory.getLogger(RunSQLSparkJob.class);
private final ArgumentApplicationParser parser;
public RunSQLSparkJob(ArgumentApplicationParser parser) {
this.parser = parser;
public static void main(String[] args) throws Exception {
Map<String, String> params = new HashMap<>();
for (int i = 0; i < args.length - 1; i++) {
if (args[i].startsWith("--")) {
params.put(args[i].substring(2), args[++i]);
* String jsonConfiguration = IOUtils .toString( Objects .requireNonNull( RunSQLSparkJob.class
* .getResourceAsStream( "/eu/dnetlib/dhp/oozie/run_sql_parameters.json"))); final ArgumentApplicationParser
* parser = new ArgumentApplicationParser(jsonConfiguration); parser.parseArgument(args);
Boolean isSparkSessionManaged = Optional
.orElse(Boolean.TRUE);"isSparkSessionManaged: {}", isSparkSessionManaged);
URL url ="sql"));
String raw_sql = Resources.toString(url, StandardCharsets.UTF_8);
String sql = StringSubstitutor.replace(raw_sql, params);"sql: {}", sql);
SparkConf conf = new SparkConf();
conf.set("hive.metastore.uris", params.get("hiveMetastoreUris"));
spark -> {
for (String statement : sql.split(";\\s*/\\*\\s*EOS\\s*\\*/\\s*")) {"executing: {}", statement);
long startTime = System.currentTimeMillis();
"executed in {}",
DurationFormatUtils.formatDuration(System.currentTimeMillis() - startTime, "HH:mm:ss.S"));

@ -0,0 +1,172 @@
package eu.dnetlib.dhp.schema.common;
import eu.dnetlib.dhp.schema.oaf.AccessRight;
import eu.dnetlib.dhp.schema.oaf.KeyValue;
import eu.dnetlib.dhp.schema.oaf.Qualifier;
import eu.dnetlib.dhp.schema.oaf.utils.OafMapperUtils;
public class ModelConstants {
private ModelConstants() {
public static final String DOI = "doi";
public static final String ORCID = "orcid";
public static final String ORCID_PENDING = "orcid_pending";
public static final String ORCID_CLASSNAME = "Open Researcher and Contributor ID";
public static final String ORCID_DS = ORCID.toUpperCase();
public static final String CROSSREF_ID = "10|openaire____::081b82f96300b6a6e3d282bad31cb6e2";
public static final String CROSSREF_NAME = "Crossref";
public static final String DATACITE_ID = "10|openaire____::9e3be59865b2c1c335d32dae2fe7b254";
public static final String ZENODO_OD_ID = "10|opendoar____::358aee4cc897452c00244351e4d91f69";
public static final String ZENODO_R3_ID = "10|re3data_____::7b0ad08687b2c960d5aeef06f811d5e6";
public static final String EUROPE_PUBMED_CENTRAL_ID = "10|opendoar____::8b6dd7db9af49e67306feb59a8bdc52c";
public static final String PUBMED_CENTRAL_ID = "10|opendoar____::eda80a3d5b344bc40f3bc04f65b7a357";
public static final String ARXIV_ID = "10|opendoar____::6f4922f45568161a8cdf4ad2299f6d23";
public static final String ROHUB_ID = "10|fairsharing_::1b69ebedb522700034547abc5652ffac";
public static final String OPENORGS_NAME = "OpenOrgs Database";
public static final String OPENOCITATIONS_NAME = "OpenCitations";
public static final String OPENOCITATIONS_ID = "10|openaire____::c06df618c5de1c786535ccf3f8b7b059";
public static final String OPEN_APC_NAME = "OpenAPC Global Initiative";
public static final String OPEN_APC_ID = "10|apc_________::e2b1600b229fc30663c8a1f662debddf";
public static final String ACCESS_RIGHT_OPEN = "OPEN";
public static final String ACCESS_RIGHT_EMBARGO = "EMBARGO";
public static final String ACCESS_RIGHT_CLOSED = "CLOSED";
public static final String DNET_SUBJECT_KEYWORD = "keyword";
public static final String DNET_SUBJECT_FOS_CLASSID = "FOS";
public static final String DNET_SUBJECT_FOS_CLASSNAME = "Fields of Science and Technology classification";
public static final String DNET_SUBJECT_TYPOLOGIES = "dnet:subject_classification_typologies";
public static final String DNET_RESULT_TYPOLOGIES = "dnet:result_typologies";
public static final String DNET_PUBLICATION_RESOURCE = "dnet:publication_resource";
public static final String DNET_ACCESS_MODES = "dnet:access_modes";
public static final String DNET_LANGUAGES = "dnet:languages";
public static final String DNET_PID_TYPES = "dnet:pid_types";
public static final String DNET_DATACITE_DATE = "dnet:dataCite_date";
public static final String DNET_DATACITE_TITLE = "dnet:dataCite_title";
public static final String DNET_DATA_CITE_RESOURCE = "dnet:dataCite_resource";
public static final String DNET_PROVENANCE_ACTIONS = "dnet:provenanceActions";
public static final String DNET_COUNTRY_TYPE = "dnet:countries";
public static final String DNET_REVIEW_LEVELS = "dnet:review_levels";
public static final String DNET_PROGRAMMING_LANGUAGES = "dnet:programming_languages";
public static final String DNET_EXTERNAL_REFERENCE_TYPE = "dnet:externalReference_typologies";
public static final String DNET_RELATION_RELTYPE = "dnet:relation_relType";
public static final String DNET_RELATION_SUBRELTYPE = "dnet:relation_subRelType";
public static final String DNET_RELATION_RELCLASS = "dnet:relation_relClass";
public static final String PEER_REVIEWED_CLASSNAME = "nonPeerReviewed";
public static final String NON_PEER_REVIEWED_CLASSNAME = "nonPeerReviewed";
public static final String PEER_REVIEWED_CLASSID = "0001";
public static final String NON_PEER_REVIEWED_CLASSID = "0002";
public static final String SYSIMPORT_CROSSWALK_REPOSITORY = "sysimport:crosswalk:repository";
public static final String SYSIMPORT_CROSSWALK_ENTITYREGISTRY = "sysimport:crosswalk:entityregistry";
public static final String SYSIMPORT_ACTIONSET = "sysimport:actionset";
public static final String SYSIMPORT_ORCID_NO_DOI = "sysimport:actionset:orcidworks-no-doi";
public static final String USER_CLAIM = "user:claim";
public static final String HARVESTED = "Harvested";
public static final String PROVENANCE_DEDUP = "sysimport:dedup";
public static final String PROVENANCE_ENRICH = "sysimport:enrich";
public static final Qualifier PROVENANCE_ACTION_SET_QUALIFIER = qualifier(
public static final String UNKNOWN = "UNKNOWN";
public static final String NOT_AVAILABLE = "not available";
public static final Qualifier REPOSITORY_PROVENANCE_ACTIONS = qualifier(
public static final Qualifier ENTITYREGISTRY_PROVENANCE_ACTION = qualifier(
public static final String UNKNOWN_REPOSITORY_ORIGINALID = "openaire____::1256f046-bf1f-4afc-8b47-d0b147148b18";
public static final KeyValue UNKNOWN_REPOSITORY = keyValue(
"10|openaire____::55045bd2a65019fd8e6741a755395c8c", "Unknown Repository");
public static final Qualifier UNKNOWN_COUNTRY = qualifier(UNKNOWN, "Unknown", DNET_COUNTRY_TYPE);
public static final Qualifier MAIN_TITLE_QUALIFIER = qualifier(
"main title", "main title", DNET_DATACITE_TITLE);
public static final Qualifier ALTERNATIVE_TITLE_QUALIFIER = qualifier(
"alternative title", "alternative title", DNET_DATACITE_TITLE);
public static final Qualifier SUBTITLE_QUALIFIER = qualifier("subtitle", "subtitle", DNET_DATACITE_TITLE);
public static final AccessRight OPEN_ACCESS_RIGHT() {
final AccessRight result = new AccessRight();
return result;
public static final AccessRight RESTRICTED_ACCESS_RIGHT() {
final AccessRight result = new AccessRight();
return result;
public static final AccessRight UNKNOWN_ACCESS_RIGHT() {
return OafMapperUtils
public static final AccessRight EMBARGOED_ACCESS_RIGHT() {
return OafMapperUtils
public static final AccessRight CLOSED_ACCESS_RIGHT() {
return OafMapperUtils
"Closed Access",
private static Qualifier qualifier(
final String classid,
final String classname,
final String schemeid) {
final Qualifier q = new Qualifier();
return q;
private static KeyValue keyValue(final String key, final String value) {
final KeyValue kv = new KeyValue();
return kv;

@ -0,0 +1,69 @@
package eu.dnetlib.dhp.schema.oaf.common;
import java.util.Comparator;
import eu.dnetlib.dhp.schema.oaf.Qualifier;
public class AccessRightComparator<T extends Qualifier> implements Comparator<T> {
public int compare(T left, T right) {
if (left == null && right == null)
return 0;
if (left == null)
return 1;
if (right == null)
return -1;
String lClass = left.getClassid();
String rClass = right.getClassid();
if (lClass.equals(rClass))
return 0;
if (lClass.equals("OPEN SOURCE"))
return -1;
if (rClass.equals("OPEN SOURCE"))
return 1;
if (lClass.equals("OPEN"))
return -1;
if (rClass.equals("OPEN"))
return 1;
if (lClass.equals("6MONTHS"))
return -1;
if (rClass.equals("6MONTHS"))
return 1;
if (lClass.equals("12MONTHS"))
return -1;
if (rClass.equals("12MONTHS"))
return 1;
if (lClass.equals("EMBARGO"))
return -1;
if (rClass.equals("EMBARGO"))
return 1;
if (lClass.equals("RESTRICTED"))
return -1;
if (rClass.equals("RESTRICTED"))
return 1;
if (lClass.equals("CLOSED"))
return -1;
if (rClass.equals("CLOSED"))
return 1;
if (lClass.equals("UNKNOWN"))
return -1;
if (rClass.equals("UNKNOWN"))
return 1;
// Else (but unlikely), lexicographical ordering will do.
return lClass.compareTo(rClass);

@ -0,0 +1,21 @@
package eu.dnetlib.dhp.schema.oaf.common;
import eu.dnetlib.dhp.schema.oaf.Entity;
/** Actual entity types in the Graph */
public enum EntityType {
publication, dataset, otherresearchproduct, software, datasource, organization, project;
* Resolves the EntityType, given the relative class name
* @param clazz the given class name
* @param <T> actual OafEntity subclass
* @return the EntityType associated to the given class
public static <T extends Entity> EntityType fromClass(Class<T> clazz) {
return EntityType.valueOf(clazz.getSimpleName().toLowerCase());

@ -0,0 +1,7 @@
package eu.dnetlib.dhp.schema.oaf.common;
/** Main entity types in the Graph */
public enum MainEntityType {
result, datasource, organization, project

@ -0,0 +1,352 @@
package eu.dnetlib.dhp.schema.oaf.common;
import static;
import java.nio.charset.StandardCharsets;
import java.text.ParseException;
import java.util.*;
import java.util.function.Function;
import org.apache.commons.codec.binary.Hex;
import org.apache.commons.lang3.StringUtils;
import com.github.sisyphsu.dateparser.DateParserUtils;
import eu.dnetlib.dhp.schema.oaf.*;
/** Oaf model utility methods. */
public class ModelSupport {
/** Defines the mapping between the actual entity type and the main entity type */
private static final Map<EntityType, MainEntityType> entityMapping = Maps.newHashMap();
static {
entityMapping.put(EntityType.publication, MainEntityType.result);
entityMapping.put(EntityType.dataset, MainEntityType.result);
entityMapping.put(EntityType.otherresearchproduct, MainEntityType.result);
entityMapping.put(, MainEntityType.result);
entityMapping.put(EntityType.datasource, MainEntityType.datasource);
entityMapping.put(EntityType.organization, MainEntityType.organization);
entityMapping.put(EntityType.project, MainEntityType.project);
* Defines the mapping between the actual entity types and the relative classes implementing them
public static final Map<EntityType, Class> entityTypes = Maps.newHashMap();
static {
entityTypes.put(EntityType.datasource, Datasource.class);
entityTypes.put(EntityType.organization, Organization.class);
entityTypes.put(EntityType.project, Project.class);
entityTypes.put(EntityType.dataset, Dataset.class);
entityTypes.put(EntityType.otherresearchproduct, OtherResearchProduct.class);
entityTypes.put(, Software.class);
entityTypes.put(EntityType.publication, Publication.class);
public static final Map<String, Class> oafTypes = Maps.newHashMap();
static {
oafTypes.put("datasource", Datasource.class);
oafTypes.put("organization", Organization.class);
oafTypes.put("project", Project.class);
oafTypes.put("dataset", Dataset.class);
oafTypes.put("otherresearchproduct", OtherResearchProduct.class);
oafTypes.put("software", Software.class);
oafTypes.put("publication", Publication.class);
oafTypes.put("relation", Relation.class);
public static final Map<Class, String> idPrefixMap = Maps.newHashMap();
static {
idPrefixMap.put(Datasource.class, "10");
idPrefixMap.put(Organization.class, "20");
idPrefixMap.put(Project.class, "40");
idPrefixMap.put(Dataset.class, "50");
idPrefixMap.put(OtherResearchProduct.class, "50");
idPrefixMap.put(Software.class, "50");
idPrefixMap.put(Publication.class, "50");
public static final Map<String, String> entityIdPrefix = Maps.newHashMap();
static {
entityIdPrefix.put("datasource", "10");
entityIdPrefix.put("organization", "20");
entityIdPrefix.put("project", "40");
entityIdPrefix.put("result", "50");
public static final Map<String, String> idPrefixEntity = Maps.newHashMap();
static {
idPrefixEntity.put("10", "datasource");
idPrefixEntity.put("20", "organization");
idPrefixEntity.put("40", "project");
idPrefixEntity.put("50", "result");
public static String getEntityTypeFromId(final String id) {
//TODO We should create a class which define the identifier and parse it
if (StringUtils.isBlank(id))
return null;
return idPrefixEntity.get(id.substring(0,2));
* Helper method: combines the relation attributes
* @param relType
* @param subRelType
* @param relClass
* @return
public static String rel(String relType, String subRelType, String relClass) {
return String.format("%s_%s_%s", relType, subRelType, relClass);
* Helper method: deserialize the relation attributes serialized with rel
* @param deserialization
* @return
public static RelationLabel unRel(String deserialization) {
final String[] s = deserialization.split("_");
if (s!= null && s.length==3) {
final Relation.RELTYPE currentRelType = Relation.RELTYPE.valueOf(s[0]);
final Relation.SUBRELTYPE currentSubRelType = Relation.SUBRELTYPE.valueOf(s[1]);
final Relation.RELCLASS currentRelClass = Relation.RELCLASS.valueOf(s[2]);
return new RelationLabel(currentRelClass, currentRelType, currentSubRelType);
throw new IllegalArgumentException("Invalid relationship format for "+ deserialization);
private static final String schemeTemplate = "dnet:%s_%s_relations";
public static final String DATE_FORMAT = "yyyy-MM-dd";
private ModelSupport() {
public static <E extends Entity> String getIdPrefix(Class<E> clazz) {
return idPrefixMap.get(clazz);
public static <X extends Oaf, Y extends Oaf, Z extends Oaf> Boolean sameClass(X left, Y right,
Class<Z> superClazz) {
return isSubClass(left, superClazz) && isSubClass(right, superClazz);
* Checks subclass-superclass relationship.
* @param subClazzObject Subclass object instance
* @param superClazzObject Superclass object instance
* @param <X> Subclass type
* @param <Y> Superclass type
* @return True if X is a subclass of Y
public static <X extends Oaf, Y extends Oaf> Boolean isSubClass(
X subClazzObject, Y superClazzObject) {
return isSubClass(subClazzObject.getClass(), superClazzObject.getClass());
* Checks subclass-superclass relationship.
* @param subClazzObject Subclass object instance
* @param superClazz Superclass class
* @param <X> Subclass type
* @param <Y> Superclass type
* @return True if X is a subclass of Y
public static <X extends Oaf, Y extends Oaf> Boolean isSubClass(
X subClazzObject, Class<Y> superClazz) {
return isSubClass(subClazzObject.getClass(), superClazz);
* Checks subclass-superclass relationship.
* @param subClazz Subclass class
* @param superClazz Superclass class
* @param <X> Subclass type
* @param <Y> Superclass type
* @return True if X is a subclass of Y
public static <X extends Oaf, Y extends Oaf> Boolean isSubClass(
Class<X> subClazz, Class<Y> superClazz) {
return superClazz.isAssignableFrom(subClazz);
* Lists all the OAF model classes
* @param <T>
* @return
public static <T extends Entity> Class<T>[] getOafModelClasses() {
return new Class[] {
public static String getMainType(final EntityType type) {
return entityMapping.get(type).name();
public static boolean isResult(EntityType type) {
public static String getScheme(final String sourceType, final String targetType) {
return String
public static String tableIdentifier(String dbName, String tableName) {
checkArgument(StringUtils.isNotBlank(dbName), "DB name cannot be empty");
checkArgument(StringUtils.isNotBlank(tableName), "table name cannot be empty");
return String.format("%s.%s", dbName, tableName);
public static <T extends Oaf> String tableIdentifier(String dbName, Class<T> clazz) {
checkArgument(Objects.nonNull(clazz), "clazz is needed to derive the table name, thus cannot be null");
return tableIdentifier(dbName, clazz.getSimpleName().toLowerCase());
public static <T extends Oaf> Function<T, String> idFn() {
return x -> {
if (isSubClass(x, Relation.class)) {
return idFnForRelation(x);
return idFnForOafEntity(x);
private static <T extends Oaf> String idFnForRelation(T t) {
Relation r = (Relation) t;
return Optional
source -> Optional
target -> Optional
relType -> Optional
subRelType -> Optional
relClass -> String
.orElse(String.join(source, target, relType.toString())))
.orElse(String.join(source, target)))
private static <T extends Oaf> String idFnForOafEntity(T t) {
return ((Entity) t).getId();
public static String md5(final String s) {
try {
final MessageDigest md = MessageDigest.getInstance("MD5");
return new String(Hex.encodeHex(md.digest()));
} catch (final NoSuchAlgorithmException e) {
throw new IllegalStateException(e);
public static String generateIdentifier(final String originalId, final String nsPrefix) {
return String.format("%s::%s", nsPrefix, md5(originalId));
public static String oldest(String dateA, String dateB) throws ParseException {
if (StringUtils.isBlank(dateA)) {
return dateB;
if (StringUtils.isBlank(dateB)) {
return dateA;
if (StringUtils.isNotBlank(dateA) && StringUtils.isNotBlank(dateB)) {
final Date a = DateParserUtils.parseDate(dateA);
final Date b = DateParserUtils.parseDate(dateB);
if (Objects.nonNull(a) && Objects.nonNull(b)) {
return a.before(b) ? dateA : dateB;
} else {
return null;
} else {
return null;

@ -0,0 +1,45 @@
package eu.dnetlib.dhp.schema.oaf.common;
import java.util.Comparator;
import eu.dnetlib.dhp.schema.common.ModelConstants;
import eu.dnetlib.dhp.schema.oaf.Qualifier;
public class RefereedComparator implements Comparator<Qualifier> {
public int compare(Qualifier left, Qualifier right) {
if (left == null && right == null)
return 0;
if (left == null)
return 1;
if (right == null)
return -1;
String lClass = left.getClassid();
String rClass = right.getClassid();
if (lClass.equals(rClass))
return 0;
if (lClass.equals(ModelConstants.PEER_REVIEWED_CLASSID))
return -1;
if (rClass.equals(ModelConstants.PEER_REVIEWED_CLASSID))
return 1;
if (lClass.equals(ModelConstants.NON_PEER_REVIEWED_CLASSID))
return -1;
if (rClass.equals(ModelConstants.NON_PEER_REVIEWED_CLASSID))
return 1;
if (lClass.equals(ModelConstants.UNKNOWN))
return -1;
if (rClass.equals(ModelConstants.UNKNOWN))
return 1;
// Else (but unlikely), lexicographical ordering will do.
return lClass.compareTo(rClass);

@ -0,0 +1,33 @@
package eu.dnetlib.dhp.schema.oaf.common;
import eu.dnetlib.dhp.schema.oaf.Relation;
public class RelationLabel {
private final Relation.RELCLASS relClass;
private final Relation.RELTYPE relType;
private final Relation.SUBRELTYPE subReltype;
public RelationLabel(Relation.RELCLASS relClass, Relation.RELTYPE relType, Relation.SUBRELTYPE subReltype) {
this.relClass = relClass;
this.relType = relType;
this.subReltype = subReltype;
public RelationLabel inverse() {
return new RelationLabel(relClass.getInverse(), relType, subReltype);
public Relation.RELTYPE getRelType() {
return relType;
public Relation.SUBRELTYPE getSubReltype() {
return subReltype;
public Relation.RELCLASS getRelClass() {
return relClass;

@ -0,0 +1,101 @@
package eu.dnetlib.dhp.schema.oaf.utils;
import java.util.HashSet;
import java.util.Objects;
import java.util.Optional;
import java.util.Set;
import org.apache.commons.lang3.StringUtils;
import eu.dnetlib.dhp.schema.common.ModelConstants;
import eu.dnetlib.dhp.schema.oaf.StructuredProperty;
import lombok.val;
public class CleaningFunctions {
public static final String DOI_PREFIX_REGEX = "(^10\\.|\\/10\\.)";
private static final String ALL_SPACES_REGEX = "(?:\\n|\\r|\\t|\\s)";
public static final String DOI_PREFIX = "10.";
public static final Set<String> PID_BLACKLIST = new HashSet<>();
static {
public CleaningFunctions() {
* Utility method that filter PID values on a per-type basis.
* @param s the PID whose value will be checked.
* @return false if the pid matches the filter criteria, true otherwise.
public static boolean pidFilter(StructuredProperty s) {
final String pidValue = s.getValue();
if (Objects.isNull(s.getQualifier()) ||
StringUtils.isBlank(pidValue) ||
StringUtils.isBlank(pidValue.replaceAll("(?:\\n|\\r|\\t|\\s)", ""))) {
return false;
if (CleaningFunctions.PID_BLACKLIST.contains(pidValue)) {
return false;
return !PidBlacklistProvider.getBlacklist(s.getQualifier().getClassid()).contains(pidValue);
* Utility method that normalises PID values on a per-type basis.
* @param pid the PID whose value will be normalised.
* @return the PID containing the normalised value.
public static StructuredProperty normalizePidValue(StructuredProperty pid) {
return pid;
* This utility was moved from DOIBoost,
* it implements a better cleaning of DOI.
* In case of wrong DOI it raises an illegalArgumentException
* @param input DOI
* @return normalized DOI
private static String normalizeDOI(final String input) {
if (input == null)
throw new IllegalArgumentException("PID value cannot be empty");
final String replaced = input
.replaceAll(ALL_SPACES_REGEX, "")
if (StringUtils.isEmpty(replaced.trim()))
throw new IllegalArgumentException("PID value normalized return empty string");
if (!replaced.contains("10."))
throw new IllegalArgumentException("DOI Must starts with 10.");
return replaced.substring(replaced.indexOf("10."));
public static String normalizePidValue(String pidType, String pidValue) {
String value = Optional
.orElseThrow(() -> new IllegalArgumentException("PID value cannot be empty"));
switch (pidType) {
// TODO add cleaning for more PID types as needed
case ModelConstants.DOI:
return normalizeDOI(value.toLowerCase());
return value;

@ -1,14 +0,0 @@
package eu.dnetlib.dhp.schema.oaf.utils;
public class DoiCleaningRule {
public static String clean(final String doi) {
return doi
.replaceAll("\\s", "")
.replaceAll("^doi:", "")
.replaceFirst(CleaningFunctions.DOI_PREFIX_REGEX, CleaningFunctions.DOI_PREFIX);

@ -1,25 +0,0 @@
package eu.dnetlib.dhp.schema.oaf.utils;
import java.util.regex.Matcher;
import java.util.regex.Pattern;
public class FundRefCleaningRule {
public static final Pattern PATTERN = Pattern.compile("\\d+");
public static String clean(final String fundRefId) {
String s = fundRefId
.replaceAll("\\s", "");
Matcher m = PATTERN.matcher(s);
if (m.find()) {
} else {
return "";

@ -1,12 +1,8 @@
package eu.dnetlib.dhp.schema.oaf.utils; package eu.dnetlib.dhp.schema.oaf.utils;
import static eu.dnetlib.dhp.schema.common.ModelConstants.*;
import static eu.dnetlib.dhp.schema.common.ModelConstants.OPENAIRE_META_RESOURCE_TYPE;
import static eu.dnetlib.dhp.schema.oaf.utils.OafMapperUtils.getProvenance; import static eu.dnetlib.dhp.schema.oaf.utils.OafMapperUtils.getProvenance;
import java.time.LocalDate; import java.time.LocalDate;
import java.time.ZoneId; import java.time.ZoneId;
import java.time.format.DateTimeFormatter; import java.time.format.DateTimeFormatter;
@ -17,24 +13,23 @@ import;
import; import;
import org.apache.commons.lang3.StringUtils; import org.apache.commons.lang3.StringUtils;
import org.apache.spark.sql.Encoders;
import com.fasterxml.jackson.core.JsonProcessingException;
import com.fasterxml.jackson.databind.ObjectMapper;
import com.github.sisyphsu.dateparser.DateParserUtils; import com.github.sisyphsu.dateparser.DateParserUtils;
import; import;
import; import;
import eu.dnetlib.dhp.common.vocabulary.VocabularyGroup; import eu.dnetlib.dhp.common.vocabulary.VocabularyGroup;
import eu.dnetlib.dhp.common.vocabulary.VocabularyTerm;
import eu.dnetlib.dhp.schema.common.ModelConstants; import eu.dnetlib.dhp.schema.common.ModelConstants;
import eu.dnetlib.dhp.schema.common.ModelSupport;
import eu.dnetlib.dhp.schema.oaf.*; import eu.dnetlib.dhp.schema.oaf.*;
import eu.dnetlib.dhp.schema.oaf.common.ModelSupport;
import me.xuender.unidecode.Unidecode; import me.xuender.unidecode.Unidecode;
public class GraphCleaningFunctions extends CleaningFunctions { public class GraphCleaningFunctions extends CleaningFunctions {
public static final String DNET_PUBLISHERS = "dnet:publishers";
public static final String DNET_LICENSES = "dnet:licenses";
public static final String ORCID_CLEANING_REGEX = ".*([0-9]{4}).*[-–—−=].*([0-9]{4}).*[-–—−=].*([0-9]{4}).*[-–—−=].*([0-9x]{4})"; public static final String ORCID_CLEANING_REGEX = ".*([0-9]{4}).*[-–—−=].*([0-9]{4}).*[-–—−=].*([0-9]{4}).*[-–—−=].*([0-9x]{4})";
public static final int ORCID_LEN = 19; public static final int ORCID_LEN = 19;
public static final String CLEANING_REGEX = "(?:\\n|\\r|\\t)"; public static final String CLEANING_REGEX = "(?:\\n|\\r|\\t)";
@ -44,68 +39,6 @@ public class GraphCleaningFunctions extends CleaningFunctions {
public static final String TITLE_FILTER_REGEX = String.format("(%s)|\\W|\\d", TITLE_TEST); public static final String TITLE_FILTER_REGEX = String.format("(%s)|\\W|\\d", TITLE_TEST);
public static final int TITLE_FILTER_RESIDUAL_LENGTH = 5; public static final int TITLE_FILTER_RESIDUAL_LENGTH = 5;
private static final String NAME_CLEANING_REGEX = "[\\r\\n\\t\\s]+";
private static final Set<String> INVALID_AUTHOR_NAMES = new HashSet<>();
private static final Set<String> INVALID_URLS = new HashSet<>();
private static final Set<String> INVALID_URL_HOSTS = new HashSet<>();
private static final HashSet<String> PEER_REVIEWED_TYPES = new HashSet<>();
static {
PEER_REVIEWED_TYPES.add("Part of book or chapter of book");
PEER_REVIEWED_TYPES.add("Doctoral thesis");
PEER_REVIEWED_TYPES.add("Master thesis");
PEER_REVIEWED_TYPES.add("Data Paper");
PEER_REVIEWED_TYPES.add("Bachelor thesis");
PEER_REVIEWED_TYPES.add("Conference object");
INVALID_AUTHOR_NAMES.add("(:unkn) unknown");
INVALID_AUTHOR_NAMES.add("[autor desconocido]");
INVALID_AUTHOR_NAMES.add("[s. n.]");
INVALID_AUTHOR_NAMES.add("no name supplied");
INVALID_AUTHOR_NAMES.add("none available");
INVALID_AUTHOR_NAMES.add("not available not available");
INVALID_AUTHOR_NAMES.add("null &na;");
INVALID_AUTHOR_NAMES.add("null anonymous");
public static <T extends Oaf> T cleanContext(T value, String contextId, String verifyParam) { public static <T extends Oaf> T cleanContext(T value, String contextId, String verifyParam) {
if (ModelSupport.isSubClass(value, Result.class)) { if (ModelSupport.isSubClass(value, Result.class)) {
@ -229,15 +162,19 @@ public class GraphCleaningFunctions extends CleaningFunctions {
} }
public static <T extends Oaf> T fixVocabularyNames(T value) { public static <T extends Oaf> T fixVocabularyNames(T value) {
if (value instanceof OafEntity) { if (value instanceof Datasource) {
// nothing to clean here
} else if (value instanceof Project) {
// nothing to clean here
} else if (value instanceof Organization) {
Organization o = (Organization) value;
if (Objects.nonNull(o.getCountry())) {
fixVocabName(o.getCountry(), ModelConstants.DNET_COUNTRY_TYPE);
} else if (value instanceof Relation) {
// nothing to clean here
} else if (value instanceof Result) {
OafEntity e = (OafEntity) value;
.ifPresent(pid -> pid.forEach(p -> fixVocabName(p.getQualifier(), ModelConstants.DNET_PID_TYPES)));
if (value instanceof Result) {
Result r = (Result) value; Result r = (Result) value;
fixVocabName(r.getLanguage(), ModelConstants.DNET_LANGUAGES); fixVocabName(r.getLanguage(), ModelConstants.DNET_LANGUAGES);
@ -251,11 +188,6 @@ public class GraphCleaningFunctions extends CleaningFunctions {
for (Instance i : r.getInstance()) { for (Instance i : r.getInstance()) {
fixVocabName(i.getAccessright(), ModelConstants.DNET_ACCESS_MODES); fixVocabName(i.getAccessright(), ModelConstants.DNET_ACCESS_MODES);
fixVocabName(i.getRefereed(), ModelConstants.DNET_REVIEW_LEVELS); fixVocabName(i.getRefereed(), ModelConstants.DNET_REVIEW_LEVELS);
pid -> pid.forEach(p -> fixVocabName(p.getQualifier(), ModelConstants.DNET_PID_TYPES)));
} }
} }
if (Objects.nonNull(r.getAuthor())) { if (Objects.nonNull(r.getAuthor())) {
@ -276,29 +208,18 @@ public class GraphCleaningFunctions extends CleaningFunctions {
} else if (value instanceof Software) { } else if (value instanceof Software) {
} }
} else if (value instanceof Datasource) {
// nothing to clean here
} else if (value instanceof Project) {
// nothing to clean here
} else if (value instanceof Organization) {
Organization o = (Organization) value;
if (Objects.nonNull(o.getCountry())) {
fixVocabName(o.getCountry(), ModelConstants.DNET_COUNTRY_TYPE);
} else if (value instanceof Relation) {
// nothing to clean here
} }
return value; return value;
} }
public static <T extends Oaf> boolean filter(T value) { public static <T extends Oaf> boolean filter(T value) {
if (!(value instanceof Relation) && (Boolean.TRUE if (value instanceof Entity) {
Entity entity = (Entity) value;
if (Boolean.TRUE
.equals( .equals(
Optional Optional
.ofNullable(value) .ofNullable(entity)
.map( .map(
o -> Optional o -> Optional
.ofNullable(o.getDataInfo()) .ofNullable(o.getDataInfo())
@ -306,57 +227,22 @@ public class GraphCleaningFunctions extends CleaningFunctions {
d -> Optional d -> Optional
.ofNullable(d.getInvisible()) .ofNullable(d.getInvisible())
.orElse(true)) .orElse(true))
.orElse(false)) .orElse(true))
.orElse(true)))) { .orElse(true))) {
return true; return true;
if (value instanceof Datasource) {
final Datasource d = (Datasource) value;
return Objects.nonNull(d.getOfficialname()) && StringUtils.isNotBlank(d.getOfficialname().getValue());
} else if (value instanceof Project) {
final Project p = (Project) value;
return Objects.nonNull(p.getCode()) && StringUtils.isNotBlank(p.getCode().getValue());
} else if (value instanceof Organization) {
// nothing to evaluate here
} else if (value instanceof Relation) {
// nothing to clean here
} else if (value instanceof Result) { } else if (value instanceof Result) {
Result r = (Result) value; Result r = (Result) value;
if (Objects.isNull(r.getTitle()) || r.getTitle().isEmpty()) { if (Objects.isNull(r.getTitle()) || r.getTitle().isEmpty()) {
return false; return false;
} }
if (value instanceof Publication) {
} else if (value instanceof Dataset) {
} else if (value instanceof OtherResearchProduct) {
} else if (value instanceof Software) {
} }
} }
return true; return true;
} }
public static <T extends Oaf> T cleanup(T value, VocabularyGroup vocs) { public static <T extends Oaf> T cleanup(T value, VocabularyGroup vocs) {
if (Objects.isNull(value.getDataInfo())) {
final DataInfo d = new DataInfo();
if (value instanceof OafEntity) {
OafEntity e = (OafEntity) value;
if (Objects.nonNull(e.getPid())) {
if (value instanceof Datasource) { if (value instanceof Datasource) {
// nothing to clean here // nothing to clean here
} else if (value instanceof Project) { } else if (value instanceof Project) {
@ -366,24 +252,25 @@ public class GraphCleaningFunctions extends CleaningFunctions {
if (Objects.isNull(o.getCountry()) || StringUtils.isBlank(o.getCountry().getClassid())) { if (Objects.isNull(o.getCountry()) || StringUtils.isBlank(o.getCountry().getClassid())) {
o.setCountry(ModelConstants.UNKNOWN_COUNTRY); o.setCountry(ModelConstants.UNKNOWN_COUNTRY);
} }
} else if (value instanceof Relation) {
Relation r = (Relation) value;
Optional<String> validationDate = doCleanDate(r.getValidationDate());
if (validationDate.isPresent()) {
} else {
} else if (value instanceof Result) { } else if (value instanceof Result) {
Result r = (Result) value; Result r = (Result) value;
if (Objects.isNull(r.getContext())) {
r.setContext(new ArrayList<>());
if (Objects.nonNull(r.getFulltext())
&& (ModelConstants.SOFTWARE_RESULTTYPE_CLASSID.equals(r.getResulttype().getClassid()) ||
ModelConstants.DATASET_RESULTTYPE_CLASSID.equals(r.getResulttype().getClassid()))) {
if (Objects.nonNull(r.getDateofacceptance())) { if (Objects.nonNull(r.getDateofacceptance())) {
Optional<String> date = cleanDateField(r.getDateofacceptance()); Optional<String> date = cleanDateField(r.getDateofacceptance());
if (date.isPresent()) { if (date.isPresent()) {
r.getDateofacceptance().setValue(date.get()); r.setDateofacceptance(date.get());
} else { } else {
r.setDateofacceptance(null); r.setDateofacceptance(null);
} }
@ -404,26 +291,8 @@ public class GraphCleaningFunctions extends CleaningFunctions {
.filter(sp -> StringUtils.isNotBlank(sp.getValue())) .filter(sp -> StringUtils.isNotBlank(sp.getValue()))
.collect(Collectors.toList())); .collect(Collectors.toList()));
} }
if (Objects.nonNull(r.getPublisher())) { if (Objects.nonNull(r.getPublisher()) && StringUtils.isBlank(r.getPublisher().getName())) {
if (StringUtils.isBlank(r.getPublisher().getValue())) {
r.setPublisher(null); r.setPublisher(null);
} else {
.replaceAll(NAME_CLEANING_REGEX, " "));
if (vocs.vocabularyExists(DNET_PUBLISHERS)) {
.map(voc -> voc.getTermBySynonym(r.getPublisher().getValue()))
.ifPresent(publisher -> r.getPublisher().setValue(publisher));
} }
if (Objects.isNull(r.getLanguage()) || StringUtils.isBlank(r.getLanguage().getClassid())) { if (Objects.isNull(r.getLanguage()) || StringUtils.isBlank(r.getLanguage().getClassid())) {
r r
@ -504,10 +373,13 @@ public class GraphCleaningFunctions extends CleaningFunctions {
.getDescription() .getDescription()
.stream() .stream()
.filter(Objects::nonNull) .filter(Objects::nonNull)
.filter(sp -> StringUtils.isNotBlank(sp.getValue())) .filter(s -> StringUtils.isNotBlank(s))
.map(GraphCleaningFunctions::cleanValue) .map(GraphCleaningFunctions::cleanValue)
.collect(Collectors.toList())); .collect(Collectors.toList()));
} }
if (Objects.nonNull(r.getPid())) {
if (Objects.isNull(r.getResourcetype()) || StringUtils.isBlank(r.getResourcetype().getClassid())) { if (Objects.isNull(r.getResourcetype()) || StringUtils.isBlank(r.getResourcetype().getClassid())) {
r r
.setResourcetype( .setResourcetype(
@ -516,37 +388,31 @@ public class GraphCleaningFunctions extends CleaningFunctions {
if (Objects.nonNull(r.getInstance())) { if (Objects.nonNull(r.getInstance())) {
for (Instance i : r.getInstance()) { for (Instance i : r.getInstance()) {
if (!vocs if (!vocs.termExists(ModelConstants.DNET_PUBLICATION_RESOURCE, i.getInstancetype().getClassid())) {
.termExists(ModelConstants.DNET_PUBLICATION_RESOURCE, i.getInstancetype().getClassid())) {
if (r instanceof Publication) { if (r instanceof Publication) {
i i
.setInstancetype( .setInstancetype(
OafMapperUtils OafMapperUtils
.qualifier( .qualifier(
"0038", "Other literature type", "0038", "Other literature type", ModelConstants.DNET_PUBLICATION_RESOURCE));
} else if (r instanceof Dataset) { } else if (r instanceof Dataset) {
i i
.setInstancetype( .setInstancetype(
OafMapperUtils OafMapperUtils
.qualifier( .qualifier(
"0039", "Other dataset type", ModelConstants.DNET_PUBLICATION_RESOURCE, "0039", "Other dataset type", ModelConstants.DNET_PUBLICATION_RESOURCE));
} else if (r instanceof Software) { } else if (r instanceof Software) {
i i
.setInstancetype( .setInstancetype(
OafMapperUtils OafMapperUtils
.qualifier( .qualifier(
"0040", "Other software type", ModelConstants.DNET_PUBLICATION_RESOURCE, "0040", "Other software type", ModelConstants.DNET_PUBLICATION_RESOURCE));
} else if (r instanceof OtherResearchProduct) { } else if (r instanceof OtherResearchProduct) {
i i
.setInstancetype( .setInstancetype(
OafMapperUtils OafMapperUtils
.qualifier( .qualifier(
"0020", "Other ORP type", ModelConstants.DNET_PUBLICATION_RESOURCE, "0020", "Other ORP type", ModelConstants.DNET_PUBLICATION_RESOURCE));
} }
} }
@ -568,8 +434,7 @@ public class GraphCleaningFunctions extends CleaningFunctions {
}); });
}); });
if (Objects.isNull(i.getAccessright()) if (Objects.isNull(i.getAccessright()) || StringUtils.isBlank(i.getAccessright().getClassid())) {
|| StringUtils.isBlank(i.getAccessright().getClassid())) {
i i
.setAccessright( .setAccessright(
accessRight( accessRight(
@ -582,69 +447,17 @@ public class GraphCleaningFunctions extends CleaningFunctions {
if (Objects.isNull(i.getRefereed()) || StringUtils.isBlank(i.getRefereed().getClassid())) { if (Objects.isNull(i.getRefereed()) || StringUtils.isBlank(i.getRefereed().getClassid())) {
i.setRefereed(qualifier("0000", "Unknown", ModelConstants.DNET_REVIEW_LEVELS)); i.setRefereed(qualifier("0000", "Unknown", ModelConstants.DNET_REVIEW_LEVELS));
} }
if (Objects.nonNull(i.getLicense()) && Objects.nonNull(i.getLicense().getValue())) {
.map(voc -> voc.getTermBySynonym(i.getLicense().getValue()))
.ifPresent(license -> i.getLicense().setValue(license));
// from the script from Dimitris
if ("0000".equals(i.getRefereed().getClassid())) {
final boolean isFromCrossref = Optional
.map(id -> id.equals(ModelConstants.CROSSREF_ID))
final boolean hasDoi = Optional
pid -> pid
p -> PidType.doi.toString().equals(p.getQualifier().getClassid())))
final boolean isPeerReviewedType = PEER_REVIEWED_TYPES
final boolean noOtherLitType = r
.noneMatch(ii -> "Other literature type".equals(ii.getInstancetype().getClassname()));
if (isFromCrossref && hasDoi && isPeerReviewedType && noOtherLitType) {
i.setRefereed(qualifier("0001", "peerReviewed", ModelConstants.DNET_REVIEW_LEVELS));
} else {
i.setRefereed(qualifier("0002", "nonPeerReviewed", ModelConstants.DNET_REVIEW_LEVELS));
if (Objects.nonNull(i.getDateofacceptance())) { if (Objects.nonNull(i.getDateofacceptance())) {
Optional<String> date = cleanDateField(i.getDateofacceptance()); Optional<String> date = cleanDateField(i.getDateofacceptance());
if (date.isPresent()) { if (date.isPresent()) {
i.getDateofacceptance().setValue(date.get()); i.setDateofacceptance(date.get());
} else { } else {
i.setDateofacceptance(null); i.setDateofacceptance(null);
} }
} }
if (StringUtils.isNotBlank(i.getFulltext()) &&
(ModelConstants.SOFTWARE_RESULTTYPE_CLASSID.equals(r.getResulttype().getClassid()) ||
ModelConstants.DATASET_RESULTTYPE_CLASSID.equals(r.getResulttype().getClassid()))) {
if (Objects.nonNull(i.getUrl())) {
} }
} }
} if (Objects.isNull(r.getBestaccessright()) || StringUtils.isBlank(r.getBestaccessright().getClassid())) {
if (Objects.isNull(r.getBestaccessright())
|| StringUtils.isBlank(r.getBestaccessright().getClassid())) {
Qualifier bestaccessrights = OafMapperUtils.createBestAccessRights(r.getInstance()); Qualifier bestaccessrights = OafMapperUtils.createBestAccessRights(r.getInstance());
if (Objects.isNull(bestaccessrights)) { if (Objects.isNull(bestaccessrights)) {
r r
@ -663,8 +476,8 @@ public class GraphCleaningFunctions extends CleaningFunctions {
.getAuthor() .getAuthor()
.stream() .stream()
.filter(Objects::nonNull) .filter(Objects::nonNull)
.filter(GraphCleaningFunctions::isValidAuthorName) .filter(a -> StringUtils.isNotBlank(a.getFullname()))
.map(GraphCleaningFunctions::cleanupAuthor) .filter(a -> StringUtils.isNotBlank(a.getFullname().replaceAll("[\\W]", "")))
.collect(Collectors.toList())); .collect(Collectors.toList()));
boolean nullRank = r boolean nullRank = r
@ -690,9 +503,6 @@ public class GraphCleaningFunctions extends CleaningFunctions {
.filter(Objects::nonNull) .filter(Objects::nonNull)
.filter(p -> Objects.nonNull(p.getQualifier())) .filter(p -> Objects.nonNull(p.getQualifier()))
.filter(p -> StringUtils.isNotBlank(p.getValue())) .filter(p -> StringUtils.isNotBlank(p.getValue()))
p -> StringUtils
.contains(StringUtils.lowerCase(p.getQualifier().getClassid()), ORCID))
.map(p -> { .map(p -> {
// hack to distinguish orcid from orcid_pending // hack to distinguish orcid from orcid_pending
String pidProvenance = getProvenance(p.getDataInfo()); String pidProvenance = getProvenance(p.getDataInfo());
@ -702,8 +512,7 @@ public class GraphCleaningFunctions extends CleaningFunctions {
.toLowerCase() .toLowerCase()
.contains(ModelConstants.ORCID)) { .contains(ModelConstants.ORCID)) {
if (pidProvenance if (pidProvenance
pidProvenance.equals("ORCID_ENRICHMENT")) {
p.getQualifier().setClassid(ModelConstants.ORCID); p.getQualifier().setClassid(ModelConstants.ORCID);
} else { } else {
p.getQualifier().setClassid(ModelConstants.ORCID_PENDING); p.getQualifier().setClassid(ModelConstants.ORCID_PENDING);
@ -744,58 +553,14 @@ public class GraphCleaningFunctions extends CleaningFunctions {
} else if (value instanceof Software) { } else if (value instanceof Software) {
} }
} else if (value instanceof Relation) {
Relation r = (Relation) value;
Optional<String> validationDate = doCleanDate(r.getValidationDate());
if (validationDate.isPresent()) {
} else {
} }
return value; return value;
} }
private static Author cleanupAuthor(Author author) { private static Optional<String> cleanDateField(String dateofacceptance) {
if (StringUtils.isNotBlank(author.getFullname())) {
.replaceAll(NAME_CLEANING_REGEX, " ")
.replace("\"", "\\\""));
if (StringUtils.isNotBlank(author.getName())) {
.replaceAll(NAME_CLEANING_REGEX, " ")
.replace("\"", "\\\""));
if (StringUtils.isNotBlank(author.getSurname())) {
.replaceAll(NAME_CLEANING_REGEX, " ")
.replace("\"", "\\\""));
return author;
private static Optional<String> cleanDateField(Field<String> dateofacceptance) {
return Optional return Optional
.ofNullable(dateofacceptance) .ofNullable(dateofacceptance)
.map(GraphCleaningFunctions::cleanDate) .map(GraphCleaningFunctions::cleanDate)
.filter(Objects::nonNull); .filter(Objects::nonNull);
} }
@ -825,32 +590,14 @@ public class GraphCleaningFunctions extends CleaningFunctions {
private static boolean isValidAuthorName(Author a) { private static boolean isValidAuthorName(Author a) {
return StringUtils.isNotBlank(a.getFullname()) && return !Stream
StringUtils.isNotBlank(a.getFullname().replaceAll("[\\W]", "")) &&
!INVALID_AUTHOR_NAMES.contains(StringUtils.lowerCase(a.getFullname()).trim()) &&
.of(a.getFullname(), a.getName(), a.getSurname()) .of(a.getFullname(), a.getName(), a.getSurname())
.filter(StringUtils::isNotBlank) .filter(s -> s != null && !s.isEmpty())
.collect(Collectors.joining("")) .collect(Collectors.joining(""))
.toLowerCase() .toLowerCase()
} }
private static boolean urlFilter(String u) {
try {
final URL url = new URL(u);
if (StringUtils.isBlank(url.getPath()) || "/".equals(url.getPath())) {
return false;
if (INVALID_URL_HOSTS.contains(url.getHost())) {
return false;
return !INVALID_URLS.contains(url.toString());
} catch (MalformedURLException ex) {
return false;
private static List<StructuredProperty> processPidCleaning(List<StructuredProperty> pids) { private static List<StructuredProperty> processPidCleaning(List<StructuredProperty> pids) {
return pids return pids
.stream() .stream()
@ -859,7 +606,7 @@ public class GraphCleaningFunctions extends CleaningFunctions {
.filter(sp -> !PID_BLACKLIST.contains(sp.getValue().trim().toLowerCase())) .filter(sp -> !PID_BLACKLIST.contains(sp.getValue().trim().toLowerCase()))
.filter(sp -> Objects.nonNull(sp.getQualifier())) .filter(sp -> Objects.nonNull(sp.getQualifier()))
.filter(sp -> StringUtils.isNotBlank(sp.getQualifier().getClassid())) .filter(sp -> StringUtils.isNotBlank(sp.getQualifier().getClassid()))
.map(PidCleaner::normalizePidValue) .map(CleaningFunctions::normalizePidValue)
.filter(CleaningFunctions::pidFilter) .filter(CleaningFunctions::pidFilter)
.collect(Collectors.toList()); .collect(Collectors.toList());
} }
@ -867,20 +614,17 @@ public class GraphCleaningFunctions extends CleaningFunctions {
private static void fixVocabName(Qualifier q, String vocabularyName) { private static void fixVocabName(Qualifier q, String vocabularyName) {
if (Objects.nonNull(q) && StringUtils.isBlank(q.getSchemeid())) { if (Objects.nonNull(q) && StringUtils.isBlank(q.getSchemeid())) {
q.setSchemeid(vocabularyName); q.setSchemeid(vocabularyName);
} }
} }
private static AccessRight accessRight(String classid, String classname, String scheme) { private static AccessRight accessRight(String classid, String classname, String scheme) {
return OafMapperUtils return OafMapperUtils
.accessRight( .accessRight(
classid, classname, scheme, scheme); classid, classname, scheme);
} }
private static Qualifier qualifier(String classid, String classname, String scheme) { private static Qualifier qualifier(String classid, String classname, String scheme) {
return OafMapperUtils return OafMapperUtils.qualifier(classid, classname, scheme);
classid, classname, scheme, scheme);
} }
protected static StructuredProperty cleanValue(StructuredProperty s) { protected static StructuredProperty cleanValue(StructuredProperty s) {
@ -893,110 +637,8 @@ public class GraphCleaningFunctions extends CleaningFunctions {
return s; return s;
} }
protected static Field<String> cleanValue(Field<String> s) { protected static String cleanValue(String s) {
s.setValue(s.getValue().replaceAll(CLEANING_REGEX, " ")); return s.replaceAll(CLEANING_REGEX, " ");
return s;
public static OafEntity applyCoarVocabularies(OafEntity entity, VocabularyGroup vocs) {
if (entity instanceof Result) {
final Result result = (Result) entity;
instances -> instances
instance -> {
if (Objects.isNull(instance.getInstanceTypeMapping())) {
List<InstanceTypeMapping> mapping = Lists.newArrayList();
Optional<InstanceTypeMapping> optionalItm = instance
if (optionalItm.isPresent()) {
InstanceTypeMapping coarItm = optionalItm.get();
OPENAIRE_COAR_RESOURCE_TYPES_3_1, coarItm.getOriginalType()))
.ifPresent(type -> {
final List<InstanceTypeMapping> mappings = Lists.newArrayList();
if (vocs.vocabularyExists(OPENAIRE_USER_RESOURCE_TYPES)) {
type -> mappings
.instanceTypeMapping(coarItm.getTypeCode(), type)));
if (!mappings.isEmpty()) {
result.setMetaResourceType(getMetaResourceType(result.getInstance(), vocs));
return entity;
private static boolean originalResourceType(InstanceTypeMapping itm) {
return StringUtils.isNotBlank(itm.getOriginalType()) &&
OPENAIRE_COAR_RESOURCE_TYPES_3_1.equals(itm.getVocabularyName()) &&
StringUtils.isBlank(itm.getTypeCode()) &&
private static Qualifier getMetaResourceType(final List<Instance> instances, final VocabularyGroup vocs) {
return Optional
.map(ii -> {
if (vocs.vocabularyExists(OPENAIRE_META_RESOURCE_TYPE)) {
Optional<InstanceTypeMapping> itm = ii
i -> Optional
.filter(t -> OPENAIRE_COAR_RESOURCE_TYPES_3_1.equals(t.getVocabularyName()))
if (!itm.isPresent() || Objects.isNull(itm.get().getTypeCode())) {
return null;
} else {
final String typeCode = itm.get().getTypeCode();
return Optional
.ofNullable(vocs.lookupTermBySynonym(OPENAIRE_META_RESOURCE_TYPE, typeCode))
() -> new IllegalStateException("unable to find a synonym for '" + typeCode + "' in " +
} else {
throw new IllegalStateException("vocabulary '" + OPENAIRE_META_RESOURCE_TYPE + "' not available");
} }
} }

@ -1,24 +0,0 @@
package eu.dnetlib.dhp.schema.oaf.utils;
import java.util.regex.Matcher;
import java.util.regex.Pattern;
public class GridCleaningRule {
public static final Pattern PATTERN = Pattern.compile("(?<grid>\\d{4,6}\\.[0-9a-z]{1,2})");
public static String clean(String grid) {
String s = grid
.replaceAll("\\s", "")
Matcher m = PATTERN.matcher(s);
if (m.find()) {
return "grid." +"grid");
return "";

@ -1,21 +0,0 @@
package eu.dnetlib.dhp.schema.oaf.utils;
import java.util.regex.Matcher;
import java.util.regex.Pattern;
public class ISNICleaningRule {
public static final Pattern PATTERN = Pattern.compile("([0]{4}) ?([0-9]{4}) ?([0-9]{4}) ?([0-9]{3}[0-9X])");
public static String clean(final String isni) {
Matcher m = PATTERN.matcher(isni);
if (m.find()) {
return String.join("",,,,;
} else {
return "";

@ -0,0 +1,317 @@
package eu.dnetlib.dhp.schema.oaf.utils;
import static;
import static eu.dnetlib.dhp.schema.common.ModelConstants.*;
import java.nio.charset.StandardCharsets;
import java.util.*;
import java.util.function.Function;
import org.apache.commons.codec.binary.Hex;
import org.apache.commons.lang3.StringUtils;
import eu.dnetlib.dhp.schema.oaf.*;
import eu.dnetlib.dhp.schema.oaf.common.ModelSupport;
* Factory class for OpenAIRE identifiers in the Graph
public class IdentifierFactory implements Serializable {
public static final String ID_SEPARATOR = "::";
public static final String ID_PREFIX_SEPARATOR = "|";
public static final int ID_PREFIX_LEN = 12;
* Declares the associations PID_TYPE -> [DATASOURCE ID, NAME] considered authoritative for that PID_TYPE.
* The id of the record (source_::id) will be rewritten as pidType_::id)
public static final Map<PidType, HashBiMap<String, String>> PID_AUTHORITY = Maps.newHashMap();
static {
PID_AUTHORITY.put(PidType.doi, HashBiMap.create());
PID_AUTHORITY.get(PidType.doi).put(CROSSREF_ID, "Crossref");
PID_AUTHORITY.get(PidType.doi).put(DATACITE_ID, "Datacite");
PID_AUTHORITY.get(PidType.doi).put(ZENODO_R3_ID, "Zenodo");
PID_AUTHORITY.put(PidType.pmc, HashBiMap.create());
PID_AUTHORITY.get(PidType.pmc).put(EUROPE_PUBMED_CENTRAL_ID, "Europe PubMed Central");
PID_AUTHORITY.get(PidType.pmc).put(PUBMED_CENTRAL_ID, "PubMed Central");
PID_AUTHORITY.put(PidType.pmid, HashBiMap.create());
PID_AUTHORITY.get(PidType.pmid).put(EUROPE_PUBMED_CENTRAL_ID, "Europe PubMed Central");
PID_AUTHORITY.get(PidType.pmid).put(PUBMED_CENTRAL_ID, "PubMed Central");
PID_AUTHORITY.put(PidType.arXiv, HashBiMap.create());
PID_AUTHORITY.get(PidType.arXiv).put(ARXIV_ID, " e-Print Archive");
PID_AUTHORITY.put(PidType.w3id, HashBiMap.create());
PID_AUTHORITY.get(PidType.w3id).put(ROHUB_ID, "ROHub");
* Declares the associations PID_TYPE -> [DATASOURCE ID, PID SUBSTRING] considered as delegated authority for that
* PID_TYPE. Example, Zenodo is delegated to forge DOIs that contain the 'zenodo' word.
* If a record with the same id (same pid) comes from 2 data sources, the one coming from a delegated source wins. E.g. Zenodo records win over those from Datacite.
* See also and the class dhp-common/src/main/java/eu/dnetlib/dhp/schema/oaf/utils/
public static final Map<PidType, Map<String, String>> DELEGATED_PID_AUTHORITY = Maps.newHashMap();
static {
DELEGATED_PID_AUTHORITY.put(PidType.doi, new HashMap<>());
DELEGATED_PID_AUTHORITY.get(PidType.doi).put(ZENODO_OD_ID, "zenodo");
DELEGATED_PID_AUTHORITY.get(PidType.doi).put(ZENODO_R3_ID, "zenodo");
DELEGATED_PID_AUTHORITY.put(PidType.w3id, new HashMap<>());
DELEGATED_PID_AUTHORITY.get(PidType.w3id).put(ROHUB_ID, "ro-id");
* Declares the associations PID_TYPE -> [DATASOURCE ID, NAME] whose records are considered enrichment for the graph.
* Their OpenAIRE ID is built from the declared PID type. Are merged with their corresponding record, identified by
* the same OpenAIRE id.
public static final Map<PidType, HashBiMap<String, String>> ENRICHMENT_PROVIDER = Maps.newHashMap();
static {
ENRICHMENT_PROVIDER.put(PidType.doi, HashBiMap.create());
public static Set<String> delegatedAuthorityDatasourceIds() {
.flatMap(m -> m.keySet().stream())
public static List<StructuredProperty> getPids(List<StructuredProperty> pid, KeyValue collectedFrom) {
return pidFromInstance(pid, collectedFrom, true).distinct().collect(Collectors.toList());
public static <T extends Result> String createDOIBoostIdentifier(T entity) {
if (entity == null)
return null;
StructuredProperty pid = null;
if (entity.getPid() != null) {
pid = entity
.filter(s -> s.getQualifier() != null && "doi".equalsIgnoreCase(s.getQualifier().getClassid()))
} else {
if (entity.getInstance() != null) {
pid = entity
.filter(i -> i.getPid() != null)
.flatMap(i -> i.getPid().stream())
if (pid != null)
return idFromPid(entity, pid, true);
return null;
* Creates an identifier from the most relevant PID (if available) provided by a known PID authority in the given
* entity T. Returns when none of the PIDs meet the selection criteria is available.
* @param entity the entity providing PIDs and a default ID.
* @param <T> the specific entity type. Currently Organization and Result subclasses are supported.
* @param md5 indicates whether should hash the PID value or not.
* @return an identifier from the most relevant PID, otherwise
public static <T extends Entity> String createIdentifier(T entity, boolean md5) {
checkArgument(StringUtils.isNoneBlank(entity.getId()), "missing entity identifier");
final Map<String, Set<StructuredProperty>> pids = extractPids(entity);
return pids
.min(new PidComparator<>(entity))
min -> Optional
p -> p
.sorted(new PidValueComparator())
.map(s -> idFromPid(entity, s, md5))
private static <T extends Entity> Map<String, Set<StructuredProperty>> extractPids(T entity) {
if (entity instanceof Result) {
return Optional
.ofNullable(((Result) entity).getInstance())
.orElse(new HashMap<>());
} else {
return entity
p -> p.getQualifier().getClassid(),
Collectors.mapping(p -> p, Collectors.toCollection(HashSet::new))));
private static Map<String, Set<StructuredProperty>> mapPids(List<Instance> instance) {
return instance
.map(i -> pidFromInstance(i.getPid(), i.getCollectedfrom(), false))
p -> p.getQualifier().getClassid(),
Collectors.mapping(p -> p, Collectors.toCollection(HashSet::new))));
private static Stream<StructuredProperty> pidFromInstance(List<StructuredProperty> pid, KeyValue collectedFrom,
boolean mapHandles) {
return Optional
pp -> pp
// filter away PIDs provided by a DS that is not considered an authority for the
// given PID Type
.filter(p -> shouldFilterPidByCriteria(collectedFrom, p, mapHandles))
.filter(p -> isNotFromDelegatedAuthority(collectedFrom, p))
private static boolean shouldFilterPidByCriteria(KeyValue collectedFrom, StructuredProperty p, boolean mapHandles) {
final PidType pType = PidType.tryValueOf(p.getQualifier().getClassid());
if (Objects.isNull(collectedFrom)) {
return false;
boolean isEnrich = Optional
enrich -> enrich.containsKey(collectedFrom.getKey())
|| enrich.containsValue(collectedFrom.getValue()))
boolean isAuthority = Optional
authorities -> authorities.containsKey(collectedFrom.getKey())
|| authorities.containsValue(collectedFrom.getValue()))
return (mapHandles && pType.equals(PidType.handle)) || isEnrich || isAuthority;
private static boolean isNotFromDelegatedAuthority(KeyValue collectedFrom, StructuredProperty p) {
final PidType pType = PidType.tryValueOf(p.getQualifier().getClassid());
final Map<String, String> da = DELEGATED_PID_AUTHORITY.get(pType);
if (Objects.isNull(da)) {
return true;
if (!da.containsKey(collectedFrom.getKey())) {
return true;
return StringUtils.contains(p.getValue(), da.get(collectedFrom.getKey()));
* @see {@link IdentifierFactory#createIdentifier(Entity, boolean)}
public static <T extends Entity> String createIdentifier(T entity) {
return createIdentifier(entity, true);
private static <T extends Entity> String idFromPid(T entity, StructuredProperty s, boolean md5) {
return idFromPid(ModelSupport.getIdPrefix(entity.getClass()), s.getQualifier().getClassid(), s.getValue(), md5);
public static String idFromPid(String numericPrefix, String pidType, String pidValue, boolean md5) {
return new StringBuilder()
.append(md5 ? ModelSupport.md5(pidValue) : pidValue)
// create the prefix (length = 12)
private static String createPrefix(String pidType) {
StringBuilder prefix = new StringBuilder(StringUtils.left(pidType, ID_PREFIX_LEN));
while (prefix.length() < ID_PREFIX_LEN) {
return prefix.substring(0, ID_PREFIX_LEN);
public static String createOpenaireId(
final int prefix,
final String originalId,
final boolean to_md5) {
if (StringUtils.isBlank(originalId)) {
return null;
} else if (to_md5) {
final String nsPrefix = StringUtils.substringBefore(originalId, "::");
final String rest = StringUtils.substringAfter(originalId, "::");
return String.format("%s|%s::%s", prefix, nsPrefix, ModelSupport.md5(rest));
} else {
return String.format("%s|%s", prefix, originalId);
public static String createOpenaireId(
final String type,
final String originalId,
final boolean to_md5) {
switch (type) {
case "datasource":
return createOpenaireId(10, originalId, to_md5);
case "organization":
return createOpenaireId(20, originalId, to_md5);
case "person":
return createOpenaireId(30, originalId, to_md5);
case "project":
return createOpenaireId(40, originalId, to_md5);
return createOpenaireId(50, originalId, to_md5);

@ -0,0 +1,104 @@
package eu.dnetlib.dhp.schema.oaf.utils;
import java.lang.reflect.InvocationTargetException;
import java.util.ArrayList;
import java.util.List;
import org.apache.commons.beanutils.BeanUtilsBean;
public class MergeBeanUtils {
* Copies all properties from sources to destination, does not copy null values and any nested objects will attempted to be
* either cloned or copied into the existing object. This is recursive. Should not cause any infinite recursion.
* @param dest object to copy props into (will mutate)
* @param sources
* @param <T> dest
* @return
* @throws IllegalAccessException
* @throws InvocationTargetException
public static <T> T mergeIn(T dest, T... sources) {
// to keep from any chance infinite recursion lets limit each object to 1 instance at a time in the stack
final List<Object> lookingAt = new ArrayList<>();
BeanUtilsBean recursiveBeanUtils = new BeanUtilsBean() {
* Check if the class name is an internal one
* @param name
* @return
private boolean isInternal(String name) {
return name.startsWith("java.") || name.startsWith("javax.")
|| name.startsWith("com.sun.") || name.startsWith("javax.")
|| name.startsWith("oracle.");
* Override to ensure that we dont end up in infinite recursion
* @param dest
* @param orig
* @throws IllegalAccessException
* @throws InvocationTargetException
public void copyProperties(Object dest, Object orig)
throws IllegalAccessException, InvocationTargetException {
try {
// if we have an object in our list, that means we hit some sort of recursion, stop here.
if ( -> o == dest)) {
return; // recursion detected
super.copyProperties(dest, orig);
} finally {
public void copyProperty(Object dest, String name, Object value)
throws IllegalAccessException, InvocationTargetException {
if ("resulttype".equals(name)) {
} else if (value != null) {
// dont copy over null values
// attempt to check if the value is a pojo we can clone using nested calls
if (!value.getClass().isPrimitive() && !value.getClass().isSynthetic()
&& !isInternal(value.getClass().getName())) {
try {
Object prop = super.getPropertyUtils().getProperty(dest, name);
// get current value, if its null then clone the value and set that to the value
if (prop == null) {
super.setProperty(dest, name, super.cloneBean(value));
} else {
// get the destination value and then recursively call
copyProperties(prop, value);
} catch (NoSuchMethodException e) {
} catch (InstantiationException e) {
throw new RuntimeException("Nested property could not be cloned.", e);
} else {
super.copyProperty(dest, name, value);
for (Object source : sources) {
try {
recursiveBeanUtils.copyProperties(dest, source);
} catch (IllegalAccessException | InvocationTargetException e) {
throw new RuntimeException(e);
return dest;

@ -0,0 +1,903 @@
package eu.dnetlib.dhp.schema.oaf.utils;
import static;
import static eu.dnetlib.dhp.schema.oaf.common.ModelSupport.isSubClass;
import static eu.dnetlib.dhp.schema.oaf.common.ModelSupport.sameClass;
import java.text.ParseException;
import java.util.*;
import org.apache.commons.lang3.StringUtils;
import org.apache.commons.lang3.tuple.ImmutablePair;
import org.apache.commons.lang3.tuple.Pair;
import eu.dnetlib.dhp.schema.common.ModelConstants;
import eu.dnetlib.dhp.schema.oaf.*;
import eu.dnetlib.dhp.schema.oaf.common.AccessRightComparator;
import eu.dnetlib.dhp.schema.oaf.common.ModelSupport;
public class MergeUtils {
public static <T extends Oaf> T merge(final T left, final T right) {
return merge(left, right, false);
public static <T extends Oaf> T merge(final T left, final T right, boolean checkDelegatedAuthority) {
if (sameClass(left, right, Entity.class)) {
return mergeEntities(left, right, checkDelegatedAuthority);
} else if (sameClass(left, right, Relation.class)) {
return mergeRelation(left, right);
} else {
throw new RuntimeException(
"MERGE_FROM_AND_GET incompatible types: %s, %s",
left.getClass().getCanonicalName(), right.getClass().getCanonicalName()));
private static <T extends Oaf> T mergeEntities(T left, T right, boolean checkDelegatedAuthority) {
if (sameClass(left, right, Result.class)) {
if (!left.getClass().equals(right.getClass()) || checkDelegatedAuthority) {
return mergeResultsOfDifferentTypes(left, right);
return mergeResult(left, right);
} else if (sameClass(left, right, Datasource.class)) {
return left;
} else if (sameClass(left, right, Organization.class)) {
return mergeOrganization(left, right);
} else if (sameClass(left, right, Project.class)) {
return mergeProject(left, right);
} else {
throw new RuntimeException(
"MERGE_FROM_AND_GET incompatible types: %s, %s",
left.getClass().getCanonicalName(), right.getClass().getCanonicalName()));
* This method is used in the global result grouping phase. It checks if one of the two is from a delegated authority
* and in that case it prefers
* such version.
* Otherwise, it considers a resulttype priority order implemented in {@link ResultTypeComparator}
* and proceeds with the canonical property merging.
* @param left
* @param right
* @return
private static <T extends Oaf> T mergeResultsOfDifferentTypes(T left, T right) {
final boolean leftFromDelegatedAuthority = isFromDelegatedAuthority((Result) left);
final boolean rightFromDelegatedAuthority = isFromDelegatedAuthority((Result) right);
if (leftFromDelegatedAuthority && !rightFromDelegatedAuthority) {
return left;
if (!leftFromDelegatedAuthority && rightFromDelegatedAuthority) {
return right;
if (new ResultTypeComparator().compare((Result) left, (Result) right) < 0) {
return mergeResult(left, right);
} else {
return mergeResult(right, left);
* Internal utility that merges the common entity fields
* @param left
* @param right
* @param <T>
private static <T extends Oaf> T mergeEntityFields(T left, T right) {
final Entity enrich = (Entity) right;
final Entity mergedEntity = (Entity) left;
mergedEntity.setOriginalId(mergeLists(mergedEntity.getOriginalId(), enrich.getOriginalId()));
mergedEntity.setCollectedfrom(mergeLists(mergedEntity.getCollectedfrom(), enrich.getCollectedfrom()));
if (mergedEntity.getLastupdatetimestamp() == null && enrich.getLastupdatetimestamp() != null) {
} else if (mergedEntity.getLastupdatetimestamp() != null && enrich.getLastupdatetimestamp() != null) {
Long.max(mergedEntity.getLastupdatetimestamp(), enrich.getLastupdatetimestamp()));
mergedEntity.setPid(mergeLists(mergedEntity.getPid(), enrich.getPid()));
final int trustCompareResult = compareTrust(mergedEntity, enrich);
if (enrich.getDateofcollection() != null && trustCompareResult < 0)
if (enrich.getDateoftransformation() != null && trustCompareResult < 0)
mergedEntity.setMeasures(mergeLists(mergedEntity.getMeasures(), enrich.getMeasures()));
mergedEntity.setExtraInfo(mergeLists(mergedEntity.getExtraInfo(), enrich.getExtraInfo()));
return (T) mergedEntity;
public static <T extends Oaf> T mergeRelation(T left, T right) {
Relation original = (Relation) left;
Relation enrich = (Relation) right;
checkArgument(Objects.equals(original.getSource(), enrich.getSource()), "source ids must be equal");
checkArgument(Objects.equals(original.getTarget(), enrich.getTarget()), "target ids must be equal");
checkArgument(Objects.equals(original.getRelType(), enrich.getRelType()), "relType(s) must be equal");
Objects.equals(original.getSubRelType(), enrich.getSubRelType()), "subRelType(s) must be equal");
checkArgument(Objects.equals(original.getRelClass(), enrich.getRelClass()), "relClass(es) must be equal");
original.setProvenance(mergeLists(original.getProvenance(), enrich.getProvenance()));
original.setValidated(original.getValidated() || enrich.getValidated());
try {
original.setValidationDate(ModelSupport.oldest(original.getValidationDate(), enrich.getValidationDate()));
} catch (ParseException e) {
throw new IllegalArgumentException(String
"invalid validation date format in relation [s:%s, t:%s]: %s", original.getSource(),
return (T) original;
private static <T extends Oaf> T mergeResult(T left, T right) {
Result original = (Result) left;
Result enrich = (Result) right;
final Result mergedResult = mergeEntityFields(original, enrich);
if (StringUtils.isBlank(mergedResult.getProcessingchargeamount())) {
if (!isAnEnrichment(mergedResult) && !isAnEnrichment(enrich))
mergedResult.setInstance(mergeLists(mergedResult.getInstance(), enrich.getInstance()));
else {
final List<Instance> enrichmentInstances = isAnEnrichment(mergedResult) ? mergedResult.getInstance()
: enrich.getInstance();
final List<Instance> enrichedInstances = isAnEnrichment(mergedResult) ? enrich.getInstance()
: mergedResult.getInstance();
if (isAnEnrichment(mergedResult))
mergedResult.setInstance(enrichInstances(enrichedInstances, enrichmentInstances));
if (enrich.getBestaccessright() != null
&& new AccessRightComparator<>()
.compare(enrich.getBestaccessright(), mergedResult.getBestaccessright()) < 0)
final int trustCompareResult = compareTrust(mergedResult, enrich);
if (enrich.getResulttype() != null && trustCompareResult < 0)
if (enrich.getLanguage() != null && trustCompareResult < 0)
if (Objects.nonNull(enrich.getDateofacceptance())) {
if (Objects.isNull(mergedResult.getDateofacceptance()) || trustCompareResult < 0) {
mergedResult.setCountry(mergeLists(mergedResult.getCountry(), enrich.getCountry()));
mergedResult.setSubject(mergeLists(mergedResult.getSubject(), enrich.getSubject()));
if (enrich.getJournal() != null && trustCompareResult < 0)
// merge title lists: main title with higher trust and distinct between the others
StructuredProperty baseMainTitle = null;
if (mergedResult.getTitle() != null) {
baseMainTitle = getMainTitle(mergedResult.getTitle());
if (baseMainTitle != null) {
final StructuredProperty p = baseMainTitle;
.setTitle(mergedResult.getTitle().stream().filter(t -> t != p).collect(Collectors.toList()));
StructuredProperty newMainTitle = null;
if (enrich.getTitle() != null) {
newMainTitle = getMainTitle(enrich.getTitle());
if (newMainTitle != null) {
final StructuredProperty p = newMainTitle;
enrich.setTitle(enrich.getTitle().stream().filter(t -> t != p).collect(Collectors.toList()));
if (newMainTitle != null && trustCompareResult < 0) {
baseMainTitle = newMainTitle;
mergedResult.setTitle(mergeLists(mergedResult.getTitle(), enrich.getTitle()));
if (mergedResult.getTitle() != null && baseMainTitle != null) {
mergedResult.setRelevantdate(mergeLists(mergedResult.getRelevantdate(), enrich.getRelevantdate()));
mergedResult.setDescription(longestLists(mergedResult.getDescription(), enrich.getDescription()));
if (enrich.getPublisher() != null && trustCompareResult < 0)
if (enrich.getEmbargoenddate() != null && trustCompareResult < 0)
mergedResult.setSource(mergeLists(mergedResult.getSource(), enrich.getSource()));
mergedResult.setFulltext(mergeLists(mergedResult.getFulltext(), enrich.getFulltext()));
mergedResult.setFormat(mergeLists(mergedResult.getFormat(), enrich.getFormat()));
mergedResult.setContributor(mergeLists(mergedResult.getContributor(), enrich.getContributor()));
if (enrich.getResourcetype() != null)
mergedResult.setCoverage(mergeLists(mergedResult.getCoverage(), enrich.getCoverage()));
mergedResult.setContext(mergeLists(mergedResult.getContext(), enrich.getContext()));
.setExternalReference(mergeLists(mergedResult.getExternalReference(), enrich.getExternalReference()));
if (enrich.getOaiprovenance() != null && trustCompareResult < 0)
if (sameClass(mergedResult, enrich, Publication.class)) {
return (T) mergePublication(mergedResult, enrich);
if (sameClass(mergedResult, enrich, Dataset.class)) {
return (T) mergeDataset(mergedResult, enrich);
if (sameClass(mergedResult, enrich, OtherResearchProduct.class)) {
return (T) mergeORP(mergedResult, enrich);
if (sameClass(mergedResult, enrich, Software.class)) {
return (T) mergeSoftware(mergedResult, enrich);
mergeEntityDataInfo(original, enrich);
return (T) mergedResult;
private static <T extends Oaf> T mergeORP(T left, T right) {
final OtherResearchProduct original = (OtherResearchProduct) left;
final OtherResearchProduct enrich = (OtherResearchProduct) right;
original.setContactperson(mergeLists(original.getContactperson(), enrich.getContactperson()));
original.setContactgroup(mergeLists(original.getContactgroup(), enrich.getContactgroup()));
original.setTool(mergeLists(original.getTool(), enrich.getTool()));
mergeEntityDataInfo(original, enrich);
return (T) original;
private static <T extends Oaf> T mergeSoftware(T left, T right) {
final Software original = (Software) left;
final Software enrich = (Software) right;
.setDocumentationUrl(mergeLists(original.getDocumentationUrl(), enrich.getDocumentationUrl()));
enrich.getCodeRepositoryUrl() != null && compareTrust(original, enrich) < 0
? enrich.getCodeRepositoryUrl()
: original.getCodeRepositoryUrl());
enrich.getProgrammingLanguage() != null && compareTrust(original, enrich) < 0
? enrich.getProgrammingLanguage()
: original.getProgrammingLanguage());
return (T) original;
private static <T extends Oaf> T mergeDataset(T left, T right) {
Dataset original = (Dataset) left;
Dataset enrich = (Dataset) right;
enrich.getStoragedate() != null && compareTrust(original, enrich) < 0 ? enrich.getStoragedate()
: original.getStoragedate());
enrich.getDevice() != null && compareTrust(original, enrich) < 0 ? enrich.getDevice()
: original.getDevice());
enrich.getSize() != null && compareTrust(original, enrich) < 0 ? enrich.getSize()
: original.getSize());
enrich.getVersion() != null && compareTrust(original, enrich) < 0 ? enrich.getVersion()
: original.getVersion());
enrich.getLastmetadataupdate() != null && compareTrust(original, enrich) < 0
? enrich.getLastmetadataupdate()
: original.getLastmetadataupdate());
enrich.getMetadataversionnumber() != null && compareTrust(original, enrich) < 0
? enrich.getMetadataversionnumber()
: original.getMetadataversionnumber());
original.setGeolocation(mergeLists(original.getGeolocation(), enrich.getGeolocation()));
mergeEntityDataInfo(original, enrich);
private static <T extends Oaf> T mergePublication(T original, T enrich) {
// add publication specific fields.
mergeEntityDataInfo(original, enrich);
return original;
private static <T extends Oaf> T mergeOrganization(T left, T right) {
Organization original = (Organization) left;
Organization enrich = (Organization) right;
final Organization mergedOrganization = mergeEntityFields(original, enrich);
int ct = compareTrust(mergedOrganization, enrich);
enrich.getLegalshortname() != null && ct < 0
? enrich.getLegalshortname()
: mergedOrganization.getLegalname());
enrich.getLegalname() != null && ct < 0 ? enrich.getLegalname()
: mergedOrganization.getLegalname());
.setAlternativeNames(mergeLists(enrich.getAlternativeNames(), mergedOrganization.getAlternativeNames()));
enrich.getWebsiteurl() != null && ct < 0
? enrich.getWebsiteurl()
: mergedOrganization.getWebsiteurl());
enrich.getLogourl() != null && ct < 0
? enrich.getLogourl()
: mergedOrganization.getLogourl());
enrich.getEclegalbody() != null && ct < 0
? enrich.getEclegalbody()
: mergedOrganization.getEclegalbody());
enrich.getEclegalperson() != null && ct < 0
? enrich.getEclegalperson()
: mergedOrganization.getEclegalperson());
enrich.getEcnonprofit() != null && ct < 0
? enrich.getEcnonprofit()
: mergedOrganization.getEcnonprofit());
enrich.getEcresearchorganization() != null && ct < 0
? enrich.getEcresearchorganization()
: mergedOrganization.getEcresearchorganization());
enrich.getEchighereducation() != null && ct < 0
? enrich.getEchighereducation()
enrich.getEcinternationalorganizationeurinterests() != null && ct < 0
? enrich.getEcinternationalorganizationeurinterests()
: mergedOrganization.getEcinternationalorganizationeurinterests());
enrich.getEcinternationalorganization() != null && ct < 0
? enrich.getEcinternationalorganization()
: mergedOrganization.getEcinternationalorganization());
enrich.getEcenterprise() != null && ct < 0
? enrich.getEcenterprise()
: mergedOrganization.getEcenterprise());
enrich.getEcsmevalidated() != null && ct < 0
? enrich.getEcsmevalidated()
: mergedOrganization.getEcsmevalidated());
enrich.getEcnutscode() != null && ct < 0
? enrich.getEcnutscode()
: mergedOrganization.getEcnutscode());
enrich.getCountry() != null && ct < 0 ? enrich.getCountry()
: mergedOrganization.getCountry());
mergeEntityDataInfo(mergedOrganization, enrich);
public static <T extends Oaf> T mergeProject(T left, T right) {
Project original = (Project) left;
Project enrich = (Project) right;
final Project mergedProject = mergeEntityFields(original, enrich);
int ct = compareTrust(mergedProject, enrich);
enrich.getWebsiteurl() != null && ct < 0
? enrich.getWebsiteurl()
: mergedProject.getWebsiteurl());
mergedProject.setCode(enrich.getCode() != null && ct < 0 ? enrich.getCode() : mergedProject.getCode());
enrich.getAcronym() != null && ct < 0
? enrich.getAcronym()
: mergedProject.getAcronym());
enrich.getTitle() != null && ct < 0
? enrich.getTitle()
: mergedProject.getTitle());
enrich.getStartdate() != null && ct < 0
? enrich.getStartdate()
: mergedProject.getStartdate());
enrich.getEnddate() != null && ct < 0
? enrich.getEnddate()
: mergedProject.getEnddate());
enrich.getCallidentifier() != null && ct < 0
? enrich.getCallidentifier()
: mergedProject.getCallidentifier());
enrich.getKeywords() != null && ct < 0
? enrich.getKeywords()
: mergedProject.getKeywords());
enrich.getDuration() != null && ct < 0
? enrich.getDuration()
: mergedProject.getDuration());
enrich.getEcsc39() != null && ct < 0
? enrich.getEcsc39()
: mergedProject.getEcsc39());
enrich.getOamandatepublications() != null && ct < 0
? enrich.getOamandatepublications()
: mergedProject.getOamandatepublications());
enrich.getEcarticle29_3() != null && ct < 0
? enrich.getEcarticle29_3()
: mergedProject.getEcarticle29_3());
mergedProject.setSubjects(mergeLists(mergedProject.getSubjects(), enrich.getSubjects()));
mergedProject.setFundingtree(mergeLists(mergedProject.getFundingtree(), enrich.getFundingtree()));
enrich.getContracttype() != null && ct < 0
? enrich.getContracttype()
: mergedProject.getContracttype());
enrich.getOptional1() != null && ct < 0
? enrich.getOptional1()
: mergedProject.getOptional1());
enrich.getOptional2() != null && ct < 0
? enrich.getOptional2()
: mergedProject.getOptional2());
enrich.getJsonextrainfo() != null && ct < 0
? enrich.getJsonextrainfo()
: mergedProject.getJsonextrainfo());
enrich.getContactfullname() != null && ct < 0
? enrich.getContactfullname()
: mergedProject.getContactfullname());
enrich.getContactfax() != null && ct < 0
? enrich.getContactfax()
: mergedProject.getContactfax());
enrich.getContactphone() != null && ct < 0
? enrich.getContactphone()
: mergedProject.getContactphone());
enrich.getContactemail() != null && ct < 0
? enrich.getContactemail()
: mergedProject.getContactemail());
enrich.getSummary() != null && ct < 0
? enrich.getSummary()
: mergedProject.getSummary());
enrich.getCurrency() != null && ct < 0
? enrich.getCurrency()
: mergedProject.getCurrency());
if (enrich.getH2020topiccode() != null && StringUtils.isEmpty(mergedProject.getH2020topiccode())) {
mergeLists(mergedProject.getH2020classification(), enrich.getH2020classification()));
mergeEntityDataInfo(mergedProject, enrich);
private static <T extends Oaf> void mergeEntityDataInfo(T left, T right) {
Entity l = (Entity) left;
Entity r = (Entity) right;
other -> Optional
otherDataInfo -> Optional
.ifPresent(thisDataInfo -> {
if (compareTrust(r, other) < 0 || thisDataInfo.getInvisible()) {
* Gets main title.
* @param titles the titles
* @return the main title
private static StructuredProperty getMainTitle(List<StructuredProperty> titles) {
// need to check if the list of titles contains more than 1 main title? (in that case, we should chose which
// main title select in the list)
for (StructuredProperty t : titles) {
if (t.getQualifier() != null && t.getQualifier().getClassid() != null)
if (t.getQualifier().getClassid().equals("main title"))
return t;
* Longest lists list.
* @param a the a
* @param b the b
* @return the list
public static List<String> longestLists(List<String> a, List<String> b) {
if (a == null || b == null)
return a == null ? b : a;
if (a.size() == b.size()) {
int msa = a
.filter(i -> i != null)
.map(i -> i.length())
int msb = b
.filter(i -> i != null)
.map(i -> i.length())
return msa > msb ? a : b;
* This main method apply the enrichment of the instances
* @param toEnrichInstances the instances that could be enriched
* @param enrichmentInstances the enrichment instances
* @return list of instances possibly enriched
private static List<Instance> enrichInstances(final List<Instance> toEnrichInstances,
final List<Instance> enrichmentInstances) {
final List<Instance> enrichmentResult = new ArrayList<>();
if (toEnrichInstances == null) {
return enrichmentResult;
if (enrichmentInstances == null) {
return enrichmentResult;
Map<String, Instance> ri = toInstanceMap(enrichmentInstances);
toEnrichInstances.forEach(i -> {
final List<Instance> e = findEnrichmentsByPID(i.getPid(), ri);
if (e != null && e.size() > 0) {
e.forEach(enr -> applyEnrichment(i, enr));
} else {
final List<Instance> a = findEnrichmentsByPID(i.getAlternateIdentifier(), ri);
if (a != null && a.size() > 0) {
a.forEach(enr -> applyEnrichment(i, enr));
return enrichmentResult;
* This method converts the list of instance enrichments
* into a Map where the key is the normalized identifier
* and the value is the instance itself
* @param ri the list of enrichment instances
* @return the result map
private static Map<String, Instance> toInstanceMap(final List<Instance> ri) {
.filter(i -> i.getPid() != null || i.getAlternateIdentifier() != null)
.flatMap(i -> {
final List<Pair<String, Instance>> result = new ArrayList<>();
if (i.getPid() != null)
.forEach(p -> result.add(new ImmutablePair<>(extractKeyFromPid(p), i)));
if (i.getAlternateIdentifier() != null)
.forEach(p -> result.add(new ImmutablePair<>(extractKeyFromPid(p), i)));
(a, b) -> a));
private static boolean isFromDelegatedAuthority(Result r) {
return Optional
instance -> instance
.filter(i -> Objects.nonNull(i.getCollectedfrom()))
.map(i -> i.getCollectedfrom().getKey())
.anyMatch(cfId -> IdentifierFactory.delegatedAuthorityDatasourceIds().contains(cfId)))
* Valid pid boolean.
* @param p the p
* @return the boolean
private static boolean validPid(final StructuredProperty p) {
return p.getValue() != null && p.getQualifier() != null && p.getQualifier().getClassid() != null;
* Normalize pid string.
* @param pid the pid
* @return the string
private static String extractKeyFromPid(final StructuredProperty pid) {
if (pid == null)
return null;
final StructuredProperty normalizedPid = CleaningFunctions.normalizePidValue(pid);
return String.format("%s::%s", normalizedPid.getQualifier().getClassid(), normalizedPid.getValue());
* This utility method finds the list of enrichment instances
* that match one or more PIDs in the input list
* @param pids the list of PIDs
* @param enrichments the List of enrichment instances having the same pid
* @return the list
private static List<Instance> findEnrichmentsByPID(final List<StructuredProperty> pids,
final Map<String, Instance> enrichments) {
if (pids == null || enrichments == null)
return null;
return pids
* Is an enrichment boolean.
* @param e the e
* @return the boolean
public static boolean isAnEnrichment(Entity e) {
return e.getDataInfo() != null &&
e.getDataInfo().getProvenanceaction() != null
&& ModelConstants.PROVENANCE_ENRICH.equalsIgnoreCase(e.getDataInfo().getProvenanceaction().getClassid());
* The enrichment consists of replacing values on
* single attribute only if in the current instance is missing
* The only repeatable field enriched is measures
* @param currentInstance the current instance
* @param enrichment the enrichment instance
private static void applyEnrichment(final Instance currentInstance, final Instance enrichment) {
if (currentInstance == null || enrichment == null)
// ENRICH accessright
if (enrichment.getAccessright() != null && currentInstance.getAccessright() == null)
// ENRICH license
if (enrichment.getLicense() != null && currentInstance.getLicense() == null)
// ENRICH instanceType
if (enrichment.getInstancetype() != null && currentInstance.getInstancetype() == null)
// ENRICH hostedby
if (enrichment.getHostedby() != null && currentInstance.getHostedby() == null)
// ENRICH distributionlocation
if (enrichment.getDistributionlocation() != null && currentInstance.getDistributionlocation() == null)
// ENRICH collectedfrom
if (enrichment.getCollectedfrom() != null && currentInstance.getCollectedfrom() == null)
// ENRICH dateofacceptance
if (enrichment.getDateofacceptance() != null && currentInstance.getDateofacceptance() == null)
// ENRICH processingchargeamount
if (enrichment.getProcessingchargeamount() != null && currentInstance.getProcessingchargeamount() == null)
// ENRICH refereed
if (enrichment.getRefereed() != null && currentInstance.getRefereed() == null)
// TODO check the other Instance fields
return Arrays
private static int compareTrust(Entity a, Entity b) {
return Float

@ -0,0 +1,25 @@
package eu.dnetlib.dhp.schema.oaf.utils;
public class ModelHardLimits {
private ModelHardLimits() {
public static final String LAYOUT = "index";
public static final String INTERPRETATION = "openaire";
public static final String SEPARATOR = "-";
public static final int MAX_EXTERNAL_ENTITIES = 50;
public static final int MAX_AUTHORS = 200;
public static final int MAX_AUTHOR_FULLNAME_LENGTH = 1000;
public static final int MAX_TITLE_LENGTH = 5000;
public static final int MAX_TITLES = 10;
public static final int MAX_ABSTRACT_LENGTH = 150000;
public static final int MAX_INSTANCES = 10;
public static String getCollectionName(String format) {

@ -13,74 +13,14 @@ import;
import org.apache.commons.lang3.StringUtils; import org.apache.commons.lang3.StringUtils;
import eu.dnetlib.dhp.schema.common.AccessRightComparator;
import eu.dnetlib.dhp.schema.common.ModelSupport;
import eu.dnetlib.dhp.schema.oaf.*; import eu.dnetlib.dhp.schema.oaf.*;
import eu.dnetlib.dhp.schema.oaf.common.AccessRightComparator;
public class OafMapperUtils { public class OafMapperUtils {
private OafMapperUtils() { private OafMapperUtils() {
} }
public static Oaf merge(final Oaf left, final Oaf right) {
if (ModelSupport.isSubClass(left, OafEntity.class)) {
return mergeEntities((OafEntity) left, (OafEntity) right);
} else if (ModelSupport.isSubClass(left, Relation.class)) {
((Relation) left).mergeFrom((Relation) right);
} else {
throw new IllegalArgumentException("invalid Oaf type:" + left.getClass().getCanonicalName());
return left;
public static OafEntity mergeEntities(OafEntity left, OafEntity right) {
if (ModelSupport.isSubClass(left, Result.class)) {
return mergeResults((Result) left, (Result) right);
} else if (ModelSupport.isSubClass(left, Datasource.class)) {
} else if (ModelSupport.isSubClass(left, Organization.class)) {
} else if (ModelSupport.isSubClass(left, Project.class)) {
} else {
throw new IllegalArgumentException("invalid OafEntity subtype:" + left.getClass().getCanonicalName());
return left;
public static Result mergeResults(Result left, Result right) {
final boolean leftFromDelegatedAuthority = isFromDelegatedAuthority(left);
final boolean rightFromDelegatedAuthority = isFromDelegatedAuthority(right);
if (leftFromDelegatedAuthority && !rightFromDelegatedAuthority) {
return left;
if (!leftFromDelegatedAuthority && rightFromDelegatedAuthority) {
return right;
if (new ResultTypeComparator().compare(left, right) < 0) {
return left;
} else {
return right;
private static boolean isFromDelegatedAuthority(Result r) {
return Optional
instance -> instance
.filter(i -> Objects.nonNull(i.getCollectedfrom()))
.map(i -> i.getCollectedfrom().getKey())
public static KeyValue keyValue(final String k, final String v) { public static KeyValue keyValue(final String k, final String v) {
final KeyValue kv = new KeyValue(); final KeyValue kv = new KeyValue();
kv.setKey(k); kv.setKey(k);
@ -100,26 +40,6 @@ public class OafMapperUtils {
return list; return list;
} }
public static <T> Field<T> field(final T value, final DataInfo info) {
if (value == null || StringUtils.isBlank(value.toString())) {
return null;
final Field<T> field = new Field<>();
return field;
public static List<Field<String>> listFields(final DataInfo info, final String... values) {
return Arrays
.map(v -> field(v, info))
public static <T> List<T> listValues(Array values) throws SQLException { public static <T> List<T> listValues(Array values) throws SQLException {
if (Objects.isNull(values)) { if (Objects.isNull(values)) {
return null; return null;
@ -131,60 +51,26 @@ public class OafMapperUtils {
.collect(Collectors.toList()); .collect(Collectors.toList());
} }
public static List<Field<String>> listFields(final DataInfo info, final List<String> values) { public static Qualifier unknown(final String schemeid) {
return values return qualifier(UNKNOWN, "Unknown", schemeid);
.map(v -> field(v, info))
} }
public static InstanceTypeMapping instanceTypeMapping(String originalType, String code, String label, public static AccessRight accessRight(
String vocabularyName) { final String classid,
final InstanceTypeMapping m = new InstanceTypeMapping(); final String classname,
m.setVocabularyName(vocabularyName); final String schemeid) {
m.setOriginalType(originalType); return accessRight(classid, classname, schemeid, null);
return m;
return instanceTypeMapping(originalType, term.getClassid(), term.getClassname(), term.getSchemeid());
public static InstanceTypeMapping instanceTypeMapping(String originalType) {
return instanceTypeMapping(originalType, null, null, null);
public static InstanceTypeMapping instanceTypeMapping(String originalType, String vocabularyName) {
return instanceTypeMapping(originalType, null, null, vocabularyName);
public static Qualifier unknown(final String schemeid, final String schemename) {
return qualifier(UNKNOWN, "Unknown", schemeid, schemename);
} }
public static AccessRight accessRight( public static AccessRight accessRight(
final String classid, final String classid,
final String classname, final String classname,
final String schemeid, final String schemeid,
final String schemename) {
return accessRight(classid, classname, schemeid, schemename, null);
public static AccessRight accessRight(
final String classid,
final String classname,
final String schemeid,
final String schemename,
final OpenAccessRoute openAccessRoute) { final OpenAccessRoute openAccessRoute) {
final AccessRight accessRight = new AccessRight(); final AccessRight accessRight = new AccessRight();
accessRight.setClassid(classid); accessRight.setClassid(classid);
accessRight.setClassname(classname); accessRight.setClassname(classname);
accessRight.setSchemeid(schemeid); accessRight.setSchemeid(schemeid);
accessRight.setOpenAccessRoute(openAccessRoute); accessRight.setOpenAccessRoute(openAccessRoute);
return accessRight; return accessRight;
} }
public static Qualifier qualifier( public static Qualifier qualifier(
final String classid, final String classid,
final String classname, final String classname,
final String schemeid, final String schemeid) {
final String schemename) {
final Qualifier q = new Qualifier(); final Qualifier q = new Qualifier();
q.setClassid(classid); q.setClassid(classid);
q.setClassname(classname); q.setClassname(classname);
q.setSchemeid(schemeid); q.setSchemeid(schemeid);
return q; return q;
@ -207,7 +91,6 @@ public class OafMapperUtils {
q.setClassid(qualifier.getClassid()); q.setClassid(qualifier.getClassid());
q.setClassname(qualifier.getClassname()); q.setClassname(qualifier.getClassname());
q.setSchemeid(qualifier.getSchemeid()); q.setSchemeid(qualifier.getSchemeid());
return q; return q;
@ -216,21 +99,18 @@ public class OafMapperUtils {
final String classid, final String classid,
final String classname, final String classname,
final String schemeid, final String schemeid,
final String schemename,
final DataInfo dataInfo) { final DataInfo dataInfo) {
return subject(value, qualifier(classid, classname, schemeid, schemename), dataInfo); return subject(value, qualifier(classid, classname, schemeid), dataInfo);
} }
public static StructuredProperty structuredProperty( public static StructuredProperty structuredProperty(
final String value, final String value,
final String classid, final String classid,
final String classname, final String classname,
final String schemeid, final String schemeid) {
final String schemename,
final DataInfo dataInfo) {
return structuredProperty(value, qualifier(classid, classname, schemeid, schemename), dataInfo); return structuredProperty(value, qualifier(classid, classname, schemeid));
public static Subject subject( public static Subject subject(
@ -249,18 +129,57 @@ public class OafMapperUtils {
public static StructuredProperty structuredProperty( public static StructuredProperty structuredProperty(
final String value, final String value,
final Qualifier qualifier, final Qualifier qualifier) {
final DataInfo dataInfo) {
if (value == null) { if (value == null) {
return null; return null;
} }
final StructuredProperty sp = new StructuredProperty(); final StructuredProperty sp = new StructuredProperty();
sp.setValue(value); sp.setValue(value);
sp.setQualifier(qualifier); sp.setQualifier(qualifier);
} }
public static Publisher publisher(final String name) {
final Publisher p = new Publisher();
return p;
public static License license(final String url) {
final License l = new License();
return l;
public static AuthorPid authorPid(
final String value,
final Qualifier qualifier,
final DataInfo dataInfo) {
if (value == null) {
return null;
final AuthorPid ap = new AuthorPid();
return ap;
public static AuthorPid authorPid(
final String value,
final String classid,
final String schemeid,
final DataInfo dataInfo) {
if (value == null) {
return null;
final AuthorPid ap = new AuthorPid();
ap.setQualifier(qualifier(classid, classid, schemeid));
public static ExtraInfo extraInfo( public static ExtraInfo extraInfo(
final String name, final String name,
final String value, final String value,
@ -302,8 +221,7 @@ public class OafMapperUtils {
final String name, final String name,
final String issnPrinted, final String issnPrinted,
final String issnOnline, final String issnOnline,
final String issnLinking, final String issnLinking) {
final DataInfo dataInfo) {
return hasIssn(issnPrinted, issnOnline, issnLinking) ? journal( return hasIssn(issnPrinted, issnOnline, issnLinking) ? journal(
name, name,
@ -316,8 +234,7 @@ public class OafMapperUtils {
null, null,
null, null,
null, null,
null, null) : null;
dataInfo) : null;
} }
public static Journal journal( public static Journal journal(
@ -331,8 +248,7 @@ public class OafMapperUtils {
final String vol, final String vol,
final String edition, final String edition,
final String conferenceplace, final String conferenceplace,
final String conferencedate, final String conferencedate) {
final DataInfo dataInfo) {
if (StringUtils.isNotBlank(name) || hasIssn(issnPrinted, issnOnline, issnLinking)) { if (StringUtils.isNotBlank(name) || hasIssn(issnPrinted, issnOnline, issnLinking)) {
final Journal j = new Journal(); final Journal j = new Journal();
@ -347,7 +263,6 @@ public class OafMapperUtils {
j.setEdition(edition); j.setEdition(edition);
j.setConferenceplace(conferenceplace); j.setConferenceplace(conferenceplace);
return j; return j;
} else { } else {
return null; return null;
@ -361,53 +276,33 @@ public class OafMapperUtils {
} }
public static DataInfo dataInfo( public static DataInfo dataInfo(
final Boolean deletedbyinference, final float trust,
final String inferenceprovenance, final String inferenceprovenance,
final Boolean inferred, final boolean inferred,
final Boolean invisible, final Qualifier provenanceaction) {
final Qualifier provenanceaction,
final String trust) {
final DataInfo d = new DataInfo(); final DataInfo d = new DataInfo();
d.setDeletedbyinference(deletedbyinference); d.setTrust(trust);
d.setInferenceprovenance(inferenceprovenance); d.setInferenceprovenance(inferenceprovenance);
d.setInferred(inferred); d.setInferred(inferred);
d.setProvenanceaction(provenanceaction); d.setProvenanceaction(provenanceaction);
return d; return d;
} }
public static String createOpenaireId( public static EntityDataInfo dataInfo(
final int prefix, final boolean invisible,
final String originalId, final boolean deletedbyinference,
final boolean to_md5) { final float trust,
if (StringUtils.isBlank(originalId)) { final String inferenceprovenance,
return null; final boolean inferred,
} else if (to_md5) { final Qualifier provenanceaction) {
final String nsPrefix = StringUtils.substringBefore(originalId, "::"); final EntityDataInfo d = new EntityDataInfo();
final String rest = StringUtils.substringAfter(originalId, "::"); d.setTrust(trust);
return String.format("%s|%s::%s", prefix, nsPrefix, IdentifierFactory.md5(rest)); d.setInvisible(invisible);
} else { d.setDeletedbyinference(deletedbyinference);
return String.format("%s|%s", prefix, originalId); d.setInferenceprovenance(inferenceprovenance);
} d.setInferred(inferred);
} d.setProvenanceaction(provenanceaction);
return d;
final String type,
final String originalId,
final boolean to_md5) {
switch (type) {
case "datasource":
return createOpenaireId(10, originalId, to_md5);
case "organization":
return createOpenaireId(20, originalId, to_md5);
case "person":
return createOpenaireId(30, originalId, to_md5);
case "project":
return createOpenaireId(40, originalId, to_md5);
} }
public static String asString(final Object o) { public static String asString(final Object o) {
@ -443,71 +338,77 @@ public class OafMapperUtils {
if (StringUtils.isBlank(rights.getSchemeid())) { if (StringUtils.isBlank(rights.getSchemeid())) {
rights.setSchemeid(DNET_ACCESS_MODES); rights.setSchemeid(DNET_ACCESS_MODES);
} }
if (StringUtils.isBlank(rights.getSchemename())) {
} }
return null; return null;
} }
public static KeyValue newKeyValueInstance(String key, String value, DataInfo dataInfo) {
KeyValue kv = new KeyValue();
return kv;
public static Measure newMeasureInstance(String id, String value, String key, DataInfo dataInfo) { public static Measure newMeasureInstance(String id, String value, String key, DataInfo dataInfo) {
Measure m = new Measure(); Measure m = new Measure();
m.setId(id); m.setId(id);
m.setUnit(Arrays.asList(newKeyValueInstance(key, value, dataInfo))); m.setUnit(Arrays.asList(unit(key, value, dataInfo)));
return m; return m;
} }
public static MeasureUnit unit(String key, String value, DataInfo dataInfo) {
MeasureUnit unit = new MeasureUnit();
public static Relation getRelation(final String source, public static Relation getRelation(final String source,
final String target, final String target,
final String relType, final Relation.RELTYPE relType,
final String subRelType, final Relation.SUBRELTYPE subRelType,
final String relClass, final Relation.RELCLASS relClass,
final OafEntity entity) { final Entity entity) {
return getRelation(source, target, relType, subRelType, relClass, entity, null); return getRelation(source, target, relType, subRelType, relClass, entity, null);
public static Relation getRelation(final String source, public static Relation getRelation(final String source,
final String target, final String target,
final String relType, final Relation.RELTYPE relType,
final String subRelType, final Relation.SUBRELTYPE subRelType,
final String relClass, final Relation.RELCLASS relClass,
final OafEntity entity, final Entity entity,
final String validationDate) { final String validationDate) {
final List<Provenance> provenance = getProvenance(
entity.getCollectedfrom(), fromEntityDataInfo(entity.getDataInfo()));
return getRelation( return getRelation(
source, target, relType, subRelType, relClass, entity.getCollectedfrom(), entity.getDataInfo(), source, target, relType, subRelType, relClass, provenance, validationDate, null);
entity.getLastupdatetimestamp(), validationDate, null);
} }
public static Relation getRelation(final String source, public static Relation getRelation(final String source,
final String target, final String target,
final String relType, final Relation.RELTYPE relType,
final String subRelType, final Relation.SUBRELTYPE subRelType,
final String relClass, final Relation.RELCLASS relClass,
final List<KeyValue> collectedfrom, final List<Provenance> provenance) {
final DataInfo dataInfo,
final Long lastupdatetimestamp) {
return getRelation( return getRelation(
source, target, relType, subRelType, relClass, collectedfrom, dataInfo, lastupdatetimestamp, null, null); source, target, relType, subRelType, relClass, provenance, null, null);
public static Relation getRelation(final String source, public static Relation getRelation(final String source,
final String target, final String target,
final String relType, final Relation.RELTYPE relType,
final String subRelType, final Relation.SUBRELTYPE subRelType,
final String relClass, final Relation.RELCLASS relClass,
final List<KeyValue> collectedfrom, final List<Provenance> provenance,
final DataInfo dataInfo, final List<KeyValue> properties) {
final Long lastupdatetimestamp, return getRelation(
source, target, relType, subRelType, relClass, provenance, null, properties);
public static Relation getRelation(final String source,
final String target,
final Relation.RELTYPE relType,
final Relation.SUBRELTYPE subRelType,
final Relation.RELCLASS relClass,
final List<Provenance> provenance,
final String validationDate, final String validationDate,
final List<KeyValue> properties) { final List<KeyValue> properties) {
final Relation rel = new Relation(); final Relation rel = new Relation();
@ -516,15 +417,27 @@ public class OafMapperUtils {
rel.setRelClass(relClass); rel.setRelClass(relClass);
rel.setSource(source); rel.setSource(source);
rel.setTarget(target); rel.setTarget(target);
rel.setCollectedfrom(collectedfrom); rel.setProvenance(provenance);
rel.setValidated(StringUtils.isNotBlank(validationDate)); rel.setValidated(StringUtils.isNotBlank(validationDate));
rel.setValidationDate(StringUtils.isNotBlank(validationDate) ? validationDate : null); rel.setValidationDate(StringUtils.isNotBlank(validationDate) ? validationDate : null);
rel.setProperties(properties); rel.setProperties(properties);
return rel; return rel;
public static List<Provenance> getProvenance(final List<KeyValue> collectedfrom, final DataInfo dataInfo) {
return collectedfrom
.map(cf -> getProvenance(cf, dataInfo))
public static Provenance getProvenance(final KeyValue collectedfrom, final DataInfo dataInfo) {
final Provenance prov = new Provenance();
return prov;
public static String getProvenance(DataInfo dataInfo) { public static String getProvenance(DataInfo dataInfo) {
return Optional return Optional
.ofNullable(dataInfo) .ofNullable(dataInfo)
@ -535,4 +448,13 @@ public class OafMapperUtils {
.orElse("")) .orElse(""))
.orElse(""); .orElse("");
public static DataInfo fromEntityDataInfo(EntityDataInfo entityDataInfo) {
DataInfo dataInfo = new DataInfo();
return dataInfo;
} }

@ -0,0 +1,38 @@
package eu.dnetlib.dhp.schema.oaf.utils;
import java.util.Comparator;
import eu.dnetlib.dhp.schema.oaf.StructuredProperty;
public class OrganizationPidComparator implements Comparator<StructuredProperty> {
public int compare(StructuredProperty left, StructuredProperty right) {
PidType lClass = PidType.tryValueOf(left.getQualifier().getClassid());
PidType rClass = PidType.tryValueOf(right.getQualifier().getClassid());
if (lClass.equals(PidType.openorgs))
return -1;
if (rClass.equals(PidType.openorgs))
return 1;
if (lClass.equals(PidType.GRID))
return -1;
if (rClass.equals(PidType.GRID))
return 1;
if (lClass.equals(PidType.mag_id))
return -1;
if (rClass.equals(PidType.mag_id))
return 1;
if (lClass.equals(PidType.urn))
return -1;
if (rClass.equals(PidType.urn))
return 1;
return 0;

@ -1,21 +0,0 @@
package eu.dnetlib.dhp.schema.oaf.utils;
import java.util.regex.Matcher;
import java.util.regex.Pattern;
public class PICCleaningRule {
public static final Pattern PATTERN = Pattern.compile("\\d{9}");
public static String clean(final String pic) {
Matcher m = PATTERN.matcher(pic);
if (m.find()) {
} else {
return "";

@ -0,0 +1,8 @@
package eu.dnetlib.dhp.schema.oaf.utils;
import java.util.HashMap;
import java.util.HashSet;
public class PidBlacklist extends HashMap<String, HashSet<String>> {

@ -0,0 +1,40 @@
package eu.dnetlib.dhp.schema.oaf.utils;
import java.util.HashSet;
import java.util.Optional;
import java.util.Set;
import com.fasterxml.jackson.databind.ObjectMapper;
public class PidBlacklistProvider {
private static final PidBlacklist blacklist;
static {
try {
String json = IOUtils.toString(IdentifierFactory.class.getResourceAsStream("pid_blacklist.json"));
blacklist = new ObjectMapper().readValue(json, PidBlacklist.class);
} catch (IOException e) {
throw new RuntimeException(e);
public static PidBlacklist getBlacklist() {
return blacklist;
public static Set<String> getBlacklist(String pidType) {
return Optional
.orElse(new HashSet<>());
private PidBlacklistProvider() {

@ -1,62 +0,0 @@
package eu.dnetlib.dhp.schema.oaf.utils;
import java.util.Optional;
import eu.dnetlib.dhp.schema.oaf.StructuredProperty;
public class PidCleaner {
* Utility method that normalises PID values on a per-type basis.
* @param pid the PID whose value will be normalised.
* @return the PID containing the normalised value.
public static StructuredProperty normalizePidValue(StructuredProperty pid) {
return pid;
public static String normalizePidValue(String pidType, String pidValue) {
String value = Optional
.orElseThrow(() -> new IllegalArgumentException("PID value cannot be empty"));
switch (pidType) {
// TODO add cleaning for more PID types as needed
// Result
case "doi":
return DoiCleaningRule.clean(value);
case "pmid":
return PmidCleaningRule.clean(value);
case "pmc":
return PmcCleaningRule.clean(value);
case "handle":
case "arXiv":
return value;
// Organization
case "GRID":
return GridCleaningRule.clean(value);
case "ISNI":
return ISNICleaningRule.clean(value);
case "ROR":
return RorCleaningRule.clean(value);
case "PIC":
return PICCleaningRule.clean(value);
case "FundRef":
return FundRefCleaningRule.clean(value);
return value;

@ -0,0 +1,48 @@
package eu.dnetlib.dhp.schema.oaf.utils;
import java.util.Comparator;
import eu.dnetlib.dhp.schema.oaf.Entity;
import eu.dnetlib.dhp.schema.oaf.Organization;
import eu.dnetlib.dhp.schema.oaf.Result;
import eu.dnetlib.dhp.schema.oaf.StructuredProperty;
import eu.dnetlib.dhp.schema.oaf.common.ModelSupport;
public class PidComparator<T extends Entity> implements Comparator<StructuredProperty> {
private final T entity;
public PidComparator(T entity) {
this.entity = entity;
public int compare(StructuredProperty left, StructuredProperty right) {
if (left == null && right == null)
return 0;
if (left == null)
return 1;
if (right == null)
return -1;
if (ModelSupport.isSubClass(entity, Result.class)) {
return compareResultPids(left, right);
if (ModelSupport.isSubClass(entity, Organization.class)) {
return compareOrganizationtPids(left, right);
// Else (but unlikely), lexicographical ordering will do.
return left.getQualifier().getClassid().compareTo(right.getQualifier().getClassid());
private int compareResultPids(StructuredProperty left, StructuredProperty right) {
return new ResultPidComparator().compare(left, right);
private int compareOrganizationtPids(StructuredProperty left, StructuredProperty right) {
return new OrganizationPidComparator().compare(left, right);

@ -0,0 +1,79 @@
package eu.dnetlib.dhp.schema.oaf.utils;
import org.apache.commons.lang3.EnumUtils;
public enum PidType {
* The DOI syntax shall be made up of a DOI prefix and a DOI suffix separated by a forward slash.
* There is no defined limit on the length of the DOI name, or of the DOI prefix or DOI suffix.
* The DOI name is case-insensitive and can incorporate any printable characters from the legal graphic characters
* of Unicode. Further constraints on character use (e.g. use of language-specific alphanumeric characters) can be
* defined for an application by the ISO 26324 Registration Authority.
* DOI prefix: The DOI prefix shall be composed of a directory indicator followed by a registrant code.
* These two components shall be separated by a full stop (period). The directory indicator shall be "10" and
* distinguishes the entire set of character strings (prefix and suffix) as digital object identifiers within the
* resolution system.
* Registrant code: The second element of the DOI prefix shall be the registrant code. The registrant code is a
* unique string assigned to a registrant.
* DOI suffix: The DOI suffix shall consist of a character string of any length chosen by the registrant.
* Each suffix shall be unique to the prefix element that precedes it. The unique suffix can be a sequential number,
* or it might incorporate an identifier generated from or based on another system used by the registrant
* (e.g. ISAN, ISBN, ISRC, ISSN, ISTC, ISNI; in such cases, a preferred construction for such a suffix can be
* specified, as in Example 1).
* Source:
* PubMed Unique Identifier (PMID)
* This field is a 1-to-8 digit accession number with no leading zeros. It is present on all records and is the
* accession number for managing and disseminating records. PMIDs are not reused after records are deleted.
* Beginning in February 2012 PMIDs include extensions following a decimal point to account for article versions
* (e.g., 21804956.2). All citations are considered version 1 until replaced. The extended PMID is not displayed
* on the MEDLINE format.
* View the citation in abstract format in PubMed to access additional versions when available (see the article in
* the Jan-Feb 2012 NLM Technical Bulletin).
* Source:
* This field contains the unique identifier for the cited article in PubMed Central. The identifier begins with the
* prefix PMC.
* Source:
pmc, handle, arXiv, nct, pdb, w3id,
// Organization
openorgs, corda, corda_h2020, GRID, mag_id, urn,
// Used by dedup
undefined, original;
public static boolean isValid(String type) {
return EnumUtils.isValidEnum(PidType.class, type);
public static PidType tryValueOf(String s) {
try {
return PidType.valueOf(s);
} catch (Exception e) {
return PidType.original;

@ -0,0 +1,33 @@
package eu.dnetlib.dhp.schema.oaf.utils;
import java.util.Comparator;
import java.util.Optional;
import eu.dnetlib.dhp.schema.oaf.StructuredProperty;
public class PidValueComparator implements Comparator<StructuredProperty> {
public int compare(StructuredProperty left, StructuredProperty right) {
if (left == null && right == null)
return 0;
if (left == null)
return 1;
if (right == null)
return -1;
StructuredProperty l = CleaningFunctions.normalizePidValue(left);
StructuredProperty r = CleaningFunctions.normalizePidValue(right);
return Optional
lv -> Optional
.map(rv -> lv.compareTo(rv))

@ -1,24 +0,0 @@
package eu.dnetlib.dhp.schema.oaf.utils;
import java.util.regex.Matcher;
import java.util.regex.Pattern;
public class PmcCleaningRule {
public static final Pattern PATTERN = Pattern.compile("PMC\\d{1,8}");
public static String clean(String pmc) {
String s = pmc
.replaceAll("\\s", "")
final Matcher m = PATTERN.matcher(s);
if (m.find()) {
return "";

@ -1,25 +0,0 @@
package eu.dnetlib.dhp.schema.oaf.utils;
import java.util.regex.Matcher;
import java.util.regex.Pattern;
public class PmidCleaningRule {
public static final Pattern PATTERN = Pattern.compile("0*(\\d{1,8})");
public static String clean(String pmid) {
String s = pmid
.replaceAll("\\s", "");
final Matcher m = PATTERN.matcher(s);
if (m.find()) {
return "";

@ -0,0 +1,53 @@
package eu.dnetlib.dhp.schema.oaf.utils;
import java.util.Comparator;
import eu.dnetlib.dhp.schema.oaf.StructuredProperty;
public class ResultPidComparator implements Comparator<StructuredProperty> {
public int compare(StructuredProperty left, StructuredProperty right) {
PidType lClass = PidType.tryValueOf(left.getQualifier().getClassid());
PidType rClass = PidType.tryValueOf(right.getQualifier().getClassid());
if (lClass.equals(PidType.doi))
return -1;
if (rClass.equals(PidType.doi))
return 1;
if (lClass.equals(PidType.pmid))
return -1;
if (rClass.equals(PidType.pmid))
return 1;
if (lClass.equals(PidType.pmc))
return -1;
if (rClass.equals(PidType.pmc))
return 1;
if (lClass.equals(PidType.handle))
return -1;
if (rClass.equals(PidType.handle))
return 1;
if (lClass.equals(PidType.arXiv))
return -1;
if (rClass.equals(PidType.arXiv))
return 1;
if (lClass.equals(PidType.nct))
return -1;
if (rClass.equals(PidType.nct))
return 1;
if (lClass.equals(PidType.pdb))
return -1;
if (rClass.equals(PidType.pdb))
return 1;
return 0;

@ -0,0 +1,77 @@
package eu.dnetlib.dhp.schema.oaf.utils;
import static eu.dnetlib.dhp.schema.common.ModelConstants.CROSSREF_ID;
import java.util.Comparator;
import java.util.HashSet;
import java.util.Optional;
import eu.dnetlib.dhp.schema.common.ModelConstants;
import eu.dnetlib.dhp.schema.oaf.KeyValue;
import eu.dnetlib.dhp.schema.oaf.Result;
public class ResultTypeComparator implements Comparator<Result> {
public int compare(Result left, Result right) {
if (left == null && right == null)
return 0;
if (left == null)
return 1;
if (right == null)
return -1;
HashSet<String> lCf = getCollectedFromIds(left);
HashSet<String> rCf = getCollectedFromIds(right);
if (lCf.contains(CROSSREF_ID) && !rCf.contains(CROSSREF_ID)) {
return -1;
if (!lCf.contains(CROSSREF_ID) && rCf.contains(CROSSREF_ID)) {
return 1;
Result.RESULTTYPE lType = left.getResulttype();
Result.RESULTTYPE rType = right.getResulttype();
if (lType.equals(rType))
return 0;
if (lType.equals(Result.RESULTTYPE.publication))
return -1;
if (rType.equals(Result.RESULTTYPE.publication))
return 1;
if (lType.equals(Result.RESULTTYPE.dataset))
return -1;
if (rType.equals(Result.RESULTTYPE.dataset))
return 1;
if (lType.equals(
return -1;
if (rType.equals(
return 1;
if (lType.equals(Result.RESULTTYPE.otherresearchproduct))
return -1;
if (rType.equals(Result.RESULTTYPE.otherresearchproduct))
return 1;
// Else (but unlikely), lexicographical ordering will do.
return lType.compareTo(rType);
protected HashSet<String> getCollectedFromIds(Result left) {
return Optional
cf -> cf
.orElse(new HashSet<>());

@ -1,27 +0,0 @@
package eu.dnetlib.dhp.schema.oaf.utils;
import java.util.regex.Matcher;
import java.util.regex.Pattern;
public class RorCleaningRule {
public static final String ROR_PREFIX = "";
private static final Pattern PATTERN = Pattern.compile("(?<ror>0[a-hj-km-np-tv-z|0-9]{6}[0-9]{2})");
public static String clean(String ror) {
String s = ror
.replaceAll("\\s", "")
Matcher m = PATTERN.matcher(s);
if (m.find()) {
return ROR_PREFIX +"ror");
return "";

@ -1,25 +0,0 @@
package eu.dnetlib.scholexplorer.relation;
public class RelInfo implements Serializable {
private String original;
private String inverse;
public String getOriginal() {
return original;
public void setOriginal(String original) {
this.original = original;
public String getInverse() {
return inverse;
public void setInverse(String inverse) {
this.inverse = inverse;

@ -1,20 +0,0 @@
package eu.dnetlib.scholexplorer.relation;
import java.util.HashMap;
import com.fasterxml.jackson.databind.ObjectMapper;
public class RelationMapper extends HashMap<String, RelInfo> implements Serializable {
public static RelationMapper load() throws Exception {
final String json = IOUtils.toString(RelationMapper.class.getResourceAsStream("relations.json"));
ObjectMapper mapper = new ObjectMapper();
return mapper.readValue(json, RelationMapper.class);

@ -6,21 +6,21 @@
"paramRequired": false "paramRequired": false
}, },
{ {
"paramName": "cip", "paramName": "i",
"paramLongName": "crossrefInputPath", "paramLongName": "inputPath",
"paramDescription": "the path to get the input data from Crossref", "paramDescription": "the source path",
"paramRequired": true
"paramName": "pip",
"paramLongName": "pubmedInputPath",
"paramDescription": "the path to get the input data from Pubmed",
"paramRequired": true "paramRequired": true
}, },
{ {
"paramName": "o", "paramName": "o",
"paramLongName": "outputPath", "paramLongName": "outputPath",
"paramDescription": "the path of the new ActionSet", "paramDescription": "path of the output graph",
"paramRequired": true
"paramName": "c",
"paramLongName": "graphTableClassName",
"paramDescription": "the graph entity class name",
"paramRequired": true "paramRequired": true
} }
] ]

@ -8,31 +8,13 @@
{ {
"paramName": "gin", "paramName": "gin",
"paramLongName": "graphInputPath", "paramLongName": "graphInputPath",
"paramDescription": "the input graph root path", "paramDescription": "the graph root path",
"paramRequired": true
"paramName": "cp",
"paramLongName": "checkpointPath",
"paramDescription": "checkpoint directory",
"paramRequired": true "paramRequired": true
}, },
{ {
"paramName": "out", "paramName": "out",
"paramLongName": "outputPath", "paramLongName": "outputPath",
"paramDescription": "the output graph root path", "paramDescription": "the output merged graph root path",
"paramRequired": true
"paramName": "fi",
"paramLongName": "filterInvisible",
"paramDescription": "if true filters out invisible entities",
"paramRequired": true
"paramName": "isu",
"paramLongName": "isLookupUrl",
"paramDescription": "url to the ISLookup Service",
"paramRequired": true "paramRequired": true
} }
] ]

@ -1,20 +0,0 @@
"paramName": "issm",
"paramLongName": "isSparkSessionManaged",
"paramDescription": "when true will stop SparkSession after job execution",
"paramRequired": false
"paramName": "hmu",
"paramLongName": "hiveMetastoreUris",
"paramDescription": "the hive metastore uris",
"paramRequired": true
"paramName": "sql",
"paramLongName": "sql",
"paramDescription": "sql script to execute",
"paramRequired": true

@ -1,158 +0,0 @@

@ -1,10 +0,0 @@
package eu.dnetlib.dhp.application.dedup.log
case class DedupLogModel(
tag: String,
configuration: String,
entity: String,
startTS: Long,
endTS: Long,
totalMs: Long
) {}

@ -1,14 +0,0 @@
package eu.dnetlib.dhp.application.dedup.log
import org.apache.spark.sql.{SaveMode, SparkSession}
class DedupLogWriter(path: String) {
def appendLog(dedupLogModel: DedupLogModel, spark: SparkSession): Unit = {
import spark.implicits._
val df = spark.createDataset[DedupLogModel](data = List(dedupLogModel))

@ -22,19 +22,6 @@ object ScholixUtils extends Serializable {
case class RelatedEntities(id: String, relatedDataset: Long, relatedPublication: Long) {} case class RelatedEntities(id: String, relatedDataset: Long, relatedPublication: Long) {}
val relations: Map[String, RelationVocabulary] = {
val input = Source
implicit lazy val formats: DefaultFormats.type = org.json4s.DefaultFormats
lazy val json: json4s.JValue = parse(input)
json.extract[Map[String, RelationVocabulary]]
def extractRelationDate(relation: Relation): String = { def extractRelationDate(relation: Relation): String = {
if (relation.getProperties == null || !relation.getProperties.isEmpty) if (relation.getProperties == null || !relation.getProperties.isEmpty)
@ -175,12 +162,11 @@ object ScholixUtils extends Serializable {
} }
def extractCollectedFrom(relation: Relation): List[ScholixEntityId] = { def extractCollectedFrom(relation: Relation): List[ScholixEntityId] = {
if (relation.getCollectedfrom != null && !relation.getCollectedfrom.isEmpty) { if (relation.getProvenance != null && !relation.getProvenance.isEmpty) {
val l: List[ScholixEntityId] = { p =>
val l: List[ScholixEntityId] = { c =>
new ScholixEntityId( new ScholixEntityId(
c.getValue, p.getCollectedfrom.getValue,
List(new ScholixIdentifier(c.getKey, DNET_IDENTIFIER_SCHEMA, null)).asJava List(new ScholixIdentifier(p.getCollectedfrom.getKey, DNET_IDENTIFIER_SCHEMA, null)).asJava
) )
}.toList }.toList
l l
@ -289,11 +275,8 @@ object ScholixUtils extends Serializable {
s.setPublisher(source.getPublisher) s.setPublisher(source.getPublisher)
} }
val semanticRelation = relations.getOrElse(relation.getRelClass.toLowerCase, null)
if (semanticRelation == null)
return null
s.setRelationship( s.setRelationship(
new ScholixRelationship(semanticRelation.original, "datacite", semanticRelation.inverse) new ScholixRelationship(relation.getRelClass.toString, "datacite", relation.getRelClass.getInverse.toString)
) )
s.setSource(source) s.setSource(source)
@ -331,12 +314,10 @@ object ScholixUtils extends Serializable {
s.setPublisher(l.asJava) s.setPublisher(l.asJava)
} }
val semanticRelation = relations.getOrElse(relation.getRelClass.toLowerCase, null)
if (semanticRelation == null)
return null
s.setRelationship( s.setRelationship(
new ScholixRelationship(semanticRelation.original, "datacite", semanticRelation.inverse) new ScholixRelationship(relation.getRelClass.toString, "datacite", relation.getRelClass.getInverse.toString)
) )
s.setSource(generateScholixResourceFromSummary(source)) s.setSource(generateScholixResourceFromSummary(source))
s s
@ -402,15 +383,15 @@ object ScholixUtils extends Serializable {
.getInstance() .getInstance()
.asScala .asScala
.filter(i => i.getDateofacceptance != null) .filter(i => i.getDateofacceptance != null)
.toList .toList
if (dt.nonEmpty) if (dt.nonEmpty)
s.setDate(dt.distinct.asJava) s.setDate(dt.distinct.asJava)
} }
if (r.getDescription != null && !r.getDescription.isEmpty) { if (r.getDescription != null && !r.getDescription.isEmpty) {
val d = r.getDescription.asScala.find(f => f != null && f.getValue != null) val d = r.getDescription.asScala.find(f => f != null)
if (d.isDefined) if (d.isDefined)
s.setDescription(d.get.getValue) s.setDescription(d.get)
} }
if (r.getSubject != null && !r.getSubject.isEmpty) { if (r.getSubject != null && !r.getSubject.isEmpty) {
@ -422,7 +403,7 @@ object ScholixUtils extends Serializable {
} }
if (r.getPublisher != null) if (r.getPublisher != null)
s.setPublisher(List(r.getPublisher.getValue).asJava) s.setPublisher(List(r.getPublisher.getName).asJava)
if (r.getCollectedfrom != null && !r.getCollectedfrom.isEmpty) { if (r.getCollectedfrom != null && !r.getCollectedfrom.isEmpty) {
val cf: List[CollectedFromType] = r.getCollectedfrom.asScala val cf: List[CollectedFromType] = r.getCollectedfrom.asScala

View File

@ -15,7 +15,7 @@ import com.fasterxml.jackson.databind.ObjectMapper;
public class MdStoreClientTest { public class MdStoreClientTest {
// @Test
public void testMongoCollection() throws IOException { public void testMongoCollection() throws IOException {
final MdstoreClient client = new MdstoreClient("mongodb://localhost:27017", "mdstore"); final MdstoreClient client = new MdstoreClient("mongodb://localhost:27017", "mdstore");

@ -0,0 +1,109 @@
package eu.dnetlib.dhp.common.api;
import org.junit.jupiter.api.Assertions;
import org.junit.jupiter.api.Disabled;
import org.junit.jupiter.api.Test;
class ZenodoAPIClientTest {
private final String URL_STRING = "";
private final String ACCESS_TOKEN = "";
private final String CONCEPT_REC_ID = "657113";
private final String depositionId = "674915";
void testUploadOldDeposition() throws IOException, MissingConceptDoiException {
ZenodoAPIClient client = new ZenodoAPIClient(URL_STRING,
Assertions.assertEquals(200, client.uploadOpenDeposition(depositionId));
File file = new File(getClass()
InputStream is = new FileInputStream(file);
Assertions.assertEquals(200, client.uploadIS(is, "COVID-19.json.gz"));
String metadata = IOUtils.toString(getClass().getResourceAsStream("/eu/dnetlib/dhp/common/api/metadata.json"));
Assertions.assertEquals(200, client.sendMretadata(metadata));
Assertions.assertEquals(202, client.publish());
void testNewDeposition() throws IOException {
ZenodoAPIClient client = new ZenodoAPIClient(URL_STRING,
Assertions.assertEquals(201, client.newDeposition());
File file = new File(getClass()
InputStream is = new FileInputStream(file);
Assertions.assertEquals(200, client.uploadIS(is, "COVID-19.json.gz"));
String metadata = IOUtils.toString(getClass().getResourceAsStream("/eu/dnetlib/dhp/common/api/metadata.json"));
Assertions.assertEquals(200, client.sendMretadata(metadata));
Assertions.assertEquals(202, client.publish());
void testNewVersionNewName() throws IOException, MissingConceptDoiException {
ZenodoAPIClient client = new ZenodoAPIClient(URL_STRING,
Assertions.assertEquals(201, client.newVersion(CONCEPT_REC_ID));
File file = new File(getClass()
InputStream is = new FileInputStream(file);
Assertions.assertEquals(200, client.uploadIS(is, "newVersion_deposition"));
Assertions.assertEquals(202, client.publish());
void testNewVersionOldName() throws IOException, MissingConceptDoiException {
ZenodoAPIClient client = new ZenodoAPIClient(URL_STRING,
Assertions.assertEquals(201, client.newVersion(CONCEPT_REC_ID));
File file = new File(getClass()
InputStream is = new FileInputStream(file);
Assertions.assertEquals(200, client.uploadIS(is, "newVersion_deposition"));
Assertions.assertEquals(202, client.publish());

@ -0,0 +1,76 @@
package eu.dnetlib.dhp.common.vocabulary;
import eu.dnetlib.dhp.schema.oaf.Qualifier;
import org.junit.jupiter.api.BeforeEach;
import org.junit.jupiter.api.Test;
import org.junit.jupiter.api.extension.ExtendWith;
import org.mockito.Mock;
import org.mockito.junit.jupiter.MockitoExtension;
import java.util.List;
import java.util.Objects;
import static org.mockito.Mockito.lenient;
public class VocabularyTest {
protected ISLookUpService isLookUpService;
protected VocabularyGroup vocabularies;
public void setUpVocabulary() throws ISLookUpException, IOException {
vocabularies = VocabularyGroup.loadVocsFromIS(isLookUpService);
private static List<String> vocs() throws IOException {
return IOUtils
private static List<String> synonyms() throws IOException {
return IOUtils
void testVocabularyMatch() throws Exception {
final String s = IOUtils.toString(this.getClass().getResourceAsStream("terms"));
for (String s1 : s.split("\n")) {
final Qualifier t1 = vocabularies.getSynonymAsQualifier("dnet:publication_resource", s1);
if (t1 == null) {
System.err.println(s1 + " Missing");
} else {
System.out.println("syn=" + s1 + " term = " + t1.getClassid() + " " + t1.getClassname());
Qualifier synonymAsQualifier = vocabularies.getSynonymAsQualifier("dnet:result_typologies", t1.getClassid());
if (synonymAsQualifier!= null)

@ -0,0 +1,39 @@
package eu.dnetlib.dhp.schema.oaf.common;
import static org.junit.jupiter.api.Assertions.*;
import org.junit.jupiter.api.Nested;
import org.junit.jupiter.api.Test;
import eu.dnetlib.dhp.schema.oaf.Entity;
import eu.dnetlib.dhp.schema.oaf.Relation;
import eu.dnetlib.dhp.schema.oaf.Result;
public class ModelSupportTest {
class IsSubClass {
void shouldReturnFalseWhenSubClassDoesNotExtendSuperClass() {
// when
Boolean result = ModelSupport.isSubClass(Relation.class, Entity.class);
// then
void shouldReturnTrueWhenSubClassExtendsSuperClass() {
// when
Boolean result = ModelSupport.isSubClass(Result.class, Entity.class);
// then

@ -0,0 +1,21 @@
package eu.dnetlib.dhp.schema.oaf.utils;
import java.util.Set;
import org.junit.jupiter.api.Assertions;
import org.junit.jupiter.api.Test;
class BlackListProviderTest {
void blackListTest() {
Assertions.assertTrue(PidBlacklistProvider.getBlacklist().get("doi").size() > 0);
final Set<String> xxx = PidBlacklistProvider.getBlacklist("xxx");
Assertions.assertEquals(0, xxx.size());

@ -1,18 +0,0 @@
package eu.dnetlib.dhp.schema.oaf.utils;
import static org.junit.jupiter.api.Assertions.assertEquals;
import org.junit.jupiter.api.Test;
class GridCleaningRuleTest {
void testCleaning() {
assertEquals("grid.493784.5", GridCleaningRule.clean("grid.493784.5"));
assertEquals("grid.493784.5x", GridCleaningRule.clean("grid.493784.5x"));
assertEquals("grid.493784.5x", GridCleaningRule.clean("493784.5x"));
assertEquals("", GridCleaningRule.clean("493x784.5x"));

@ -1,19 +0,0 @@
package eu.dnetlib.dhp.schema.oaf.utils;
import static org.junit.jupiter.api.Assertions.assertEquals;
import org.junit.jupiter.api.Test;
class ISNICleaningRuleTest {
void testCleaning() {
assertEquals("0000000463436020", ISNICleaningRule.clean("0000 0004 6343 6020"));
assertEquals("0000000463436020", ISNICleaningRule.clean("0000000463436020"));
assertEquals("", ISNICleaningRule.clean("Q30256598"));
assertEquals("0000000493403529", ISNICleaningRule.clean("ISNI:0000000493403529"));
assertEquals("000000008614884X", ISNICleaningRule.clean("0000 0000 8614 884X"));

@ -0,0 +1,103 @@
package eu.dnetlib.dhp.schema.oaf.utils;
import static org.junit.jupiter.api.Assertions.*;
import org.junit.jupiter.api.Assertions;
import org.junit.jupiter.api.Test;
import com.fasterxml.jackson.databind.DeserializationFeature;
import com.fasterxml.jackson.databind.ObjectMapper;
import eu.dnetlib.dhp.schema.oaf.Publication;
class IdentifierFactoryTest {
private static final ObjectMapper OBJECT_MAPPER = new ObjectMapper()
.configure(DeserializationFeature.FAIL_ON_UNKNOWN_PROPERTIES, false);
void testCreateIdentifierForPublication() throws IOException {
"publication_doi1.json", "50|doi_________::79dbc7a2a56dc1532659f9038843256e", true);
"publication_doi2.json", "50|doi_________::79dbc7a2a56dc1532659f9038843256e", true);
"publication_doi3.json", "50|pmc_________::94e4cb08c93f8733b48e2445d04002ac", true);
"publication_doi4.json", "50|od______2852::38861c44e6052a8d49f59a4c39ba5e66", true);
"publication_doi5.json", "50|doi_________::3bef95c0ca26dd55451fc8839ea69d27", true);
"publication_pmc1.json", "50|DansKnawCris::0829b5191605bdbea36d6502b8c1ce1f", true);
"publication_pmc2.json", "50|pmc_________::94e4cb08c93f8733b48e2445d04002ac", true);
"publication_openapc.json", "50|doi_________::79dbc7a2a56dc1532659f9038843256e", true);
final String defaultID = "50|DansKnawCris::0829b5191605bdbea36d6502b8c1ce1f";
verifyIdentifier("publication_3.json", defaultID, true);
verifyIdentifier("publication_4.json", defaultID, true);
verifyIdentifier("publication_5.json", defaultID, true);
void testCreateIdentifierForPublicationNoHash() throws IOException {
verifyIdentifier("publication_doi1.json", "50|doi_________::10.1016/j.cmet.2010.03.013", false);
verifyIdentifier("publication_doi2.json", "50|doi_________::10.1016/j.cmet.2010.03.013", false);
verifyIdentifier("publication_pmc1.json", "50|DansKnawCris::0829b5191605bdbea36d6502b8c1ce1f", false);
"publication_urn1.json", "50|DansKnawCris::0829b5191605bdbea36d6502b8c1ce1f", false);
final String defaultID = "50|DansKnawCris::0829b5191605bdbea36d6502b8c1ce1f";
verifyIdentifier("publication_3.json", defaultID, false);
verifyIdentifier("publication_4.json", defaultID, false);
verifyIdentifier("publication_5.json", defaultID, false);
void testCreateIdentifierForROHub() throws IOException {
"orp-rohub.json", "50|w3id________::afc7592914ae190a50570db90f55f9c2", true);
protected void verifyIdentifier(String filename, String expectedID, boolean md5) throws IOException {
final String json = IOUtils.toString(getClass().getResourceAsStream(filename));
final Publication pub = OBJECT_MAPPER.readValue(json, Publication.class);
assertEquals(expectedID, IdentifierFactory.createIdentifier(pub, md5));
void testNormalizeDOI() throws Exception {
final String doi = "10.1042/BCJ20160876";
assertEquals(CleaningFunctions.normalizePidValue("doi", doi), doi.toLowerCase());
final String doi2 = "0.1042/BCJ20160876";
assertThrows(IllegalArgumentException.class, () -> CleaningFunctions.normalizePidValue("doi", doi2));
final String doi3 = "";
assertThrows(IllegalArgumentException.class, () -> CleaningFunctions.normalizePidValue("doi", doi3));
final String doi4 = "";
assertEquals(CleaningFunctions.normalizePidValue("doi", doi4), "10.1042/BCJ20160876".toLowerCase());
final String doi5 = " BCJ20160876";
assertEquals(CleaningFunctions.normalizePidValue("doi", doi5), "10.1042/BCJ20160876".toLowerCase());

@ -0,0 +1,110 @@
package eu.dnetlib.dhp.schema.oaf.utils;
import static org.junit.jupiter.api.Assertions.*;
import static org.junit.jupiter.api.Assertions.assertEquals;
import java.util.HashSet;
import java.util.List;
import org.junit.jupiter.api.Test;
import com.fasterxml.jackson.databind.DeserializationFeature;
import com.fasterxml.jackson.databind.ObjectMapper;
import eu.dnetlib.dhp.schema.common.ModelConstants;
import eu.dnetlib.dhp.schema.oaf.Dataset;
import eu.dnetlib.dhp.schema.oaf.KeyValue;
import eu.dnetlib.dhp.schema.oaf.Publication;
import eu.dnetlib.dhp.schema.oaf.Result;
public class MergeUtilsTest {
private static final ObjectMapper OBJECT_MAPPER = new ObjectMapper()
.configure(DeserializationFeature.FAIL_ON_UNKNOWN_PROPERTIES, false);
void testMergePubs() throws IOException {
Publication p1 = read("publication_1.json", Publication.class);
Publication p2 = read("publication_2.json", Publication.class);
Dataset d1 = read("dataset_1.json", Dataset.class);
Dataset d2 = read("dataset_2.json", Dataset.class);
assertEquals(1, p1.getCollectedfrom().size());
assertEquals(ModelConstants.CROSSREF_ID, p1.getCollectedfrom().get(0).getKey());
assertEquals(1, d2.getCollectedfrom().size());
assertEquals(1, p2.getCollectedfrom().size());
assertEquals(1, d1.getCollectedfrom().size());
final Result p1d2 = MergeUtils.merge(p1, d2);
assertEquals(Result.RESULTTYPE.publication, p1d2.getResulttype());
assertTrue(p1d2 instanceof Publication);
assertEquals(p1.getId(), p1d2.getId());
void testMergePubs_1() throws IOException {
Publication p2 = read("publication_2.json", Publication.class);
Dataset d1 = read("dataset_1.json", Dataset.class);
final Result p2d1 = MergeUtils.merge(p2, d1);
assertEquals(Result.RESULTTYPE.dataset, p2d1.getResulttype());
assertTrue(p2d1 instanceof Dataset);
assertEquals(d1.getId(), p2d1.getId());
assertEquals(2, p2d1.getCollectedfrom().size());
void testMergePubs_2() throws IOException {
Publication p1 = read("publication_1.json", Publication.class);
Publication p2 = read("publication_2.json", Publication.class);
Result p1p2 = MergeUtils.merge(p1, p2);
assertTrue(p1p2 instanceof Publication);
assertEquals(p1.getId(), p1p2.getId());
assertEquals(2, p1p2.getCollectedfrom().size());
void testDelegatedAuthority_1() throws IOException {
Dataset d1 = read("dataset_2.json", Dataset.class);
Dataset d2 = read("dataset_delegated.json", Dataset.class);
assertEquals(1, d2.getCollectedfrom().size());
Result res = MergeUtils.merge(d1, d2, true);
assertEquals(d2, res);
void testDelegatedAuthority_2() throws IOException {
Dataset p1 = read("publication_1.json", Dataset.class);
Dataset d2 = read("dataset_delegated.json", Dataset.class);
assertEquals(1, d2.getCollectedfrom().size());
Result res = MergeUtils.merge(p1, d2, true);
assertEquals(d2, res);
protected HashSet<String> cfId(List<KeyValue> collectedfrom) {
protected <T extends Result> T read(String filename, Class<T> clazz) throws IOException {
final String json = IOUtils.toString(getClass().getResourceAsStream(filename));
return OBJECT_MAPPER.readValue(json, clazz);

@ -142,66 +142,13 @@ class OafMapperUtilsTest {
assertEquals("2013-11-12", GraphCleaningFunctions.cleanDate("1384216367189")); assertEquals("2013-11-12", GraphCleaningFunctions.cleanDate("1384216367189"));
assertEquals("2013-11-12", GraphCleaningFunctions.cleanDate("1384216367111222")); assertEquals("2013-11-12", GraphCleaningFunctions.cleanDate("1384216367111222"));
assertEquals("2013-11-12", GraphCleaningFunctions.cleanDate("1384216367111222333")); assertEquals("2013-11-12", GraphCleaningFunctions.cleanDate("1384216367111222333"));
} }
@Test @Test
void testDate() { void testDate() {
final String date = GraphCleaningFunctions.cleanDate("23-FEB-1998"); final String date = GraphCleaningFunctions.cleanDate("23-FEB-1998");
assertNotNull(date); assertNotNull(date);
System.out.println(date); assertEquals("1998-02-23", date);
void testMergePubs() throws IOException {
Publication p1 = read("publication_1.json", Publication.class);
Publication p2 = read("publication_2.json", Publication.class);
Dataset d1 = read("dataset_1.json", Dataset.class);
Dataset d2 = read("dataset_2.json", Dataset.class);
assertEquals(1, p1.getCollectedfrom().size());
assertEquals(ModelConstants.CROSSREF_ID, p1.getCollectedfrom().get(0).getKey());
assertEquals(1, d2.getCollectedfrom().size());
.mergeResults(p1, d2)
assertEquals(1, p2.getCollectedfrom().size());
assertEquals(1, d1.getCollectedfrom().size());
.mergeResults(p2, d1)
void testDelegatedAuthority() throws IOException {
Dataset d1 = read("dataset_2.json", Dataset.class);
Dataset d2 = read("dataset_delegated.json", Dataset.class);
assertEquals(1, d2.getCollectedfrom().size());
Result res = OafMapperUtils.mergeResults(d1, d2);
assertEquals(d2, res);
protected HashSet<String> cfId(List<KeyValue> collectedfrom) {
} }
protected <T extends Result> T read(String filename, Class<T> clazz) throws IOException { protected <T extends Result> T read(String filename, Class<T> clazz) throws IOException {

@ -1,19 +0,0 @@
package eu.dnetlib.dhp.schema.oaf.utils;
import static org.junit.jupiter.api.Assertions.assertEquals;
import org.junit.jupiter.api.Test;
class PICCleaningRuleTest {
void testCleaning() {
assertEquals("887624982", PICCleaningRule.clean("887624982"));
assertEquals("", PICCleaningRule.clean("887 624982"));
assertEquals("887624982", PICCleaningRule.clean(" 887624982 "));
assertEquals("887624982", PICCleaningRule.clean(" 887624982x "));
assertEquals("887624982", PICCleaningRule.clean(" 88762498200 "));

@ -1,19 +0,0 @@
package eu.dnetlib.dhp.schema.oaf.utils;
import static org.junit.jupiter.api.Assertions.assertEquals;
import org.junit.jupiter.api.Test;
class PmcCleaningRuleTest {
void testCleaning() {
assertEquals("PMC1234", PmcCleaningRule.clean("PMC1234"));
assertEquals("PMC1234", PmcCleaningRule.clean(" PMC1234"));
assertEquals("PMC12345678", PmcCleaningRule.clean("PMC12345678"));
assertEquals("PMC12345678", PmcCleaningRule.clean("PMC123456789"));
assertEquals("PMC12345678", PmcCleaningRule.clean("PMC 12345678"));

@ -1,24 +0,0 @@
package eu.dnetlib.dhp.schema.oaf.utils;
import static org.junit.jupiter.api.Assertions.assertEquals;
import org.junit.jupiter.api.Test;
class PmidCleaningRuleTest {
void testCleaning() {
// leading zeros are removed
assertEquals("1234", PmidCleaningRule.clean("01234"));
// tolerant to spaces in the middle
assertEquals("1234567", PmidCleaningRule.clean("0123 4567"));
// stop parsing at first not numerical char
assertEquals("123", PmidCleaningRule.clean("0123x4567"));
// invalid id leading to empty result
assertEquals("", PmidCleaningRule.clean("abc"));
// valid id with zeroes in the number
assertEquals("20794075", PmidCleaningRule.clean("20794075"));

@ -1,17 +0,0 @@
package eu.dnetlib.dhp.schema.oaf.utils;
import static org.junit.jupiter.api.Assertions.assertEquals;
import org.junit.jupiter.api.Test;
class RorCleaningRuleTest {
void testCleaning() {
assertEquals("", RorCleaningRule.clean(""));
assertEquals("", RorCleaningRule.clean("05rpz9w55"));
assertEquals("", RorCleaningRule.clean("05rpz9w_55"));

@ -1,114 +0,0 @@
package eu.dnetlib.oa.merge;
import static org.junit.jupiter.api.Assertions.*;
import java.util.List;
import java.util.Objects;
import org.junit.jupiter.api.Test;
import org.junit.platform.commons.util.StringUtils;
import com.fasterxml.jackson.core.type.TypeReference;
import com.fasterxml.jackson.databind.ObjectMapper;
import eu.dnetlib.dhp.oa.merge.AuthorMerger;
import eu.dnetlib.dhp.schema.oaf.Author;
public class AuthorMergerTest {
public void testEnrcichAuthor() throws Exception {
final ObjectMapper mapper = new ObjectMapper();
BufferedReader pr = new BufferedReader(new InputStreamReader(
BufferedReader or = new BufferedReader(new InputStreamReader(
TypeReference<List<Author>> aclass = new TypeReference<List<Author>>() {
String pubLine;
int i = 0;
while ((pubLine = pr.readLine()) != null) {
final String pubId = pubLine;
final String MatchPidOrcid = or.readLine();
final String pubOrcid = or.readLine();
final String data = pr.readLine();
if (StringUtils.isNotBlank(data)) {
List<Author> publicationAuthors = mapper.readValue(data, aclass);
List<Author> orcidAuthors = mapper.readValue(or.readLine(), aclass);
System.out.printf("OAF ID = %s \n", pubId);
System.out.printf("ORCID Intersected ID = %s \n", pubOrcid);
System.out.printf("OAF Author Size = %d \n", publicationAuthors.size());
System.out.printf("Oricd Author Size = %d \n", orcidAuthors.size());
System.out.printf("Oricd Matched PID = %s \n", MatchPidOrcid);
long originalAuthorWithPiD = publicationAuthors
a -> a.getPid() != null && a
p -> p.getQualifier() != null
&& p.getQualifier().getClassid().toLowerCase().contains("orcid")))
long start = System.currentTimeMillis();
// final List<Author> enrichedList = AuthorMerger.enrichOrcid(publicationAuthors, orcidAuthors);
final List<Author> enrichedList = AuthorMerger.enrichOrcid(publicationAuthors, orcidAuthors);
long enrichedAuthorWithPid = enrichedList
a -> a.getPid() != null && a
p -> p.getQualifier() != null
&& p.getQualifier().getClassid().toLowerCase().contains("orcid")))
long totalTime = (System.currentTimeMillis() - start) / 1000;
"Enriched authors in %d seconds from %d pid to %d pid \n", totalTime, originalAuthorWithPiD,
public void checkSimilarityTest() {
final Author left = new Author();
left.setFullname("Anand, Rachna");
final Author right = new Author();
right.setFullname("Rachna, Anand");
// System.out.println(AuthorMerger.normalize(right.getFullname()));
boolean same = AuthorMerger.checkORCIDSimilarity(left, right);

@ -1,14 +0,0 @@
package eu.dnetlib.scholexplorer.relation;
import org.junit.jupiter.api.Test;
class RelationMapperTest {
void testLoadRels() throws Exception {
RelationMapper relationMapper = RelationMapper.load();

@ -1 +1,28 @@
{"id":"50|DansKnawCris::0829b5191605bdbea36d6502b8c1ce1g", "resuttype" : { "classid" : "dataset" }, "pid":[{"qualifier":{"classid":"doi"},"value":"10.1016/j.cmet.2011.03.013"},{"qualifier":{"classid":"urn"},"value":"urn:nbn:nl:ui:29-f3ed5f9e-edf6-457e-8848-61b58a4075e2"},{"qualifier":{"classid":"scp-number"},"value":"79953761260"},{"qualifier":{"classid":"pmc"},"value":"21459329"}], "collectedfrom" : [ { "key" : "10|openaire____::081b82f96300b6a6e3d282bad31cb6e2", "value" : "Crossref"} ]} {
"id": "50|DansKnawCris::0829b5191605bdbea36d6502b8c1ce1g",
"resuttype": "dataset",
"pid": [
"qualifier": {"classid": "doi"},
"value": "10.1016/j.cmet.2011.03.013"
"qualifier": {"classid": "urn"},
"value": "urn:nbn:nl:ui:29-f3ed5f9e-edf6-457e-8848-61b58a4075e2"
"qualifier": {"classid": "scp-number"},
"value": "79953761260"
"qualifier": {"classid": "pmc"},
"value": "21459329"
"collectedfrom": [
"key": "10|openaire____::081b82f96300b6a6e3d282bad31cb6e2",
"value": "Crossref"

View File

@ -1,6 +1,6 @@
{ {
"id": "50|DansKnawCris::0829b5191605bdbea36d6502b8c1ce1g", "id": "50|DansKnawCris::0829b5191605bdbea36d6502b8c1ce1g",
"resuttype": {"classid": "dataset"}, "resuttype": "dataset",
"pid": [ "pid": [
{ {
"qualifier": {"classid": "doi"}, "qualifier": {"classid": "doi"},
@ -30,8 +30,7 @@
"refereed": { "refereed": {
"classid": "0000", "classid": "0000",
"classname": "UNKNOWN", "classname": "UNKNOWN",
"schemeid": "dnet:review_levels", "schemeid": "dnet:review_levels"
"schemename": "dnet:review_levels"
}, },
"hostedby": { "hostedby": {
"key": "10|opendoar____::358aee4cc897452c00244351e4d91f69", "key": "10|opendoar____::358aee4cc897452c00244351e4d91f69",
@ -40,45 +39,15 @@
"accessright": { "accessright": {
"classid": "OPEN", "classid": "OPEN",
"classname": "Open Access", "classname": "Open Access",
"schemeid": "dnet:access_modes", "schemeid": "dnet:access_modes"
"schemename": "dnet:access_modes"
"processingchargecurrency": {
"dataInfo": {
"provenanceaction": {
"classid": "sysimport:crosswalk:datasetarchive",
"classname": "Harvested",
"schemeid": "dnet:provenanceActions",
"schemename": "dnet:provenanceActions"
"deletedbyinference": false,
"inferred": false,
"inferenceprovenance": "",
"invisible": true,
"trust": "0.9"
"value": "EUR"
}, },
"processingchargecurrency": "EUR",
"pid": [ "pid": [
{ {
"dataInfo": {
"provenanceaction": {
"classid": "sysimport:crosswalk:datasetarchive",
"classname": "Harvested",
"schemeid": "dnet:provenanceActions",
"schemename": "dnet:provenanceActions"
"deletedbyinference": false,
"inferred": false,
"inferenceprovenance": "",
"invisible": true,
"trust": "0.9"
"qualifier": { "qualifier": {
"classid": "doi", "classid": "doi",
"classname": "Digital Object Identifier", "classname": "Digital Object Identifier",
"schemeid": "dnet:pid_types", "schemeid": "dnet:pid_types"
"schemename": "dnet:pid_types"
}, },
"value": "10.1371/journal.pone.0085605" "value": "10.1371/journal.pone.0085605"
} }
@ -87,24 +56,10 @@
"url": [""], "url": [""],
"alternateIdentifier": [ "alternateIdentifier": [
{ {
"dataInfo": {
"provenanceaction": {
"classid": "sysimport:crosswalk:datasetarchive",
"classname": "Harvested",
"schemeid": "dnet:provenanceActions",
"schemename": "dnet:provenanceActions"
"deletedbyinference": false,
"inferred": false,
"inferenceprovenance": "",
"invisible": true,
"trust": "0.9"
"qualifier": { "qualifier": {
"classid": "pmid", "classid": "pmid",
"classname": "PubMed ID", "classname": "PubMed ID",
"schemeid": "dnet:pid_types", "schemeid": "dnet:pid_types"
"schemename": "dnet:pid_types"
}, },
"value": "24454899.0" "value": "24454899.0"
} }
@ -113,27 +68,11 @@
"key": "10|openaire____::081b82f96300b6a6e3d282bad31cb6e3", "key": "10|openaire____::081b82f96300b6a6e3d282bad31cb6e3",
"value": "Repository B" "value": "Repository B"
}, },
"processingchargeamount": { "processingchargeamount": "1022.02",
"dataInfo": {
"provenanceaction": {
"classid": "sysimport:crosswalk:datasetarchive",
"classname": "Harvested",
"schemeid": "dnet:provenanceActions",
"schemename": "dnet:provenanceActions"
"deletedbyinference": false,
"inferred": false,
"inferenceprovenance": "",
"invisible": true,
"trust": "0.9"
"value": "1022.02"
"instancetype": { "instancetype": {
"classid": "0004", "classid": "0004",
"classname": "Conference object", "classname": "Conference object",
"schemeid": "dnet:publication_resource", "schemeid": "dnet:publication_resource"
"schemename": "dnet:publication_resource"
} }
} }
] ]

Some files were not shown because too many files have changed in this diff Show More