adding context information to projects and datasources #407
|
@ -0,0 +1,43 @@
|
||||||
|
# Contributor Code of Conduct
|
||||||
|
|
||||||
|
Openness, transparency and our community-driven participatory approach guide us in our day-to-day interactions and decision-making. Our open source projects are no exception. Trust, respect, collaboration and transparency are core values we believe should live and breathe within our projects. Our community welcomes participants from around the world with different experiences, unique perspectives, and great ideas to share.
|
||||||
|
|
||||||
|
## Our Pledge
|
||||||
|
|
||||||
|
In the interest of fostering an open and welcoming environment, we as contributors and maintainers pledge to making participation in our project and our community a harassment-free experience for everyone, regardless of age, body size, disability, ethnicity, sex characteristics, gender identity and expression, level of experience, education, socio-economic status, nationality, personal appearance, race, religion, or sexual identity and orientation.
|
||||||
|
|
||||||
|
## Our Standards
|
||||||
|
|
||||||
|
Examples of behavior that contributes to creating a positive environment include:
|
||||||
|
|
||||||
|
- Using welcoming and inclusive language
|
||||||
|
- Being respectful of differing viewpoints and experiences
|
||||||
|
- Gracefully accepting constructive criticism
|
||||||
|
- Attempting collaboration before conflict
|
||||||
|
- Focusing on what is best for the community
|
||||||
|
- Showing empathy towards other community members
|
||||||
|
|
||||||
|
Examples of unacceptable behavior by participants include:
|
||||||
|
|
||||||
|
- Violence, threats of violence, or inciting others to commit self-harm
|
||||||
|
- The use of sexualized language or imagery and unwelcome sexual attention or advances
|
||||||
|
- Trolling, intentionally spreading misinformation, insulting/derogatory comments, and personal or political attacks
|
||||||
|
- Public or private harassment
|
||||||
|
- Publishing others' private information, such as a physical or electronic address, without explicit permission
|
||||||
|
- Abuse of the reporting process to intentionally harass or exclude others
|
||||||
|
- Advocating for, or encouraging, any of the above behavior
|
||||||
|
- Other conduct which could reasonably be considered inappropriate in a professional setting
|
||||||
|
|
||||||
|
## Our Responsibilities
|
||||||
|
|
||||||
|
Project maintainers are responsible for clarifying the standards of acceptable behavior and are expected to take appropriate and fair corrective action in response to any instances of unacceptable behavior.
|
||||||
|
|
||||||
|
Project maintainers have the right and responsibility to remove, edit, or reject comments, commits, code, wiki edits, issues, and other contributions that are not aligned to this Code of Conduct, or to ban temporarily or permanently any contributor for other behaviors that they deem inappropriate, threatening, offensive, or harmful.
|
||||||
|
|
||||||
|
## Scope
|
||||||
|
|
||||||
|
This Code of Conduct applies both within project spaces and in public spaces when an individual is representing the project or its community. Examples of representing a project or community include using an official project e-mail address, posting via an official social media account, or acting as an appointed representative at an online or offline event. Representation of a project may be further defined and clarified by project maintainers.
|
||||||
|
|
||||||
|
## Attribution
|
||||||
|
|
||||||
|
This Code of Conduct is adapted from the [Contributor Covenant](https://www.contributor-covenant.org/), [version 1.4](https://www.contributor-covenant.org/version/1/4/code-of-conduct.html).
|
|
@ -0,0 +1,10 @@
|
||||||
|
# Contributing to D-Net Hadoop
|
||||||
|
|
||||||
|
:+1::tada: First off, thanks for taking the time to contribute! :tada::+1:
|
||||||
|
|
||||||
|
This project and everyone participating in it is governed by our [Code of Conduct](CODE_OF_CONDUCT.md). By participating, you are expected to uphold this code. Please report unacceptable behavior to [dnet-team@isti.cnr.it](mailto:dnet-team@isti.cnr.it).
|
||||||
|
|
||||||
|
The following is a set of guidelines for contributing to this project and its packages. These are mostly guidelines, not rules, which applies to this project as a while, including all its sub-modules.
|
||||||
|
Use your best judgment, and feel free to propose changes to this document in a pull request.
|
||||||
|
|
||||||
|
All contributions are welcome, all contributions will be considered to be contributed under the [project license](LICENSE.md).
|
|
@ -2,6 +2,11 @@
|
||||||
|
|
||||||
Dnet-hadoop is the project that defined all the [OOZIE workflows](https://oozie.apache.org/) for the OpenAIRE Graph construction, processing, provisioning.
|
Dnet-hadoop is the project that defined all the [OOZIE workflows](https://oozie.apache.org/) for the OpenAIRE Graph construction, processing, provisioning.
|
||||||
|
|
||||||
|
This project adheres to the Contributor Covenant [code of conduct](CODE_OF_CONDUCT.md).
|
||||||
|
By participating, you are expected to uphold this code. Please report unacceptable behavior to [dnet-team@isti.cnr.it](mailto:dnet-team@isti.cnr.it).
|
||||||
|
|
||||||
|
This project is licensed under the [AGPL v3 or later version](#LICENSE.md).
|
||||||
|
|
||||||
How to build, package and run oozie workflows
|
How to build, package and run oozie workflows
|
||||||
====================
|
====================
|
||||||
|
|
||||||
|
|
|
@ -0,0 +1,39 @@
|
||||||
|
|
||||||
|
package eu.dnetlib.dhp.common.api.context;
|
||||||
|
|
||||||
|
public class CategorySummary {
|
||||||
|
|
||||||
|
private String id;
|
||||||
|
|
||||||
|
private String label;
|
||||||
|
|
||||||
|
private boolean hasConcept;
|
||||||
|
|
||||||
|
public String getId() {
|
||||||
|
return id;
|
||||||
|
}
|
||||||
|
|
||||||
|
public String getLabel() {
|
||||||
|
return label;
|
||||||
|
}
|
||||||
|
|
||||||
|
public boolean isHasConcept() {
|
||||||
|
return hasConcept;
|
||||||
|
}
|
||||||
|
|
||||||
|
public CategorySummary setId(final String id) {
|
||||||
|
this.id = id;
|
||||||
|
return this;
|
||||||
|
}
|
||||||
|
|
||||||
|
public CategorySummary setLabel(final String label) {
|
||||||
|
this.label = label;
|
||||||
|
return this;
|
||||||
|
}
|
||||||
|
|
||||||
|
public CategorySummary setHasConcept(final boolean hasConcept) {
|
||||||
|
this.hasConcept = hasConcept;
|
||||||
|
return this;
|
||||||
|
}
|
||||||
|
|
||||||
|
}
|
|
@ -0,0 +1,7 @@
|
||||||
|
|
||||||
|
package eu.dnetlib.dhp.common.api.context;
|
||||||
|
|
||||||
|
import java.util.ArrayList;
|
||||||
|
|
||||||
|
public class CategorySummaryList extends ArrayList<CategorySummary> {
|
||||||
|
}
|
|
@ -0,0 +1,52 @@
|
||||||
|
|
||||||
|
package eu.dnetlib.dhp.common.api.context;
|
||||||
|
|
||||||
|
import java.util.List;
|
||||||
|
|
||||||
|
public class ConceptSummary {
|
||||||
|
|
||||||
|
private String id;
|
||||||
|
|
||||||
|
private String label;
|
||||||
|
|
||||||
|
public boolean hasSubConcept;
|
||||||
|
|
||||||
|
private List<ConceptSummary> concepts;
|
||||||
|
|
||||||
|
public String getId() {
|
||||||
|
return id;
|
||||||
|
}
|
||||||
|
|
||||||
|
public String getLabel() {
|
||||||
|
return label;
|
||||||
|
}
|
||||||
|
|
||||||
|
public List<ConceptSummary> getConcepts() {
|
||||||
|
return concepts;
|
||||||
|
}
|
||||||
|
|
||||||
|
public ConceptSummary setId(final String id) {
|
||||||
|
this.id = id;
|
||||||
|
return this;
|
||||||
|
}
|
||||||
|
|
||||||
|
public ConceptSummary setLabel(final String label) {
|
||||||
|
this.label = label;
|
||||||
|
return this;
|
||||||
|
}
|
||||||
|
|
||||||
|
public boolean isHasSubConcept() {
|
||||||
|
return hasSubConcept;
|
||||||
|
}
|
||||||
|
|
||||||
|
public ConceptSummary setHasSubConcept(final boolean hasSubConcept) {
|
||||||
|
this.hasSubConcept = hasSubConcept;
|
||||||
|
return this;
|
||||||
|
}
|
||||||
|
|
||||||
|
public ConceptSummary setConcept(final List<ConceptSummary> concepts) {
|
||||||
|
this.concepts = concepts;
|
||||||
|
return this;
|
||||||
|
}
|
||||||
|
|
||||||
|
}
|
|
@ -0,0 +1,7 @@
|
||||||
|
|
||||||
|
package eu.dnetlib.dhp.common.api.context;
|
||||||
|
|
||||||
|
import java.util.ArrayList;
|
||||||
|
|
||||||
|
public class ConceptSummaryList extends ArrayList<ConceptSummary> {
|
||||||
|
}
|
|
@ -0,0 +1,50 @@
|
||||||
|
|
||||||
|
package eu.dnetlib.dhp.common.api.context;
|
||||||
|
|
||||||
|
public class ContextSummary {
|
||||||
|
|
||||||
|
private String id;
|
||||||
|
|
||||||
|
private String label;
|
||||||
|
|
||||||
|
private String type;
|
||||||
|
|
||||||
|
private String status;
|
||||||
|
|
||||||
|
public String getId() {
|
||||||
|
return id;
|
||||||
|
}
|
||||||
|
|
||||||
|
public String getLabel() {
|
||||||
|
return label;
|
||||||
|
}
|
||||||
|
|
||||||
|
public String getType() {
|
||||||
|
return type;
|
||||||
|
}
|
||||||
|
|
||||||
|
public String getStatus() {
|
||||||
|
return status;
|
||||||
|
}
|
||||||
|
|
||||||
|
public ContextSummary setId(final String id) {
|
||||||
|
this.id = id;
|
||||||
|
return this;
|
||||||
|
}
|
||||||
|
|
||||||
|
public ContextSummary setLabel(final String label) {
|
||||||
|
this.label = label;
|
||||||
|
return this;
|
||||||
|
}
|
||||||
|
|
||||||
|
public ContextSummary setType(final String type) {
|
||||||
|
this.type = type;
|
||||||
|
return this;
|
||||||
|
}
|
||||||
|
|
||||||
|
public ContextSummary setStatus(final String status) {
|
||||||
|
this.status = status;
|
||||||
|
return this;
|
||||||
|
}
|
||||||
|
|
||||||
|
}
|
|
@ -0,0 +1,7 @@
|
||||||
|
|
||||||
|
package eu.dnetlib.dhp.common.api.context;
|
||||||
|
|
||||||
|
import java.util.ArrayList;
|
||||||
|
|
||||||
|
public class ContextSummaryList extends ArrayList<ContextSummary> {
|
||||||
|
}
|
|
@ -8,10 +8,13 @@ import java.io.InputStream;
|
||||||
import java.net.*;
|
import java.net.*;
|
||||||
import java.util.List;
|
import java.util.List;
|
||||||
import java.util.Map;
|
import java.util.Map;
|
||||||
|
import java.util.concurrent.TimeUnit;
|
||||||
|
|
||||||
import org.apache.commons.io.IOUtils;
|
import org.apache.commons.io.IOUtils;
|
||||||
import org.apache.commons.lang3.math.NumberUtils;
|
import org.apache.commons.lang3.math.NumberUtils;
|
||||||
|
import org.apache.commons.lang3.time.DateUtils;
|
||||||
import org.apache.http.HttpHeaders;
|
import org.apache.http.HttpHeaders;
|
||||||
|
import org.joda.time.Instant;
|
||||||
import org.slf4j.Logger;
|
import org.slf4j.Logger;
|
||||||
import org.slf4j.LoggerFactory;
|
import org.slf4j.LoggerFactory;
|
||||||
|
|
||||||
|
@ -94,14 +97,16 @@ public class HttpConnector2 {
|
||||||
throw new CollectorException(msg);
|
throw new CollectorException(msg);
|
||||||
}
|
}
|
||||||
|
|
||||||
log.info("Request attempt {} [{}]", retryNumber, requestUrl);
|
|
||||||
|
|
||||||
InputStream input = null;
|
InputStream input = null;
|
||||||
|
|
||||||
|
long start = System.currentTimeMillis();
|
||||||
try {
|
try {
|
||||||
if (getClientParams().getRequestDelay() > 0) {
|
if (getClientParams().getRequestDelay() > 0) {
|
||||||
backoffAndSleep(getClientParams().getRequestDelay());
|
backoffAndSleep(getClientParams().getRequestDelay());
|
||||||
}
|
}
|
||||||
|
|
||||||
|
log.info("Request attempt {} [{}]", retryNumber, requestUrl);
|
||||||
|
|
||||||
final HttpURLConnection urlConn = (HttpURLConnection) new URL(requestUrl).openConnection();
|
final HttpURLConnection urlConn = (HttpURLConnection) new URL(requestUrl).openConnection();
|
||||||
urlConn.setInstanceFollowRedirects(false);
|
urlConn.setInstanceFollowRedirects(false);
|
||||||
urlConn.setReadTimeout(getClientParams().getReadTimeOut() * 1000);
|
urlConn.setReadTimeout(getClientParams().getReadTimeOut() * 1000);
|
||||||
|
@ -115,9 +120,8 @@ public class HttpConnector2 {
|
||||||
urlConn.addRequestProperty(headerEntry.getKey(), headerEntry.getValue());
|
urlConn.addRequestProperty(headerEntry.getKey(), headerEntry.getValue());
|
||||||
}
|
}
|
||||||
}
|
}
|
||||||
if (log.isDebugEnabled()) {
|
|
||||||
logHeaderFields(urlConn);
|
logHeaderFields(urlConn);
|
||||||
}
|
|
||||||
|
|
||||||
int retryAfter = obtainRetryAfter(urlConn.getHeaderFields());
|
int retryAfter = obtainRetryAfter(urlConn.getHeaderFields());
|
||||||
String rateLimit = urlConn.getHeaderField(Constants.HTTPHEADER_IETF_DRAFT_RATELIMIT_LIMIT);
|
String rateLimit = urlConn.getHeaderField(Constants.HTTPHEADER_IETF_DRAFT_RATELIMIT_LIMIT);
|
||||||
|
@ -132,9 +136,7 @@ public class HttpConnector2 {
|
||||||
}
|
}
|
||||||
|
|
||||||
if (is2xx(urlConn.getResponseCode())) {
|
if (is2xx(urlConn.getResponseCode())) {
|
||||||
input = urlConn.getInputStream();
|
return getInputStream(urlConn, start);
|
||||||
responseType = urlConn.getContentType();
|
|
||||||
return input;
|
|
||||||
}
|
}
|
||||||
if (is3xx(urlConn.getResponseCode())) {
|
if (is3xx(urlConn.getResponseCode())) {
|
||||||
// REDIRECTS
|
// REDIRECTS
|
||||||
|
@ -144,6 +146,7 @@ public class HttpConnector2 {
|
||||||
.put(
|
.put(
|
||||||
REPORT_PREFIX + urlConn.getResponseCode(),
|
REPORT_PREFIX + urlConn.getResponseCode(),
|
||||||
String.format("Moved to: %s", newUrl));
|
String.format("Moved to: %s", newUrl));
|
||||||
|
logRequestTime(start);
|
||||||
urlConn.disconnect();
|
urlConn.disconnect();
|
||||||
if (retryAfter > 0) {
|
if (retryAfter > 0) {
|
||||||
backoffAndSleep(retryAfter);
|
backoffAndSleep(retryAfter);
|
||||||
|
@ -159,26 +162,50 @@ public class HttpConnector2 {
|
||||||
if (retryAfter > 0) {
|
if (retryAfter > 0) {
|
||||||
log
|
log
|
||||||
.warn(
|
.warn(
|
||||||
"{} - waiting and repeating request after suggested retry-after {} sec.",
|
"waiting and repeating request after suggested retry-after {} sec for URL {}",
|
||||||
requestUrl, retryAfter);
|
retryAfter, requestUrl);
|
||||||
backoffAndSleep(retryAfter * 1000);
|
backoffAndSleep(retryAfter * 1000);
|
||||||
} else {
|
} else {
|
||||||
log
|
log
|
||||||
.warn(
|
.warn(
|
||||||
"{} - waiting and repeating request after default delay of {} sec.",
|
"waiting and repeating request after default delay of {} sec for URL {}",
|
||||||
requestUrl, getClientParams().getRetryDelay());
|
getClientParams().getRetryDelay(), requestUrl);
|
||||||
backoffAndSleep(retryNumber * getClientParams().getRetryDelay() * 1000);
|
backoffAndSleep(retryNumber * getClientParams().getRetryDelay());
|
||||||
}
|
}
|
||||||
report.put(REPORT_PREFIX + urlConn.getResponseCode(), requestUrl);
|
report.put(REPORT_PREFIX + urlConn.getResponseCode(), requestUrl);
|
||||||
|
|
||||||
|
logRequestTime(start);
|
||||||
|
|
||||||
urlConn.disconnect();
|
urlConn.disconnect();
|
||||||
|
|
||||||
return attemptDownload(requestUrl, retryNumber + 1, report);
|
return attemptDownload(requestUrl, retryNumber + 1, report);
|
||||||
|
case 422: // UNPROCESSABLE ENTITY
|
||||||
|
report.put(REPORT_PREFIX + urlConn.getResponseCode(), requestUrl);
|
||||||
|
log.warn("waiting and repeating request after 10 sec for URL {}", requestUrl);
|
||||||
|
backoffAndSleep(10000);
|
||||||
|
urlConn.disconnect();
|
||||||
|
logRequestTime(start);
|
||||||
|
try {
|
||||||
|
return getInputStream(urlConn, start);
|
||||||
|
} catch (IOException e) {
|
||||||
|
log
|
||||||
|
.error(
|
||||||
|
"server returned 422 and got IOException accessing the response body from URL {}",
|
||||||
|
requestUrl);
|
||||||
|
log.error("IOException:", e);
|
||||||
|
return attemptDownload(requestUrl, retryNumber + 1, report);
|
||||||
|
}
|
||||||
default:
|
default:
|
||||||
|
log.error("gor error {} from URL: {}", urlConn.getResponseCode(), urlConn.getURL());
|
||||||
|
log.error("response message: {}", urlConn.getResponseMessage());
|
||||||
report
|
report
|
||||||
.put(
|
.put(
|
||||||
REPORT_PREFIX + urlConn.getResponseCode(),
|
REPORT_PREFIX + urlConn.getResponseCode(),
|
||||||
String
|
String
|
||||||
.format(
|
.format(
|
||||||
"%s Error: %s", requestUrl, urlConn.getResponseMessage()));
|
"%s Error: %s", requestUrl, urlConn.getResponseMessage()));
|
||||||
|
logRequestTime(start);
|
||||||
|
urlConn.disconnect();
|
||||||
throw new CollectorException(urlConn.getResponseCode() + " error " + report);
|
throw new CollectorException(urlConn.getResponseCode() + " error " + report);
|
||||||
}
|
}
|
||||||
}
|
}
|
||||||
|
@ -199,13 +226,27 @@ public class HttpConnector2 {
|
||||||
}
|
}
|
||||||
}
|
}
|
||||||
|
|
||||||
|
private InputStream getInputStream(HttpURLConnection urlConn, long start) throws IOException {
|
||||||
|
InputStream input = urlConn.getInputStream();
|
||||||
|
responseType = urlConn.getContentType();
|
||||||
|
logRequestTime(start);
|
||||||
|
return input;
|
||||||
|
}
|
||||||
|
|
||||||
|
private static void logRequestTime(long start) {
|
||||||
|
log
|
||||||
|
.info(
|
||||||
|
"request time elapsed: {}sec",
|
||||||
|
TimeUnit.MILLISECONDS.toSeconds(System.currentTimeMillis() - start));
|
||||||
|
}
|
||||||
|
|
||||||
private void logHeaderFields(final HttpURLConnection urlConn) throws IOException {
|
private void logHeaderFields(final HttpURLConnection urlConn) throws IOException {
|
||||||
log.debug("StatusCode: {}", urlConn.getResponseMessage());
|
log.info("Response: {} - {}", urlConn.getResponseCode(), urlConn.getResponseMessage());
|
||||||
|
|
||||||
for (Map.Entry<String, List<String>> e : urlConn.getHeaderFields().entrySet()) {
|
for (Map.Entry<String, List<String>> e : urlConn.getHeaderFields().entrySet()) {
|
||||||
if (e.getKey() != null) {
|
if (e.getKey() != null) {
|
||||||
for (String v : e.getValue()) {
|
for (String v : e.getValue()) {
|
||||||
log.debug(" key: {} - value: {}", e.getKey(), v);
|
log.info(" key: {} - value: {}", e.getKey(), v);
|
||||||
}
|
}
|
||||||
}
|
}
|
||||||
}
|
}
|
||||||
|
@ -225,7 +266,7 @@ public class HttpConnector2 {
|
||||||
for (String key : headerMap.keySet()) {
|
for (String key : headerMap.keySet()) {
|
||||||
if ((key != null) && key.equalsIgnoreCase(HttpHeaders.RETRY_AFTER) && (!headerMap.get(key).isEmpty())
|
if ((key != null) && key.equalsIgnoreCase(HttpHeaders.RETRY_AFTER) && (!headerMap.get(key).isEmpty())
|
||||||
&& NumberUtils.isCreatable(headerMap.get(key).get(0))) {
|
&& NumberUtils.isCreatable(headerMap.get(key).get(0))) {
|
||||||
return Integer.parseInt(headerMap.get(key).get(0)) + 10;
|
return Integer.parseInt(headerMap.get(key).get(0));
|
||||||
}
|
}
|
||||||
}
|
}
|
||||||
return -1;
|
return -1;
|
||||||
|
|
|
@ -0,0 +1,77 @@
|
||||||
|
|
||||||
|
package eu.dnetlib.dhp.oozie;
|
||||||
|
|
||||||
|
import static eu.dnetlib.dhp.common.SparkSessionSupport.runWithSparkHiveSession;
|
||||||
|
|
||||||
|
import java.net.URL;
|
||||||
|
import java.nio.charset.StandardCharsets;
|
||||||
|
import java.util.HashMap;
|
||||||
|
import java.util.Map;
|
||||||
|
import java.util.Optional;
|
||||||
|
|
||||||
|
import org.apache.commons.lang3.time.DurationFormatUtils;
|
||||||
|
import org.apache.commons.text.StringSubstitutor;
|
||||||
|
import org.apache.spark.SparkConf;
|
||||||
|
import org.slf4j.Logger;
|
||||||
|
import org.slf4j.LoggerFactory;
|
||||||
|
|
||||||
|
import com.google.common.io.Resources;
|
||||||
|
|
||||||
|
import eu.dnetlib.dhp.application.ArgumentApplicationParser;
|
||||||
|
|
||||||
|
public class RunSQLSparkJob {
|
||||||
|
private static final Logger log = LoggerFactory.getLogger(RunSQLSparkJob.class);
|
||||||
|
|
||||||
|
private final ArgumentApplicationParser parser;
|
||||||
|
|
||||||
|
public RunSQLSparkJob(ArgumentApplicationParser parser) {
|
||||||
|
this.parser = parser;
|
||||||
|
}
|
||||||
|
|
||||||
|
public static void main(String[] args) throws Exception {
|
||||||
|
|
||||||
|
Map<String, String> params = new HashMap<>();
|
||||||
|
for (int i = 0; i < args.length - 1; i++) {
|
||||||
|
if (args[i].startsWith("--")) {
|
||||||
|
params.put(args[i].substring(2), args[++i]);
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
/*
|
||||||
|
* String jsonConfiguration = IOUtils .toString( Objects .requireNonNull( RunSQLSparkJob.class
|
||||||
|
* .getResourceAsStream( "/eu/dnetlib/dhp/oozie/run_sql_parameters.json"))); final ArgumentApplicationParser
|
||||||
|
* parser = new ArgumentApplicationParser(jsonConfiguration); parser.parseArgument(args);
|
||||||
|
*/
|
||||||
|
|
||||||
|
Boolean isSparkSessionManaged = Optional
|
||||||
|
.ofNullable(params.get("isSparkSessionManaged"))
|
||||||
|
.map(Boolean::valueOf)
|
||||||
|
.orElse(Boolean.TRUE);
|
||||||
|
log.info("isSparkSessionManaged: {}", isSparkSessionManaged);
|
||||||
|
|
||||||
|
URL url = com.google.common.io.Resources.getResource(params.get("sql"));
|
||||||
|
String raw_sql = Resources.toString(url, StandardCharsets.UTF_8);
|
||||||
|
|
||||||
|
String sql = StringSubstitutor.replace(raw_sql, params);
|
||||||
|
log.info("sql: {}", sql);
|
||||||
|
|
||||||
|
SparkConf conf = new SparkConf();
|
||||||
|
conf.set("hive.metastore.uris", params.get("hiveMetastoreUris"));
|
||||||
|
|
||||||
|
runWithSparkHiveSession(
|
||||||
|
conf,
|
||||||
|
isSparkSessionManaged,
|
||||||
|
spark -> {
|
||||||
|
for (String statement : sql.split(";\\s*/\\*\\s*EOS\\s*\\*/\\s*")) {
|
||||||
|
log.info("executing: {}", statement);
|
||||||
|
long startTime = System.currentTimeMillis();
|
||||||
|
spark.sql(statement).show();
|
||||||
|
log
|
||||||
|
.info(
|
||||||
|
"executed in {}",
|
||||||
|
DurationFormatUtils.formatDuration(System.currentTimeMillis() - startTime, "HH:mm:ss.S"));
|
||||||
|
}
|
||||||
|
});
|
||||||
|
}
|
||||||
|
|
||||||
|
}
|
|
@ -312,7 +312,8 @@ public class GraphCleaningFunctions extends CleaningFunctions {
|
||||||
}
|
}
|
||||||
|
|
||||||
if (value instanceof Datasource) {
|
if (value instanceof Datasource) {
|
||||||
// nothing to evaluate here
|
final Datasource d = (Datasource) value;
|
||||||
|
return Objects.nonNull(d.getOfficialname()) && StringUtils.isNotBlank(d.getOfficialname().getValue());
|
||||||
} else if (value instanceof Project) {
|
} else if (value instanceof Project) {
|
||||||
final Project p = (Project) value;
|
final Project p = (Project) value;
|
||||||
return Objects.nonNull(p.getCode()) && StringUtils.isNotBlank(p.getCode().getValue());
|
return Objects.nonNull(p.getCode()) && StringUtils.isNotBlank(p.getCode().getValue());
|
||||||
|
|
|
@ -0,0 +1,20 @@
|
||||||
|
[
|
||||||
|
{
|
||||||
|
"paramName": "issm",
|
||||||
|
"paramLongName": "isSparkSessionManaged",
|
||||||
|
"paramDescription": "when true will stop SparkSession after job execution",
|
||||||
|
"paramRequired": false
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"paramName": "hmu",
|
||||||
|
"paramLongName": "hiveMetastoreUris",
|
||||||
|
"paramDescription": "the hive metastore uris",
|
||||||
|
"paramRequired": true
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"paramName": "sql",
|
||||||
|
"paramLongName": "sql",
|
||||||
|
"paramDescription": "sql script to execute",
|
||||||
|
"paramRequired": true
|
||||||
|
}
|
||||||
|
]
|
|
@ -0,0 +1,39 @@
|
||||||
|
/*
|
||||||
|
* Copyright (c) 2024.
|
||||||
|
* SPDX-FileCopyrightText: © 2023 Consiglio Nazionale delle Ricerche
|
||||||
|
* SPDX-License-Identifier: AGPL-3.0-or-later
|
||||||
|
*/
|
||||||
|
|
||||||
|
package eu.dnetlib.dhp.actionmanager.promote;
|
||||||
|
|
||||||
|
/** Encodes the Actionset promotion strategies */
|
||||||
|
public class PromoteAction {
|
||||||
|
|
||||||
|
/** The supported actionset promotion strategies
|
||||||
|
*
|
||||||
|
* ENRICH: promotes only records in the actionset matching another record in the
|
||||||
|
* graph and enriches them applying the given MergeAndGet strategy
|
||||||
|
* UPSERT: promotes all the records in an actionset, matching records are updated
|
||||||
|
* using the given MergeAndGet strategy, the non-matching record as inserted as they are.
|
||||||
|
*/
|
||||||
|
public enum Strategy {
|
||||||
|
ENRICH, UPSERT
|
||||||
|
}
|
||||||
|
|
||||||
|
/**
|
||||||
|
* Returns the string representation of the join type implementing the given PromoteAction.
|
||||||
|
*
|
||||||
|
* @param strategy the strategy to be used to promote the Actionset contents
|
||||||
|
* @return the join type used to implement the promotion strategy
|
||||||
|
*/
|
||||||
|
public static String joinTypeForStrategy(PromoteAction.Strategy strategy) {
|
||||||
|
switch (strategy) {
|
||||||
|
case ENRICH:
|
||||||
|
return "left_outer";
|
||||||
|
case UPSERT:
|
||||||
|
return "full_outer";
|
||||||
|
default:
|
||||||
|
throw new IllegalStateException("unsupported PromoteAction: " + strategy.toString());
|
||||||
|
}
|
||||||
|
}
|
||||||
|
}
|
|
@ -67,8 +67,9 @@ public class PromoteActionPayloadForGraphTableJob {
|
||||||
String outputGraphTablePath = parser.get("outputGraphTablePath");
|
String outputGraphTablePath = parser.get("outputGraphTablePath");
|
||||||
logger.info("outputGraphTablePath: {}", outputGraphTablePath);
|
logger.info("outputGraphTablePath: {}", outputGraphTablePath);
|
||||||
|
|
||||||
MergeAndGet.Strategy strategy = MergeAndGet.Strategy.valueOf(parser.get("mergeAndGetStrategy").toUpperCase());
|
MergeAndGet.Strategy mergeAndGetStrategy = MergeAndGet.Strategy
|
||||||
logger.info("strategy: {}", strategy);
|
.valueOf(parser.get("mergeAndGetStrategy").toUpperCase());
|
||||||
|
logger.info("mergeAndGetStrategy: {}", mergeAndGetStrategy);
|
||||||
|
|
||||||
Boolean shouldGroupById = Optional
|
Boolean shouldGroupById = Optional
|
||||||
.ofNullable(parser.get("shouldGroupById"))
|
.ofNullable(parser.get("shouldGroupById"))
|
||||||
|
@ -76,6 +77,12 @@ public class PromoteActionPayloadForGraphTableJob {
|
||||||
.orElse(true);
|
.orElse(true);
|
||||||
logger.info("shouldGroupById: {}", shouldGroupById);
|
logger.info("shouldGroupById: {}", shouldGroupById);
|
||||||
|
|
||||||
|
PromoteAction.Strategy promoteActionStrategy = Optional
|
||||||
|
.ofNullable(parser.get("promoteActionStrategy"))
|
||||||
|
.map(PromoteAction.Strategy::valueOf)
|
||||||
|
.orElse(PromoteAction.Strategy.UPSERT);
|
||||||
|
logger.info("promoteActionStrategy: {}", promoteActionStrategy);
|
||||||
|
|
||||||
@SuppressWarnings("unchecked")
|
@SuppressWarnings("unchecked")
|
||||||
Class<? extends Oaf> rowClazz = (Class<? extends Oaf>) Class.forName(graphTableClassName);
|
Class<? extends Oaf> rowClazz = (Class<? extends Oaf>) Class.forName(graphTableClassName);
|
||||||
@SuppressWarnings("unchecked")
|
@SuppressWarnings("unchecked")
|
||||||
|
@ -97,7 +104,8 @@ public class PromoteActionPayloadForGraphTableJob {
|
||||||
inputGraphTablePath,
|
inputGraphTablePath,
|
||||||
inputActionPayloadPath,
|
inputActionPayloadPath,
|
||||||
outputGraphTablePath,
|
outputGraphTablePath,
|
||||||
strategy,
|
mergeAndGetStrategy,
|
||||||
|
promoteActionStrategy,
|
||||||
rowClazz,
|
rowClazz,
|
||||||
actionPayloadClazz,
|
actionPayloadClazz,
|
||||||
shouldGroupById);
|
shouldGroupById);
|
||||||
|
@ -124,14 +132,16 @@ public class PromoteActionPayloadForGraphTableJob {
|
||||||
String inputGraphTablePath,
|
String inputGraphTablePath,
|
||||||
String inputActionPayloadPath,
|
String inputActionPayloadPath,
|
||||||
String outputGraphTablePath,
|
String outputGraphTablePath,
|
||||||
MergeAndGet.Strategy strategy,
|
MergeAndGet.Strategy mergeAndGetStrategy,
|
||||||
|
PromoteAction.Strategy promoteActionStrategy,
|
||||||
Class<G> rowClazz,
|
Class<G> rowClazz,
|
||||||
Class<A> actionPayloadClazz, Boolean shouldGroupById) {
|
Class<A> actionPayloadClazz, Boolean shouldGroupById) {
|
||||||
Dataset<G> rowDS = readGraphTable(spark, inputGraphTablePath, rowClazz);
|
Dataset<G> rowDS = readGraphTable(spark, inputGraphTablePath, rowClazz);
|
||||||
Dataset<A> actionPayloadDS = readActionPayload(spark, inputActionPayloadPath, actionPayloadClazz);
|
Dataset<A> actionPayloadDS = readActionPayload(spark, inputActionPayloadPath, actionPayloadClazz);
|
||||||
|
|
||||||
Dataset<G> result = promoteActionPayloadForGraphTable(
|
Dataset<G> result = promoteActionPayloadForGraphTable(
|
||||||
rowDS, actionPayloadDS, strategy, rowClazz, actionPayloadClazz, shouldGroupById)
|
rowDS, actionPayloadDS, mergeAndGetStrategy, promoteActionStrategy, rowClazz, actionPayloadClazz,
|
||||||
|
shouldGroupById)
|
||||||
.map((MapFunction<G, G>) value -> value, Encoders.bean(rowClazz));
|
.map((MapFunction<G, G>) value -> value, Encoders.bean(rowClazz));
|
||||||
|
|
||||||
saveGraphTable(result, outputGraphTablePath);
|
saveGraphTable(result, outputGraphTablePath);
|
||||||
|
@ -183,7 +193,8 @@ public class PromoteActionPayloadForGraphTableJob {
|
||||||
private static <G extends Oaf, A extends Oaf> Dataset<G> promoteActionPayloadForGraphTable(
|
private static <G extends Oaf, A extends Oaf> Dataset<G> promoteActionPayloadForGraphTable(
|
||||||
Dataset<G> rowDS,
|
Dataset<G> rowDS,
|
||||||
Dataset<A> actionPayloadDS,
|
Dataset<A> actionPayloadDS,
|
||||||
MergeAndGet.Strategy strategy,
|
MergeAndGet.Strategy mergeAndGetStrategy,
|
||||||
|
PromoteAction.Strategy promoteActionStrategy,
|
||||||
Class<G> rowClazz,
|
Class<G> rowClazz,
|
||||||
Class<A> actionPayloadClazz,
|
Class<A> actionPayloadClazz,
|
||||||
Boolean shouldGroupById) {
|
Boolean shouldGroupById) {
|
||||||
|
@ -195,8 +206,9 @@ public class PromoteActionPayloadForGraphTableJob {
|
||||||
|
|
||||||
SerializableSupplier<Function<G, String>> rowIdFn = ModelSupport::idFn;
|
SerializableSupplier<Function<G, String>> rowIdFn = ModelSupport::idFn;
|
||||||
SerializableSupplier<Function<A, String>> actionPayloadIdFn = ModelSupport::idFn;
|
SerializableSupplier<Function<A, String>> actionPayloadIdFn = ModelSupport::idFn;
|
||||||
SerializableSupplier<BiFunction<G, A, G>> mergeRowWithActionPayloadAndGetFn = MergeAndGet.functionFor(strategy);
|
SerializableSupplier<BiFunction<G, A, G>> mergeRowWithActionPayloadAndGetFn = MergeAndGet
|
||||||
SerializableSupplier<BiFunction<G, G, G>> mergeRowsAndGetFn = MergeAndGet.functionFor(strategy);
|
.functionFor(mergeAndGetStrategy);
|
||||||
|
SerializableSupplier<BiFunction<G, G, G>> mergeRowsAndGetFn = MergeAndGet.functionFor(mergeAndGetStrategy);
|
||||||
SerializableSupplier<G> zeroFn = zeroFn(rowClazz);
|
SerializableSupplier<G> zeroFn = zeroFn(rowClazz);
|
||||||
SerializableSupplier<Function<G, Boolean>> isNotZeroFn = PromoteActionPayloadForGraphTableJob::isNotZeroFnUsingIdOrSourceAndTarget;
|
SerializableSupplier<Function<G, Boolean>> isNotZeroFn = PromoteActionPayloadForGraphTableJob::isNotZeroFnUsingIdOrSourceAndTarget;
|
||||||
|
|
||||||
|
@ -207,6 +219,7 @@ public class PromoteActionPayloadForGraphTableJob {
|
||||||
rowIdFn,
|
rowIdFn,
|
||||||
actionPayloadIdFn,
|
actionPayloadIdFn,
|
||||||
mergeRowWithActionPayloadAndGetFn,
|
mergeRowWithActionPayloadAndGetFn,
|
||||||
|
promoteActionStrategy,
|
||||||
rowClazz,
|
rowClazz,
|
||||||
actionPayloadClazz);
|
actionPayloadClazz);
|
||||||
|
|
||||||
|
|
|
@ -34,6 +34,7 @@ public class PromoteActionPayloadFunctions {
|
||||||
* @param rowIdFn Function used to get the id of graph table row
|
* @param rowIdFn Function used to get the id of graph table row
|
||||||
* @param actionPayloadIdFn Function used to get id of action payload instance
|
* @param actionPayloadIdFn Function used to get id of action payload instance
|
||||||
* @param mergeAndGetFn Function used to merge graph table row and action payload instance
|
* @param mergeAndGetFn Function used to merge graph table row and action payload instance
|
||||||
|
* @param promoteActionStrategy the Actionset promotion strategy
|
||||||
* @param rowClazz Class of graph table
|
* @param rowClazz Class of graph table
|
||||||
* @param actionPayloadClazz Class of action payload
|
* @param actionPayloadClazz Class of action payload
|
||||||
* @param <G> Type of graph table row
|
* @param <G> Type of graph table row
|
||||||
|
@ -46,6 +47,7 @@ public class PromoteActionPayloadFunctions {
|
||||||
SerializableSupplier<Function<G, String>> rowIdFn,
|
SerializableSupplier<Function<G, String>> rowIdFn,
|
||||||
SerializableSupplier<Function<A, String>> actionPayloadIdFn,
|
SerializableSupplier<Function<A, String>> actionPayloadIdFn,
|
||||||
SerializableSupplier<BiFunction<G, A, G>> mergeAndGetFn,
|
SerializableSupplier<BiFunction<G, A, G>> mergeAndGetFn,
|
||||||
|
PromoteAction.Strategy promoteActionStrategy,
|
||||||
Class<G> rowClazz,
|
Class<G> rowClazz,
|
||||||
Class<A> actionPayloadClazz) {
|
Class<A> actionPayloadClazz) {
|
||||||
if (!isSubClass(rowClazz, actionPayloadClazz)) {
|
if (!isSubClass(rowClazz, actionPayloadClazz)) {
|
||||||
|
@ -61,7 +63,7 @@ public class PromoteActionPayloadFunctions {
|
||||||
.joinWith(
|
.joinWith(
|
||||||
actionPayloadWithIdDS,
|
actionPayloadWithIdDS,
|
||||||
rowWithIdDS.col("_1").equalTo(actionPayloadWithIdDS.col("_1")),
|
rowWithIdDS.col("_1").equalTo(actionPayloadWithIdDS.col("_1")),
|
||||||
"full_outer")
|
PromoteAction.joinTypeForStrategy(promoteActionStrategy))
|
||||||
.map(
|
.map(
|
||||||
(MapFunction<Tuple2<Tuple2<String, G>, Tuple2<String, A>>, G>) value -> {
|
(MapFunction<Tuple2<Tuple2<String, G>, Tuple2<String, A>>, G>) value -> {
|
||||||
Optional<G> rowOpt = Optional.ofNullable(value._1()).map(Tuple2::_2);
|
Optional<G> rowOpt = Optional.ofNullable(value._1()).map(Tuple2::_2);
|
||||||
|
|
|
@ -41,6 +41,12 @@
|
||||||
"paramDescription": "strategy for merging graph table objects with action payload instances, MERGE_FROM_AND_GET or SELECT_NEWER_AND_GET",
|
"paramDescription": "strategy for merging graph table objects with action payload instances, MERGE_FROM_AND_GET or SELECT_NEWER_AND_GET",
|
||||||
"paramRequired": true
|
"paramRequired": true
|
||||||
},
|
},
|
||||||
|
{
|
||||||
|
"paramName": "pas",
|
||||||
|
"paramLongName": "promoteActionStrategy",
|
||||||
|
"paramDescription": "strategy for promoting the actionset contents into the graph tables, ENRICH or UPSERT (default)",
|
||||||
|
"paramRequired": false
|
||||||
|
},
|
||||||
{
|
{
|
||||||
"paramName": "sgid",
|
"paramName": "sgid",
|
||||||
"paramLongName": "shouldGroupById",
|
"paramLongName": "shouldGroupById",
|
||||||
|
|
|
@ -115,6 +115,7 @@
|
||||||
<arg>--actionPayloadClassName</arg><arg>eu.dnetlib.dhp.schema.oaf.Dataset</arg>
|
<arg>--actionPayloadClassName</arg><arg>eu.dnetlib.dhp.schema.oaf.Dataset</arg>
|
||||||
<arg>--outputGraphTablePath</arg><arg>${workingDir}/dataset</arg>
|
<arg>--outputGraphTablePath</arg><arg>${workingDir}/dataset</arg>
|
||||||
<arg>--mergeAndGetStrategy</arg><arg>${mergeAndGetStrategy}</arg>
|
<arg>--mergeAndGetStrategy</arg><arg>${mergeAndGetStrategy}</arg>
|
||||||
|
<arg>--promoteActionStrategy</arg><arg>${promoteActionStrategy}</arg>
|
||||||
<arg>--shouldGroupById</arg><arg>${shouldGroupById}</arg>
|
<arg>--shouldGroupById</arg><arg>${shouldGroupById}</arg>
|
||||||
</spark>
|
</spark>
|
||||||
<ok to="DecisionPromoteResultActionPayloadForDatasetTable"/>
|
<ok to="DecisionPromoteResultActionPayloadForDatasetTable"/>
|
||||||
|
@ -167,6 +168,7 @@
|
||||||
<arg>--actionPayloadClassName</arg><arg>eu.dnetlib.dhp.schema.oaf.Result</arg>
|
<arg>--actionPayloadClassName</arg><arg>eu.dnetlib.dhp.schema.oaf.Result</arg>
|
||||||
<arg>--outputGraphTablePath</arg><arg>${outputGraphRootPath}/dataset</arg>
|
<arg>--outputGraphTablePath</arg><arg>${outputGraphRootPath}/dataset</arg>
|
||||||
<arg>--mergeAndGetStrategy</arg><arg>${mergeAndGetStrategy}</arg>
|
<arg>--mergeAndGetStrategy</arg><arg>${mergeAndGetStrategy}</arg>
|
||||||
|
<arg>--promoteActionStrategy</arg><arg>${promoteActionStrategy}</arg>
|
||||||
<arg>--shouldGroupById</arg><arg>${shouldGroupById}</arg>
|
<arg>--shouldGroupById</arg><arg>${shouldGroupById}</arg>
|
||||||
</spark>
|
</spark>
|
||||||
<ok to="End"/>
|
<ok to="End"/>
|
||||||
|
|
|
@ -106,6 +106,7 @@
|
||||||
<arg>--actionPayloadClassName</arg><arg>eu.dnetlib.dhp.schema.oaf.Datasource</arg>
|
<arg>--actionPayloadClassName</arg><arg>eu.dnetlib.dhp.schema.oaf.Datasource</arg>
|
||||||
<arg>--outputGraphTablePath</arg><arg>${outputGraphRootPath}/datasource</arg>
|
<arg>--outputGraphTablePath</arg><arg>${outputGraphRootPath}/datasource</arg>
|
||||||
<arg>--mergeAndGetStrategy</arg><arg>${mergeAndGetStrategy}</arg>
|
<arg>--mergeAndGetStrategy</arg><arg>${mergeAndGetStrategy}</arg>
|
||||||
|
<arg>--promoteActionStrategy</arg><arg>${promoteActionStrategy}</arg>
|
||||||
</spark>
|
</spark>
|
||||||
<ok to="End"/>
|
<ok to="End"/>
|
||||||
<error to="Kill"/>
|
<error to="Kill"/>
|
||||||
|
|
|
@ -106,6 +106,7 @@
|
||||||
<arg>--actionPayloadClassName</arg><arg>eu.dnetlib.dhp.schema.oaf.Organization</arg>
|
<arg>--actionPayloadClassName</arg><arg>eu.dnetlib.dhp.schema.oaf.Organization</arg>
|
||||||
<arg>--outputGraphTablePath</arg><arg>${outputGraphRootPath}/organization</arg>
|
<arg>--outputGraphTablePath</arg><arg>${outputGraphRootPath}/organization</arg>
|
||||||
<arg>--mergeAndGetStrategy</arg><arg>${mergeAndGetStrategy}</arg>
|
<arg>--mergeAndGetStrategy</arg><arg>${mergeAndGetStrategy}</arg>
|
||||||
|
<arg>--promoteActionStrategy</arg><arg>${promoteActionStrategy}</arg>
|
||||||
</spark>
|
</spark>
|
||||||
<ok to="End"/>
|
<ok to="End"/>
|
||||||
<error to="Kill"/>
|
<error to="Kill"/>
|
||||||
|
|
|
@ -114,6 +114,7 @@
|
||||||
<arg>--actionPayloadClassName</arg><arg>eu.dnetlib.dhp.schema.oaf.OtherResearchProduct</arg>
|
<arg>--actionPayloadClassName</arg><arg>eu.dnetlib.dhp.schema.oaf.OtherResearchProduct</arg>
|
||||||
<arg>--outputGraphTablePath</arg><arg>${workingDir}/otherresearchproduct</arg>
|
<arg>--outputGraphTablePath</arg><arg>${workingDir}/otherresearchproduct</arg>
|
||||||
<arg>--mergeAndGetStrategy</arg><arg>${mergeAndGetStrategy}</arg>
|
<arg>--mergeAndGetStrategy</arg><arg>${mergeAndGetStrategy}</arg>
|
||||||
|
<arg>--promoteActionStrategy</arg><arg>${promoteActionStrategy}</arg>
|
||||||
<arg>--shouldGroupById</arg><arg>${shouldGroupById}</arg>
|
<arg>--shouldGroupById</arg><arg>${shouldGroupById}</arg>
|
||||||
</spark>
|
</spark>
|
||||||
<ok to="DecisionPromoteResultActionPayloadForOtherResearchProductTable"/>
|
<ok to="DecisionPromoteResultActionPayloadForOtherResearchProductTable"/>
|
||||||
|
@ -166,6 +167,7 @@
|
||||||
<arg>--actionPayloadClassName</arg><arg>eu.dnetlib.dhp.schema.oaf.Result</arg>
|
<arg>--actionPayloadClassName</arg><arg>eu.dnetlib.dhp.schema.oaf.Result</arg>
|
||||||
<arg>--outputGraphTablePath</arg><arg>${outputGraphRootPath}/otherresearchproduct</arg>
|
<arg>--outputGraphTablePath</arg><arg>${outputGraphRootPath}/otherresearchproduct</arg>
|
||||||
<arg>--mergeAndGetStrategy</arg><arg>${mergeAndGetStrategy}</arg>
|
<arg>--mergeAndGetStrategy</arg><arg>${mergeAndGetStrategy}</arg>
|
||||||
|
<arg>--promoteActionStrategy</arg><arg>${promoteActionStrategy}</arg>
|
||||||
<arg>--shouldGroupById</arg><arg>${shouldGroupById}</arg>
|
<arg>--shouldGroupById</arg><arg>${shouldGroupById}</arg>
|
||||||
</spark>
|
</spark>
|
||||||
<ok to="End"/>
|
<ok to="End"/>
|
||||||
|
|
|
@ -106,6 +106,7 @@
|
||||||
<arg>--actionPayloadClassName</arg><arg>eu.dnetlib.dhp.schema.oaf.Project</arg>
|
<arg>--actionPayloadClassName</arg><arg>eu.dnetlib.dhp.schema.oaf.Project</arg>
|
||||||
<arg>--outputGraphTablePath</arg><arg>${outputGraphRootPath}/project</arg>
|
<arg>--outputGraphTablePath</arg><arg>${outputGraphRootPath}/project</arg>
|
||||||
<arg>--mergeAndGetStrategy</arg><arg>${mergeAndGetStrategy}</arg>
|
<arg>--mergeAndGetStrategy</arg><arg>${mergeAndGetStrategy}</arg>
|
||||||
|
<arg>--promoteActionStrategy</arg><arg>${promoteActionStrategy}</arg>
|
||||||
</spark>
|
</spark>
|
||||||
<ok to="End"/>
|
<ok to="End"/>
|
||||||
<error to="Kill"/>
|
<error to="Kill"/>
|
||||||
|
|
|
@ -115,6 +115,7 @@
|
||||||
<arg>--actionPayloadClassName</arg><arg>eu.dnetlib.dhp.schema.oaf.Publication</arg>
|
<arg>--actionPayloadClassName</arg><arg>eu.dnetlib.dhp.schema.oaf.Publication</arg>
|
||||||
<arg>--outputGraphTablePath</arg><arg>${workingDir}/publication</arg>
|
<arg>--outputGraphTablePath</arg><arg>${workingDir}/publication</arg>
|
||||||
<arg>--mergeAndGetStrategy</arg><arg>${mergeAndGetStrategy}</arg>
|
<arg>--mergeAndGetStrategy</arg><arg>${mergeAndGetStrategy}</arg>
|
||||||
|
<arg>--promoteActionStrategy</arg><arg>${promoteActionStrategy}</arg>
|
||||||
<arg>--shouldGroupById</arg><arg>${shouldGroupById}</arg>
|
<arg>--shouldGroupById</arg><arg>${shouldGroupById}</arg>
|
||||||
</spark>
|
</spark>
|
||||||
<ok to="DecisionPromoteResultActionPayloadForPublicationTable"/>
|
<ok to="DecisionPromoteResultActionPayloadForPublicationTable"/>
|
||||||
|
@ -167,6 +168,7 @@
|
||||||
<arg>--actionPayloadClassName</arg><arg>eu.dnetlib.dhp.schema.oaf.Result</arg>
|
<arg>--actionPayloadClassName</arg><arg>eu.dnetlib.dhp.schema.oaf.Result</arg>
|
||||||
<arg>--outputGraphTablePath</arg><arg>${outputGraphRootPath}/publication</arg>
|
<arg>--outputGraphTablePath</arg><arg>${outputGraphRootPath}/publication</arg>
|
||||||
<arg>--mergeAndGetStrategy</arg><arg>${mergeAndGetStrategy}</arg>
|
<arg>--mergeAndGetStrategy</arg><arg>${mergeAndGetStrategy}</arg>
|
||||||
|
<arg>--promoteActionStrategy</arg><arg>${promoteActionStrategy}</arg>
|
||||||
<arg>--shouldGroupById</arg><arg>${shouldGroupById}</arg>
|
<arg>--shouldGroupById</arg><arg>${shouldGroupById}</arg>
|
||||||
</spark>
|
</spark>
|
||||||
<ok to="End"/>
|
<ok to="End"/>
|
||||||
|
|
|
@ -107,6 +107,7 @@
|
||||||
<arg>--actionPayloadClassName</arg><arg>eu.dnetlib.dhp.schema.oaf.Relation</arg>
|
<arg>--actionPayloadClassName</arg><arg>eu.dnetlib.dhp.schema.oaf.Relation</arg>
|
||||||
<arg>--outputGraphTablePath</arg><arg>${outputGraphRootPath}/relation</arg>
|
<arg>--outputGraphTablePath</arg><arg>${outputGraphRootPath}/relation</arg>
|
||||||
<arg>--mergeAndGetStrategy</arg><arg>${mergeAndGetStrategy}</arg>
|
<arg>--mergeAndGetStrategy</arg><arg>${mergeAndGetStrategy}</arg>
|
||||||
|
<arg>--promoteActionStrategy</arg><arg>${promoteActionStrategy}</arg>
|
||||||
</spark>
|
</spark>
|
||||||
<ok to="End"/>
|
<ok to="End"/>
|
||||||
<error to="Kill"/>
|
<error to="Kill"/>
|
||||||
|
|
|
@ -114,6 +114,7 @@
|
||||||
<arg>--actionPayloadClassName</arg><arg>eu.dnetlib.dhp.schema.oaf.Software</arg>
|
<arg>--actionPayloadClassName</arg><arg>eu.dnetlib.dhp.schema.oaf.Software</arg>
|
||||||
<arg>--outputGraphTablePath</arg><arg>${workingDir}/software</arg>
|
<arg>--outputGraphTablePath</arg><arg>${workingDir}/software</arg>
|
||||||
<arg>--mergeAndGetStrategy</arg><arg>${mergeAndGetStrategy}</arg>
|
<arg>--mergeAndGetStrategy</arg><arg>${mergeAndGetStrategy}</arg>
|
||||||
|
<arg>--promoteActionStrategy</arg><arg>${promoteActionStrategy}</arg>
|
||||||
<arg>--shouldGroupById</arg><arg>${shouldGroupById}</arg>
|
<arg>--shouldGroupById</arg><arg>${shouldGroupById}</arg>
|
||||||
</spark>
|
</spark>
|
||||||
<ok to="DecisionPromoteResultActionPayloadForSoftwareTable"/>
|
<ok to="DecisionPromoteResultActionPayloadForSoftwareTable"/>
|
||||||
|
@ -166,6 +167,7 @@
|
||||||
<arg>--actionPayloadClassName</arg><arg>eu.dnetlib.dhp.schema.oaf.Result</arg>
|
<arg>--actionPayloadClassName</arg><arg>eu.dnetlib.dhp.schema.oaf.Result</arg>
|
||||||
<arg>--outputGraphTablePath</arg><arg>${outputGraphRootPath}/software</arg>
|
<arg>--outputGraphTablePath</arg><arg>${outputGraphRootPath}/software</arg>
|
||||||
<arg>--mergeAndGetStrategy</arg><arg>${mergeAndGetStrategy}</arg>
|
<arg>--mergeAndGetStrategy</arg><arg>${mergeAndGetStrategy}</arg>
|
||||||
|
<arg>--promoteActionStrategy</arg><arg>${promoteActionStrategy}</arg>
|
||||||
<arg>--shouldGroupById</arg><arg>${shouldGroupById}</arg>
|
<arg>--shouldGroupById</arg><arg>${shouldGroupById}</arg>
|
||||||
</spark>
|
</spark>
|
||||||
<ok to="End"/>
|
<ok to="End"/>
|
||||||
|
|
|
@ -54,7 +54,7 @@ public class PromoteActionPayloadFunctionsTest {
|
||||||
RuntimeException.class,
|
RuntimeException.class,
|
||||||
() -> PromoteActionPayloadFunctions
|
() -> PromoteActionPayloadFunctions
|
||||||
.joinGraphTableWithActionPayloadAndMerge(
|
.joinGraphTableWithActionPayloadAndMerge(
|
||||||
null, null, null, null, null, OafImplSubSub.class, OafImpl.class));
|
null, null, null, null, null, null, OafImplSubSub.class, OafImpl.class));
|
||||||
}
|
}
|
||||||
|
|
||||||
@Test
|
@Test
|
||||||
|
@ -104,6 +104,7 @@ public class PromoteActionPayloadFunctionsTest {
|
||||||
rowIdFn,
|
rowIdFn,
|
||||||
actionPayloadIdFn,
|
actionPayloadIdFn,
|
||||||
mergeAndGetFn,
|
mergeAndGetFn,
|
||||||
|
PromoteAction.Strategy.UPSERT,
|
||||||
OafImplSubSub.class,
|
OafImplSubSub.class,
|
||||||
OafImplSubSub.class)
|
OafImplSubSub.class)
|
||||||
.collectAsList();
|
.collectAsList();
|
||||||
|
@ -183,6 +184,7 @@ public class PromoteActionPayloadFunctionsTest {
|
||||||
rowIdFn,
|
rowIdFn,
|
||||||
actionPayloadIdFn,
|
actionPayloadIdFn,
|
||||||
mergeAndGetFn,
|
mergeAndGetFn,
|
||||||
|
PromoteAction.Strategy.UPSERT,
|
||||||
OafImplSubSub.class,
|
OafImplSubSub.class,
|
||||||
OafImplSub.class)
|
OafImplSub.class)
|
||||||
.collectAsList();
|
.collectAsList();
|
||||||
|
|
|
@ -1,6 +1,16 @@
|
||||||
|
|
||||||
package eu.dnetlib.dhp.oa.dedup;
|
package eu.dnetlib.dhp.oa.dedup;
|
||||||
|
|
||||||
|
import java.util.*;
|
||||||
|
import java.util.stream.Stream;
|
||||||
|
|
||||||
|
import org.apache.commons.beanutils.BeanUtils;
|
||||||
|
import org.apache.commons.lang3.StringUtils;
|
||||||
|
import org.apache.spark.api.java.function.FlatMapFunction;
|
||||||
|
import org.apache.spark.api.java.function.MapFunction;
|
||||||
|
import org.apache.spark.api.java.function.ReduceFunction;
|
||||||
|
import org.apache.spark.sql.*;
|
||||||
|
|
||||||
import eu.dnetlib.dhp.oa.dedup.model.Identifier;
|
import eu.dnetlib.dhp.oa.dedup.model.Identifier;
|
||||||
import eu.dnetlib.dhp.oa.merge.AuthorMerger;
|
import eu.dnetlib.dhp.oa.merge.AuthorMerger;
|
||||||
import eu.dnetlib.dhp.schema.common.ModelSupport;
|
import eu.dnetlib.dhp.schema.common.ModelSupport;
|
||||||
|
@ -8,19 +18,10 @@ import eu.dnetlib.dhp.schema.oaf.Author;
|
||||||
import eu.dnetlib.dhp.schema.oaf.DataInfo;
|
import eu.dnetlib.dhp.schema.oaf.DataInfo;
|
||||||
import eu.dnetlib.dhp.schema.oaf.OafEntity;
|
import eu.dnetlib.dhp.schema.oaf.OafEntity;
|
||||||
import eu.dnetlib.dhp.schema.oaf.Result;
|
import eu.dnetlib.dhp.schema.oaf.Result;
|
||||||
import org.apache.commons.beanutils.BeanUtils;
|
|
||||||
import org.apache.commons.lang3.StringUtils;
|
|
||||||
import org.apache.spark.api.java.function.FlatMapFunction;
|
|
||||||
import org.apache.spark.api.java.function.MapFunction;
|
|
||||||
import org.apache.spark.api.java.function.ReduceFunction;
|
|
||||||
import org.apache.spark.sql.*;
|
|
||||||
import scala.Tuple2;
|
import scala.Tuple2;
|
||||||
import scala.Tuple3;
|
import scala.Tuple3;
|
||||||
import scala.collection.JavaConversions;
|
import scala.collection.JavaConversions;
|
||||||
|
|
||||||
import java.util.*;
|
|
||||||
import java.util.stream.Stream;
|
|
||||||
|
|
||||||
public class DedupRecordFactory {
|
public class DedupRecordFactory {
|
||||||
public static final class DedupRecordReduceState {
|
public static final class DedupRecordReduceState {
|
||||||
public final String dedupId;
|
public final String dedupId;
|
||||||
|
@ -39,7 +40,8 @@ public class DedupRecordFactory {
|
||||||
} else {
|
} else {
|
||||||
if (Result.class.isAssignableFrom(entity.getClass())) {
|
if (Result.class.isAssignableFrom(entity.getClass())) {
|
||||||
Result result = (Result) entity;
|
Result result = (Result) entity;
|
||||||
if (result.getDateofacceptance() != null && StringUtils.isNotBlank(result.getDateofacceptance().getValue())) {
|
if (result.getDateofacceptance() != null
|
||||||
|
&& StringUtils.isNotBlank(result.getDateofacceptance().getValue())) {
|
||||||
acceptanceDate.add(result.getDateofacceptance().getValue());
|
acceptanceDate.add(result.getDateofacceptance().getValue());
|
||||||
}
|
}
|
||||||
}
|
}
|
||||||
|
@ -50,6 +52,7 @@ public class DedupRecordFactory {
|
||||||
return dedupId;
|
return dedupId;
|
||||||
}
|
}
|
||||||
}
|
}
|
||||||
|
|
||||||
private static final int MAX_ACCEPTANCE_DATE = 20;
|
private static final int MAX_ACCEPTANCE_DATE = 20;
|
||||||
|
|
||||||
private DedupRecordFactory() {
|
private DedupRecordFactory() {
|
||||||
|
@ -90,8 +93,12 @@ public class DedupRecordFactory {
|
||||||
.join(entities, JavaConversions.asScalaBuffer(Collections.singletonList("id")), "left")
|
.join(entities, JavaConversions.asScalaBuffer(Collections.singletonList("id")), "left")
|
||||||
.select("dedupId", "id", "kryoObject")
|
.select("dedupId", "id", "kryoObject")
|
||||||
.as(Encoders.tuple(Encoders.STRING(), Encoders.STRING(), kryoEncoder))
|
.as(Encoders.tuple(Encoders.STRING(), Encoders.STRING(), kryoEncoder))
|
||||||
.map((MapFunction<Tuple3<String, String, OafEntity>, DedupRecordReduceState>) t -> new DedupRecordReduceState(t._1(), t._2(), t._3()), Encoders.kryo(DedupRecordReduceState.class))
|
.map(
|
||||||
.groupByKey((MapFunction<DedupRecordReduceState, String>) DedupRecordReduceState::getDedupId, Encoders.STRING())
|
(MapFunction<Tuple3<String, String, OafEntity>, DedupRecordReduceState>) t -> new DedupRecordReduceState(
|
||||||
|
t._1(), t._2(), t._3()),
|
||||||
|
Encoders.kryo(DedupRecordReduceState.class))
|
||||||
|
.groupByKey(
|
||||||
|
(MapFunction<DedupRecordReduceState, String>) DedupRecordReduceState::getDedupId, Encoders.STRING())
|
||||||
.reduceGroups(
|
.reduceGroups(
|
||||||
(ReduceFunction<DedupRecordReduceState>) (t1, t2) -> {
|
(ReduceFunction<DedupRecordReduceState>) (t1, t2) -> {
|
||||||
if (t1.entity == null) {
|
if (t1.entity == null) {
|
||||||
|
@ -105,10 +112,8 @@ public class DedupRecordFactory {
|
||||||
t1.entity = reduceEntity(t1.entity, t2.entity);
|
t1.entity = reduceEntity(t1.entity, t2.entity);
|
||||||
|
|
||||||
return t1;
|
return t1;
|
||||||
}
|
})
|
||||||
)
|
.flatMap((FlatMapFunction<Tuple2<String, DedupRecordReduceState>, OafEntity>) t -> {
|
||||||
.flatMap
|
|
||||||
((FlatMapFunction<Tuple2<String, DedupRecordReduceState>, OafEntity>) t -> {
|
|
||||||
String dedupId = t._1();
|
String dedupId = t._1();
|
||||||
DedupRecordReduceState agg = t._2();
|
DedupRecordReduceState agg = t._2();
|
||||||
|
|
||||||
|
@ -116,10 +121,21 @@ public class DedupRecordFactory {
|
||||||
return Collections.emptyIterator();
|
return Collections.emptyIterator();
|
||||||
}
|
}
|
||||||
|
|
||||||
return Stream.concat(Stream.of(agg.getDedupId()), agg.aliases.stream())
|
return Stream
|
||||||
.map(id -> {
|
.concat(
|
||||||
|
Stream
|
||||||
|
.of(agg.getDedupId())
|
||||||
|
.map(id -> createDedupOafEntity(id, agg.entity, dataInfo, ts)),
|
||||||
|
agg.aliases
|
||||||
|
.stream()
|
||||||
|
.map(id -> createMergedDedupAliasOafEntity(id, agg.entity, dataInfo, ts)))
|
||||||
|
.iterator();
|
||||||
|
}, beanEncoder);
|
||||||
|
}
|
||||||
|
|
||||||
|
private static OafEntity createDedupOafEntity(String id, OafEntity base, DataInfo dataInfo, long ts) {
|
||||||
try {
|
try {
|
||||||
OafEntity res = (OafEntity) BeanUtils.cloneBean(agg.entity);
|
OafEntity res = (OafEntity) BeanUtils.cloneBean(base);
|
||||||
res.setId(id);
|
res.setId(id);
|
||||||
res.setDataInfo(dataInfo);
|
res.setDataInfo(dataInfo);
|
||||||
res.setLastupdatetimestamp(ts);
|
res.setLastupdatetimestamp(ts);
|
||||||
|
@ -127,8 +143,18 @@ public class DedupRecordFactory {
|
||||||
} catch (Exception e) {
|
} catch (Exception e) {
|
||||||
throw new RuntimeException(e);
|
throw new RuntimeException(e);
|
||||||
}
|
}
|
||||||
}).iterator();
|
}
|
||||||
}, beanEncoder);
|
|
||||||
|
private static OafEntity createMergedDedupAliasOafEntity(String id, OafEntity base, DataInfo dataInfo, long ts) {
|
||||||
|
try {
|
||||||
|
OafEntity res = createDedupOafEntity(id, base, dataInfo, ts);
|
||||||
|
DataInfo ds = (DataInfo) BeanUtils.cloneBean(dataInfo);
|
||||||
|
ds.setDeletedbyinference(true);
|
||||||
|
res.setDataInfo(ds);
|
||||||
|
return res;
|
||||||
|
} catch (Exception e) {
|
||||||
|
throw new RuntimeException(e);
|
||||||
|
}
|
||||||
}
|
}
|
||||||
|
|
||||||
private static OafEntity reduceEntity(OafEntity entity, OafEntity duplicate) {
|
private static OafEntity reduceEntity(OafEntity entity, OafEntity duplicate) {
|
||||||
|
@ -137,7 +163,6 @@ public class DedupRecordFactory {
|
||||||
return entity;
|
return entity;
|
||||||
}
|
}
|
||||||
|
|
||||||
|
|
||||||
int compare = new IdentifierComparator<>()
|
int compare = new IdentifierComparator<>()
|
||||||
.compare(Identifier.newInstance(entity), Identifier.newInstance(duplicate));
|
.compare(Identifier.newInstance(entity), Identifier.newInstance(duplicate));
|
||||||
|
|
||||||
|
|
|
@ -242,7 +242,8 @@ public class SparkCreateMergeRels extends AbstractSparkAction {
|
||||||
|
|
||||||
// this was a pivot in a previous graph but it has been merged into a new group with different
|
// this was a pivot in a previous graph but it has been merged into a new group with different
|
||||||
// pivot
|
// pivot
|
||||||
if (!r.isNullAt(r.fieldIndex("lastUsage")) && !pivot.equals(id) && !dedupId.equals(pivotDedupId)) {
|
if (!r.isNullAt(r.fieldIndex("lastUsage")) && !pivot.equals(id)
|
||||||
|
&& !dedupId.equals(pivotDedupId)) {
|
||||||
// materialize the previous dedup record as a merge relation with the new one
|
// materialize the previous dedup record as a merge relation with the new one
|
||||||
res.add(new Tuple3<>(dedupId, pivotDedupId, null));
|
res.add(new Tuple3<>(dedupId, pivotDedupId, null));
|
||||||
}
|
}
|
||||||
|
|
|
@ -0,0 +1,26 @@
|
||||||
|
<configuration>
|
||||||
|
<property>
|
||||||
|
<name>jobTracker</name>
|
||||||
|
<value>yarnRM</value>
|
||||||
|
</property>
|
||||||
|
<property>
|
||||||
|
<name>nameNode</name>
|
||||||
|
<value>hdfs://nameservice1</value>
|
||||||
|
</property>
|
||||||
|
<property>
|
||||||
|
<name>oozie.use.system.libpath</name>
|
||||||
|
<value>true</value>
|
||||||
|
</property>
|
||||||
|
<property>
|
||||||
|
<name>oozie.action.sharelib.for.spark</name>
|
||||||
|
<value>spark2</value>
|
||||||
|
</property>
|
||||||
|
<property>
|
||||||
|
<name>hiveMetastoreUris</name>
|
||||||
|
<value>thrift://iis-cdh5-test-m3.ocean.icm.edu.pl:9083</value>
|
||||||
|
</property>
|
||||||
|
<property>
|
||||||
|
<name>sparkSqlWarehouseDir</name>
|
||||||
|
<value>/user/hive/warehouse</value>
|
||||||
|
</property>
|
||||||
|
</configuration>
|
|
@ -0,0 +1,62 @@
|
||||||
|
|
||||||
|
CREATE TABLE `${pivot_history_db}`.`dataset_new` STORED AS PARQUET AS
|
||||||
|
WITH pivots (
|
||||||
|
SELECT property.value AS id, '${new_graph_date}' AS usedIn FROM `${new_graph_db}`.`relation`
|
||||||
|
LEFT SEMI JOIN `${new_graph_db}`.`dataset` ON relation.source = dataset.id
|
||||||
|
LATERAL VIEW EXPLODE(properties) AS property WHERE relClass = 'isMergedIn' AND property.key = 'pivot'
|
||||||
|
UNION
|
||||||
|
SELECT id, usedIn FROM `${pivot_history_db}`.`dataset` LATERAL VIEW EXPLODE(usages) AS usedIn
|
||||||
|
)
|
||||||
|
SELECT id, min(usedIn) as firstUsage, max(usedIn) as lastUsage, collect_set(usedIn) as usages
|
||||||
|
FROM pivots
|
||||||
|
GROUP BY id; /*EOS*/
|
||||||
|
CREATE TABLE `${pivot_history_db}`.`publication_new` STORED AS PARQUET AS
|
||||||
|
WITH pivots (
|
||||||
|
SELECT property.value AS id, '${new_graph_date}' AS usedIn FROM `${new_graph_db}`.`relation`
|
||||||
|
LEFT SEMI JOIN `${new_graph_db}`.`publication` ON relation.source = publication.id
|
||||||
|
LATERAL VIEW EXPLODE(properties) AS property WHERE relClass = 'isMergedIn' AND property.key = 'pivot'
|
||||||
|
UNION
|
||||||
|
SELECT id, usedIn FROM `${pivot_history_db}`.`publication` LATERAL VIEW EXPLODE(usages) AS usedIn
|
||||||
|
)
|
||||||
|
SELECT id, min(usedIn) as firstUsage, max(usedIn) as lastUsage, collect_set(usedIn) as usages
|
||||||
|
FROM pivots
|
||||||
|
GROUP BY id; /*EOS*/
|
||||||
|
CREATE TABLE `${pivot_history_db}`.`software_new` STORED AS PARQUET AS
|
||||||
|
WITH pivots (
|
||||||
|
SELECT property.value AS id, '${new_graph_date}' AS usedIn FROM `${new_graph_db}`.`relation`
|
||||||
|
LEFT SEMI JOIN `${new_graph_db}`.`software` ON relation.source = software.id
|
||||||
|
LATERAL VIEW EXPLODE(properties) AS property WHERE relClass = 'isMergedIn' AND property.key = 'pivot'
|
||||||
|
UNION
|
||||||
|
SELECT id, usedIn FROM `${pivot_history_db}`.`software` LATERAL VIEW EXPLODE(usages) AS usedIn
|
||||||
|
)
|
||||||
|
SELECT id, min(usedIn) as firstUsage, max(usedIn) as lastUsage, collect_set(usedIn) as usages
|
||||||
|
FROM pivots
|
||||||
|
GROUP BY id; /*EOS*/
|
||||||
|
CREATE TABLE `${pivot_history_db}`.`otherresearchproduct_new` STORED AS PARQUET AS
|
||||||
|
WITH pivots (
|
||||||
|
SELECT property.value AS id, '${new_graph_date}' AS usedIn FROM `${new_graph_db}`.`relation`
|
||||||
|
LEFT SEMI JOIN `${new_graph_db}`.`otherresearchproduct` ON relation.source = otherresearchproduct.id
|
||||||
|
LATERAL VIEW EXPLODE(properties) AS property WHERE relClass = 'isMergedIn' AND property.key = 'pivot'
|
||||||
|
UNION
|
||||||
|
SELECT id, usedIn FROM `${pivot_history_db}`.`otherresearchproduct` LATERAL VIEW EXPLODE(usages) AS usedIn
|
||||||
|
)
|
||||||
|
SELECT id, min(usedIn) as firstUsage, max(usedIn) as lastUsage, collect_set(usedIn) as usages
|
||||||
|
FROM pivots
|
||||||
|
GROUP BY id; /*EOS*/
|
||||||
|
|
||||||
|
|
||||||
|
DROP TABLE IF EXISTS `${pivot_history_db}`.`dataset_old`; /*EOS*/
|
||||||
|
ALTER TABLE `${pivot_history_db}`.`dataset` RENAME TO `${pivot_history_db}`.`dataset_old`; /*EOS*/
|
||||||
|
ALTER TABLE `${pivot_history_db}`.`dataset_new` RENAME TO `${pivot_history_db}`.`dataset`; /*EOS*/
|
||||||
|
|
||||||
|
DROP TABLE IF EXISTS `${pivot_history_db}`.`publication_old`; /*EOS*/
|
||||||
|
ALTER TABLE `${pivot_history_db}`.`publication` RENAME TO `${pivot_history_db}`.`publication_old`; /*EOS*/
|
||||||
|
ALTER TABLE `${pivot_history_db}`.`publication_new` RENAME TO `${pivot_history_db}`.`publication`; /*EOS*/
|
||||||
|
|
||||||
|
DROP TABLE IF EXISTS `${pivot_history_db}`.`software_old`; /*EOS*/
|
||||||
|
ALTER TABLE `${pivot_history_db}`.`software` RENAME TO `${pivot_history_db}`.`software_old`; /*EOS*/
|
||||||
|
ALTER TABLE `${pivot_history_db}`.`software_new` RENAME TO `${pivot_history_db}`.`software`; /*EOS*/
|
||||||
|
|
||||||
|
DROP TABLE IF EXISTS `${pivot_history_db}`.`otherresearchproduct_old`; /*EOS*/
|
||||||
|
ALTER TABLE `${pivot_history_db}`.`otherresearchproduct` RENAME TO `${pivot_history_db}`.`otherresearchproduct_old`; /*EOS*/
|
||||||
|
ALTER TABLE `${pivot_history_db}`.`otherresearchproduct_new` RENAME TO `${pivot_history_db}`.`otherresearchproduct`; /*EOS*/
|
|
@ -0,0 +1,95 @@
|
||||||
|
<workflow-app name="Update pivot history" xmlns="uri:oozie:workflow:0.5">
|
||||||
|
<parameters>
|
||||||
|
<!-- properties used in SQL -->
|
||||||
|
<property>
|
||||||
|
<name>pivot_history_db</name>
|
||||||
|
<!-- <value>openaire_beta_pivots_test</value> -->
|
||||||
|
<description>Pivot history DB on hive</description>
|
||||||
|
</property>
|
||||||
|
<property>
|
||||||
|
<name>new_graph_db</name>
|
||||||
|
<!--<value>openaire_beta_20231208</value> -->
|
||||||
|
<description>New graph DB on hive</description>
|
||||||
|
</property>
|
||||||
|
<property>
|
||||||
|
<name>new_graph_date</name>
|
||||||
|
<!-- <value>20231208</value> -->
|
||||||
|
<description>Creation date of new graph db</description>
|
||||||
|
</property>
|
||||||
|
|
||||||
|
<!-- RunSQLSparkJob properties -->
|
||||||
|
<property>
|
||||||
|
<name>hiveMetastoreUris</name>
|
||||||
|
<description>hive server metastore URIs</description>
|
||||||
|
</property>
|
||||||
|
<property>
|
||||||
|
<name>sparkSqlWarehouseDir</name>
|
||||||
|
</property>
|
||||||
|
<!-- General oozie workflow properties -->
|
||||||
|
<property>
|
||||||
|
<name>sparkClusterOpts</name>
|
||||||
|
<value>--conf spark.network.timeout=600 --conf spark.extraListeners= --conf spark.sql.queryExecutionListeners= --conf spark.yarn.historyServer.address=http://iis-cdh5-test-m3.ocean.icm.edu.pl:18088 --conf spark.eventLog.dir=hdfs://nameservice1/user/spark/applicationHistory</value>
|
||||||
|
<description>spark cluster-wide options</description>
|
||||||
|
</property>
|
||||||
|
<property>
|
||||||
|
<name>sparkResourceOpts</name>
|
||||||
|
<value>--executor-memory=3G --conf spark.executor.memoryOverhead=3G --executor-cores=6 --driver-memory=8G --driver-cores=4</value>
|
||||||
|
<description>spark resource options</description>
|
||||||
|
</property>
|
||||||
|
<property>
|
||||||
|
<name>sparkApplicationOpts</name>
|
||||||
|
<value>--conf spark.sql.shuffle.partitions=3840</value>
|
||||||
|
<description>spark resource options</description>
|
||||||
|
</property>
|
||||||
|
</parameters>
|
||||||
|
|
||||||
|
<global>
|
||||||
|
<job-tracker>${jobTracker}</job-tracker>
|
||||||
|
<name-node>${nameNode}</name-node>
|
||||||
|
<configuration>
|
||||||
|
<property>
|
||||||
|
<name>mapreduce.job.queuename</name>
|
||||||
|
<value>${queueName}</value>
|
||||||
|
</property>
|
||||||
|
<property>
|
||||||
|
<name>oozie.launcher.mapred.job.queue.name</name>
|
||||||
|
<value>${oozieLauncherQueueName}</value>
|
||||||
|
</property>
|
||||||
|
<property>
|
||||||
|
<name>oozie.action.sharelib.for.spark</name>
|
||||||
|
<value>${oozieActionShareLibForSpark2}</value>
|
||||||
|
</property>
|
||||||
|
</configuration>
|
||||||
|
</global>
|
||||||
|
|
||||||
|
<start to="UpgradePivotHistory"/>
|
||||||
|
|
||||||
|
<kill name="Kill">
|
||||||
|
<message>Action failed, error message[${wf:errorMessage(wf:lastErrorNode())}]</message>
|
||||||
|
</kill>
|
||||||
|
|
||||||
|
<action name="UpgradePivotHistory">
|
||||||
|
<spark xmlns="uri:oozie:spark-action:0.2">
|
||||||
|
<master>yarn</master>
|
||||||
|
<mode>cluster</mode>
|
||||||
|
<name>Upgrade Pivot History</name>
|
||||||
|
<class>eu.dnetlib.dhp.oozie.RunSQLSparkJob</class>
|
||||||
|
<jar>dhp-dedup-openaire-${projectVersion}.jar</jar>
|
||||||
|
<spark-opts>
|
||||||
|
--conf spark.sql.warehouse.dir=${sparkSqlWarehouseDir}
|
||||||
|
${sparkClusterOpts}
|
||||||
|
${sparkResourceOpts}
|
||||||
|
${sparkApplicationOpts}
|
||||||
|
</spark-opts>
|
||||||
|
<arg>--hiveMetastoreUris</arg><arg>${hiveMetastoreUris}</arg>
|
||||||
|
<arg>--sql</arg><arg>eu/dnetlib/dhp/oa/dedup/pivothistory/oozie_app/sql.sql</arg>
|
||||||
|
<arg>--pivot_history_db</arg><arg>${pivot_history_db}</arg>
|
||||||
|
<arg>--new_graph_db</arg><arg>${new_graph_db}</arg>
|
||||||
|
<arg>--new_graph_date</arg><arg>${new_graph_date}</arg>
|
||||||
|
</spark>
|
||||||
|
<ok to="End"/>
|
||||||
|
<error to="Kill"/>
|
||||||
|
</action>
|
||||||
|
|
||||||
|
<end name="End"/>
|
||||||
|
</workflow-app>
|
|
@ -25,7 +25,7 @@ case class mappingAffiliation(name: String) {}
|
||||||
|
|
||||||
case class mappingAuthor(
|
case class mappingAuthor(
|
||||||
given: Option[String],
|
given: Option[String],
|
||||||
family: String,
|
family: Option[String],
|
||||||
sequence: Option[String],
|
sequence: Option[String],
|
||||||
ORCID: Option[String],
|
ORCID: Option[String],
|
||||||
affiliation: Option[mappingAffiliation]
|
affiliation: Option[mappingAffiliation]
|
||||||
|
@ -226,14 +226,14 @@ case object Crossref2Oaf {
|
||||||
|
|
||||||
//Mapping Author
|
//Mapping Author
|
||||||
val authorList: List[mappingAuthor] =
|
val authorList: List[mappingAuthor] =
|
||||||
(json \ "author").extractOrElse[List[mappingAuthor]](List())
|
(json \ "author").extract[List[mappingAuthor]].filter(a => a.family.isDefined)
|
||||||
|
|
||||||
val sorted_list = authorList.sortWith((a: mappingAuthor, b: mappingAuthor) =>
|
val sorted_list = authorList.sortWith((a: mappingAuthor, b: mappingAuthor) =>
|
||||||
a.sequence.isDefined && a.sequence.get.equalsIgnoreCase("first")
|
a.sequence.isDefined && a.sequence.get.equalsIgnoreCase("first")
|
||||||
)
|
)
|
||||||
|
|
||||||
result.setAuthor(sorted_list.zipWithIndex.map { case (a, index) =>
|
result.setAuthor(sorted_list.zipWithIndex.map { case (a, index) =>
|
||||||
generateAuhtor(a.given.orNull, a.family, a.ORCID.orNull, index)
|
generateAuhtor(a.given.orNull, a.family.get, a.ORCID.orNull, index)
|
||||||
}.asJava)
|
}.asJava)
|
||||||
|
|
||||||
// Mapping instance
|
// Mapping instance
|
||||||
|
|
File diff suppressed because one or more lines are too long
|
@ -22,6 +22,13 @@ class CrossrefMappingTest {
|
||||||
val logger: Logger = LoggerFactory.getLogger(Crossref2Oaf.getClass)
|
val logger: Logger = LoggerFactory.getLogger(Crossref2Oaf.getClass)
|
||||||
val mapper = new ObjectMapper()
|
val mapper = new ObjectMapper()
|
||||||
|
|
||||||
|
@Test
|
||||||
|
def testMissingAuthorParser():Unit = {
|
||||||
|
val json: String = Source.fromInputStream(getClass.getResourceAsStream("/eu/dnetlib/doiboost/crossref/s41567-022-01757-y.json")).mkString
|
||||||
|
val result = Crossref2Oaf.convert(json)
|
||||||
|
result.filter(o => o.isInstanceOf[Publication]).map(p=> p.asInstanceOf[Publication]).foreach(p =>assertTrue(p.getAuthor.size()>0))
|
||||||
|
}
|
||||||
|
|
||||||
@Test
|
@Test
|
||||||
def testFunderRelationshipsMapping(): Unit = {
|
def testFunderRelationshipsMapping(): Unit = {
|
||||||
val template = Source
|
val template = Source
|
||||||
|
|
|
@ -114,7 +114,7 @@
|
||||||
<arg>--sourcePath</arg><arg>${sourcePath}</arg>
|
<arg>--sourcePath</arg><arg>${sourcePath}</arg>
|
||||||
<arg>--hive_metastore_uris</arg><arg>${hive_metastore_uris}</arg>
|
<arg>--hive_metastore_uris</arg><arg>${hive_metastore_uris}</arg>
|
||||||
<arg>--resultTableName</arg><arg>eu.dnetlib.dhp.schema.oaf.Publication</arg>
|
<arg>--resultTableName</arg><arg>eu.dnetlib.dhp.schema.oaf.Publication</arg>
|
||||||
<arg>--outputPath</arg><arg>${workingDir}/orcid/preparedInfo/targetOrcidAssoc</arg>
|
<arg>--outputPath</arg><arg>${workingDir}/orcid/targetOrcidAssoc</arg>
|
||||||
<arg>--allowedsemrels</arg><arg>${allowedsemrels}</arg>
|
<arg>--allowedsemrels</arg><arg>${allowedsemrels}</arg>
|
||||||
</spark>
|
</spark>
|
||||||
<ok to="wait"/>
|
<ok to="wait"/>
|
||||||
|
@ -142,7 +142,7 @@
|
||||||
<arg>--sourcePath</arg><arg>${sourcePath}</arg>
|
<arg>--sourcePath</arg><arg>${sourcePath}</arg>
|
||||||
<arg>--hive_metastore_uris</arg><arg>${hive_metastore_uris}</arg>
|
<arg>--hive_metastore_uris</arg><arg>${hive_metastore_uris}</arg>
|
||||||
<arg>--resultTableName</arg><arg>eu.dnetlib.dhp.schema.oaf.Dataset</arg>
|
<arg>--resultTableName</arg><arg>eu.dnetlib.dhp.schema.oaf.Dataset</arg>
|
||||||
<arg>--outputPath</arg><arg>${workingDir}/orcid/preparedInfo/targetOrcidAssoc</arg>
|
<arg>--outputPath</arg><arg>${workingDir}/orcid/targetOrcidAssoc</arg>
|
||||||
<arg>--allowedsemrels</arg><arg>${allowedsemrels}</arg>
|
<arg>--allowedsemrels</arg><arg>${allowedsemrels}</arg>
|
||||||
</spark>
|
</spark>
|
||||||
<ok to="wait"/>
|
<ok to="wait"/>
|
||||||
|
@ -170,7 +170,7 @@
|
||||||
<arg>--sourcePath</arg><arg>${sourcePath}</arg>
|
<arg>--sourcePath</arg><arg>${sourcePath}</arg>
|
||||||
<arg>--hive_metastore_uris</arg><arg>${hive_metastore_uris}</arg>
|
<arg>--hive_metastore_uris</arg><arg>${hive_metastore_uris}</arg>
|
||||||
<arg>--resultTableName</arg><arg>eu.dnetlib.dhp.schema.oaf.OtherResearchProduct</arg>
|
<arg>--resultTableName</arg><arg>eu.dnetlib.dhp.schema.oaf.OtherResearchProduct</arg>
|
||||||
<arg>--outputPath</arg><arg>${workingDir}/orcid/preparedInfo/targetOrcidAssoc</arg>
|
<arg>--outputPath</arg><arg>${workingDir}/orcid/targetOrcidAssoc</arg>
|
||||||
<arg>--allowedsemrels</arg><arg>${allowedsemrels}</arg>
|
<arg>--allowedsemrels</arg><arg>${allowedsemrels}</arg>
|
||||||
</spark>
|
</spark>
|
||||||
<ok to="wait"/>
|
<ok to="wait"/>
|
||||||
|
@ -198,7 +198,7 @@
|
||||||
<arg>--sourcePath</arg><arg>${sourcePath}</arg>
|
<arg>--sourcePath</arg><arg>${sourcePath}</arg>
|
||||||
<arg>--hive_metastore_uris</arg><arg>${hive_metastore_uris}</arg>
|
<arg>--hive_metastore_uris</arg><arg>${hive_metastore_uris}</arg>
|
||||||
<arg>--resultTableName</arg><arg>eu.dnetlib.dhp.schema.oaf.Software</arg>
|
<arg>--resultTableName</arg><arg>eu.dnetlib.dhp.schema.oaf.Software</arg>
|
||||||
<arg>--outputPath</arg><arg>${workingDir}/orcid/preparedInfo/targetOrcidAssoc</arg>
|
<arg>--outputPath</arg><arg>${workingDir}/orcid/targetOrcidAssoc</arg>
|
||||||
<arg>--allowedsemrels</arg><arg>${allowedsemrels}</arg>
|
<arg>--allowedsemrels</arg><arg>${allowedsemrels}</arg>
|
||||||
</spark>
|
</spark>
|
||||||
<ok to="wait"/>
|
<ok to="wait"/>
|
||||||
|
@ -225,8 +225,8 @@
|
||||||
--conf spark.dynamicAllocation.enabled=true
|
--conf spark.dynamicAllocation.enabled=true
|
||||||
--conf spark.dynamicAllocation.maxExecutors=${spark2MaxExecutors}
|
--conf spark.dynamicAllocation.maxExecutors=${spark2MaxExecutors}
|
||||||
</spark-opts>
|
</spark-opts>
|
||||||
<arg>--sourcePath</arg><arg>${workingDir}/orcid/orcidprop</arg>
|
<arg>--sourcePath</arg><arg>${workingDir}/orcid/targetOrcidAssoc</arg>
|
||||||
<arg>--outputPath</arg><arg>${workingDir}/orcid/orcidprop/mergedOrcidAssoc</arg>
|
<arg>--outputPath</arg><arg>${workingDir}/orcid/mergedOrcidAssoc</arg>
|
||||||
</spark>
|
</spark>
|
||||||
<ok to="fork-join-exec-propagation"/>
|
<ok to="fork-join-exec-propagation"/>
|
||||||
<error to="Kill"/>
|
<error to="Kill"/>
|
||||||
|
@ -247,9 +247,10 @@
|
||||||
<class>eu.dnetlib.dhp.orcidtoresultfromsemrel.SparkOrcidToResultFromSemRelJob</class>
|
<class>eu.dnetlib.dhp.orcidtoresultfromsemrel.SparkOrcidToResultFromSemRelJob</class>
|
||||||
<jar>dhp-enrichment-${projectVersion}.jar</jar>
|
<jar>dhp-enrichment-${projectVersion}.jar</jar>
|
||||||
<spark-opts>
|
<spark-opts>
|
||||||
--executor-cores=${sparkExecutorCores}
|
--executor-cores=4
|
||||||
--executor-memory=${sparkExecutorMemory}
|
--executor-memory=4G
|
||||||
--driver-memory=${sparkDriverMemory}
|
--driver-memory=${sparkDriverMemory}
|
||||||
|
--conf spark.executor.memoryOverhead=5G
|
||||||
--conf spark.extraListeners=${spark2ExtraListeners}
|
--conf spark.extraListeners=${spark2ExtraListeners}
|
||||||
--conf spark.sql.queryExecutionListeners=${spark2SqlQueryExecutionListeners}
|
--conf spark.sql.queryExecutionListeners=${spark2SqlQueryExecutionListeners}
|
||||||
--conf spark.yarn.historyServer.address=${spark2YarnHistoryServerAddress}
|
--conf spark.yarn.historyServer.address=${spark2YarnHistoryServerAddress}
|
||||||
|
@ -259,9 +260,9 @@
|
||||||
--conf spark.speculation=false
|
--conf spark.speculation=false
|
||||||
--conf spark.hadoop.mapreduce.map.speculative=false
|
--conf spark.hadoop.mapreduce.map.speculative=false
|
||||||
--conf spark.hadoop.mapreduce.reduce.speculative=false
|
--conf spark.hadoop.mapreduce.reduce.speculative=false
|
||||||
--conf spark.sql.shuffle.partitions=3840
|
--conf spark.sql.shuffle.partitions=15000
|
||||||
</spark-opts>
|
</spark-opts>
|
||||||
<arg>--possibleUpdatesPath</arg><arg>${workingDir}/orcid/orcidprop/mergedOrcidAssoc</arg>
|
<arg>--possibleUpdatesPath</arg><arg>${workingDir}/orcid/mergedOrcidAssoc</arg>
|
||||||
<arg>--sourcePath</arg><arg>${sourcePath}/publication</arg>
|
<arg>--sourcePath</arg><arg>${sourcePath}/publication</arg>
|
||||||
<arg>--resultTableName</arg><arg>eu.dnetlib.dhp.schema.oaf.Publication</arg>
|
<arg>--resultTableName</arg><arg>eu.dnetlib.dhp.schema.oaf.Publication</arg>
|
||||||
<arg>--outputPath</arg><arg>${outputPath}/publication</arg>
|
<arg>--outputPath</arg><arg>${outputPath}/publication</arg>
|
||||||
|
@ -291,7 +292,7 @@
|
||||||
--conf spark.hadoop.mapreduce.map.speculative=false
|
--conf spark.hadoop.mapreduce.map.speculative=false
|
||||||
--conf spark.hadoop.mapreduce.reduce.speculative=false
|
--conf spark.hadoop.mapreduce.reduce.speculative=false
|
||||||
</spark-opts>
|
</spark-opts>
|
||||||
<arg>--possibleUpdatesPath</arg><arg>${workingDir}/orcid/orcidprop/mergedOrcidAssoc</arg>
|
<arg>--possibleUpdatesPath</arg><arg>${workingDir}/orcid/mergedOrcidAssoc</arg>
|
||||||
<arg>--sourcePath</arg><arg>${sourcePath}/dataset</arg>
|
<arg>--sourcePath</arg><arg>${sourcePath}/dataset</arg>
|
||||||
<arg>--resultTableName</arg><arg>eu.dnetlib.dhp.schema.oaf.Dataset</arg>
|
<arg>--resultTableName</arg><arg>eu.dnetlib.dhp.schema.oaf.Dataset</arg>
|
||||||
<arg>--outputPath</arg><arg>${outputPath}/dataset</arg>
|
<arg>--outputPath</arg><arg>${outputPath}/dataset</arg>
|
||||||
|
@ -321,7 +322,7 @@
|
||||||
--conf spark.hadoop.mapreduce.map.speculative=false
|
--conf spark.hadoop.mapreduce.map.speculative=false
|
||||||
--conf spark.hadoop.mapreduce.reduce.speculative=false
|
--conf spark.hadoop.mapreduce.reduce.speculative=false
|
||||||
</spark-opts>
|
</spark-opts>
|
||||||
<arg>--possibleUpdatesPath</arg><arg>${workingDir}/orcid/orcidprop/mergedOrcidAssoc</arg>
|
<arg>--possibleUpdatesPath</arg><arg>${workingDir}/orcid/mergedOrcidAssoc</arg>
|
||||||
<arg>--sourcePath</arg><arg>${sourcePath}/otherresearchproduct</arg>
|
<arg>--sourcePath</arg><arg>${sourcePath}/otherresearchproduct</arg>
|
||||||
<arg>--resultTableName</arg><arg>eu.dnetlib.dhp.schema.oaf.OtherResearchProduct</arg>
|
<arg>--resultTableName</arg><arg>eu.dnetlib.dhp.schema.oaf.OtherResearchProduct</arg>
|
||||||
<arg>--outputPath</arg><arg>${outputPath}/otherresearchproduct</arg>
|
<arg>--outputPath</arg><arg>${outputPath}/otherresearchproduct</arg>
|
||||||
|
@ -351,7 +352,7 @@
|
||||||
--conf spark.hadoop.mapreduce.map.speculative=false
|
--conf spark.hadoop.mapreduce.map.speculative=false
|
||||||
--conf spark.hadoop.mapreduce.reduce.speculative=false
|
--conf spark.hadoop.mapreduce.reduce.speculative=false
|
||||||
</spark-opts>
|
</spark-opts>
|
||||||
<arg>--possibleUpdatesPath</arg><arg>${workingDir}/orcid/orcidprop/mergedOrcidAssoc</arg>
|
<arg>--possibleUpdatesPath</arg><arg>${workingDir}/orcid/mergedOrcidAssoc</arg>
|
||||||
<arg>--sourcePath</arg><arg>${sourcePath}/software</arg>
|
<arg>--sourcePath</arg><arg>${sourcePath}/software</arg>
|
||||||
<arg>--resultTableName</arg><arg>eu.dnetlib.dhp.schema.oaf.Software</arg>
|
<arg>--resultTableName</arg><arg>eu.dnetlib.dhp.schema.oaf.Software</arg>
|
||||||
<arg>--outputPath</arg><arg>${outputPath}/software</arg>
|
<arg>--outputPath</arg><arg>${outputPath}/software</arg>
|
||||||
|
|
|
@ -238,11 +238,23 @@ public class OdfToOafMapper extends AbstractMdRecordToOafMapper {
|
||||||
(Element) doc
|
(Element) doc
|
||||||
.selectSingleNode(
|
.selectSingleNode(
|
||||||
"//*[local-name()='metadata']/*[local-name() = 'resource']/*[local-name() = 'resourceType']"))
|
"//*[local-name()='metadata']/*[local-name() = 'resource']/*[local-name() = 'resourceType']"))
|
||||||
.map(element -> {
|
.map(e -> {
|
||||||
final String resourceTypeURI = element.attributeValue("uri");
|
final String resourceTypeURI = Optional
|
||||||
final String resourceTypeAnyURI = element.attributeValue("anyURI");
|
.ofNullable(e.attributeValue("uri"))
|
||||||
final String resourceTypeTxt = element.getText();
|
.filter(StringUtils::isNotBlank)
|
||||||
final String resourceTypeGeneral = element.attributeValue("resourceTypeGeneral");
|
.orElse(null);
|
||||||
|
final String resourceTypeAnyURI = Optional
|
||||||
|
.ofNullable(e.attributeValue("anyURI"))
|
||||||
|
.filter(StringUtils::isNotBlank)
|
||||||
|
.orElse(null);
|
||||||
|
final String resourceTypeTxt = Optional
|
||||||
|
.ofNullable(e.getText())
|
||||||
|
.filter(StringUtils::isNotBlank)
|
||||||
|
.orElse(null);
|
||||||
|
final String resourceTypeGeneral = Optional
|
||||||
|
.ofNullable(e.attributeValue("resourceTypeGeneral"))
|
||||||
|
.filter(StringUtils::isNotBlank)
|
||||||
|
.orElse(null);
|
||||||
|
|
||||||
return ObjectUtils
|
return ObjectUtils
|
||||||
.firstNonNull(resourceTypeURI, resourceTypeAnyURI, resourceTypeTxt, resourceTypeGeneral);
|
.firstNonNull(resourceTypeURI, resourceTypeAnyURI, resourceTypeTxt, resourceTypeGeneral);
|
||||||
|
|
|
@ -43,6 +43,17 @@
|
||||||
<arg>--graphPath</arg><arg>${graphPath}</arg>
|
<arg>--graphPath</arg><arg>${graphPath}</arg>
|
||||||
<arg>--master</arg><arg>yarn</arg>
|
<arg>--master</arg><arg>yarn</arg>
|
||||||
</spark>
|
</spark>
|
||||||
|
<ok to="reset_outputpath"/>
|
||||||
|
<error to="Kill"/>
|
||||||
|
</action>
|
||||||
|
|
||||||
|
<action name="reset_outputpath">
|
||||||
|
<fs>
|
||||||
|
<delete path="${targetPath}/datasource"/>
|
||||||
|
<delete path="${targetPath}/organization"/>
|
||||||
|
<delete path="${targetPath}/project"/>
|
||||||
|
<delete path="${targetPath}/relation"/>
|
||||||
|
</fs>
|
||||||
<ok to="copy_datasource"/>
|
<ok to="copy_datasource"/>
|
||||||
<error to="Kill"/>
|
<error to="Kill"/>
|
||||||
</action>
|
</action>
|
||||||
|
|
|
@ -1171,6 +1171,34 @@ class MappersTest {
|
||||||
|
|
||||||
}
|
}
|
||||||
|
|
||||||
|
@Test
|
||||||
|
void test_Zenodo2() throws IOException {
|
||||||
|
final String xml = IOUtils.toString(Objects.requireNonNull(getClass().getResourceAsStream("odf_zenodo2.xml")));
|
||||||
|
final List<Oaf> list = new OdfToOafMapper(vocs, false, true).processMdRecord(xml);
|
||||||
|
|
||||||
|
assertEquals(3, list.size());
|
||||||
|
Publication p = cleanup((Publication) list.get(0), vocs);
|
||||||
|
|
||||||
|
assertNotNull(p.getInstance());
|
||||||
|
assertEquals(1, p.getInstance().size());
|
||||||
|
|
||||||
|
final Instance instance = p.getInstance().get(0);
|
||||||
|
|
||||||
|
assertNotNull(instance.getInstanceTypeMapping());
|
||||||
|
assertEquals(1, instance.getInstanceTypeMapping().size());
|
||||||
|
|
||||||
|
Optional<InstanceTypeMapping> coarType = instance
|
||||||
|
.getInstanceTypeMapping()
|
||||||
|
.stream()
|
||||||
|
.filter(itm -> ModelConstants.OPENAIRE_COAR_RESOURCE_TYPES_3_1.equals(itm.getVocabularyName()))
|
||||||
|
.findFirst();
|
||||||
|
|
||||||
|
assertTrue(coarType.isPresent());
|
||||||
|
assertNotNull(coarType.get().getOriginalType());
|
||||||
|
assertNull(coarType.get().getTypeCode());
|
||||||
|
assertNull(coarType.get().getTypeLabel());
|
||||||
|
}
|
||||||
|
|
||||||
@Test
|
@Test
|
||||||
void testROHub2() throws IOException {
|
void testROHub2() throws IOException {
|
||||||
final String xml = IOUtils
|
final String xml = IOUtils
|
||||||
|
@ -1229,7 +1257,7 @@ class MappersTest {
|
||||||
}
|
}
|
||||||
|
|
||||||
@Test
|
@Test
|
||||||
public void testD4ScienceTraining() throws IOException {
|
void testD4ScienceTraining() throws IOException {
|
||||||
final String xml = IOUtils
|
final String xml = IOUtils
|
||||||
.toString(Objects.requireNonNull(getClass().getResourceAsStream("d4science-1-training.xml")));
|
.toString(Objects.requireNonNull(getClass().getResourceAsStream("d4science-1-training.xml")));
|
||||||
final List<Oaf> list = new OdfToOafMapper(vocs, false, true).processMdRecord(xml);
|
final List<Oaf> list = new OdfToOafMapper(vocs, false, true).processMdRecord(xml);
|
||||||
|
@ -1240,7 +1268,7 @@ class MappersTest {
|
||||||
}
|
}
|
||||||
|
|
||||||
@Test
|
@Test
|
||||||
public void testD4ScienceDataset() throws IOException {
|
void testD4ScienceDataset() throws IOException {
|
||||||
final String xml = IOUtils
|
final String xml = IOUtils
|
||||||
.toString(Objects.requireNonNull(getClass().getResourceAsStream("d4science-2-dataset.xml")));
|
.toString(Objects.requireNonNull(getClass().getResourceAsStream("d4science-2-dataset.xml")));
|
||||||
final List<Oaf> list = new OdfToOafMapper(vocs, false, true).processMdRecord(xml);
|
final List<Oaf> list = new OdfToOafMapper(vocs, false, true).processMdRecord(xml);
|
||||||
|
@ -1250,6 +1278,21 @@ class MappersTest {
|
||||||
System.out.println("***************");
|
System.out.println("***************");
|
||||||
}
|
}
|
||||||
|
|
||||||
|
@Test
|
||||||
|
void testIRISPub() throws IOException, DocumentException {
|
||||||
|
final String xml = IOUtils.toString(Objects.requireNonNull(getClass().getResourceAsStream("iris-odf.xml")));
|
||||||
|
final List<Oaf> list = new OdfToOafMapper(vocs, false, true).processMdRecord(xml);
|
||||||
|
System.out.println("***************");
|
||||||
|
System.out.println(new ObjectMapper().writeValueAsString(list));
|
||||||
|
System.out.println("***************");
|
||||||
|
final Publication p = (Publication) list.get(0);
|
||||||
|
assertNotNull(p.getInstance().get(0).getUrl().get(0));
|
||||||
|
assertValidId(p.getId());
|
||||||
|
System.out.println(p.getInstance().get(0).getUrl());
|
||||||
|
p.getPid().forEach(x -> System.out.println(x.getValue()));
|
||||||
|
p.getInstance().get(0).getAlternateIdentifier().forEach(x -> System.out.println(x.getValue()));
|
||||||
|
|
||||||
|
}
|
||||||
@Test
|
@Test
|
||||||
void testNotWellFormed() throws IOException {
|
void testNotWellFormed() throws IOException {
|
||||||
final String xml = IOUtils
|
final String xml = IOUtils
|
||||||
|
|
|
@ -0,0 +1,215 @@
|
||||||
|
<?xml version="1.0" encoding="UTF-8"?>
|
||||||
|
<record xmlns:datacite="http://datacite.org/schema/kernel-4"
|
||||||
|
xmlns:dc="http://purl.org/dc/elements/1.1/"
|
||||||
|
xmlns:dr="http://www.driver-repository.eu/namespace/dr"
|
||||||
|
xmlns:dri="http://www.driver-repository.eu/namespace/dri"
|
||||||
|
xmlns:oaf="http://namespace.openaire.eu/oaf"
|
||||||
|
xmlns:oai="http://www.openarchives.org/OAI/2.0/"
|
||||||
|
xmlns:oaire="http://namespace.openaire.eu/schema/oaire/"
|
||||||
|
xmlns:xs="http://www.w3.org/2001/XMLSchema" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
|
||||||
|
<oai:header xmlns="http://namespace.openaire.eu/" xmlns:prov="http://www.openarchives.org/OAI/2.0/provenance">
|
||||||
|
<identifier>oai:air.unimi.it:2434/907506</identifier>
|
||||||
|
<datestamp>2024-01-04T12:42:51Z</datestamp>
|
||||||
|
<setSpec>com_2434_73555</setSpec>
|
||||||
|
<setSpec>col_2434_73557</setSpec>
|
||||||
|
<setSpec>openaire</setSpec>
|
||||||
|
<dr:dateOfTransformation>2024-01-29T16:56:50.632Z</dr:dateOfTransformation>
|
||||||
|
|
||||||
|
<dri:objIdentifier>od______1261::ff2d9e058e7bea90a27f41c31078e601</dri:objIdentifier>
|
||||||
|
<dri:recordIdentifier>oai:air.unimi.it:2434/907506</dri:recordIdentifier>
|
||||||
|
<dri:dateOfCollection/>
|
||||||
|
<dri:mdFormat/>
|
||||||
|
<dri:mdFormatInterpretation/>
|
||||||
|
<dri:repositoryId/>
|
||||||
|
<oaf:datasourceprefix> od______1261</oaf:datasourceprefix>
|
||||||
|
</oai:header>
|
||||||
|
<metadata>
|
||||||
|
<oaire:resource xmlns:oaire="http://namespace.openaire.eu/schema/oaire/"
|
||||||
|
xmlns:exslt="http://exslt.org/common"
|
||||||
|
xmlns:xs="http://www.w3.org/2001/XMLSchema"
|
||||||
|
xmlns:rdf="http://www.w3.org/TR/rdf-concepts/"
|
||||||
|
xmlns:doc="http://www.lyncode.com/xoai"
|
||||||
|
xmlns:dc="http://purl.org/dc/elements/1.1/"
|
||||||
|
xmlns:datacite="http://datacite.org/schema/kernel-4"
|
||||||
|
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
|
||||||
|
xmlns:vc="http://www.w3.org/2007/XMLSchema-versioning"
|
||||||
|
xmlns="http://www.openarchives.org/OAI/2.0/"
|
||||||
|
xsi:schemaLocation="http://namespace.openaire.eu/schema/oaire/ https://www.openaire.eu/schema/repo-lit/4.0/openaire.xsd">
|
||||||
|
<datacite:titles>
|
||||||
|
<datacite:title xml:lang="en">Ensuring tests of conservation interventions build on existing literature</datacite:title>
|
||||||
|
</datacite:titles>
|
||||||
|
<datacite:creator>
|
||||||
|
<datacite:creator>
|
||||||
|
<datacite:creatorName>W.J. Sutherland</datacite:creatorName>
|
||||||
|
</datacite:creator>
|
||||||
|
<datacite:creator>
|
||||||
|
<datacite:creatorName>S.T. Alvarez-Castaneda</datacite:creatorName>
|
||||||
|
</datacite:creator>
|
||||||
|
<datacite:creator>
|
||||||
|
<datacite:creatorName>T. Amano</datacite:creatorName>
|
||||||
|
</datacite:creator>
|
||||||
|
<datacite:creator>
|
||||||
|
<datacite:creatorName>R. Ambrosini</datacite:creatorName>
|
||||||
|
</datacite:creator>
|
||||||
|
<datacite:creator>
|
||||||
|
<datacite:creatorName>P. Atkinson</datacite:creatorName>
|
||||||
|
</datacite:creator>
|
||||||
|
<datacite:creator>
|
||||||
|
<datacite:creatorName>J.M. Baxter</datacite:creatorName>
|
||||||
|
</datacite:creator>
|
||||||
|
<datacite:creator>
|
||||||
|
<datacite:creatorName>A.L. Bond</datacite:creatorName>
|
||||||
|
</datacite:creator>
|
||||||
|
<datacite:creator>
|
||||||
|
<datacite:creatorName>P.J. Boon</datacite:creatorName>
|
||||||
|
</datacite:creator>
|
||||||
|
<datacite:creator>
|
||||||
|
<datacite:creatorName>K.L. Buchanan</datacite:creatorName>
|
||||||
|
</datacite:creator>
|
||||||
|
<datacite:creator>
|
||||||
|
<datacite:creatorName>J. Barlow</datacite:creatorName>
|
||||||
|
</datacite:creator>
|
||||||
|
<datacite:creator>
|
||||||
|
<datacite:creatorName>G. Bogliani</datacite:creatorName>
|
||||||
|
</datacite:creator>
|
||||||
|
<datacite:creator>
|
||||||
|
<datacite:creatorName>O.M. Bragg</datacite:creatorName>
|
||||||
|
</datacite:creator>
|
||||||
|
<datacite:creator>
|
||||||
|
<datacite:creatorName>M. Burgman</datacite:creatorName>
|
||||||
|
</datacite:creator>
|
||||||
|
<datacite:creator>
|
||||||
|
<datacite:creatorName>M.W. Cadotte</datacite:creatorName>
|
||||||
|
</datacite:creator>
|
||||||
|
<datacite:creator>
|
||||||
|
<datacite:creatorName>M. Calver</datacite:creatorName>
|
||||||
|
</datacite:creator>
|
||||||
|
<datacite:creator>
|
||||||
|
<datacite:creatorName>S.J. Cooke</datacite:creatorName>
|
||||||
|
</datacite:creator>
|
||||||
|
<datacite:creator>
|
||||||
|
<datacite:creatorName>R.T. Corlett</datacite:creatorName>
|
||||||
|
</datacite:creator>
|
||||||
|
<datacite:creator>
|
||||||
|
<datacite:creatorName>V. Devictor</datacite:creatorName>
|
||||||
|
</datacite:creator>
|
||||||
|
<datacite:creator>
|
||||||
|
<datacite:creatorName>J.G. Ewen</datacite:creatorName>
|
||||||
|
</datacite:creator>
|
||||||
|
<datacite:creator>
|
||||||
|
<datacite:creatorName>M. Fisher</datacite:creatorName>
|
||||||
|
</datacite:creator>
|
||||||
|
<datacite:creator>
|
||||||
|
<datacite:creatorName>G. Freeman</datacite:creatorName>
|
||||||
|
</datacite:creator>
|
||||||
|
<datacite:creator>
|
||||||
|
<datacite:creatorName>E. Game</datacite:creatorName>
|
||||||
|
</datacite:creator>
|
||||||
|
<datacite:creator>
|
||||||
|
<datacite:creatorName>B.J. Godley</datacite:creatorName>
|
||||||
|
</datacite:creator>
|
||||||
|
<datacite:creator>
|
||||||
|
<datacite:creatorName>C. Gortazar</datacite:creatorName>
|
||||||
|
</datacite:creator>
|
||||||
|
<datacite:creator>
|
||||||
|
<datacite:creatorName>I.R. Hartley</datacite:creatorName>
|
||||||
|
</datacite:creator>
|
||||||
|
<datacite:creator>
|
||||||
|
<datacite:creatorName>D.L. Hawksworth</datacite:creatorName>
|
||||||
|
</datacite:creator>
|
||||||
|
<datacite:creator>
|
||||||
|
<datacite:creatorName>K.A. Hobson</datacite:creatorName>
|
||||||
|
</datacite:creator>
|
||||||
|
<datacite:creator>
|
||||||
|
<datacite:creatorName>M.-. Lu</datacite:creatorName>
|
||||||
|
</datacite:creator>
|
||||||
|
<datacite:creator>
|
||||||
|
<datacite:creatorName>B. Martin-Lopez</datacite:creatorName>
|
||||||
|
</datacite:creator>
|
||||||
|
<datacite:creator>
|
||||||
|
<datacite:creatorName>K. Ma</datacite:creatorName>
|
||||||
|
</datacite:creator>
|
||||||
|
<datacite:creator>
|
||||||
|
<datacite:creatorName>A. Machado</datacite:creatorName>
|
||||||
|
</datacite:creator>
|
||||||
|
<datacite:creator>
|
||||||
|
<datacite:creatorName>D. Mae</datacite:creatorName>
|
||||||
|
</datacite:creator>
|
||||||
|
<datacite:creator>
|
||||||
|
<datacite:creatorName>M. Mangiacotti</datacite:creatorName>
|
||||||
|
</datacite:creator>
|
||||||
|
<datacite:creator>
|
||||||
|
<datacite:creatorName>D.J. Mccafferty</datacite:creatorName>
|
||||||
|
</datacite:creator>
|
||||||
|
<datacite:creator>
|
||||||
|
<datacite:creatorName>V. Melfi</datacite:creatorName>
|
||||||
|
</datacite:creator>
|
||||||
|
<datacite:creator>
|
||||||
|
<datacite:creatorName>S. Molur</datacite:creatorName>
|
||||||
|
</datacite:creator>
|
||||||
|
<datacite:creator>
|
||||||
|
<datacite:creatorName>A.J. Moore</datacite:creatorName>
|
||||||
|
</datacite:creator>
|
||||||
|
<datacite:creator>
|
||||||
|
<datacite:creatorName>S.D. Murphy</datacite:creatorName>
|
||||||
|
</datacite:creator>
|
||||||
|
<datacite:creator>
|
||||||
|
<datacite:creatorName>D. Norri</datacite:creatorName>
|
||||||
|
</datacite:creator>
|
||||||
|
<datacite:creator>
|
||||||
|
<datacite:creatorName>A.P.E. van Oudenhoven</datacite:creatorName>
|
||||||
|
</datacite:creator>
|
||||||
|
<datacite:creator>
|
||||||
|
<datacite:creatorName>J. Power</datacite:creatorName>
|
||||||
|
</datacite:creator>
|
||||||
|
<datacite:creator>
|
||||||
|
<datacite:creatorName>E.C. Ree</datacite:creatorName>
|
||||||
|
</datacite:creator>
|
||||||
|
<datacite:creator>
|
||||||
|
<datacite:creatorName>M.W. Schwartz</datacite:creatorName>
|
||||||
|
</datacite:creator>
|
||||||
|
<datacite:creator>
|
||||||
|
<datacite:creatorName>I. Storch</datacite:creatorName>
|
||||||
|
</datacite:creator>
|
||||||
|
<datacite:creator>
|
||||||
|
<datacite:creatorName>C. Wordley</datacite:creatorName>
|
||||||
|
</datacite:creator>
|
||||||
|
</datacite:creator>
|
||||||
|
<datacite:relatedIdentifiers>
|
||||||
|
</datacite:relatedIdentifiers>
|
||||||
|
<datacite:dates>
|
||||||
|
<datacite:date dateType="Accepted">2020</datacite:date>
|
||||||
|
<datacite:date dateType="Issued">2020</datacite:date>
|
||||||
|
<datacite:date dateType="Available">2022-06-20</datacite:date>
|
||||||
|
</datacite:dates>
|
||||||
|
<dc:language>eng</dc:language>
|
||||||
|
<dc:publisher>Wiley Blackwell Publishing</dc:publisher>
|
||||||
|
<oaire:resourceType resourceTypeGeneral="literature"
|
||||||
|
uri="http://purl.org/coar/resource_type/c_6501">journal article</oaire:resourceType>
|
||||||
|
<dc:format>application/pdf</dc:format>
|
||||||
|
<datacite:identifier xmlns:datacite="http://datacite.org/schema/kernel-3"
|
||||||
|
identifierType="Handle">2434/907506</datacite:identifier>
|
||||||
|
<datacite:rights rightsURI="http://purl.org/coar/access_right/c_abf2">open access</datacite:rights>
|
||||||
|
<datacite:subjects>
|
||||||
|
<datacite:subject>Conservation of Natural Resources</datacite:subject>
|
||||||
|
</datacite:subjects>
|
||||||
|
<datacite:sizes/>
|
||||||
|
<datacite:sizes/>
|
||||||
|
<datacite:sizes>
|
||||||
|
<datacite:size>191802 bytes</datacite:size>
|
||||||
|
</datacite:sizes>
|
||||||
|
<oaire:file accessRightsURI="" mimeType="application/pdf" objectType="fulltext">https://air.unimi.it/bitstream/2434/907506/4/Full%20manuscript%20resubmitted.pdf</oaire:file>
|
||||||
|
</oaire:resource>
|
||||||
|
<oaf:identifier identifierType="DOI">10.1111/cobi.13555</oaf:identifier>
|
||||||
|
<oaf:identifier identifierType="PMID">32779884</oaf:identifier>
|
||||||
|
<oaf:fulltext>https://air.unimi.it/bitstream/2434/907506/4/Full%20manuscript%20resubmitted.pdf</oaf:fulltext>
|
||||||
|
<dr:CobjCategory type="publication">0001</dr:CobjCategory>
|
||||||
|
<oaf:dateAccepted>2020-01-01</oaf:dateAccepted>
|
||||||
|
<oaf:accessrights>OPEN</oaf:accessrights>
|
||||||
|
<oaf:language>eng</oaf:language>
|
||||||
|
<oaf:hostedBy name="Archivio Istituzionale della Ricerca dell'Università degli Studi di Milano"
|
||||||
|
id="opendoar____::1261"/>
|
||||||
|
<oaf:collectedFrom name="Archivio Istituzionale della Ricerca dell'Università degli Studi di Milano"
|
||||||
|
id="opendoar____::1261"/>
|
||||||
|
</metadata>
|
||||||
|
</record>
|
|
@ -0,0 +1,59 @@
|
||||||
|
<?xml version="1.0" encoding="UTF-8"?>
|
||||||
|
<record xmlns:oaf="http://namespace.openaire.eu/oaf"
|
||||||
|
xmlns:oai="http://www.openarchives.org/OAI/2.0/"
|
||||||
|
xmlns:datacite="http://datacite.org/schema/kernel-3"
|
||||||
|
xmlns:dr="http://www.driver-repository.eu/namespace/dr"
|
||||||
|
xmlns:dri="http://www.driver-repository.eu/namespace/dri">
|
||||||
|
<header xmlns="http://www.openarchives.org/OAI/2.0/">
|
||||||
|
<identifier>oai:zenodo.org:1596086</identifier>
|
||||||
|
<datestamp>2020-01-20T13:50:28Z</datestamp>
|
||||||
|
<setSpec>openaire</setSpec>
|
||||||
|
<dr:dateOfTransformation>2024-02-08T11:03:10.994Z</dr:dateOfTransformation>
|
||||||
|
<dri:objIdentifier>od______2659::036d5555a6688ed00c8d0da97bdece3b</dri:objIdentifier>
|
||||||
|
<dri:dateOfCollection>2024-02-08T11:03:10.994Z</dri:dateOfCollection>
|
||||||
|
<dri:dateOfTransformation>2024-02-08T11:03:10.994Z</dri:dateOfTransformation>
|
||||||
|
</header>
|
||||||
|
<metadata>
|
||||||
|
<resource xmlns="http://datacite.org/schema/kernel-4"
|
||||||
|
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
|
||||||
|
xsi:schemaLocation="http://datacite.org/schema/kernel-4 http://schema.datacite.org/meta/kernel-4.1/metadata.xsd">
|
||||||
|
<identifier identifierType="URL">https://zenodo.org/record/1596086</identifier>
|
||||||
|
<alternateIdentifiers xmlns="http://datacite.org/schema/kernel-3"/>
|
||||||
|
<creators>
|
||||||
|
<creator>
|
||||||
|
<creatorName>Bonney, T. G.</creatorName>
|
||||||
|
<givenName>T. G.</givenName>
|
||||||
|
<familyName>Bonney</familyName>
|
||||||
|
</creator>
|
||||||
|
</creators>
|
||||||
|
<titles>
|
||||||
|
<title>Ice Blocks on a Moraine</title>
|
||||||
|
</titles>
|
||||||
|
<publisher>Zenodo</publisher>
|
||||||
|
<publicationYear>1889</publicationYear>
|
||||||
|
<dates>
|
||||||
|
<date dateType="Issued">1889-08-22</date>
|
||||||
|
</dates>
|
||||||
|
<resourceType resourceTypeGeneral="JournalArticle"/>
|
||||||
|
<relatedIdentifiers>
|
||||||
|
<relatedIdentifier relatedIdentifierType="DOI" relationType="IsIdenticalTo"
|
||||||
|
>10.1038/040391a0</relatedIdentifier>
|
||||||
|
</relatedIdentifiers>
|
||||||
|
<rightsList>
|
||||||
|
<rights rightsURI="https://creativecommons.org/publicdomain/zero/1.0/legalcode"
|
||||||
|
>Creative Commons Zero v1.0 Universal</rights>
|
||||||
|
<rights rightsURI="info:eu-repo/semantics/openAccess">Open Access</rights>
|
||||||
|
</rightsList>
|
||||||
|
<descriptions>
|
||||||
|
<description descriptionType="Abstract">n/a</description>
|
||||||
|
</descriptions>
|
||||||
|
</resource>
|
||||||
|
<dr:CobjCategory type="publication">0001</dr:CobjCategory>
|
||||||
|
<oaf:dateAccepted>1889-08-22</oaf:dateAccepted>
|
||||||
|
<oaf:accessrights>OPEN</oaf:accessrights>
|
||||||
|
<oaf:license>http://creativecommons.org/publicdomain/zero/1.0/legalcode</oaf:license>
|
||||||
|
<oaf:language/>
|
||||||
|
<oaf:hostedBy name="ZENODO" id="opendoar____::2659"/>
|
||||||
|
<oaf:collectedFrom name="ZENODO" id="opendoar____::2659"/>
|
||||||
|
</metadata>
|
||||||
|
</record>
|
|
@ -62,8 +62,8 @@ public class XmlConverterJob {
|
||||||
final String outputPath = parser.get("outputPath");
|
final String outputPath = parser.get("outputPath");
|
||||||
log.info("outputPath: {}", outputPath);
|
log.info("outputPath: {}", outputPath);
|
||||||
|
|
||||||
final String isLookupUrl = parser.get("isLookupUrl");
|
final String contextApiBaseUrl = parser.get("contextApiBaseUrl");
|
||||||
log.info("isLookupUrl: {}", isLookupUrl);
|
log.info("contextApiBaseUrl: {}", contextApiBaseUrl);
|
||||||
|
|
||||||
final SparkConf conf = new SparkConf();
|
final SparkConf conf = new SparkConf();
|
||||||
conf.set("spark.serializer", "org.apache.spark.serializer.KryoSerializer");
|
conf.set("spark.serializer", "org.apache.spark.serializer.KryoSerializer");
|
||||||
|
@ -71,7 +71,7 @@ public class XmlConverterJob {
|
||||||
|
|
||||||
runWithSparkSession(conf, isSparkSessionManaged, spark -> {
|
runWithSparkSession(conf, isSparkSessionManaged, spark -> {
|
||||||
removeOutputDir(spark, outputPath);
|
removeOutputDir(spark, outputPath);
|
||||||
convertToXml(spark, inputPath, outputPath, ContextMapper.fromIS(isLookupUrl));
|
convertToXml(spark, inputPath, outputPath, ContextMapper.fromAPI(contextApiBaseUrl));
|
||||||
});
|
});
|
||||||
}
|
}
|
||||||
|
|
||||||
|
|
|
@ -1,18 +1,22 @@
|
||||||
|
|
||||||
package eu.dnetlib.dhp.oa.provision.utils;
|
package eu.dnetlib.dhp.oa.provision.utils;
|
||||||
|
|
||||||
import java.io.Serializable;
|
import java.io.*;
|
||||||
import java.io.StringReader;
|
import java.net.HttpURLConnection;
|
||||||
|
import java.net.URL;
|
||||||
import java.util.HashMap;
|
import java.util.HashMap;
|
||||||
|
|
||||||
import org.dom4j.Document;
|
import org.dom4j.Document;
|
||||||
import org.dom4j.DocumentException;
|
import org.dom4j.DocumentException;
|
||||||
import org.dom4j.Node;
|
import org.dom4j.Node;
|
||||||
import org.dom4j.io.SAXReader;
|
import org.dom4j.io.SAXReader;
|
||||||
|
import org.jetbrains.annotations.NotNull;
|
||||||
import org.xml.sax.SAXException;
|
import org.xml.sax.SAXException;
|
||||||
|
|
||||||
import com.google.common.base.Joiner;
|
import com.google.common.base.Joiner;
|
||||||
|
|
||||||
|
import eu.dnetlib.dhp.common.api.context.*;
|
||||||
|
import eu.dnetlib.dhp.common.rest.DNetRestClient;
|
||||||
import eu.dnetlib.dhp.utils.ISLookupClientFactory;
|
import eu.dnetlib.dhp.utils.ISLookupClientFactory;
|
||||||
import eu.dnetlib.enabling.is.lookup.rmi.ISLookUpException;
|
import eu.dnetlib.enabling.is.lookup.rmi.ISLookUpException;
|
||||||
import eu.dnetlib.enabling.is.lookup.rmi.ISLookUpService;
|
import eu.dnetlib.enabling.is.lookup.rmi.ISLookUpService;
|
||||||
|
@ -23,6 +27,45 @@ public class ContextMapper extends HashMap<String, ContextDef> implements Serial
|
||||||
|
|
||||||
private static final String XQUERY = "for $x in //RESOURCE_PROFILE[.//RESOURCE_TYPE/@value='ContextDSResourceType']//*[name()='context' or name()='category' or name()='concept'] return <entry id=\"{$x/@id}\" label=\"{$x/@label|$x/@name}\" name=\"{$x/name()}\" type=\"{$x/@type}\"/>";
|
private static final String XQUERY = "for $x in //RESOURCE_PROFILE[.//RESOURCE_TYPE/@value='ContextDSResourceType']//*[name()='context' or name()='category' or name()='concept'] return <entry id=\"{$x/@id}\" label=\"{$x/@label|$x/@name}\" name=\"{$x/name()}\" type=\"{$x/@type}\"/>";
|
||||||
|
|
||||||
|
public static ContextMapper fromAPI(final String baseURL) throws Exception {
|
||||||
|
|
||||||
|
final ContextMapper contextMapper = new ContextMapper();
|
||||||
|
|
||||||
|
for (ContextSummary ctx : DNetRestClient
|
||||||
|
.doGET(String.format("%s/contexts", baseURL), ContextSummaryList.class)) {
|
||||||
|
|
||||||
|
contextMapper.put(ctx.getId(), new ContextDef(ctx.getId(), ctx.getLabel(), "context", ctx.getType()));
|
||||||
|
|
||||||
|
for (CategorySummary cat : DNetRestClient
|
||||||
|
.doGET(String.format("%s/context/%s?all=true", baseURL, ctx.getId()), CategorySummaryList.class)) {
|
||||||
|
contextMapper.put(cat.getId(), new ContextDef(cat.getId(), cat.getLabel(), "category", ""));
|
||||||
|
if (cat.isHasConcept()) {
|
||||||
|
for (ConceptSummary c : DNetRestClient
|
||||||
|
.doGET(
|
||||||
|
String.format("%s/context/category/%s?all=true", baseURL, cat.getId()),
|
||||||
|
ConceptSummaryList.class)) {
|
||||||
|
contextMapper.put(c.getId(), new ContextDef(c.getId(), c.getLabel(), "concept", ""));
|
||||||
|
if (c.isHasSubConcept()) {
|
||||||
|
for (ConceptSummary cs : c.getConcepts()) {
|
||||||
|
contextMapper.put(cs.getId(), new ContextDef(cs.getId(), cs.getLabel(), "concept", ""));
|
||||||
|
if (cs.isHasSubConcept()) {
|
||||||
|
for (ConceptSummary css : cs.getConcepts()) {
|
||||||
|
contextMapper
|
||||||
|
.put(
|
||||||
|
css.getId(),
|
||||||
|
new ContextDef(css.getId(), css.getLabel(), "concept", ""));
|
||||||
|
}
|
||||||
|
}
|
||||||
|
}
|
||||||
|
}
|
||||||
|
}
|
||||||
|
}
|
||||||
|
}
|
||||||
|
}
|
||||||
|
return contextMapper;
|
||||||
|
}
|
||||||
|
|
||||||
|
@Deprecated
|
||||||
public static ContextMapper fromIS(final String isLookupUrl)
|
public static ContextMapper fromIS(final String isLookupUrl)
|
||||||
throws DocumentException, ISLookUpException, SAXException {
|
throws DocumentException, ISLookUpException, SAXException {
|
||||||
ISLookUpService isLookUp = ISLookupClientFactory.getLookUpService(isLookupUrl);
|
ISLookUpService isLookUp = ISLookupClientFactory.getLookUpService(isLookupUrl);
|
||||||
|
@ -32,6 +75,7 @@ public class ContextMapper extends HashMap<String, ContextDef> implements Serial
|
||||||
return fromXml(sb.toString());
|
return fromXml(sb.toString());
|
||||||
}
|
}
|
||||||
|
|
||||||
|
@Deprecated
|
||||||
public static ContextMapper fromXml(final String xml) throws DocumentException, SAXException {
|
public static ContextMapper fromXml(final String xml) throws DocumentException, SAXException {
|
||||||
final ContextMapper contextMapper = new ContextMapper();
|
final ContextMapper contextMapper = new ContextMapper();
|
||||||
|
|
||||||
|
|
|
@ -12,9 +12,9 @@
|
||||||
"paramRequired": true
|
"paramRequired": true
|
||||||
},
|
},
|
||||||
{
|
{
|
||||||
"paramName": "ilu",
|
"paramName": "cau",
|
||||||
"paramLongName": "isLookupUrl",
|
"paramLongName": "contextApiBaseUrl",
|
||||||
"paramDescription": "URL of the isLookUp Service",
|
"paramDescription": "URL of the context API",
|
||||||
"paramRequired": true
|
"paramRequired": true
|
||||||
}
|
}
|
||||||
]
|
]
|
||||||
|
|
|
@ -9,6 +9,10 @@
|
||||||
<name>isLookupUrl</name>
|
<name>isLookupUrl</name>
|
||||||
<description>URL for the isLookup service</description>
|
<description>URL for the isLookup service</description>
|
||||||
</property>
|
</property>
|
||||||
|
<property>
|
||||||
|
<name>contextApiBaseUrl</name>
|
||||||
|
<description>context API URL</description>
|
||||||
|
</property>
|
||||||
<property>
|
<property>
|
||||||
<name>relPartitions</name>
|
<name>relPartitions</name>
|
||||||
<description>number or partitions for the relations Dataset</description>
|
<description>number or partitions for the relations Dataset</description>
|
||||||
|
@ -589,7 +593,7 @@
|
||||||
</spark-opts>
|
</spark-opts>
|
||||||
<arg>--inputPath</arg><arg>${workingDir}/join_entities</arg>
|
<arg>--inputPath</arg><arg>${workingDir}/join_entities</arg>
|
||||||
<arg>--outputPath</arg><arg>${workingDir}/xml</arg>
|
<arg>--outputPath</arg><arg>${workingDir}/xml</arg>
|
||||||
<arg>--isLookupUrl</arg><arg>${isLookupUrl}</arg>
|
<arg>--contextApiBaseUrl</arg><arg>${contextApiBaseUrl}</arg>
|
||||||
</spark>
|
</spark>
|
||||||
<ok to="should_index"/>
|
<ok to="should_index"/>
|
||||||
<error to="Kill"/>
|
<error to="Kill"/>
|
||||||
|
|
|
@ -244,4 +244,27 @@ public class XmlRecordFactoryTest {
|
||||||
|
|
||||||
}
|
}
|
||||||
|
|
||||||
|
@Test
|
||||||
|
public void testIrisGuidelines4() throws DocumentException, IOException {
|
||||||
|
final ContextMapper contextMapper = new ContextMapper();
|
||||||
|
|
||||||
|
final XmlRecordFactory xmlRecordFactory = new XmlRecordFactory(contextMapper, false,
|
||||||
|
XmlConverterJob.schemaLocation);
|
||||||
|
|
||||||
|
final Publication p = OBJECT_MAPPER
|
||||||
|
.readValue(
|
||||||
|
IOUtils.toString(getClass().getResourceAsStream("iris-odf-4.json")),
|
||||||
|
Publication.class);
|
||||||
|
|
||||||
|
final String xml = xmlRecordFactory.build(new JoinedEntity<>(p));
|
||||||
|
|
||||||
|
assertNotNull(xml);
|
||||||
|
|
||||||
|
final Document doc = new SAXReader().read(new StringReader(xml));
|
||||||
|
|
||||||
|
assertNotNull(doc);
|
||||||
|
System.out.println(doc.asXML());
|
||||||
|
|
||||||
|
}
|
||||||
|
|
||||||
}
|
}
|
||||||
|
|
File diff suppressed because one or more lines are too long
|
@ -8,6 +8,11 @@
|
||||||
<modelVersion>4.0.0</modelVersion>
|
<modelVersion>4.0.0</modelVersion>
|
||||||
<artifactId>dhp-stats-update</artifactId>
|
<artifactId>dhp-stats-update</artifactId>
|
||||||
<dependencies>
|
<dependencies>
|
||||||
|
<dependency>
|
||||||
|
<groupId>eu.dnetlib.dhp</groupId>
|
||||||
|
<artifactId>dhp-common</artifactId>
|
||||||
|
<version>${project.version}</version>
|
||||||
|
</dependency>
|
||||||
<dependency>
|
<dependency>
|
||||||
<groupId>org.apache.spark</groupId>
|
<groupId>org.apache.spark</groupId>
|
||||||
<artifactId>spark-core_${scala.binary.version}</artifactId>
|
<artifactId>spark-core_${scala.binary.version}</artifactId>
|
||||||
|
|
File diff suppressed because it is too large
Load Diff
|
@ -64,6 +64,26 @@
|
||||||
<name>hadoop_user_name</name>
|
<name>hadoop_user_name</name>
|
||||||
<description>user name of the wf owner</description>
|
<description>user name of the wf owner</description>
|
||||||
</property>
|
</property>
|
||||||
|
|
||||||
|
<property>
|
||||||
|
<name>sparkSqlWarehouseDir</name>
|
||||||
|
</property>
|
||||||
|
<!-- General oozie workflow properties -->
|
||||||
|
<property>
|
||||||
|
<name>sparkClusterOpts</name>
|
||||||
|
<value>--conf spark.network.timeout=600 --conf spark.extraListeners= --conf spark.sql.queryExecutionListeners= --conf spark.yarn.historyServer.address=http://iis-cdh5-test-m3.ocean.icm.edu.pl:18088 --conf spark.eventLog.dir=hdfs://nameservice1/user/spark/applicationHistory</value>
|
||||||
|
<description>spark cluster-wide options</description>
|
||||||
|
</property>
|
||||||
|
<property>
|
||||||
|
<name>sparkResourceOpts</name>
|
||||||
|
<value>--executor-memory=6G --conf spark.executor.memoryOverhead=4G --executor-cores=6 --driver-memory=8G --driver-cores=4</value>
|
||||||
|
<description>spark resource options</description>
|
||||||
|
</property>
|
||||||
|
<property>
|
||||||
|
<name>sparkApplicationOpts</name>
|
||||||
|
<value>--conf spark.sql.shuffle.partitions=3840</value>
|
||||||
|
<description>spark resource options</description>
|
||||||
|
</property>
|
||||||
</parameters>
|
</parameters>
|
||||||
|
|
||||||
<global>
|
<global>
|
||||||
|
@ -78,6 +98,14 @@
|
||||||
<name>hive.txn.timeout</name>
|
<name>hive.txn.timeout</name>
|
||||||
<value>${hive_timeout}</value>
|
<value>${hive_timeout}</value>
|
||||||
</property>
|
</property>
|
||||||
|
<property>
|
||||||
|
<name>hive.mapjoin.followby.gby.localtask.max.memory.usage</name>
|
||||||
|
<value>0.80</value>
|
||||||
|
</property>
|
||||||
|
<property>
|
||||||
|
<name>oozie.action.sharelib.for.spark</name>
|
||||||
|
<value>${oozieActionShareLibForSpark2}</value>
|
||||||
|
</property>
|
||||||
<property>
|
<property>
|
||||||
<name>mapred.job.queue.name</name>
|
<name>mapred.job.queue.name</name>
|
||||||
<value>analytics</value>
|
<value>analytics</value>
|
||||||
|
@ -318,12 +346,23 @@
|
||||||
</action>
|
</action>
|
||||||
|
|
||||||
<action name="Step16-createIndicatorsTables">
|
<action name="Step16-createIndicatorsTables">
|
||||||
<hive2 xmlns="uri:oozie:hive2-action:0.1">
|
<spark xmlns="uri:oozie:spark-action:0.2">
|
||||||
<jdbc-url>${hive_jdbc_url}</jdbc-url>
|
<master>yarn</master>
|
||||||
<script>scripts/step16-createIndicatorsTables.sql</script>
|
<mode>cluster</mode>
|
||||||
<param>stats_db_name=${stats_db_name}</param>
|
<name>Step16-createIndicatorsTables</name>
|
||||||
<param>external_stats_db_name=${external_stats_db_name}</param>
|
<class>eu.dnetlib.dhp.oozie.RunSQLSparkJob</class>
|
||||||
</hive2>
|
<jar>dhp-stats-update-${projectVersion}.jar</jar>
|
||||||
|
<spark-opts>
|
||||||
|
--conf spark.sql.warehouse.dir=${sparkSqlWarehouseDir}
|
||||||
|
${sparkClusterOpts}
|
||||||
|
${sparkResourceOpts}
|
||||||
|
${sparkApplicationOpts}
|
||||||
|
</spark-opts>
|
||||||
|
<arg>--hiveMetastoreUris</arg><arg>${hive_metastore_uris}</arg>
|
||||||
|
<arg>--sql</arg><arg>eu/dnetlib/dhp/oa/graph/stats/oozie_app/scripts/step16-createIndicatorsTables.sql</arg>
|
||||||
|
<arg>--stats_db_name</arg><arg>${stats_db_name}</arg>
|
||||||
|
<arg>--external_stats_db_name</arg><arg>${external_stats_db_name}</arg>
|
||||||
|
</spark>
|
||||||
<ok to="Step16_1-definitions"/>
|
<ok to="Step16_1-definitions"/>
|
||||||
<error to="Kill"/>
|
<error to="Kill"/>
|
||||||
</action>
|
</action>
|
||||||
|
|
Loading…
Reference in New Issue